A friend at Cloudera recently invited me to write a post for their corporate blog about how social scientists are using large scale computation.
I’ve been using Hadoop and MapReduce to study some really large datasets this year. I think it’s going to become more and more important and open the world of scientific computing to social scientists. I’m happy to evangelize for it.
One of the ideas that didn’t make its way into the final version is that even though the tools and data are becoming more widely available to laypeople, asking good social science questions — and answering them correctly — is still hard. It’s comparatively easy to ask the wrong question, use the wrong data, draw the wrong inference, and so on, epecially if the wrongness is subtle. As an example, I think the OkCupid blog is interesting, but it’s not social science.
Social science has long been concerned with sampling methods precisely because it’s dangerously easy to incorrectly extrapolate findings from a non-representative sample to an entire population. Drawing conclusions from internet-based interactions can be problematic because the sample frame doesn’t match the population of interest. Even though I learned to make a cigar box guitar from Make Magazine, I don’t assume I know that much about acoustic engineering. Likewise, recreational data analysis is fun, illuminating and perhaps suggestive of how our social world works, but one ought not conclude that correlations or trends tell the whole, correct story. However, if exploring and experimenting with data can spark an interest in quantitative analysis of our social world, then I think it’s all for the better.