<div dir="ltr">I wanted to follow up on this thread...<div class="gmail_extra"><div class="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word;line-break:after-white-space"><div>The median number of messages per group that a person has sent has a different shape than the obviously long tail distributions. While most people have a small number (because most people have sent few messages total) and the numbers decrease from there, there seems to be a flatter place on the curve, that there is a bunching towards sending on the order of 100 messages / group.</div></div></blockquote><div><br></div><div>This caught my eye because it's illustrative of a general point about descriptive statistical analysis.</div><div><br></div><div>There's a long-standing debate about what kinds of generative processes underly observable long-tail distributions.</div><div><br></div><div>My conclusion after studying this for some time is that most of the time, these distributions are due to a fundamental law of probability, the Central Limit Theorem, which leads to heavy-tail distributions when many independent factors are multiplied. There's little to be learned from the observation of a heavy-tail distribution because it reflects mathematical law.</div><div><br></div><div>Similarly, it is in one sense interesting that the median number of messages per group that a person has sent has a slightly different distribution, the first thing I'd want to do before interpreting that is to find out whether this is to be expected of data in a mathematically general way. The median is a specific way of finding the "middle" of a distribution, one which will naturally tend towards the "head" of a distribution, however long or heavy the "tail". So we shouldn't be surprised if the distribution of the median of values is much thinner than, say, the mean.</div><div><br></div><div>In general, I've found that working with real data is a great way to learn more about statistics and get an intuition for the relationship between different mathematical concepts. It's also a great way of seeing how these mathematical laws wind up shaping social reality. This is a too often understated point that I think it be biggest take-away from all my computational social science work (part of why I wrote <a href="http://cosmosandhistory.org/index.php/journal/article/view/570/917">this</a>).</div><div><br></div><div>It does not, however, help a lot to generate compelling research theses of popular or scientific interest. To get these, we need a more sophisticated strategy. </div><div><br></div><div>One of them has two steps:</div><div> - understand the mathematical baseline expectations (the 'null hypothesis'), (what we're talking about now)</div><div> - identify under what conditions the data deviate from these expectations (the 'statistically significant' result)</div><div><br></div><div>Another of them works like this:</div><div> - Featurization: Identify a number of different factors that might be influencing the value of a dependent variable (for example, if the dependent variable is number of messages sent, independent 'features' might be gender, what organization they work for, etc.)</div><div> - Regression: Find out which features are correlated with the dependent variable.</div><div><br></div><div>There's others that get more sophisticated, of course. I'm just trying to paint a picture of how we might try to go beyond these exploratory histograms in the next phase.</div></div></div></div>