[Bigbang-dev] research questions of interest for standard-setting participation
Sebastian Benthall
sbenthall at gmail.com
Wed Mar 28 17:35:46 CEST 2018
I wanted to follow up on this thread...
> The median number of messages per group that a person has sent has a
> different shape than the obviously long tail distributions. While most
> people have a small number (because most people have sent few messages
> total) and the numbers decrease from there, there seems to be a flatter
> place on the curve, that there is a bunching towards sending on the order
> of 100 messages / group.
>
This caught my eye because it's illustrative of a general point about
descriptive statistical analysis.
There's a long-standing debate about what kinds of generative processes
underly observable long-tail distributions.
My conclusion after studying this for some time is that most of the time,
these distributions are due to a fundamental law of probability, the
Central Limit Theorem, which leads to heavy-tail distributions when many
independent factors are multiplied. There's little to be learned from the
observation of a heavy-tail distribution because it reflects mathematical
law.
Similarly, it is in one sense interesting that the median number of
messages per group that a person has sent has a slightly different
distribution, the first thing I'd want to do before interpreting that is to
find out whether this is to be expected of data in a mathematically general
way. The median is a specific way of finding the "middle" of a
distribution, one which will naturally tend towards the "head" of a
distribution, however long or heavy the "tail". So we shouldn't be
surprised if the distribution of the median of values is much thinner than,
say, the mean.
In general, I've found that working with real data is a great way to learn
more about statistics and get an intuition for the relationship between
different mathematical concepts. It's also a great way of seeing how these
mathematical laws wind up shaping social reality. This is a too often
understated point that I think it be biggest take-away from all my
computational social science work (part of why I wrote this
<http://cosmosandhistory.org/index.php/journal/article/view/570/917>).
It does not, however, help a lot to generate compelling research theses of
popular or scientific interest. To get these, we need a more sophisticated
strategy.
One of them has two steps:
- understand the mathematical baseline expectations (the 'null
hypothesis'), (what we're talking about now)
- identify under what conditions the data deviate from these expectations
(the 'statistically significant' result)
Another of them works like this:
- Featurization: Identify a number of different factors that might be
influencing the value of a dependent variable (for example, if the
dependent variable is number of messages sent, independent 'features' might
be gender, what organization they work for, etc.)
- Regression: Find out which features are correlated with the dependent
variable.
There's others that get more sophisticated, of course. I'm just trying to
paint a picture of how we might try to go beyond these exploratory
histograms in the next phase.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180328/e6f0b459/attachment-0001.html>
More information about the Bigbang-dev
mailing list