[Bigbang-dev] research questions of interest for standard-setting participation

Nick Doty npdoty at ischool.berkeley.edu
Fri Feb 16 01:26:01 CET 2018


On Feb 5, 2018, at 2:09 PM, Sebastian Benthall <sbenthall at gmail.com> wrote:
> 
> 2) I just figured out how to make time for this in the short term. So count me in.
> 
> Shall we plan a meeting about this?

Yeah, I'd love to do that! Would folks be interested in an audio chat next week? I will send around a Doodle poll if it's more than just me and Seb.

> On Feb 5, 2018 4:24 PM, "Sebastian Benthall" <sbenthall at gmail.com <mailto:sbenthall at gmail.com>> wrote:
> These are great questions, Nick.
> 
> I'd love to work on them with you, especially because they are such general metrics.
> Sadly I've got almost no time to work on it until May, due to dissertation work.
> 
> Let me provide some recommendations based on my attempts to address similar questions on SciPy and other lists.

These comments are really helpful, thanks!

I am interested to understand the math better, and could really use your help on that. I definitely get your general point that because there's a long-tail distribution in any case, I need to find cases that don't fit that pattern in order to show meaningful results.

I'm not sure I understand the concentration parameter, but it does seem like something like that would be useful. I also thought there might be interesting graph analysis metrics -- like centrality? -- in a graph of the nodes of connections between participants and lists.

Thanks again for your thoughts!
—Nick

> * how many participants total in IETF work?
> 
> The odds are *very* high that the emails-per-person distribution is a heavy-tail distribution.
> Based on previous work <https://conference.scipy.org/proceedings/scipy2015/pdfs/sebastian_benthall.pdf>, I would test for fit to log normal and power law distributions.
> My money is on log normal being a better fit.
> 
> This is important because when interpreting the results, we have to keep in mind that
> the log normal distribution is essentially a noise pattern.
> So it's easy to read into the data relationships that may not be there,
> especially if you're using a linear rather than a log linear relationship as an indicator.
> 
> * how "sticky" is participation?
>         if people participate on a list, do they return? do they show up to f2f meetings?
>         what's the attrition rate?
>         what's the distribution of length of participation?
> 
> Assuming there is a heavy tail distribution of participation, then about half the contributors
> will only contribute once.
> 
> The distribution of attrition/retention will look more or less just like the distribution of participation.
> The length will look like it as well.
> 
> It's not clear how to interpret this, because the reasons why any particular person participates a lot
> or a little are very likely
> (a) myriad (no single reason, but rather a combination of many reasons, and
> (b) exogenous to the data itself.
> 
> For these reasons I expect you would get more interesting results if you can segment the population
> into categories of interest. You've mentioned gender and firms of employment, which are both good ones.
> 
> But for each category, you may want to have more than one parameter to
> characterize the each one's participation distribution.
> May mean and variance?
> 
> * who has participated longest? across the most groups?
>         is there a group of "elites" across working groups?
> 
> This is a great question.
> But keep in mind: the people who participate most are going to be participating a lot
> more numerically across all lists than others.
> So they will have more chances to participate in different lists.
> 
> You may want to be looking at, for each participant, their individual distribution of participation
> over many lists, and then look at the concentration parameter of that distribution:
> 
> https://en.wikipedia.org/wiki/Concentration_parameter <https://en.wikipedia.org/wiki/Concentration_parameter>
> 
> The math can be a bit tricky but I think it's worth tackling correctly.
> 
>         how many participants are single-group?
> 
> Since most participants will be only send one message, that's going to skew this metric
> unless you take that into account somehow.
> 
>         how many groups does the typical participant join?
> 
> As I believe I've mentioned to this group before, I've been looking into estimating gender in mailing list participation, including:
> 
> * What is the gender distribution of participants in Internet and Web technical standard-setting?
>     how does that distribution differ from the population at large? from employment at related firms?
>     does that distribution change over time?
>     are there sub-groups which have distinctly different distributions?
> * Does the gender distribution of conversation differ from the gender distribution of the participants?
> 
> Great questions.
> 
> Do you have questions you'd like to add to this list? Would you be interested in trying to measure/answer one of these questions? Which are the easiest and which are the most difficult? What features would we need to add to BigBang to make them answerable?
> 
> In sum, I think all these questions are great ones and related to each other.
> I think the biggest challenge is getting the correct statistical modeling right,
> so that the results are not misinterpreted.
> 
> - Seb
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180215/6712adfa/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 529 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180215/6712adfa/attachment.sig>


More information about the Bigbang-dev mailing list