[Bigbang-dev] research questions of interest for standard-setting participation

Fri Feb 16 01:19:21 CET 2018

I've been working on the IETF mailing list corpus that I have thus far. These aren't direct answers to the proposed questions, but perhaps enough data that I can start getting your input on what I need to do to get answers to questions.

In this corpus of 316 different mailing list archives, 101,510 "people" sent a combined total of 1.2 million messages. Most people sent only 1 message, only a quarter of people sent more than 2. Confirming Seb's assumption, there is an extremely long tail.

The median number of messages per group that a person has sent has a different shape than the obviously long tail distributions. While most people have a small number (because most people have sent few messages total) and the numbers decrease from there, there seems to be a flatter place on the curve, that there is a bunching towards sending on the order of 100 messages / group.
[see attached histogram median-messages-per-group]

I'm also looking at comparing the number of messages sent with the number of groups sent to, in the hope of finding a cluster of people who have disproportionate connecting ability and might be an elite of interest. On this second graph there seem to be two natural-log-shaped curves, people who send a couple of messages to lots of lists, and people who have sent a lot of messages to a lot of lists, and that second curve seems worthy of investigation to me.
[see attached scatter plot total-messages-vs-num-groups]

I'm also curious about the clustering possible now that I have vectors consisting of the number of messages sent to each of a large number of groups from each person. Is there latent semantics or some vector similarity clustering I can do to find groupings of people? Could we recommend email lists to people based on what other lists similar people contribute to?

I'm concerned that spam might have the possibility of seriously confusing/influencing results of my analysis. Many of the single-message-senders are spammers, but I don't know how to figure out how many or what important participation I'll ignore if I simply exclude them. One mailing list archive had collected so much spam in the years since the mailing list was informally closed that I had to delete large amounts of it just to get my scripts to run in the RAM I had available.

Here's the fork/branch where I'm doing this scientific analysis:
https://github.com/npdoty/bigbang/tree/ietf-participation <https://github.com/npdoty/bigbang/tree/ietf-participation>

And here's the branch with some additional scripts and improved error handling for core that I've developed in the meantime:
https://github.com/npdoty/bigbang/tree/archive-activity <https://github.com/npdoty/bigbang/tree/archive-activity>
(This will be a pull request against datactive/bigbang at some point; I can do that now, or add a little more to it as I fix additional issues.)

> On Feb 1, 2018, at 11:02 AM, Nick Doty <npdoty at ischool.berkeley.edu> wrote:
> 
> I've asked around among a few standards folks about what they would be interested to learn about the demographics or patterns of participation that we might be able to understand from mailing lists. Here is a list of potential questions. I've framed these as IETF questions, but I think they could similarly apply to other standard-setting organizations (W3C, say), and maybe in some form to other online communities.
> 
> * how many participants total in IETF work?
> * how "sticky" is participation?
> 	if people participate on a list, do they return? do they show up to f2f meetings?
> 	what's the attrition rate?
> 	what's the distribution of length of participation?
> * who has participated longest? across the most groups?
> 	is there a group of "elites" across working groups?
> 	how many participants are single-group?
> 	how many groups does the typical participant join?
> 
> As I believe I've mentioned to this group before, I've been looking into estimating gender in mailing list participation, including:
> 
> * What is the gender distribution of participants in Internet and Web technical standard-setting?
>    how does that distribution differ from the population at large? from employment at related firms?
>    does that distribution change over time?
>    are there sub-groups which have distinctly different distributions?
> * Does the gender distribution of conversation differ from the gender distribution of the participants?
> 
> Do you have questions you'd like to add to this list? Would you be interested in trying to measure/answer one of these questions? Which are the easiest and which are the most difficult? What features would we need to add to BigBang to make them answerable?
> 
> Let me know, I'd love to dive in deeper to some of these questions with collaborators.
> 
> Cheers,
> Nick

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180216/e75f0eab/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: median-messages-per-group.png
Type: image/png
Size: 9290 bytes
Desc: not available
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180216/e75f0eab/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: total-messages-vs-num-groups.png
Type: image/png
Size: 20823 bytes
Desc: not available
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180216/e75f0eab/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 529 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180216/e75f0eab/attachment.sig>