<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><div class="">I've been working on the IETF mailing list corpus that I have thus far. These aren't direct answers to the proposed questions, but perhaps enough data that I can start getting your input on what I need to do to get answers to questions.</div><div class=""><br class=""></div><div class="">In this corpus of 316 different mailing list archives, 101,510 "people" sent a combined total of 1.2 million messages. Most people sent only 1 message, only a quarter of people sent more than 2. Confirming Seb's assumption, there is an extremely long tail.</div><div class=""><br class=""></div><div class="">The median number of messages per group that a person has sent has a different shape than the obviously long tail distributions. While most people have a small number (because most people have sent few messages total) and the numbers decrease from there, there seems to be a flatter place on the curve, that there is a bunching towards sending on the order of 100 messages / group.</div><div class="">[see attached histogram median-messages-per-group]</div><div class=""><br class=""></div><div class="">I'm also looking at comparing the number of messages sent with the number of groups sent to, in the hope of finding a cluster of people who have disproportionate connecting ability and might be an elite of interest. On this second graph there seem to be two natural-log-shaped curves, people who send a couple of messages to lots of lists, and people who have sent a lot of messages to a lot of lists, and that second curve seems worthy of investigation to me.</div><div class="">[see attached scatter plot total-messages-vs-num-groups]</div><div class=""><br class=""></div><div class="">I'm also curious about the clustering possible now that I have vectors consisting of the number of messages sent to each of a large number of groups from each person. Is there latent semantics or some vector similarity clustering I can do to find groupings of people? Could we recommend email lists to people based on what other lists similar people contribute to?</div><div class=""><br class=""></div>I'm concerned that spam might have the possibility of seriously confusing/influencing results of my analysis. Many of the single-message-senders are spammers, but I don't know how to figure out how many or what important participation I'll ignore if I simply exclude them. One mailing list archive had collected so much spam in the years since the mailing list was informally closed that I had to delete large amounts of it just to get my scripts to run in the RAM I had available.<div class=""><br class=""></div><div class="">Here's the fork/branch where I'm doing this scientific analysis:</div><div class=""><a href="https://github.com/npdoty/bigbang/tree/ietf-participation" class="">https://github.com/npdoty/bigbang/tree/ietf-participation</a></div><div class=""><br class=""></div><div class="">And here's the branch with some additional scripts and improved error handling for core that I've developed in the meantime:</div><div class=""><a href="https://github.com/npdoty/bigbang/tree/archive-activity" class="">https://github.com/npdoty/bigbang/tree/archive-activity</a></div><div class="">(This will be a pull request against datactive/bigbang at some point; I can do that now, or add a little more to it as I fix additional issues.)</div><div class=""><br class=""><div class=""><br class=""><div><img apple-inline="yes" id="7662373F-3C44-4229-BCDC-E19165A7912C" src="cid:16B752F9-657D-4878-BDA3-4F03F058ADA6" class=""><img apple-inline="yes" id="A3FF7147-F46E-4273-8608-ADE1329CB5CF" src="cid:05248BDC-EBA4-43F9-87E7-011518325F86" class=""><br class=""><blockquote type="cite" class=""><div class="">On Feb 1, 2018, at 11:02 AM, Nick Doty <<a href="mailto:npdoty@ischool.berkeley.edu" class="">npdoty@ischool.berkeley.edu</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">I've asked around among a few standards folks about what they would be interested to learn about the demographics or patterns of participation that we might be able to understand from mailing lists. Here is a list of potential questions. I've framed these as IETF questions, but I think they could similarly apply to other standard-setting organizations (W3C, say), and maybe in some form to other online communities.<br class=""><br class="">* how many participants total in IETF work?<br class="">* how "sticky" is participation?<br class=""><span class="Apple-tab-span" style="white-space:pre"> </span>if people participate on a list, do they return? do they show up to f2f meetings?<br class=""><span class="Apple-tab-span" style="white-space:pre"> </span>what's the attrition rate?<br class=""><span class="Apple-tab-span" style="white-space:pre"> </span>what's the distribution of length of participation?<br class="">* who has participated longest? across the most groups?<br class=""><span class="Apple-tab-span" style="white-space:pre"> </span>is there a group of "elites" across working groups?<br class=""><span class="Apple-tab-span" style="white-space:pre"> </span>how many participants are single-group?<br class=""><span class="Apple-tab-span" style="white-space:pre"> </span>how many groups does the typical participant join?<br class=""><br class="">As I believe I've mentioned to this group before, I've been looking into estimating gender in mailing list participation, including:<br class=""><br class="">* What is the gender distribution of participants in Internet and Web technical standard-setting?<br class=""> how does that distribution differ from the population at large? from employment at related firms?<br class=""> does that distribution change over time?<br class=""> are there sub-groups which have distinctly different distributions?<br class="">* Does the gender distribution of conversation differ from the gender distribution of the participants?<br class=""><br class="">Do you have questions you'd like to add to this list? Would you be interested in trying to measure/answer one of these questions? Which are the easiest and which are the most difficult? What features would we need to add to BigBang to make them answerable?<br class=""><br class="">Let me know, I'd love to dive in deeper to some of these questions with collaborators.<br class=""><br class="">Cheers,<br class="">Nick<br class=""></div></div></blockquote></div><br class=""></div></div></body></html>