[Bigbang-dev] Are gender diversity and draft productivity correlated? THE VERDICT

Sebastian Benthall sbenthall at gmail.com
Tue Sep 1 17:04:22 CEST 2020


Some updates:

- A plot of mailing list activity, by gender, and (final) draft output is
up here, with the correlation values, is here:
https://github.com/datactive/bigbang/pull/394#issuecomment-684917057

- None of the correlations of mailing list activity with draft output is
statistically significant! This reverses the previous verdict.

- I've made an issue for expanding the draft metadata collection to include
the submissions:
https://github.com/datactive/bigbang/issues/397

- Could I request a review of the code for this project thus far? It's
currently languishing a bit as a PR:
https://github.com/datactive/bigbang/pull/394

I've got to work on a few other projects for a bit but I'm excited to hear
where folks think we might go from here.

Best regards,
Seb


On Mon, Aug 31, 2020 at 10:52 AM Sebastian Benthall <sbenthall at gmail.com>
wrote:

> Thank you!
>
>
>> * The group is “httpbis” not “httpbisa”
>>
>
> Aha!
>
> I found `httpbisa` as the closest acronym to `httpbis` on this list of
> IETF mailing list archives:
>
> https://github.com/datactive/bigbang/blob/master/examples/url_collections/mm.ietf.org.txt
>
> Niels, does it make sense that the mailing list and the working group have
> different names in this case? Is that common?
>
> I can confirm that the records I pulled using the datatracker include
> drafts for working groups besides `httpbis`. group_from_acronym('nonsense')
> returns None. None passed as a group to the documents query results in a
> default query of all groups, I suppose.
>
>
>> Also, remember to look at the submissions to find the different versions
>> of a draft, else you only get the most recent version.
>>
>> Try something like:
>>
>> dt = DataTracker(cache_dir=Path("cache"))
>>
>> g  = dt.group_from_acronym("httpbis")
>> for d in dt.documents(group=g,
>> doctype=dt.document_type_from_slug("draft")):
>>     print("")
>>     for sub_url in d.submissions:
>>         sub = dt.submission(sub_url)
>>         print(F"{sub.document_date.strftime('%Y-%m-%d')} {sub.name
>> }-{sub.rev}")
>>         for a in sub.parse_authors():
>>             print(F"           {a['name']} <{a['email']}>")
>>
>> This will find each submission of all the working group drafts for a
>> particular group. It doesn’t follow the history back to the pre-working
>> group individual submissions, but can be extended to do that if needed.
>>
>
> I see. Thanks again for this.
>
> I welcome input from any stakeholders about whether whether "productivity"
> should be operationalized in terms of final draft output and/or submissions.
>
>
>
>> Looking at https://datatracker.ietf.org/wg/httpbis/documents/ it seems
>>> that httpbis has 48 documents. Each of these will have gone through
>>> multiple versions as a draft, but even with ~20 draft per document (which
>>> is roughly typical), that’s not close to thousands.
>>>
>>> Searching
>>> https://mailarchive.ietf.org/arch/browse/i-d-announce/?q=httpbis finds
>>> announcements for 721 internet drafts containing the string “httpbis”,
>>> which seems plausible.
>>>
>>> Colin
>>>
>>>
>>>
>>> Another issue here is that the draft output preceeds the mailing list
>>> records (see attachment). Another is that there are very emails sent by
>>> women (or, so identifiable by our detection method) in httpbisa:
>>>
>>> <image.png>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 26, 2020 at 3:26 PM Niels ten Oever <mail at nielstenoever.net>
>>> wrote:
>>>
>>>> Httpbis is the one you're looking for :)
>>>>
>>>> DNSops is also a nice big one.
>>>>
>>>> Cheers,
>>>>
>>>> Niels
>>>> On Aug 26, 2020, at 21:17, Sebastian Benthall <sbenthall at gmail.com>
>>>> wrote:
>>>>>
>>>>> Hmmm.
>>>>>
>>>>> Web mail archives of the http list at
>>>>> https://ietf.org/mail-archive/text/http/ only go up to 2012.
>>>>> Does that make sense to you?
>>>>>
>>>>> It looks like there are several DNS working groups. Any one in
>>>>> particular you think would be worth looking at?
>>>>>
>>>>> Genericizing the code so that it can loop through many groups and
>>>>> compute results is the next step towards confirmation. Probably worth
>>>>> looking at a couple other concrete and well-understood examples before
>>>>> doing the big analysis though.
>>>>>
>>>>> - S
>>>>>
>>>>> On Wed, Aug 26, 2020 at 1:52 PM Niels ten Oever <
>>>>> mail at nielstenoever.net> wrote:
>>>>>
>>>>>> Very interesting. I'd say the number if drafts and authors in hrpc is
>>>>>> too low to make a statement about this though. Could we do this for the
>>>>>> HTTP and/or DNS WGs ?
>>>>>> On Aug 26, 2020, at 19:30, Sebastian Benthall < sbenthall at gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm revisiting the question of whether mailing list gender diversity
>>>>>>> and draft productivity of working groups are correlated.
>>>>>>>
>>>>>>> Putting aside for now all the methodological complications, here is
>>>>>>> how I am operationalizing the question:
>>>>>>>
>>>>>>>    - I'm looking specifically at the HRPC working group, with this
>>>>>>>    data:
>>>>>>>    [image: image.png]
>>>>>>>    - Gender is being detected based on first name birth records.
>>>>>>>    "unknown" is used for cases that cannot with the current data set be
>>>>>>>    determined as either men or women.
>>>>>>>    - I'm measuring "diversity" on any day as: (women's activity +
>>>>>>>    unknown's activity) / (men's activity). Because, you know, this is probably
>>>>>>>    close to what most people probably mean by diversity. (Recall that
>>>>>>>    non-Western names are more likely to be categorized as "unknown".)
>>>>>>>    - I'm using a 100 day rolling average on the activity counts.
>>>>>>>
>>>>>>> This is the matrix of Pearson correlations between each of these
>>>>>>> values:
>>>>>>>
>>>>>>> women unknown men drafts diversity
>>>>>>> women 1.000000 0.910922 0.804869 0.008890 0.160833
>>>>>>> unknown 0.910922 1.000000 0.808168 0.027502 0.245059
>>>>>>> men 0.804869 0.808168 1.000000 0.015406 -0.141915
>>>>>>> drafts 0.008890 0.027502 0.015406 1.000000 0.061884
>>>>>>> diversity 0.160833 0.245059 -0.141915 0.061884 1.000000
>>>>>>>
>>>>>>> Things to note:
>>>>>>>
>>>>>>>    - The activity of each gender is correlated with the activity of
>>>>>>>    other genders.
>>>>>>>    - Diversity is anticorrelated with the number of men. This is
>>>>>>>    expected based on how it was defined, and a good sanity check.
>>>>>>>    - Draft output is MORE correlated with diversity than it is with
>>>>>>>    any individual gender!
>>>>>>>
>>>>>>> This last point is quite nice. It resonates with the work of Scott
>>>>>>> Page on the value of diversity to collective intelligence, for example.
>>>>>>>
>>>>>>> These numbers are a bit hard to interpret. How much should we trust
>>>>>>> them? These are the *p*-values associated with each correlation:
>>>>>>> women unknown men drafts diversity
>>>>>>> women 0 0 0 0.6925 0
>>>>>>> unknown 0 0 0 0.221 0
>>>>>>> men 0 0 0 0.493 0
>>>>>>> drafts 0.6925 0.221 0.493 0 0.0059
>>>>>>> diversity 0 0 0 0.0059 0
>>>>>>>
>>>>>>> Generally, *p*-values below .01 are considered "statistically
>>>>>>> significant", i.e. publishable.
>>>>>>> This correlation between diversity and draft output makes the cut!!
>>>>>>>
>>>>>>> So the verdict is: for HRPC, YES, gender diversity is correlated
>>>>>>> with draft output.
>>>>>>>
>>>>>>> This result is robust to transformations of the activity scores into
>>>>>>> the log space, which is comforting.
>>>>>>> Further work is needed to see if this result is robust across other
>>>>>>> IETF working groups.
>>>>>>>
>>>>>>> Nick, what would you say to including a result like this in the
>>>>>>> paper about IETF and gender?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Seb
>>>>>>>
>>>>>>>       ------------------------------
>>>>>>>
>>>>>>> Bigbang-dev mailing list
>>>>>>> Bigbang-dev at data-activism.net
>>>>>>> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>>>>>>>
>>>>>>> <diversity-productivity-httpbisa.png>
>>> _______________________________________________
>>> Bigbang-dev mailing list
>>> Bigbang-dev at data-activism.net
>>> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>>>
>>>
>>>
>>
>>
>> --
>> Colin Perkins
>> https://csperkins.org/
>>
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20200901/7f6bb220/attachment-0001.html>


More information about the Bigbang-dev mailing list