[Bigbang-dev] Are gender diversity and draft productivity correlated? THE VERDICT

Sebastian Benthall sbenthall at gmail.com
Mon Aug 31 16:52:49 CEST 2020


Thank you!


> * The group is “httpbis” not “httpbisa”
>

Aha!

I found `httpbisa` as the closest acronym to `httpbis` on this list of IETF
mailing list archives:
https://github.com/datactive/bigbang/blob/master/examples/url_collections/mm.ietf.org.txt

Niels, does it make sense that the mailing list and the working group have
different names in this case? Is that common?

I can confirm that the records I pulled using the datatracker include
drafts for working groups besides `httpbis`. group_from_acronym('nonsense')
returns None. None passed as a group to the documents query results in a
default query of all groups, I suppose.


> Also, remember to look at the submissions to find the different versions
> of a draft, else you only get the most recent version.
>
> Try something like:
>
> dt = DataTracker(cache_dir=Path("cache"))
>
> g  = dt.group_from_acronym("httpbis")
> for d in dt.documents(group=g,
> doctype=dt.document_type_from_slug("draft")):
>     print("")
>     for sub_url in d.submissions:
>         sub = dt.submission(sub_url)
>         print(F"{sub.document_date.strftime('%Y-%m-%d')} {sub.name
> }-{sub.rev}")
>         for a in sub.parse_authors():
>             print(F"           {a['name']} <{a['email']}>")
>
> This will find each submission of all the working group drafts for a
> particular group. It doesn’t follow the history back to the pre-working
> group individual submissions, but can be extended to do that if needed.
>

I see. Thanks again for this.

I welcome input from any stakeholders about whether whether "productivity"
should be operationalized in terms of final draft output and/or submissions.



> Looking at https://datatracker.ietf.org/wg/httpbis/documents/ it seems
>> that httpbis has 48 documents. Each of these will have gone through
>> multiple versions as a draft, but even with ~20 draft per document (which
>> is roughly typical), that’s not close to thousands.
>>
>> Searching
>> https://mailarchive.ietf.org/arch/browse/i-d-announce/?q=httpbis finds
>> announcements for 721 internet drafts containing the string “httpbis”,
>> which seems plausible.
>>
>> Colin
>>
>>
>>
>> Another issue here is that the draft output preceeds the mailing list
>> records (see attachment). Another is that there are very emails sent by
>> women (or, so identifiable by our detection method) in httpbisa:
>>
>> <image.png>
>>
>>
>>
>>
>> On Wed, Aug 26, 2020 at 3:26 PM Niels ten Oever <mail at nielstenoever.net>
>> wrote:
>>
>>> Httpbis is the one you're looking for :)
>>>
>>> DNSops is also a nice big one.
>>>
>>> Cheers,
>>>
>>> Niels
>>> On Aug 26, 2020, at 21:17, Sebastian Benthall <sbenthall at gmail.com>
>>> wrote:
>>>>
>>>> Hmmm.
>>>>
>>>> Web mail archives of the http list at
>>>> https://ietf.org/mail-archive/text/http/ only go up to 2012.
>>>> Does that make sense to you?
>>>>
>>>> It looks like there are several DNS working groups. Any one in
>>>> particular you think would be worth looking at?
>>>>
>>>> Genericizing the code so that it can loop through many groups and
>>>> compute results is the next step towards confirmation. Probably worth
>>>> looking at a couple other concrete and well-understood examples before
>>>> doing the big analysis though.
>>>>
>>>> - S
>>>>
>>>> On Wed, Aug 26, 2020 at 1:52 PM Niels ten Oever <
>>>> mail at nielstenoever.net> wrote:
>>>>
>>>>> Very interesting. I'd say the number if drafts and authors in hrpc is
>>>>> too low to make a statement about this though. Could we do this for the
>>>>> HTTP and/or DNS WGs ?
>>>>> On Aug 26, 2020, at 19:30, Sebastian Benthall < sbenthall at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I'm revisiting the question of whether mailing list gender diversity
>>>>>> and draft productivity of working groups are correlated.
>>>>>>
>>>>>> Putting aside for now all the methodological complications, here is
>>>>>> how I am operationalizing the question:
>>>>>>
>>>>>>    - I'm looking specifically at the HRPC working group, with this
>>>>>>    data:
>>>>>>    [image: image.png]
>>>>>>    - Gender is being detected based on first name birth records.
>>>>>>    "unknown" is used for cases that cannot with the current data set be
>>>>>>    determined as either men or women.
>>>>>>    - I'm measuring "diversity" on any day as: (women's activity +
>>>>>>    unknown's activity) / (men's activity). Because, you know, this is probably
>>>>>>    close to what most people probably mean by diversity. (Recall that
>>>>>>    non-Western names are more likely to be categorized as "unknown".)
>>>>>>    - I'm using a 100 day rolling average on the activity counts.
>>>>>>
>>>>>> This is the matrix of Pearson correlations between each of these
>>>>>> values:
>>>>>>
>>>>>> women unknown men drafts diversity
>>>>>> women 1.000000 0.910922 0.804869 0.008890 0.160833
>>>>>> unknown 0.910922 1.000000 0.808168 0.027502 0.245059
>>>>>> men 0.804869 0.808168 1.000000 0.015406 -0.141915
>>>>>> drafts 0.008890 0.027502 0.015406 1.000000 0.061884
>>>>>> diversity 0.160833 0.245059 -0.141915 0.061884 1.000000
>>>>>>
>>>>>> Things to note:
>>>>>>
>>>>>>    - The activity of each gender is correlated with the activity of
>>>>>>    other genders.
>>>>>>    - Diversity is anticorrelated with the number of men. This is
>>>>>>    expected based on how it was defined, and a good sanity check.
>>>>>>    - Draft output is MORE correlated with diversity than it is with
>>>>>>    any individual gender!
>>>>>>
>>>>>> This last point is quite nice. It resonates with the work of Scott
>>>>>> Page on the value of diversity to collective intelligence, for example.
>>>>>>
>>>>>> These numbers are a bit hard to interpret. How much should we trust
>>>>>> them? These are the *p*-values associated with each correlation:
>>>>>> women unknown men drafts diversity
>>>>>> women 0 0 0 0.6925 0
>>>>>> unknown 0 0 0 0.221 0
>>>>>> men 0 0 0 0.493 0
>>>>>> drafts 0.6925 0.221 0.493 0 0.0059
>>>>>> diversity 0 0 0 0.0059 0
>>>>>>
>>>>>> Generally, *p*-values below .01 are considered "statistically
>>>>>> significant", i.e. publishable.
>>>>>> This correlation between diversity and draft output makes the cut!!
>>>>>>
>>>>>> So the verdict is: for HRPC, YES, gender diversity is correlated with
>>>>>> draft output.
>>>>>>
>>>>>> This result is robust to transformations of the activity scores into
>>>>>> the log space, which is comforting.
>>>>>> Further work is needed to see if this result is robust across other
>>>>>> IETF working groups.
>>>>>>
>>>>>> Nick, what would you say to including a result like this in the paper
>>>>>> about IETF and gender?
>>>>>>
>>>>>> Cheers,
>>>>>> Seb
>>>>>>
>>>>>>       ------------------------------
>>>>>>
>>>>>> Bigbang-dev mailing list
>>>>>> Bigbang-dev at data-activism.net
>>>>>> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>>>>>>
>>>>>> <diversity-productivity-httpbisa.png>
>> _______________________________________________
>> Bigbang-dev mailing list
>> Bigbang-dev at data-activism.net
>> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>>
>>
>>
>
>
> --
> Colin Perkins
> https://csperkins.org/
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20200831/da8bc9a4/attachment.html>


More information about the Bigbang-dev mailing list