[Bigbang-dev] Are gender diversity and draft productivity correlated? THE VERDICT

Niels ten Oever mail at nielstenoever.net
Wed Sep 2 18:52:01 CEST 2020


Well, what is most shocking from this is the consistently near zero female identified participation in HTTPbis!

On 9/2/20 3:57 PM, Sebastian Benthall wrote:
> Hello,
> 
> Thanks for catching that.
> Indeed, the data was not correct.
> 
> I should have acted immediately when Colin suggested I look at the submissions.
> The submissions have a "document_date" field which is much more reliable than the document's "time" field.
> 
> I've fixed the data collection script and included the total number of drafts in the plot.
> https://github.com/datactive/bigbang/pull/394#issuecomment-685748156
> 
> On Tue, Sep 1, 2020 at 7:09 PM Niels ten Oever <mail at nielstenoever.net <mailto:mail at nielstenoever.net>> wrote:
> 
>     Hiya,
> 
>     On 9/1/20 5:04 PM, Sebastian Benthall wrote:
>     > Some updates:
>     >
>     > - A plot of mailing list activity, by gender, and (final) draft output is up here, with the correlation values, is here:
>     > https://github.com/datactive/bigbang/pull/394#issuecomment-684917057
> 
>     Interesting! In order to be able to judge https://github.com/datactive/bigbang/pull/394 it would be great if the y-axis of the graphs, and ideally also a data field in the notebook, would show the total number of drafts in a specific period. I have the feeling that the representation of the drafts is not fully correct.
> 
>     According to the graph (if I read it correctly) there should be no drafts for httpbis in the period 2012 - 2014, but a cursory glance at the datatracker [0] shows that RFC7230, 7231, 7232, 7233, 7235, 7236, and 7237 were published in 2014, and I am pretty sure these RFCs were all preceded by quite a number of drafts.
> 
>     Cheers,
> 
>     Niels
> 
>     [0] https://datatracker.ietf.org/doc/search?name=httpbis&sort=&rfcs=on&activedrafts=on&by=group&group=
> 
>     >
>     > - None of the correlations of mailing list activity with draft output is statistically significant! This reverses the previous verdict.
>     >
>     > - I've made an issue for expanding the draft metadata collection to include the submissions:
>     > https://github.com/datactive/bigbang/issues/397
>     >
>     > - Could I request a review of the code for this project thus far? It's currently languishing a bit as a PR:
>     > https://github.com/datactive/bigbang/pull/394
>     >
>     > I've got to work on a few other projects for a bit but I'm excited to hear where folks think we might go from here.
>     >
>     > Best regards,
>     > Seb
>     >
>     >
>     > On Mon, Aug 31, 2020 at 10:52 AM Sebastian Benthall <sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:sbenthall at gmail.com <mailto:sbenthall at gmail.com>>> wrote:
>     >
>     >     Thank you!
>     >      
>     >
>     >         * The group is “httpbis” not “httpbisa”
>     >
>     >
>     >     Aha!
>     >
>     >     I found `httpbisa` as the closest acronym to `httpbis` on this list of IETF mailing list archives:
>     >     https://github.com/datactive/bigbang/blob/master/examples/url_collections/mm.ietf.org.txt
>     >
>     >     Niels, does it make sense that the mailing list and the working group have different names in this case? Is that common?
>     >
>     >     I can confirm that the records I pulled using the datatracker include drafts for working groups besides `httpbis`. group_from_acronym('nonsense') returns None. None passed as a group to the documents query results in a default query of all groups, I suppose.
>     >      
>     >
>     >         Also, remember to look at the submissions to find the different versions of a draft, else you only get the most recent version. 
>     >
>     >         Try something like:
>     >
>     >         dt = DataTracker(cache_dir=Path("cache"))
>     >
>     >         g  = dt.group_from_acronym("httpbis")
>     >         for d in dt.documents(group=g, doctype=dt.document_type_from_slug("draft")):
>     >             print("")
>     >             for sub_url in d.submissions:
>     >                 sub = dt.submission(sub_url)
>     >                 print(F"{sub.document_date.strftime('%Y-%m-%d')} {sub.name <http://sub.name> <http://sub.name>}-{sub.rev}")
>     >                 for a in sub.parse_authors():
>     >                     print(F"           {a['name']} <{a['email']}>")
>     >
>     >         This will find each submission of all the working group drafts for a particular group. It doesn’t follow the history back to the pre-working group individual submissions, but can be extended to do that if needed.
>     >
>     >
>     >     I see. Thanks again for this.
>     >
>     >     I welcome input from any stakeholders about whether whether "productivity" should be operationalized in terms of final draft output and/or submissions.
>     >
>     >       
>     >
>     >>             Looking at https://datatracker.ietf.org/wg/httpbis/documents/ it seems that httpbis has 48 documents. Each of these will have gone through multiple versions as a draft, but even with ~20 draft per document (which is roughly typical), that’s not close to thousands. 
>     >>
>     >>             Searching https://mailarchive.ietf.org/arch/browse/i-d-announce/?q=httpbis finds announcements for 721 internet drafts containing the string “httpbis”, which seems plausible.
>     >>
>     >>             Colin
>     >>
>     >>
>     >>
>     >>>             Another issue here is that the draft output preceeds the mailing list records (see attachment). Another is that there are very emails sent by women (or, so identifiable by our detection method) in httpbisa:
>     >>>
>     >>>             <image.png>
>     >>>
>     >>>
>     >>>
>     >>>
>     >>>             On Wed, Aug 26, 2020 at 3:26 PM Niels ten Oever <mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:mail at nielstenoever.net <mailto:mail at nielstenoever.net>>> wrote:
>     >>>
>     >>>                 Httpbis is the one you're looking for :)
>     >>>
>     >>>                 DNSops is also a nice big one.
>     >>>
>     >>>                 Cheers,
>     >>>
>     >>>                 Niels
>     >>>                 On Aug 26, 2020, at 21:17, Sebastian Benthall <sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:sbenthall at gmail.com <mailto:sbenthall at gmail.com>>> wrote:
>     >>>
>     >>>                     Hmmm.
>     >>>
>     >>>                     Web mail archives of the http list at  https://ietf.org/mail-archive/text/http/ only go up to 2012.
>     >>>                     Does that make sense to you?
>     >>>
>     >>>                     It looks like there are several DNS working groups. Any one in particular you think would be worth looking at?
>     >>>
>     >>>                     Genericizing the code so that it can loop through many groups and compute results is the next step towards confirmation. Probably worth looking at a couple other concrete and well-understood examples before doing the big analysis though.
>     >>>
>     >>>                     - S
>     >>>
>     >>>                     On Wed, Aug 26, 2020 at 1:52 PM Niels ten Oever < mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:mail at nielstenoever.net <mailto:mail at nielstenoever.net>>> wrote:
>     >>>
>     >>>                         Very interesting. I'd say the number if drafts and authors in hrpc is too low to make a statement about this though. Could we do this for the HTTP and/or DNS WGs ?
>     >>>                         On Aug 26, 2020, at 19:30, Sebastian Benthall < sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:sbenthall at gmail.com <mailto:sbenthall at gmail.com>>> wrote:
>     >>>
>     >>>                             Hello,
>     >>>
>     >>>                             I'm revisiting the question of whether mailing list gender diversity and draft productivity of working groups are correlated.
>     >>>
>     >>>                             Putting aside for now all the methodological complications, here is how I am operationalizing the question:
>     >>>
>     >>>                               * I'm looking specifically at the HRPC working group, with this data:
>     >>>                                 image.png
>     >>>                              *
>     >>>                                 Gender is being detected based on first name birth records. "unknown" is used for cases that cannot with the current data set be determined as either men or women.
>     >>>                               * I'm measuring "diversity" on any day as: (women's activity + unknown's activity) / (men's activity). Because, you know, this is probably close to what most people probably mean by diversity. (Recall that non-Western names are more likely to be categorized as "unknown".)
>     >>>                               * I'm using a 100 day rolling average on the activity counts.
>     >>>
>     >>>                             This is the matrix of Pearson correlations between each of these values:
>     >>>
>     >>>                                     women   unknown         men     drafts  diversity
>     >>>                             women   1.000000        0.910922        0.804869        0.008890        0.160833
>     >>>                             unknown         0.910922        1.000000        0.808168        0.027502        0.245059
>     >>>                             men     0.804869        0.808168        1.000000        0.015406        -0.141915
>     >>>                             drafts  0.008890        0.027502        0.015406        1.000000        0.061884
>     >>>                             diversity       0.160833        0.245059        -0.141915       0.061884        1.000000
>     >>>
>     >>>
>     >>>                             Things to note:
>     >>>
>     >>>                               * The activity of each gender is correlated with the activity of other genders.
>     >>>                               * Diversity is anticorrelated with the number of men. This is expected based on how it was defined, and a good sanity check.
>     >>>                               * Draft output is MORE correlated with diversity than it is with any individual gender!
>     >>>
>     >>>                             This last point is quite nice. It resonates with the work of Scott Page on the value of diversity to collective intelligence, for example.
>     >>>
>     >>>                             These numbers are a bit hard to interpret. How much should we trust them? These are the /p/-values associated with each correlation:
>     >>>                                     women   unknown         men     drafts  diversity
>     >>>                             women   0       0       0       0.6925  0
>     >>>                             unknown         0       0       0       0.221   0
>     >>>                             men     0       0       0       0.493   0
>     >>>                             drafts  0.6925  0.221   0.493   0       0.0059
>     >>>                             diversity       0       0       0       0.0059  0
>     >>>
>     >>>
>     >>>                             Generally, /p/-values below .01 are considered "statistically significant", i.e. publishable.
>     >>>                             This correlation between diversity and draft output makes the cut!!
>     >>>
>     >>>                             So the verdict is: for HRPC, YES, gender diversity is correlated with draft output.
>     >>>
>     >>>                             This result is robust to transformations of the activity scores into the log space, which is comforting.
>     >>>                             Further work is needed to see if this result is robust across other IETF working groups.
>     >>>
>     >>>                             Nick, what would you say to including a result like this in the paper about IETF and gender?
>     >>>
>     >>>                             Cheers,
>     >>>                             Seb
>     >>>
>     >>>                                   
>     >>>                           
>      ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>     >>>
>     >>>                             Bigbang-dev mailing list
>     >>>                             Bigbang-dev at data-activism.net <mailto:Bigbang-dev at data-activism.net> <mailto:Bigbang-dev at data-activism.net <mailto:Bigbang-dev at data-activism.net>>
>     >>>                             https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>     >>>
>     >>>             <diversity-productivity-httpbisa.png>_______________________________________________
>     >>>             Bigbang-dev mailing list
>     >>>             Bigbang-dev at data-activism.net <mailto:Bigbang-dev at data-activism.net> <mailto:Bigbang-dev at data-activism.net <mailto:Bigbang-dev at data-activism.net>>
>     >>>             https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>     >>
>     >
>     >
>     >
>     >         -- 
>     >         Colin Perkins
>     >         https://csperkins.org/
>     >
>     >
>     >
>     >
> 
>     -- 
>     Niels ten Oever
>     Researcher and PhD Candidate - DATACTIVE Research Group - University of Amsterdam
>     Postdoctoral Scholar (abd) - Communications Department - Texas A&M University
>     Research Fellow - Centre for Internet and Human Rights - European University Viadrina
>     Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio Vargas
> 
>     W: https://nielstenoever.net
>     E: mail at nielstenoever.net <mailto:mail at nielstenoever.net>
>     T: @nielstenoever
>     P/S/WA: +31629051853
>     PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
> 

-- 
Niels ten Oever
Researcher and PhD Candidate - DATACTIVE Research Group - University of Amsterdam
Postdoctoral Scholar (abd) - Communications Department - Texas A&M University
Research Fellow - Centre for Internet and Human Rights - European University Viadrina
Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio Vargas

W: https://nielstenoever.net
E: mail at nielstenoever.net
T: @nielstenoever
P/S/WA: +31629051853
PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3


More information about the Bigbang-dev mailing list