[Bigbang-dev] Are gender diversity and draft productivity correlated? THE VERDICT

Sebastian Benthall sbenthall at gmail.com
Wed Sep 2 19:53:27 CEST 2020


Maybe the reasons for this are similar those for why so few women
participate in open source. There's good presentations of empirical
research about this in this video:
https://www.youtube.com/watch?v=d5XkVHQGqH4



On Wed, Sep 2, 2020 at 12:52 PM Niels ten Oever <mail at nielstenoever.net>
wrote:

> Well, what is most shocking from this is the consistently near zero female
> identified participation in HTTPbis!
>
> On 9/2/20 3:57 PM, Sebastian Benthall wrote:
> > Hello,
> >
> > Thanks for catching that.
> > Indeed, the data was not correct.
> >
> > I should have acted immediately when Colin suggested I look at the
> submissions.
> > The submissions have a "document_date" field which is much more reliable
> than the document's "time" field.
> >
> > I've fixed the data collection script and included the total number of
> drafts in the plot.
> > https://github.com/datactive/bigbang/pull/394#issuecomment-685748156
> >
> > On Tue, Sep 1, 2020 at 7:09 PM Niels ten Oever <mail at nielstenoever.net
> <mailto:mail at nielstenoever.net>> wrote:
> >
> >     Hiya,
> >
> >     On 9/1/20 5:04 PM, Sebastian Benthall wrote:
> >     > Some updates:
> >     >
> >     > - A plot of mailing list activity, by gender, and (final) draft
> output is up here, with the correlation values, is here:
> >     >
> https://github.com/datactive/bigbang/pull/394#issuecomment-684917057
> >
> >     Interesting! In order to be able to judge
> https://github.com/datactive/bigbang/pull/394 it would be great if the
> y-axis of the graphs, and ideally also a data field in the notebook, would
> show the total number of drafts in a specific period. I have the feeling
> that the representation of the drafts is not fully correct.
> >
> >     According to the graph (if I read it correctly) there should be no
> drafts for httpbis in the period 2012 - 2014, but a cursory glance at the
> datatracker [0] shows that RFC7230, 7231, 7232, 7233, 7235, 7236, and 7237
> were published in 2014, and I am pretty sure these RFCs were all preceded
> by quite a number of drafts.
> >
> >     Cheers,
> >
> >     Niels
> >
> >     [0]
> https://datatracker.ietf.org/doc/search?name=httpbis&sort=&rfcs=on&activedrafts=on&by=group&group=
> >
> >     >
> >     > - None of the correlations of mailing list activity with draft
> output is statistically significant! This reverses the previous verdict.
> >     >
> >     > - I've made an issue for expanding the draft metadata collection
> to include the submissions:
> >     > https://github.com/datactive/bigbang/issues/397
> >     >
> >     > - Could I request a review of the code for this project thus far?
> It's currently languishing a bit as a PR:
> >     > https://github.com/datactive/bigbang/pull/394
> >     >
> >     > I've got to work on a few other projects for a bit but I'm excited
> to hear where folks think we might go from here.
> >     >
> >     > Best regards,
> >     > Seb
> >     >
> >     >
> >     > On Mon, Aug 31, 2020 at 10:52 AM Sebastian Benthall <
> sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>>> wrote:
> >     >
> >     >     Thank you!
> >     >
> >     >
> >     >         * The group is “httpbis” not “httpbisa”
> >     >
> >     >
> >     >     Aha!
> >     >
> >     >     I found `httpbisa` as the closest acronym to `httpbis` on this
> list of IETF mailing list archives:
> >     >
> https://github.com/datactive/bigbang/blob/master/examples/url_collections/mm.ietf.org.txt
> >     >
> >     >     Niels, does it make sense that the mailing list and the
> working group have different names in this case? Is that common?
> >     >
> >     >     I can confirm that the records I pulled using the datatracker
> include drafts for working groups besides `httpbis`.
> group_from_acronym('nonsense') returns None. None passed as a group to the
> documents query results in a default query of all groups, I suppose.
> >     >
> >     >
> >     >         Also, remember to look at the submissions to find the
> different versions of a draft, else you only get the most recent version.
> >     >
> >     >         Try something like:
> >     >
> >     >         dt = DataTracker(cache_dir=Path("cache"))
> >     >
> >     >         g  = dt.group_from_acronym("httpbis")
> >     >         for d in dt.documents(group=g,
> doctype=dt.document_type_from_slug("draft")):
> >     >             print("")
> >     >             for sub_url in d.submissions:
> >     >                 sub = dt.submission(sub_url)
> >     >                 print(F"{sub.document_date.strftime('%Y-%m-%d')} {
> sub.name <http://sub.name> <http://sub.name>}-{sub.rev}")
> >     >                 for a in sub.parse_authors():
> >     >                     print(F"           {a['name']} <{a['email']}>")
> >     >
> >     >         This will find each submission of all the working group
> drafts for a particular group. It doesn’t follow the history back to the
> pre-working group individual submissions, but can be extended to do that if
> needed.
> >     >
> >     >
> >     >     I see. Thanks again for this.
> >     >
> >     >     I welcome input from any stakeholders about whether whether
> "productivity" should be operationalized in terms of final draft output
> and/or submissions.
> >     >
> >     >
> >     >
> >     >>             Looking at
> https://datatracker.ietf.org/wg/httpbis/documents/ it seems that httpbis
> has 48 documents. Each of these will have gone through multiple versions as
> a draft, but even with ~20 draft per document (which is roughly typical),
> that’s not close to thousands.
> >     >>
> >     >>             Searching
> https://mailarchive.ietf.org/arch/browse/i-d-announce/?q=httpbis finds
> announcements for 721 internet drafts containing the string “httpbis”,
> which seems plausible.
> >     >>
> >     >>             Colin
> >     >>
> >     >>
> >     >>
> >     >>>             Another issue here is that the draft output preceeds
> the mailing list records (see attachment). Another is that there are very
> emails sent by women (or, so identifiable by our detection method) in
> httpbisa:
> >     >>>
> >     >>>             <image.png>
> >     >>>
> >     >>>
> >     >>>
> >     >>>
> >     >>>             On Wed, Aug 26, 2020 at 3:26 PM Niels ten Oever <
> mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net>>> wrote:
> >     >>>
> >     >>>                 Httpbis is the one you're looking for :)
> >     >>>
> >     >>>                 DNSops is also a nice big one.
> >     >>>
> >     >>>                 Cheers,
> >     >>>
> >     >>>                 Niels
> >     >>>                 On Aug 26, 2020, at 21:17, Sebastian Benthall <
> sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>>> wrote:
> >     >>>
> >     >>>                     Hmmm.
> >     >>>
> >     >>>                     Web mail archives of the http list at
> https://ietf.org/mail-archive/text/http/ only go up to 2012.
> >     >>>                     Does that make sense to you?
> >     >>>
> >     >>>                     It looks like there are several DNS working
> groups. Any one in particular you think would be worth looking at?
> >     >>>
> >     >>>                     Genericizing the code so that it can loop
> through many groups and compute results is the next step towards
> confirmation. Probably worth looking at a couple other concrete and
> well-understood examples before doing the big analysis though.
> >     >>>
> >     >>>                     - S
> >     >>>
> >     >>>                     On Wed, Aug 26, 2020 at 1:52 PM Niels ten
> Oever < mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net>>> wrote:
> >     >>>
> >     >>>                         Very interesting. I'd say the number if
> drafts and authors in hrpc is too low to make a statement about this
> though. Could we do this for the HTTP and/or DNS WGs ?
> >     >>>                         On Aug 26, 2020, at 19:30, Sebastian
> Benthall < sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>>> wrote:
> >     >>>
> >     >>>                             Hello,
> >     >>>
> >     >>>                             I'm revisiting the question of
> whether mailing list gender diversity and draft productivity of working
> groups are correlated.
> >     >>>
> >     >>>                             Putting aside for now all the
> methodological complications, here is how I am operationalizing the
> question:
> >     >>>
> >     >>>                               * I'm looking specifically at the
> HRPC working group, with this data:
> >     >>>                                 image.png
> >     >>>                              *
> >     >>>                                 Gender is being detected based
> on first name birth records. "unknown" is used for cases that cannot with
> the current data set be determined as either men or women.
> >     >>>                               * I'm measuring "diversity" on any
> day as: (women's activity + unknown's activity) / (men's activity).
> Because, you know, this is probably close to what most people probably mean
> by diversity. (Recall that non-Western names are more likely to be
> categorized as "unknown".)
> >     >>>                               * I'm using a 100 day rolling
> average on the activity counts.
> >     >>>
> >     >>>                             This is the matrix of Pearson
> correlations between each of these values:
> >     >>>
> >     >>>                                     women   unknown         men
>    drafts  diversity
> >     >>>                             women   1.000000        0.910922
>     0.804869        0.008890        0.160833
> >     >>>                             unknown         0.910922
> 1.000000        0.808168        0.027502        0.245059
> >     >>>                             men     0.804869        0.808168
>     1.000000        0.015406        -0.141915
> >     >>>                             drafts  0.008890        0.027502
>     0.015406        1.000000        0.061884
> >     >>>                             diversity       0.160833
> 0.245059        -0.141915       0.061884        1.000000
> >     >>>
> >     >>>
> >     >>>                             Things to note:
> >     >>>
> >     >>>                               * The activity of each gender is
> correlated with the activity of other genders.
> >     >>>                               * Diversity is anticorrelated with
> the number of men. This is expected based on how it was defined, and a good
> sanity check.
> >     >>>                               * Draft output is MORE correlated
> with diversity than it is with any individual gender!
> >     >>>
> >     >>>                             This last point is quite nice. It
> resonates with the work of Scott Page on the value of diversity to
> collective intelligence, for example.
> >     >>>
> >     >>>                             These numbers are a bit hard to
> interpret. How much should we trust them? These are the /p/-values
> associated with each correlation:
> >     >>>                                     women   unknown         men
>    drafts  diversity
> >     >>>                             women   0       0       0
>  0.6925  0
> >     >>>                             unknown         0       0       0
>    0.221   0
> >     >>>                             men     0       0       0
>  0.493   0
> >     >>>                             drafts  0.6925  0.221   0.493   0
>    0.0059
> >     >>>                             diversity       0       0       0
>    0.0059  0
> >     >>>
> >     >>>
> >     >>>                             Generally, /p/-values below .01 are
> considered "statistically significant", i.e. publishable.
> >     >>>                             This correlation between diversity
> and draft output makes the cut!!
> >     >>>
> >     >>>                             So the verdict is: for HRPC, YES,
> gender diversity is correlated with draft output.
> >     >>>
> >     >>>                             This result is robust to
> transformations of the activity scores into the log space, which is
> comforting.
> >     >>>                             Further work is needed to see if
> this result is robust across other IETF working groups.
> >     >>>
> >     >>>                             Nick, what would you say to
> including a result like this in the paper about IETF and gender?
> >     >>>
> >     >>>                             Cheers,
> >     >>>                             Seb
> >     >>>
> >     >>>
> >     >>>
> >
>   ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >     >>>
> >     >>>                             Bigbang-dev mailing list
> >     >>>                             Bigbang-dev at data-activism.net
> <mailto:Bigbang-dev at data-activism.net> <mailto:
> Bigbang-dev at data-activism.net <mailto:Bigbang-dev at data-activism.net>>
> >     >>>
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
> >     >>>
> >     >>>
>  <diversity-productivity-httpbisa.png>_______________________________________________
> >     >>>             Bigbang-dev mailing list
> >     >>>             Bigbang-dev at data-activism.net <mailto:
> Bigbang-dev at data-activism.net> <mailto:Bigbang-dev at data-activism.net
> <mailto:Bigbang-dev at data-activism.net>>
> >     >>>
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
> >     >>
> >     >
> >     >
> >     >
> >     >         --
> >     >         Colin Perkins
> >     >         https://csperkins.org/
> >     >
> >     >
> >     >
> >     >
> >
> >     --
> >     Niels ten Oever
> >     Researcher and PhD Candidate - DATACTIVE Research Group - University
> of Amsterdam
> >     Postdoctoral Scholar (abd) - Communications Department - Texas A&M
> University
> >     Research Fellow - Centre for Internet and Human Rights - European
> University Viadrina
> >     Associated Scholar - Centro de Tecnologia e Sociedade - Fundação
> Getúlio Vargas
> >
> >     W: https://nielstenoever.net
> >     E: mail at nielstenoever.net <mailto:mail at nielstenoever.net>
> >     T: @nielstenoever
> >     P/S/WA: +31629051853
> >     PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
> >
>
> --
> Niels ten Oever
> Researcher and PhD Candidate - DATACTIVE Research Group - University of
> Amsterdam
> Postdoctoral Scholar (abd) - Communications Department - Texas A&M
> University
> Research Fellow - Centre for Internet and Human Rights - European
> University Viadrina
> Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio
> Vargas
>
> W: https://nielstenoever.net
> E: mail at nielstenoever.net
> T: @nielstenoever
> P/S/WA: +31629051853
> PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20200902/b3482343/attachment-0001.html>


More information about the Bigbang-dev mailing list