[Bigbang-dev] Are gender diversity and draft productivity correlated? THE VERDICT

Sebastian Benthall sbenthall at gmail.com
Thu Sep 3 14:54:44 CEST 2020


That's a good idea.
This sounds like a new issue though.

That PR is getting a bit overloaded.
If it serves as a modular improvement, then I hope it passes your review
and can be merged.

Then new features can be put into new issues and prioritized.

Because there's been a lot of work recently, it would make sense to start
preparing to cut a new release.


On Thu, Sep 3, 2020, 5:17 AM Niels ten Oever <mail at nielstenoever.net> wrote:

> To make this even more useful, we could have a field that exports the
> 'unknown' values? Perhaps even in a way that gender could be assigned to
> them?
>
> Cheers,
>
> Niels
>
> On 9/2/20 7:53 PM, Sebastian Benthall wrote:
> > Maybe the reasons for this are similar those for why so few women
> participate in open source. There's good presentations of empirical
> research about this in this video:
> > https://www.youtube.com/watch?v=d5XkVHQGqH4
> >
> >
> >
> > On Wed, Sep 2, 2020 at 12:52 PM Niels ten Oever <mail at nielstenoever.net
> <mailto:mail at nielstenoever.net>> wrote:
> >
> >     Well, what is most shocking from this is the consistently near zero
> female identified participation in HTTPbis!
> >
> >     On 9/2/20 3:57 PM, Sebastian Benthall wrote:
> >     > Hello,
> >     >
> >     > Thanks for catching that.
> >     > Indeed, the data was not correct.
> >     >
> >     > I should have acted immediately when Colin suggested I look at the
> submissions.
> >     > The submissions have a "document_date" field which is much more
> reliable than the document's "time" field.
> >     >
> >     > I've fixed the data collection script and included the total
> number of drafts in the plot.
> >     >
> https://github.com/datactive/bigbang/pull/394#issuecomment-685748156
> >     >
> >     > On Tue, Sep 1, 2020 at 7:09 PM Niels ten Oever <
> mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net>>> wrote:
> >     >
> >     >     Hiya,
> >     >
> >     >     On 9/1/20 5:04 PM, Sebastian Benthall wrote:
> >     >     > Some updates:
> >     >     >
> >     >     > - A plot of mailing list activity, by gender, and (final)
> draft output is up here, with the correlation values, is here:
> >     >     >
> https://github.com/datactive/bigbang/pull/394#issuecomment-684917057
> >     >
> >     >     Interesting! In order to be able to judge
> https://github.com/datactive/bigbang/pull/394 it would be great if the
> y-axis of the graphs, and ideally also a data field in the notebook, would
> show the total number of drafts in a specific period. I have the feeling
> that the representation of the drafts is not fully correct.
> >     >
> >     >     According to the graph (if I read it correctly) there should
> be no drafts for httpbis in the period 2012 - 2014, but a cursory glance at
> the datatracker [0] shows that RFC7230, 7231, 7232, 7233, 7235, 7236, and
> 7237 were published in 2014, and I am pretty sure these RFCs were all
> preceded by quite a number of drafts.
> >     >
> >     >     Cheers,
> >     >
> >     >     Niels
> >     >
> >     >     [0]
> https://datatracker.ietf.org/doc/search?name=httpbis&sort=&rfcs=on&activedrafts=on&by=group&group=
> >     >
> >     >     >
> >     >     > - None of the correlations of mailing list activity with
> draft output is statistically significant! This reverses the previous
> verdict.
> >     >     >
> >     >     > - I've made an issue for expanding the draft metadata
> collection to include the submissions:
> >     >     > https://github.com/datactive/bigbang/issues/397
> >     >     >
> >     >     > - Could I request a review of the code for this project thus
> far? It's currently languishing a bit as a PR:
> >     >     > https://github.com/datactive/bigbang/pull/394
> >     >     >
> >     >     > I've got to work on a few other projects for a bit but I'm
> excited to hear where folks think we might go from here.
> >     >     >
> >     >     > Best regards,
> >     >     > Seb
> >     >     >
> >     >     >
> >     >     > On Mon, Aug 31, 2020 at 10:52 AM Sebastian Benthall <
> sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>>>> wrote:
> >     >     >
> >     >     >     Thank you!
> >     >     >
> >     >     >
> >     >     >         * The group is “httpbis” not “httpbisa”
> >     >     >
> >     >     >
> >     >     >     Aha!
> >     >     >
> >     >     >     I found `httpbisa` as the closest acronym to `httpbis`
> on this list of IETF mailing list archives:
> >     >     >
> https://github.com/datactive/bigbang/blob/master/examples/url_collections/mm.ietf.org.txt
> >     >     >
> >     >     >     Niels, does it make sense that the mailing list and the
> working group have different names in this case? Is that common?
> >     >     >
> >     >     >     I can confirm that the records I pulled using the
> datatracker include drafts for working groups besides `httpbis`.
> group_from_acronym('nonsense') returns None. None passed as a group to the
> documents query results in a default query of all groups, I suppose.
> >     >     >
> >     >     >
> >     >     >         Also, remember to look at the submissions to find
> the different versions of a draft, else you only get the most recent
> version.
> >     >     >
> >     >     >         Try something like:
> >     >     >
> >     >     >         dt = DataTracker(cache_dir=Path("cache"))
> >     >     >
> >     >     >         g  = dt.group_from_acronym("httpbis")
> >     >     >         for d in dt.documents(group=g,
> doctype=dt.document_type_from_slug("draft")):
> >     >     >             print("")
> >     >     >             for sub_url in d.submissions:
> >     >     >                 sub = dt.submission(sub_url)
> >     >     >
> print(F"{sub.document_date.strftime('%Y-%m-%d')} {sub.name <
> http://sub.name> <http://sub.name> <http://sub.name>}-{sub.rev}")
> >     >     >                 for a in sub.parse_authors():
> >     >     >                     print(F"           {a['name']}
> <{a['email']}>")
> >     >     >
> >     >     >         This will find each submission of all the working
> group drafts for a particular group. It doesn’t follow the history back to
> the pre-working group individual submissions, but can be extended to do
> that if needed.
> >     >     >
> >     >     >
> >     >     >     I see. Thanks again for this.
> >     >     >
> >     >     >     I welcome input from any stakeholders about whether
> whether "productivity" should be operationalized in terms of final draft
> output and/or submissions.
> >     >     >
> >     >     >
> >     >     >
> >     >     >>             Looking at
> https://datatracker.ietf.org/wg/httpbis/documents/ it seems that httpbis
> has 48 documents. Each of these will have gone through multiple versions as
> a draft, but even with ~20 draft per document (which is roughly typical),
> that’s not close to thousands.
> >     >     >>
> >     >     >>             Searching
> https://mailarchive.ietf.org/arch/browse/i-d-announce/?q=httpbis finds
> announcements for 721 internet drafts containing the string “httpbis”,
> which seems plausible.
> >     >     >>
> >     >     >>             Colin
> >     >     >>
> >     >     >>
> >     >     >>
> >     >     >>>             Another issue here is that the draft output
> preceeds the mailing list records (see attachment). Another is that there
> are very emails sent by women (or, so identifiable by our detection method)
> in httpbisa:
> >     >     >>>
> >     >     >>>             <image.png>
> >     >     >>>
> >     >     >>>
> >     >     >>>
> >     >     >>>
> >     >     >>>             On Wed, Aug 26, 2020 at 3:26 PM Niels ten
> Oever <mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net>> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net>>>> wrote:
> >     >     >>>
> >     >     >>>                 Httpbis is the one you're looking for :)
> >     >     >>>
> >     >     >>>                 DNSops is also a nice big one.
> >     >     >>>
> >     >     >>>                 Cheers,
> >     >     >>>
> >     >     >>>                 Niels
> >     >     >>>                 On Aug 26, 2020, at 21:17, Sebastian
> Benthall <sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>>>> wrote:
> >     >     >>>
> >     >     >>>                     Hmmm.
> >     >     >>>
> >     >     >>>                     Web mail archives of the http list at
> https://ietf.org/mail-archive/text/http/ only go up to 2012.
> >     >     >>>                     Does that make sense to you?
> >     >     >>>
> >     >     >>>                     It looks like there are several DNS
> working groups. Any one in particular you think would be worth looking at?
> >     >     >>>
> >     >     >>>                     Genericizing the code so that it can
> loop through many groups and compute results is the next step towards
> confirmation. Probably worth looking at a couple other concrete and
> well-understood examples before doing the big analysis though.
> >     >     >>>
> >     >     >>>                     - S
> >     >     >>>
> >     >     >>>                     On Wed, Aug 26, 2020 at 1:52 PM Niels
> ten Oever < mail at nielstenoever.net <mailto:mail at nielstenoever.net>
> <mailto:mail at nielstenoever.net <mailto:mail at nielstenoever.net>> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net>>>> wrote:
> >     >     >>>
> >     >     >>>                         Very interesting. I'd say the
> number if drafts and authors in hrpc is too low to make a statement about
> this though. Could we do this for the HTTP and/or DNS WGs ?
> >     >     >>>                         On Aug 26, 2020, at 19:30,
> Sebastian Benthall < sbenthall at gmail.com <mailto:sbenthall at gmail.com>
> <mailto:sbenthall at gmail.com <mailto:sbenthall at gmail.com>> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>>>> wrote:
> >     >     >>>
> >     >     >>>                             Hello,
> >     >     >>>
> >     >     >>>                             I'm revisiting the question of
> whether mailing list gender diversity and draft productivity of working
> groups are correlated.
> >     >     >>>
> >     >     >>>                             Putting aside for now all the
> methodological complications, here is how I am operationalizing the
> question:
> >     >     >>>
> >     >     >>>                               * I'm looking specifically
> at the HRPC working group, with this data:
> >     >     >>>                                 image.png
> >     >     >>>                              *
> >     >     >>>                                 Gender is being detected
> based on first name birth records. "unknown" is used for cases that cannot
> with the current data set be determined as either men or women.
> >     >     >>>                               * I'm measuring "diversity"
> on any day as: (women's activity + unknown's activity) / (men's activity).
> Because, you know, this is probably close to what most people probably mean
> by diversity. (Recall that non-Western names are more likely to be
> categorized as "unknown".)
> >     >     >>>                               * I'm using a 100 day
> rolling average on the activity counts.
> >     >     >>>
> >     >     >>>                             This is the matrix of Pearson
> correlations between each of these values:
> >     >     >>>
> >     >     >>>                                     women   unknown
>  men     drafts  diversity
> >     >     >>>                             women   1.000000
> 0.910922        0.804869        0.008890        0.160833
> >     >     >>>                             unknown         0.910922
>   1.000000        0.808168        0.027502        0.245059
> >     >     >>>                             men     0.804869
> 0.808168        1.000000        0.015406        -0.141915
> >     >     >>>                             drafts  0.008890
> 0.027502        0.015406        1.000000        0.061884
> >     >     >>>                             diversity       0.160833
>   0.245059        -0.141915       0.061884        1.000000
> >     >     >>>
> >     >     >>>
> >     >     >>>                             Things to note:
> >     >     >>>
> >     >     >>>                               * The activity of each
> gender is correlated with the activity of other genders.
> >     >     >>>                               * Diversity is
> anticorrelated with the number of men. This is expected based on how it was
> defined, and a good sanity check.
> >     >     >>>                               * Draft output is MORE
> correlated with diversity than it is with any individual gender!
> >     >     >>>
> >     >     >>>                             This last point is quite nice.
> It resonates with the work of Scott Page on the value of diversity to
> collective intelligence, for example.
> >     >     >>>
> >     >     >>>                             These numbers are a bit hard
> to interpret. How much should we trust them? These are the /p/-values
> associated with each correlation:
> >     >     >>>                                     women   unknown
>  men     drafts  diversity
> >     >     >>>                             women   0       0       0
>  0.6925  0
> >     >     >>>                             unknown         0       0
>  0       0.221   0
> >     >     >>>                             men     0       0       0
>  0.493   0
> >     >     >>>                             drafts  0.6925  0.221   0.493
>  0       0.0059
> >     >     >>>                             diversity       0       0
>  0       0.0059  0
> >     >     >>>
> >     >     >>>
> >     >     >>>                             Generally, /p/-values below
> .01 are considered "statistically significant", i.e. publishable.
> >     >     >>>                             This correlation between
> diversity and draft output makes the cut!!
> >     >     >>>
> >     >     >>>                             So the verdict is: for HRPC,
> YES, gender diversity is correlated with draft output.
> >     >     >>>
> >     >     >>>                             This result is robust to
> transformations of the activity scores into the log space, which is
> comforting.
> >     >     >>>                             Further work is needed to see
> if this result is robust across other IETF working groups.
> >     >     >>>
> >     >     >>>                             Nick, what would you say to
> including a result like this in the paper about IETF and gender?
> >     >     >>>
> >     >     >>>                             Cheers,
> >     >     >>>                             Seb
> >     >     >>>
> >     >     >>>
> >     >     >>>
> >     >
> >
>    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >     >     >>>
> >     >     >>>                             Bigbang-dev mailing list
> >     >     >>>                             Bigbang-dev at data-activism.net
> <mailto:Bigbang-dev at data-activism.net> <mailto:
> Bigbang-dev at data-activism.net <mailto:Bigbang-dev at data-activism.net>>
> <mailto:Bigbang-dev at data-activism.net <mailto:
> Bigbang-dev at data-activism.net> <mailto:Bigbang-dev at data-activism.net
> <mailto:Bigbang-dev at data-activism.net>>>
> >     >     >>>
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
> >     >     >>>
> >     >     >>>
>  <diversity-productivity-httpbisa.png>_______________________________________________
> >     >     >>>             Bigbang-dev mailing list
> >     >     >>>             Bigbang-dev at data-activism.net <mailto:
> Bigbang-dev at data-activism.net> <mailto:Bigbang-dev at data-activism.net
> <mailto:Bigbang-dev at data-activism.net>> <mailto:
> Bigbang-dev at data-activism.net <mailto:Bigbang-dev at data-activism.net>
> <mailto:Bigbang-dev at data-activism.net <mailto:
> Bigbang-dev at data-activism.net>>>
> >     >     >>>
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
> >     >     >>
> >     >     >
> >     >     >
> >     >     >
> >     >     >         --
> >     >     >         Colin Perkins
> >     >     >         https://csperkins.org/
> >     >     >
> >     >     >
> >     >     >
> >     >     >
> >     >
> >     >     --
> >     >     Niels ten Oever
> >     >     Researcher and PhD Candidate - DATACTIVE Research Group -
> University of Amsterdam
> >     >     Postdoctoral Scholar (abd) - Communications Department - Texas
> A&M University
> >     >     Research Fellow - Centre for Internet and Human Rights -
> European University Viadrina
> >     >     Associated Scholar - Centro de Tecnologia e Sociedade -
> Fundação Getúlio Vargas
> >     >
> >     >     W: https://nielstenoever.net
> >     >     E: mail at nielstenoever.net <mailto:mail at nielstenoever.net>
> <mailto:mail at nielstenoever.net <mailto:mail at nielstenoever.net>>
> >     >     T: @nielstenoever
> >     >     P/S/WA: +31629051853
> >     >     PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
> >     >
> >
> >     --
> >     Niels ten Oever
> >     Researcher and PhD Candidate - DATACTIVE Research Group - University
> of Amsterdam
> >     Postdoctoral Scholar (abd) - Communications Department - Texas A&M
> University
> >     Research Fellow - Centre for Internet and Human Rights - European
> University Viadrina
> >     Associated Scholar - Centro de Tecnologia e Sociedade - Fundação
> Getúlio Vargas
> >
> >     W: https://nielstenoever.net
> >     E: mail at nielstenoever.net <mailto:mail at nielstenoever.net>
> >     T: @nielstenoever
> >     P/S/WA: +31629051853
> >     PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
> >
>
> --
> Niels ten Oever
> Researcher and PhD Candidate - DATACTIVE Research Group - University of
> Amsterdam
> Postdoctoral Scholar (abd) - Communications Department - Texas A&M
> University
> Research Fellow - Centre for Internet and Human Rights - European
> University Viadrina
> Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio
> Vargas
>
> W: https://nielstenoever.net
> E: mail at nielstenoever.net
> T: @nielstenoever
> P/S/WA: +31629051853
> PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20200903/80fb1dd3/attachment-0001.html>


More information about the Bigbang-dev mailing list