[Bigbang-dev] Are gender diversity and draft productivity correlated? THE VERDICT
Sebastian Benthall
sbenthall at gmail.com
Thu Sep 3 14:54:44 CEST 2020
That's a good idea.
This sounds like a new issue though.
That PR is getting a bit overloaded.
If it serves as a modular improvement, then I hope it passes your review
and can be merged.
Then new features can be put into new issues and prioritized.
Because there's been a lot of work recently, it would make sense to start
preparing to cut a new release.
On Thu, Sep 3, 2020, 5:17 AM Niels ten Oever <mail at nielstenoever.net> wrote:
> To make this even more useful, we could have a field that exports the
> 'unknown' values? Perhaps even in a way that gender could be assigned to
> them?
>
> Cheers,
>
> Niels
>
> On 9/2/20 7:53 PM, Sebastian Benthall wrote:
> > Maybe the reasons for this are similar those for why so few women
> participate in open source. There's good presentations of empirical
> research about this in this video:
> > https://www.youtube.com/watch?v=d5XkVHQGqH4
> >
> >
> >
> > On Wed, Sep 2, 2020 at 12:52 PM Niels ten Oever <mail at nielstenoever.net
> <mailto:mail at nielstenoever.net>> wrote:
> >
> > Well, what is most shocking from this is the consistently near zero
> female identified participation in HTTPbis!
> >
> > On 9/2/20 3:57 PM, Sebastian Benthall wrote:
> > > Hello,
> > >
> > > Thanks for catching that.
> > > Indeed, the data was not correct.
> > >
> > > I should have acted immediately when Colin suggested I look at the
> submissions.
> > > The submissions have a "document_date" field which is much more
> reliable than the document's "time" field.
> > >
> > > I've fixed the data collection script and included the total
> number of drafts in the plot.
> > >
> https://github.com/datactive/bigbang/pull/394#issuecomment-685748156
> > >
> > > On Tue, Sep 1, 2020 at 7:09 PM Niels ten Oever <
> mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net>>> wrote:
> > >
> > > Hiya,
> > >
> > > On 9/1/20 5:04 PM, Sebastian Benthall wrote:
> > > > Some updates:
> > > >
> > > > - A plot of mailing list activity, by gender, and (final)
> draft output is up here, with the correlation values, is here:
> > > >
> https://github.com/datactive/bigbang/pull/394#issuecomment-684917057
> > >
> > > Interesting! In order to be able to judge
> https://github.com/datactive/bigbang/pull/394 it would be great if the
> y-axis of the graphs, and ideally also a data field in the notebook, would
> show the total number of drafts in a specific period. I have the feeling
> that the representation of the drafts is not fully correct.
> > >
> > > According to the graph (if I read it correctly) there should
> be no drafts for httpbis in the period 2012 - 2014, but a cursory glance at
> the datatracker [0] shows that RFC7230, 7231, 7232, 7233, 7235, 7236, and
> 7237 were published in 2014, and I am pretty sure these RFCs were all
> preceded by quite a number of drafts.
> > >
> > > Cheers,
> > >
> > > Niels
> > >
> > > [0]
> https://datatracker.ietf.org/doc/search?name=httpbis&sort=&rfcs=on&activedrafts=on&by=group&group=
> > >
> > > >
> > > > - None of the correlations of mailing list activity with
> draft output is statistically significant! This reverses the previous
> verdict.
> > > >
> > > > - I've made an issue for expanding the draft metadata
> collection to include the submissions:
> > > > https://github.com/datactive/bigbang/issues/397
> > > >
> > > > - Could I request a review of the code for this project thus
> far? It's currently languishing a bit as a PR:
> > > > https://github.com/datactive/bigbang/pull/394
> > > >
> > > > I've got to work on a few other projects for a bit but I'm
> excited to hear where folks think we might go from here.
> > > >
> > > > Best regards,
> > > > Seb
> > > >
> > > >
> > > > On Mon, Aug 31, 2020 at 10:52 AM Sebastian Benthall <
> sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>>>> wrote:
> > > >
> > > > Thank you!
> > > >
> > > >
> > > > * The group is “httpbis” not “httpbisa”
> > > >
> > > >
> > > > Aha!
> > > >
> > > > I found `httpbisa` as the closest acronym to `httpbis`
> on this list of IETF mailing list archives:
> > > >
> https://github.com/datactive/bigbang/blob/master/examples/url_collections/mm.ietf.org.txt
> > > >
> > > > Niels, does it make sense that the mailing list and the
> working group have different names in this case? Is that common?
> > > >
> > > > I can confirm that the records I pulled using the
> datatracker include drafts for working groups besides `httpbis`.
> group_from_acronym('nonsense') returns None. None passed as a group to the
> documents query results in a default query of all groups, I suppose.
> > > >
> > > >
> > > > Also, remember to look at the submissions to find
> the different versions of a draft, else you only get the most recent
> version.
> > > >
> > > > Try something like:
> > > >
> > > > dt = DataTracker(cache_dir=Path("cache"))
> > > >
> > > > g = dt.group_from_acronym("httpbis")
> > > > for d in dt.documents(group=g,
> doctype=dt.document_type_from_slug("draft")):
> > > > print("")
> > > > for sub_url in d.submissions:
> > > > sub = dt.submission(sub_url)
> > > >
> print(F"{sub.document_date.strftime('%Y-%m-%d')} {sub.name <
> http://sub.name> <http://sub.name> <http://sub.name>}-{sub.rev}")
> > > > for a in sub.parse_authors():
> > > > print(F" {a['name']}
> <{a['email']}>")
> > > >
> > > > This will find each submission of all the working
> group drafts for a particular group. It doesn’t follow the history back to
> the pre-working group individual submissions, but can be extended to do
> that if needed.
> > > >
> > > >
> > > > I see. Thanks again for this.
> > > >
> > > > I welcome input from any stakeholders about whether
> whether "productivity" should be operationalized in terms of final draft
> output and/or submissions.
> > > >
> > > >
> > > >
> > > >> Looking at
> https://datatracker.ietf.org/wg/httpbis/documents/ it seems that httpbis
> has 48 documents. Each of these will have gone through multiple versions as
> a draft, but even with ~20 draft per document (which is roughly typical),
> that’s not close to thousands.
> > > >>
> > > >> Searching
> https://mailarchive.ietf.org/arch/browse/i-d-announce/?q=httpbis finds
> announcements for 721 internet drafts containing the string “httpbis”,
> which seems plausible.
> > > >>
> > > >> Colin
> > > >>
> > > >>
> > > >>
> > > >>> Another issue here is that the draft output
> preceeds the mailing list records (see attachment). Another is that there
> are very emails sent by women (or, so identifiable by our detection method)
> in httpbisa:
> > > >>>
> > > >>> <image.png>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Wed, Aug 26, 2020 at 3:26 PM Niels ten
> Oever <mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net>> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net>>>> wrote:
> > > >>>
> > > >>> Httpbis is the one you're looking for :)
> > > >>>
> > > >>> DNSops is also a nice big one.
> > > >>>
> > > >>> Cheers,
> > > >>>
> > > >>> Niels
> > > >>> On Aug 26, 2020, at 21:17, Sebastian
> Benthall <sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>>>> wrote:
> > > >>>
> > > >>> Hmmm.
> > > >>>
> > > >>> Web mail archives of the http list at
> https://ietf.org/mail-archive/text/http/ only go up to 2012.
> > > >>> Does that make sense to you?
> > > >>>
> > > >>> It looks like there are several DNS
> working groups. Any one in particular you think would be worth looking at?
> > > >>>
> > > >>> Genericizing the code so that it can
> loop through many groups and compute results is the next step towards
> confirmation. Probably worth looking at a couple other concrete and
> well-understood examples before doing the big analysis though.
> > > >>>
> > > >>> - S
> > > >>>
> > > >>> On Wed, Aug 26, 2020 at 1:52 PM Niels
> ten Oever < mail at nielstenoever.net <mailto:mail at nielstenoever.net>
> <mailto:mail at nielstenoever.net <mailto:mail at nielstenoever.net>> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:
> mail at nielstenoever.net <mailto:mail at nielstenoever.net>>>> wrote:
> > > >>>
> > > >>> Very interesting. I'd say the
> number if drafts and authors in hrpc is too low to make a statement about
> this though. Could we do this for the HTTP and/or DNS WGs ?
> > > >>> On Aug 26, 2020, at 19:30,
> Sebastian Benthall < sbenthall at gmail.com <mailto:sbenthall at gmail.com>
> <mailto:sbenthall at gmail.com <mailto:sbenthall at gmail.com>> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com> <mailto:
> sbenthall at gmail.com <mailto:sbenthall at gmail.com>>>> wrote:
> > > >>>
> > > >>> Hello,
> > > >>>
> > > >>> I'm revisiting the question of
> whether mailing list gender diversity and draft productivity of working
> groups are correlated.
> > > >>>
> > > >>> Putting aside for now all the
> methodological complications, here is how I am operationalizing the
> question:
> > > >>>
> > > >>> * I'm looking specifically
> at the HRPC working group, with this data:
> > > >>> image.png
> > > >>> *
> > > >>> Gender is being detected
> based on first name birth records. "unknown" is used for cases that cannot
> with the current data set be determined as either men or women.
> > > >>> * I'm measuring "diversity"
> on any day as: (women's activity + unknown's activity) / (men's activity).
> Because, you know, this is probably close to what most people probably mean
> by diversity. (Recall that non-Western names are more likely to be
> categorized as "unknown".)
> > > >>> * I'm using a 100 day
> rolling average on the activity counts.
> > > >>>
> > > >>> This is the matrix of Pearson
> correlations between each of these values:
> > > >>>
> > > >>> women unknown
> men drafts diversity
> > > >>> women 1.000000
> 0.910922 0.804869 0.008890 0.160833
> > > >>> unknown 0.910922
> 1.000000 0.808168 0.027502 0.245059
> > > >>> men 0.804869
> 0.808168 1.000000 0.015406 -0.141915
> > > >>> drafts 0.008890
> 0.027502 0.015406 1.000000 0.061884
> > > >>> diversity 0.160833
> 0.245059 -0.141915 0.061884 1.000000
> > > >>>
> > > >>>
> > > >>> Things to note:
> > > >>>
> > > >>> * The activity of each
> gender is correlated with the activity of other genders.
> > > >>> * Diversity is
> anticorrelated with the number of men. This is expected based on how it was
> defined, and a good sanity check.
> > > >>> * Draft output is MORE
> correlated with diversity than it is with any individual gender!
> > > >>>
> > > >>> This last point is quite nice.
> It resonates with the work of Scott Page on the value of diversity to
> collective intelligence, for example.
> > > >>>
> > > >>> These numbers are a bit hard
> to interpret. How much should we trust them? These are the /p/-values
> associated with each correlation:
> > > >>> women unknown
> men drafts diversity
> > > >>> women 0 0 0
> 0.6925 0
> > > >>> unknown 0 0
> 0 0.221 0
> > > >>> men 0 0 0
> 0.493 0
> > > >>> drafts 0.6925 0.221 0.493
> 0 0.0059
> > > >>> diversity 0 0
> 0 0.0059 0
> > > >>>
> > > >>>
> > > >>> Generally, /p/-values below
> .01 are considered "statistically significant", i.e. publishable.
> > > >>> This correlation between
> diversity and draft output makes the cut!!
> > > >>>
> > > >>> So the verdict is: for HRPC,
> YES, gender diversity is correlated with draft output.
> > > >>>
> > > >>> This result is robust to
> transformations of the activity scores into the log space, which is
> comforting.
> > > >>> Further work is needed to see
> if this result is robust across other IETF working groups.
> > > >>>
> > > >>> Nick, what would you say to
> including a result like this in the paper about IETF and gender?
> > > >>>
> > > >>> Cheers,
> > > >>> Seb
> > > >>>
> > > >>>
> > > >>>
> > >
> >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > >>>
> > > >>> Bigbang-dev mailing list
> > > >>> Bigbang-dev at data-activism.net
> <mailto:Bigbang-dev at data-activism.net> <mailto:
> Bigbang-dev at data-activism.net <mailto:Bigbang-dev at data-activism.net>>
> <mailto:Bigbang-dev at data-activism.net <mailto:
> Bigbang-dev at data-activism.net> <mailto:Bigbang-dev at data-activism.net
> <mailto:Bigbang-dev at data-activism.net>>>
> > > >>>
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
> > > >>>
> > > >>>
> <diversity-productivity-httpbisa.png>_______________________________________________
> > > >>> Bigbang-dev mailing list
> > > >>> Bigbang-dev at data-activism.net <mailto:
> Bigbang-dev at data-activism.net> <mailto:Bigbang-dev at data-activism.net
> <mailto:Bigbang-dev at data-activism.net>> <mailto:
> Bigbang-dev at data-activism.net <mailto:Bigbang-dev at data-activism.net>
> <mailto:Bigbang-dev at data-activism.net <mailto:
> Bigbang-dev at data-activism.net>>>
> > > >>>
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Colin Perkins
> > > > https://csperkins.org/
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Niels ten Oever
> > > Researcher and PhD Candidate - DATACTIVE Research Group -
> University of Amsterdam
> > > Postdoctoral Scholar (abd) - Communications Department - Texas
> A&M University
> > > Research Fellow - Centre for Internet and Human Rights -
> European University Viadrina
> > > Associated Scholar - Centro de Tecnologia e Sociedade -
> Fundação Getúlio Vargas
> > >
> > > W: https://nielstenoever.net
> > > E: mail at nielstenoever.net <mailto:mail at nielstenoever.net>
> <mailto:mail at nielstenoever.net <mailto:mail at nielstenoever.net>>
> > > T: @nielstenoever
> > > P/S/WA: +31629051853
> > > PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
> > >
> >
> > --
> > Niels ten Oever
> > Researcher and PhD Candidate - DATACTIVE Research Group - University
> of Amsterdam
> > Postdoctoral Scholar (abd) - Communications Department - Texas A&M
> University
> > Research Fellow - Centre for Internet and Human Rights - European
> University Viadrina
> > Associated Scholar - Centro de Tecnologia e Sociedade - Fundação
> Getúlio Vargas
> >
> > W: https://nielstenoever.net
> > E: mail at nielstenoever.net <mailto:mail at nielstenoever.net>
> > T: @nielstenoever
> > P/S/WA: +31629051853
> > PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
> >
>
> --
> Niels ten Oever
> Researcher and PhD Candidate - DATACTIVE Research Group - University of
> Amsterdam
> Postdoctoral Scholar (abd) - Communications Department - Texas A&M
> University
> Research Fellow - Centre for Internet and Human Rights - European
> University Viadrina
> Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio
> Vargas
>
> W: https://nielstenoever.net
> E: mail at nielstenoever.net
> T: @nielstenoever
> P/S/WA: +31629051853
> PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20200903/80fb1dd3/attachment-0001.html>
More information about the Bigbang-dev
mailing list