[Bigbang-dev] Gender diversity and draft productivity

Sebastian Benthall sbenthall at gmail.com
Sat Jul 11 02:43:31 CEST 2020


Nick!

> An issue that has not yet been settled is how we are measuring
>> "diversity", and how that measurement should reflect our uncertainty and
>> the possibility of more than two represented gender categories.
>>
>
> So far I haven’t been trying to capture or record people with non-binary
> genders both because it’s not easily estimated by gender-detector and
> similar libraries and for ethical considerations that it could be outing or
> identifying people. In general, my research has been trying to estimate the
> gender breakdown of populations but not to record and publish individual
> people’s genders, to avoid individual misgendering and to avoid the privacy
> risks of disclosing someone’s gender.
>

That makes sense.

It may make sense to break down the unknown cases further when they
dominate. (See below)

I'm going to for the sake of honing the intuitions here push back and say
that if we are using only such public and expressed information as one's
stated name and public biography to infer gender, nothing we are doing is
creating any new risk.

I guess I'm suspecting the "outing" case here.

The gender guess is then based on whether or not the preponderance of uses
>> of the name apply to "male" or "female" people. There's a confidence cutoff
>> that's actually quite strict; anything below this confidence rate gets an
>> "unknown" response.
>>
>
> Yes, these libraries import datasets that I believe come from local
> governments, which record names and recorded genders at birth. As you note,
> the cut-off is quite high confidence (both that there are enough instances
> recorded, and that the percentage of the recorded instances is extremely
> disproportionate to the identified gender).
>

Aha! Fascinating.

Yes, I still see errors, and most often with names that in the US are
> strongly gendered but in other countries may not be gendered or may have a
> different gender balance. Those are cases where the US/Western focus also
> leads to incorrect data. But those instances have been rare when I’ve done
> manual checks with groups of people I know; more often the gender-detector
> library is recording genders as unknown.
>


It looks like a method argument can switch the data backed to the UK name
set, which might be slightly better for European and maybe other
continental names.

Because IETF is global we could run both and average the two. Or if we get
good national origin metadata about participants we could use it to map
them to the right dictionary.

> I’d be interested in that. I have not looked at estimates of gender
> participation over time. I have compared different mailing lists/working
> groups, which seemed of interest. Some rough initial work in the graphs
> attached.
>
>
> <image.png>
>
> Awesome.

Has anything of theoretical interest explained the differences in the
numbers?

For the cases where there's a proponderance of "unknowns", is it possible
to break them into smaller categories?

For example, I wonder if the dataset bias is causing a mailing list with a
strong non-Western regional presence to register grey.


I think it would be better to use this method to look at the mailing list
> traffic by gender rather than the document authors: since there’s a small
> number of document editors, that’s something that could more easily be
> tagged by hand with higher precision.
>

I agree mostly.
The mailing lists should have more interesting aggregate numbers.

I mainly started with HRPC drafts because of the close connection between
the BigBang community and the HRPC community, and because with a small set
of authors I knew we could validate it amongst ourselves. Be our own guinea
pig, so to speak.

It would maybe be notable if the gender breakdown of the drafts were
unrepresentative of the breakdown of the corresponding mailing lists.

Or if draft content varies, on average, with draft author gender.


I believe Jari was providing statistics on gender of RFC authors which used
> (at least in part) a manual list. He wouldn’t make that list public as a
> privacy matter, but it could be something he would be willing to share with
> researchers as long as we also kept it private.
>

Yeah let's stay away from that!

Yes, I found the methods and caveats about them to be the most detailed
> part of working/writing on this topic. In the draft I’d put together so
> far, I started with all the limitations of the method, and then tried to
> explain why it still might be useful to look at these estimates. I’m still
> cautious about publishing that because I don’t know how much we can look
> past those limitations and whether any harm can be done by publishing
> estimates, but I’d be interested to hear other perspectives.
>

I think it's good work and you should publish it!

Maybe it would be best to work on a paper together that could include
> multiple reviews and perspectives.
>

I'm all for that :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20200710/3cb7aaf6/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gender-fraction-bars-20170718.png
Type: image/png
Size: 49686 bytes
Desc: not available
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20200710/3cb7aaf6/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gender-pies-20170730.png
Type: image/png
Size: 2872434 bytes
Desc: not available
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20200710/3cb7aaf6/attachment-0003.png>


More information about the Bigbang-dev mailing list