[Bigbang-dev] Notes on gender estimation in mailing list analysis

Sat Oct 29 00:42:05 CEST 2016

# Notes on gender estimation in mailing list analysis

A high-level interest in mailing list analysis for the purpose of understanding collaborative group participation (a key topic for BigBang) is evaluating the diversity of participants and participation. Who participates in software development, technical standard-setting, Internet governance or other groups? To what extent are the participants or levels of participation skewed in terms of organizational affiliation, nationality, gender or other demographic characteristics?

Feedback is appreciated, but in particular, I'd like to know:
* what research questions do you have where gender estimation might be useful?
* do you know of other projects (not already listed here) doing similar things which would be useful for comparison or collaboration?
* whom else should we be asking for feedback?

Want to work on this feature? Join this Github issue: https://github.com/datactive/bigbang/issues/249

## Research questions and methods

In the space of Internet governance broadly, I believe these questions are motivated by an interest in exploring the legitimacy of multistakeholder decision-making and how representative it might be of, for example, the users of technology being designed. What factors affect differences in participation by demographic?

In addition, we may wish to understand how the particular collaborative methods we are exploring affect the representation of participation. Do Github projects with codes of conduct show more active participation from women? Do mailing list messages show proportionately more or less participation from non-US regions compared to in-person meetings of Internet governance groups? Is there significant variation in participation rates (among genders, organizational affiliations, regions) between different working groups within a single organization?

As usual, quantitative methods could be used both inductively (to identify disparities in participation that are worthy of explanation by detailed qualitative investigation) and deductively (to either support or counter a hypothesis generated from qualitative study by surveying a large sample).

## Challenges

These notes refer to *estimating* gender to emphasize how difficult it will be to determine participant gender with high precision or confidence. As a computer-mediated form of communication, recipients of an email may not know  much about the sender, including their legal name, how they appear in person, much less details of gender identity.

Automated means are further constrained. For mailing lists, our best methods of estimation are likely to be based on the corresponding name (given names and family names, where present) and records of how often a particular name is used by a person of a particular gender. Existing gender estimation libraries use these name records specific to a particular country (the same given name might be commonly associated with a different gender in the US vs. the UK, for example); most of the Internet governance or software development mailing lists we're looking at will typically have participants from many different countries.

As Matias notes [0], it's possible to combine automated with human methods -- for example, the researcher or crowdsourced workers could look up a person's online presence and guess their gender. (I understand that Harsh Gupta determined gender by looking up the online identities of all participants in a group, for example [1].)

This work should not make claims about the gender of particular individuals, because these methods can't give a high level of confidence at that level and because it may not be appropriate either to out someone's gender status or misgender them.

## Code

Using malev/gender-detector library [2], I've written initial code that attempts to extract given names from email headers and calculate aggregate gender information for the messages of a mailing list. You can see that code in use in a Python notebook [3]. As expected, the completeness of that estimate varies between mailing lists and suffers both from not being able to determine the first/given name and not being able to guess the gender from that name.

As a next step, it would be useful to provide a way to input a spreadsheet of email addresses and human-guessed genders to supplement the name-based automated guess. We could also add functionality to export a list of names/email addresses that can't be guessed, to facilitate the most efficient use of human effort. It would be nice if we could develop a Mechanical Turk workflow for getting volunteer coding of gender and use inter-coder reliability to provide confidence; if possible, I'd love if we could coordinate that work with others doing similar research.

This has mostly been focused on mailing lists, but Github activity would be another similar area. I've been interested in learning more about research showing that pull requests from women are accepted more often if they're not identifiable as women [4].

[0] https://civic.mit.edu/blog/natematias/best-practices-for-ethical-gender-research-at-very-large-scales
[1] https://github.com/hargup/eme_diversity_analysis/blob/master/EME%20Diversity%20Analysis.ipynb
[2] https://github.com/malev/gender-detector
[3] https://github.com/npdoty/bigbang/blob/169ceabd98ba794abc2f1dcc78f7e2eeaf43fdf2/examples/Analyze%20Senders%20-%20Name%20and%20Gender.ipynb
[4] http://schedule.bid-seminar.com/speakers/75
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 496 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20161028/94e5e48a/attachment.sig>