[Bigbang-dev] Notes on gender estimation in mailing list analysis

Mon Oct 31 22:06:26 CET 2016

Nick, thanks so much for championing development of this important feature!

If you think there's a better place to respond to your specific prompts
(issue, wiki?) then please tell me. For now, a couple thoughts in email...

> Feedback is appreciated, but in particular, I'd like to know:
> * what research questions do you have where gender estimation might be
> useful?
> * do you know of other projects (not already listed here) doing similar
> things which would be useful for comparison or collaboration?
> * whom else should we be asking for feedback?
>

I think Joe Hall at CDT mentioned an interest in this issue specifically. I
don't recall what his research questions were in particular.

You mentioned some work by Nathan Matias. I think it would be fantastic if
we could get him to weigh in on this.

## Research questions and methods
>
> In the space of Internet governance broadly, I believe these questions are
> motivated by an interest in exploring the legitimacy of multistakeholder
> decision-making and how representative it might be of, for example, the
> users of technology being designed. What factors affect differences in
> participation by demographic?
>

Personally, I see two sides to the diversity question. One which has a lot
of popular interest is the question of diversity and legitimacy. A
governing body that is not representative of the demographics of the people
it governs can be perceived as illegitimate. This is an important issue.

A related but different question (which I bring up just because it's closer
to my own substantive research interests) is the relationship between
diversity and collective intelligence. Scott Page has prominent work on
this:

https://democracyspot.net/2013/04/26/when-diversity-trumps-ability/

While I think it may be useful to distinguish the diversity-and-legitimacy
question from the diversity-and-productivity/intelligence questions, from a
data preprocessing perspective there's a lot of overlap in the requirements
needed to study either.

So I suppose I would argue that we may want to separate the concerns of
robust heuristics for gender estimation from any particular research
question.

> In addition, we may wish to understand how the particular collaborative
> methods we are exploring affect the representation of participation. Do
> Github projects with codes of conduct show more active participation from
> women? Do mailing list messages show proportionately more or less
> participation from non-US regions compared to in-person meetings of
> Internet governance groups? Is there significant variation in participation
> rates (among genders, organizational affiliations, regions) between
> different working groups within a single organization?
>

I think one important thing suggested by these queries is the importance of
intersectionality when considering the properties of these participants.

Putting it another way: I'd argue that we'll want to have a whole suite of
tools for gathering and estimating data about mailing list participants,
including gender but also organizational affiliation and geographic origin.
Then a lot of the interesting questions/answers will be in the correlations
of these different pieces of information.

> These notes refer to *estimating* gender to emphasize how difficult it
> will be to determine participant gender with high precision or confidence.
> As a computer-mediated form of communication, recipients of an email may
> not know  much about the sender, including their legal name, how they
> appear in person, much less details of gender identity.
>

I think that this is definitely an important caveat to any work we do on
this. I think we should be very active in soliciting criticism on our
efforts in this area, and specifically in engaging other interested
researchers in designing computational solutions to the identity questions
that this sort of work raises.

> As Matias notes [0], it's possible to combine automated with human methods
> -- for example, the researcher or crowdsourced workers could look up a
> person's online presence and guess their gender. (I understand that Harsh
> Gupta determined gender by looking up the online identities of all
> participants in a group, for example [1].)
>

I wonder if this process could be automated at all.

For example, I can imagine a script that:

   1. Takes a given name/email address/organizational affilation
   2. Looks up the first hits on Google (or some engine with a public API)
   for results
   3. Counts pronouns used on those pages and compares them with global
   averages
   4. Uses that comparison as evidence in gender estimation heuristic.

I wonder how accurate such a script would be.

> ## Code
>
> Using malev/gender-detector library [2], I've written initial code that
> attempts to extract given names from email headers and calculate aggregate
> gender information for the messages of a mailing list. You can see that
> code in use in a Python notebook [3]. As expected, the completeness of that
> estimate varies between mailing lists and suffers both from not being able
> to determine the first/given name and not being able to guess the gender
> from that name.
>

How extensible is this library?
Is it internationalized?

> As a next step, it would be useful to provide a way to input a spreadsheet
> of email addresses and human-guessed genders to supplement the name-based
> automated guess. We could also add functionality to export a list of
> names/email addresses that can't be guessed, to facilitate the most
> efficient use of human effort.
>

+1

One interesting insight from the summer school workshop was that a number
of participants got the most value out of BigBang by exporting data as .csv
and importing it into Excel, which they were more familiar with.

So in general, architecting with ease of data import/export in mind is
going to be good for adoption by non-developers.

> This has mostly been focused on mailing lists, but Github activity would
> be another similar area. I've been interested in learning more about
> research showing that pull requests from women are accepted more often if
> they're not identifiable as women [4].
>

Excellent point.

I imagine that in the future as we think about the ontology underlying our
analysis, we are going to want to take about a Person, who will have an
indentified email address(es) and Git/Github username/credentials.
Resolving all this information from across media is going to be a big
source of insights, I believe.

Thanks again for bringing up all these important issues!
Seb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20161031/6a0bf7c5/attachment.html>