[Bigbang-dev] research questions of interest for standard-setting participation

Sebastian Benthall sbenthall at gmail.com
Fri Feb 16 17:35:22 CET 2018


Is there a web-accessible link to a dump of the IETF data that's ready?

I'm reinstall bigbang fresh on a new machine and figure I should start
working with the IETF data set, as that's the topic of interest at the
moment.

On Fri, Feb 16, 2018 at 5:31 AM, Niels ten Oever <niels at article19.org>
wrote:

> I would love to at least listen-in!
>
> Cheers,
>
> Niels
> On 02/16/2018 01:26 AM, Nick Doty wrote:
> > On Feb 5, 2018, at 2:09 PM, Sebastian Benthall <sbenthall at gmail.com
> > <mailto:sbenthall at gmail.com>> wrote:
> >>
> >> 2) I just figured out how to make time for this in the short term. So
> >> count me in.
> >>
> >> Shall we plan a meeting about this?
> >
> > Yeah, I'd love to do that! Would folks be interested in an audio chat
> > next week? I will send around a Doodle poll if it's more than just me
> > and Seb.
> >
> >> On Feb 5, 2018 4:24 PM, "Sebastian Benthall" <sbenthall at gmail.com
> >> <mailto:sbenthall at gmail.com>> wrote:
> >>
> >>     These are great questions, Nick.
> >>
> >>     I'd love to work on them with you, especially because they are
> >>     such general metrics.
> >>     Sadly I've got almost no time to work on it until May, due to
> >>     dissertation work.
> >>
> >>     Let me provide some recommendations based on my attempts to
> >>     address similar questions on SciPy and other lists.
> >
> > These comments are really helpful, thanks!
> >
> > I am interested to understand the math better, and could really use your
> > help on that. I definitely get your general point that because there's a
> > long-tail distribution in any case, I need to find cases that don't fit
> > that pattern in order to show meaningful results.
> >
> > I'm not sure I understand the concentration parameter, but it does seem
> > like something like that would be useful. I also thought there might be
> > interesting graph analysis metrics -- like centrality? -- in a graph of
> > the nodes of connections between participants and lists.
> >
> > Thanks again for your thoughts!
> > —Nick
> >
> >>
> >>         * how many participants total in IETF work?
> >>
> >>
> >>     The odds are *very* high that the emails-per-person distribution
> >>     is a heavy-tail distribution.
> >>     Based on previous work
> >>     <https://conference.scipy.org/proceedings/scipy2015/pdfs/
> sebastian_benthall.pdf>,
> >>     I would test for fit to log normal and power law distributions.
> >>     My money is on log normal being a better fit.
> >>
> >>     This is important because when interpreting the results, we have
> >>     to keep in mind that
> >>     the log normal distribution is essentially a noise pattern.
> >>     So it's easy to read into the data relationships that may not be
> >>     there,
> >>     especially if you're using a linear rather than a log linear
> >>     relationship as an indicator.
> >>
> >>         * how "sticky" is participation?
> >>                 if people participate on a list, do they return? do
> >>         they show up to f2f meetings?
> >>                 what's the attrition rate?
> >>                 what's the distribution of length of participation?
> >>
> >>
> >>     Assuming there is a heavy tail distribution of participation, then
> >>     about half the contributors
> >>     will only contribute once.
> >>
> >>     The distribution of attrition/retention will look more or less
> >>     just like the distribution of participation.
> >>     The length will look like it as well.
> >>
> >>     It's not clear how to interpret this, because the reasons why any
> >>     particular person participates a lot
> >>     or a little are very likely
> >>     (a) myriad (no single reason, but rather a combination of many
> >>     reasons, and
> >>     (b) exogenous to the data itself.
> >>
> >>     For these reasons I expect you would get more interesting results
> >>     if you can segment the population
> >>     into categories of interest. You've mentioned gender and firms of
> >>     employment, which are both good ones.
> >>
> >>     But for each category, you may want to have more than one parameter
> to
> >>     characterize the each one's participation distribution.
> >>     May mean /and/ variance?
> >>
> >>         * who has participated longest? across the most groups?
> >>                 is there a group of "elites" across working groups?
> >>
> >>
> >>     This is a great question.
> >>     But keep in mind: the people who participate most are going to be
> >>     participating a lot
> >>     more numerically across all lists than others.
> >>     So they will have more chances to participate in different lists.
> >>
> >>     You may want to be looking at, for each participant, their
> >>     individual distribution of participation
> >>     over many lists, and then look at the concentration parameter of
> >>     that distribution:
> >>
> >>     https://en.wikipedia.org/wiki/Concentration_parameter
> >>     <https://en.wikipedia.org/wiki/Concentration_parameter>
> >>
> >>     The math can be a bit tricky but I think it's worth tackling
> >>     correctly.
> >>
> >>
> >>                 how many participants are single-group?
> >>
> >>
> >>     Since most participants will be only send one message, that's
> >>     going to skew this metric
> >>     unless you take that into account somehow.
> >>
> >>
> >>                 how many groups does the typical participant join?
> >>
> >>         As I believe I've mentioned to this group before, I've been
> >>         looking into estimating gender in mailing list participation,
> >>         including:
> >>
> >>         * What is the gender distribution of participants in Internet
> >>         and Web technical standard-setting?
> >>             how does that distribution differ from the population at
> >>         large? from employment at related firms?
> >>             does that distribution change over time?
> >>             are there sub-groups which have distinctly different
> >>         distributions?
> >>         * Does the gender distribution of conversation differ from the
> >>         gender distribution of the participants?
> >>
> >>
> >>     Great questions.
> >>
> >>
> >>         Do you have questions you'd like to add to this list? Would
> >>         you be interested in trying to measure/answer one of these
> >>         questions? Which are the easiest and which are the most
> >>         difficult? What features would we need to add to BigBang to
> >>         make them answerable?
> >>
> >>
> >>     In sum, I think all these questions are great ones and related to
> >>     each other.
> >>     I think the biggest challenge is getting the correct statistical
> >>     modeling right,
> >>     so that the results are not misinterpreted.
> >>
> >>     - Seb
> >>
> >>
> >
> >
> >
> > _______________________________________________
> > Bigbang-dev mailing list
> > Bigbang-dev at data-activism.net
> > https://lists.ghserv.net/mailman/listinfo/bigbang-dev
> >
>
>
> _______________________________________________
> Bigbang-dev mailing list
> Bigbang-dev at data-activism.net
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180216/1b0ba838/attachment.html>


More information about the Bigbang-dev mailing list