[Bigbang-dev] research questions of interest for standard-setting participation

Fri Feb 16 11:31:22 CET 2018

I would love to at least listen-in!

Cheers,

Niels
On 02/16/2018 01:26 AM, Nick Doty wrote:
> On Feb 5, 2018, at 2:09 PM, Sebastian Benthall <sbenthall at gmail.com
> <mailto:sbenthall at gmail.com>> wrote:
>>
>> 2) I just figured out how to make time for this in the short term. So
>> count me in.
>>
>> Shall we plan a meeting about this?
> 
> Yeah, I'd love to do that! Would folks be interested in an audio chat
> next week? I will send around a Doodle poll if it's more than just me
> and Seb.
> 
>> On Feb 5, 2018 4:24 PM, "Sebastian Benthall" <sbenthall at gmail.com
>> <mailto:sbenthall at gmail.com>> wrote:
>>
>>     These are great questions, Nick.
>>
>>     I'd love to work on them with you, especially because they are
>>     such general metrics.
>>     Sadly I've got almost no time to work on it until May, due to
>>     dissertation work.
>>
>>     Let me provide some recommendations based on my attempts to
>>     address similar questions on SciPy and other lists.
> 
> These comments are really helpful, thanks!
> 
> I am interested to understand the math better, and could really use your
> help on that. I definitely get your general point that because there's a
> long-tail distribution in any case, I need to find cases that don't fit
> that pattern in order to show meaningful results. 
> 
> I'm not sure I understand the concentration parameter, but it does seem
> like something like that would be useful. I also thought there might be
> interesting graph analysis metrics -- like centrality? -- in a graph of
> the nodes of connections between participants and lists.
> 
> Thanks again for your thoughts!
> —Nick
>  
>>
>>         * how many participants total in IETF work?
>>
>>
>>     The odds are *very* high that the emails-per-person distribution
>>     is a heavy-tail distribution.
>>     Based on previous work
>>     <https://conference.scipy.org/proceedings/scipy2015/pdfs/sebastian_benthall.pdf>,
>>     I would test for fit to log normal and power law distributions.
>>     My money is on log normal being a better fit.
>>
>>     This is important because when interpreting the results, we have
>>     to keep in mind that
>>     the log normal distribution is essentially a noise pattern.
>>     So it's easy to read into the data relationships that may not be
>>     there,
>>     especially if you're using a linear rather than a log linear
>>     relationship as an indicator.
>>
>>         * how "sticky" is participation?
>>                 if people participate on a list, do they return? do
>>         they show up to f2f meetings?
>>                 what's the attrition rate?
>>                 what's the distribution of length of participation?
>>
>>
>>     Assuming there is a heavy tail distribution of participation, then
>>     about half the contributors
>>     will only contribute once.
>>
>>     The distribution of attrition/retention will look more or less
>>     just like the distribution of participation.
>>     The length will look like it as well.
>>
>>     It's not clear how to interpret this, because the reasons why any
>>     particular person participates a lot
>>     or a little are very likely 
>>     (a) myriad (no single reason, but rather a combination of many
>>     reasons, and 
>>     (b) exogenous to the data itself.
>>
>>     For these reasons I expect you would get more interesting results
>>     if you can segment the population
>>     into categories of interest. You've mentioned gender and firms of
>>     employment, which are both good ones.
>>
>>     But for each category, you may want to have more than one parameter to
>>     characterize the each one's participation distribution.
>>     May mean /and/ variance?
>>
>>         * who has participated longest? across the most groups?
>>                 is there a group of "elites" across working groups?
>>
>>
>>     This is a great question.
>>     But keep in mind: the people who participate most are going to be
>>     participating a lot
>>     more numerically across all lists than others.
>>     So they will have more chances to participate in different lists.
>>
>>     You may want to be looking at, for each participant, their
>>     individual distribution of participation
>>     over many lists, and then look at the concentration parameter of
>>     that distribution:
>>
>>     https://en.wikipedia.org/wiki/Concentration_parameter
>>     <https://en.wikipedia.org/wiki/Concentration_parameter>
>>
>>     The math can be a bit tricky but I think it's worth tackling
>>     correctly.
>>      
>>
>>                 how many participants are single-group?
>>
>>
>>     Since most participants will be only send one message, that's
>>     going to skew this metric
>>     unless you take that into account somehow.
>>      
>>
>>                 how many groups does the typical participant join?
>>
>>         As I believe I've mentioned to this group before, I've been
>>         looking into estimating gender in mailing list participation,
>>         including:
>>
>>         * What is the gender distribution of participants in Internet
>>         and Web technical standard-setting?
>>             how does that distribution differ from the population at
>>         large? from employment at related firms?
>>             does that distribution change over time?
>>             are there sub-groups which have distinctly different
>>         distributions?
>>         * Does the gender distribution of conversation differ from the
>>         gender distribution of the participants?
>>
>>
>>     Great questions.
>>      
>>
>>         Do you have questions you'd like to add to this list? Would
>>         you be interested in trying to measure/answer one of these
>>         questions? Which are the easiest and which are the most
>>         difficult? What features would we need to add to BigBang to
>>         make them answerable?
>>
>>
>>     In sum, I think all these questions are great ones and related to
>>     each other.
>>     I think the biggest challenge is getting the correct statistical
>>     modeling right,
>>     so that the results are not misinterpreted.
>>
>>     - Seb
>>      
>>
> 
> 
> 
> _______________________________________________
> Bigbang-dev mailing list
> Bigbang-dev at data-activism.net
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180216/c052cc11/attachment.sig>