[Bigbang-dev] research questions of interest for standard-setting participation
Sebastian Benthall
sbenthall at gmail.com
Fri Feb 16 17:51:12 CET 2018
Not urgent--I'll get the data with the crawler script and test as I go.
On Feb 16, 2018 11:35 AM, "Sebastian Benthall" <sbenthall at gmail.com> wrote:
> Is there a web-accessible link to a dump of the IETF data that's ready?
>
> I'm reinstall bigbang fresh on a new machine and figure I should start
> working with the IETF data set, as that's the topic of interest at the
> moment.
>
> On Fri, Feb 16, 2018 at 5:31 AM, Niels ten Oever <niels at article19.org>
> wrote:
>
>> I would love to at least listen-in!
>>
>> Cheers,
>>
>> Niels
>> On 02/16/2018 01:26 AM, Nick Doty wrote:
>> > On Feb 5, 2018, at 2:09 PM, Sebastian Benthall <sbenthall at gmail.com
>> > <mailto:sbenthall at gmail.com>> wrote:
>> >>
>> >> 2) I just figured out how to make time for this in the short term. So
>> >> count me in.
>> >>
>> >> Shall we plan a meeting about this?
>> >
>> > Yeah, I'd love to do that! Would folks be interested in an audio chat
>> > next week? I will send around a Doodle poll if it's more than just me
>> > and Seb.
>> >
>> >> On Feb 5, 2018 4:24 PM, "Sebastian Benthall" <sbenthall at gmail.com
>> >> <mailto:sbenthall at gmail.com>> wrote:
>> >>
>> >> These are great questions, Nick.
>> >>
>> >> I'd love to work on them with you, especially because they are
>> >> such general metrics.
>> >> Sadly I've got almost no time to work on it until May, due to
>> >> dissertation work.
>> >>
>> >> Let me provide some recommendations based on my attempts to
>> >> address similar questions on SciPy and other lists.
>> >
>> > These comments are really helpful, thanks!
>> >
>> > I am interested to understand the math better, and could really use your
>> > help on that. I definitely get your general point that because there's a
>> > long-tail distribution in any case, I need to find cases that don't fit
>> > that pattern in order to show meaningful results.
>> >
>> > I'm not sure I understand the concentration parameter, but it does seem
>> > like something like that would be useful. I also thought there might be
>> > interesting graph analysis metrics -- like centrality? -- in a graph of
>> > the nodes of connections between participants and lists.
>> >
>> > Thanks again for your thoughts!
>> > —Nick
>> >
>> >>
>> >> * how many participants total in IETF work?
>> >>
>> >>
>> >> The odds are *very* high that the emails-per-person distribution
>> >> is a heavy-tail distribution.
>> >> Based on previous work
>> >> <https://conference.scipy.org/proceedings/scipy2015/
>> pdfs/sebastian_benthall.pdf>,
>> >> I would test for fit to log normal and power law distributions.
>> >> My money is on log normal being a better fit.
>> >>
>> >> This is important because when interpreting the results, we have
>> >> to keep in mind that
>> >> the log normal distribution is essentially a noise pattern.
>> >> So it's easy to read into the data relationships that may not be
>> >> there,
>> >> especially if you're using a linear rather than a log linear
>> >> relationship as an indicator.
>> >>
>> >> * how "sticky" is participation?
>> >> if people participate on a list, do they return? do
>> >> they show up to f2f meetings?
>> >> what's the attrition rate?
>> >> what's the distribution of length of participation?
>> >>
>> >>
>> >> Assuming there is a heavy tail distribution of participation, then
>> >> about half the contributors
>> >> will only contribute once.
>> >>
>> >> The distribution of attrition/retention will look more or less
>> >> just like the distribution of participation.
>> >> The length will look like it as well.
>> >>
>> >> It's not clear how to interpret this, because the reasons why any
>> >> particular person participates a lot
>> >> or a little are very likely
>> >> (a) myriad (no single reason, but rather a combination of many
>> >> reasons, and
>> >> (b) exogenous to the data itself.
>> >>
>> >> For these reasons I expect you would get more interesting results
>> >> if you can segment the population
>> >> into categories of interest. You've mentioned gender and firms of
>> >> employment, which are both good ones.
>> >>
>> >> But for each category, you may want to have more than one
>> parameter to
>> >> characterize the each one's participation distribution.
>> >> May mean /and/ variance?
>> >>
>> >> * who has participated longest? across the most groups?
>> >> is there a group of "elites" across working groups?
>> >>
>> >>
>> >> This is a great question.
>> >> But keep in mind: the people who participate most are going to be
>> >> participating a lot
>> >> more numerically across all lists than others.
>> >> So they will have more chances to participate in different lists.
>> >>
>> >> You may want to be looking at, for each participant, their
>> >> individual distribution of participation
>> >> over many lists, and then look at the concentration parameter of
>> >> that distribution:
>> >>
>> >> https://en.wikipedia.org/wiki/Concentration_parameter
>> >> <https://en.wikipedia.org/wiki/Concentration_parameter>
>> >>
>> >> The math can be a bit tricky but I think it's worth tackling
>> >> correctly.
>> >>
>> >>
>> >> how many participants are single-group?
>> >>
>> >>
>> >> Since most participants will be only send one message, that's
>> >> going to skew this metric
>> >> unless you take that into account somehow.
>> >>
>> >>
>> >> how many groups does the typical participant join?
>> >>
>> >> As I believe I've mentioned to this group before, I've been
>> >> looking into estimating gender in mailing list participation,
>> >> including:
>> >>
>> >> * What is the gender distribution of participants in Internet
>> >> and Web technical standard-setting?
>> >> how does that distribution differ from the population at
>> >> large? from employment at related firms?
>> >> does that distribution change over time?
>> >> are there sub-groups which have distinctly different
>> >> distributions?
>> >> * Does the gender distribution of conversation differ from the
>> >> gender distribution of the participants?
>> >>
>> >>
>> >> Great questions.
>> >>
>> >>
>> >> Do you have questions you'd like to add to this list? Would
>> >> you be interested in trying to measure/answer one of these
>> >> questions? Which are the easiest and which are the most
>> >> difficult? What features would we need to add to BigBang to
>> >> make them answerable?
>> >>
>> >>
>> >> In sum, I think all these questions are great ones and related to
>> >> each other.
>> >> I think the biggest challenge is getting the correct statistical
>> >> modeling right,
>> >> so that the results are not misinterpreted.
>> >>
>> >> - Seb
>> >>
>> >>
>> >
>> >
>> >
>> > _______________________________________________
>> > Bigbang-dev mailing list
>> > Bigbang-dev at data-activism.net
>> > https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>> >
>>
>>
>> _______________________________________________
>> Bigbang-dev mailing list
>> Bigbang-dev at data-activism.net
>> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180216/02a2c9b7/attachment-0001.html>
More information about the Bigbang-dev
mailing list