[Bigbang-dev] BigBang discrepancy between quantitate and qualitative findings - explanations?

Tue Jul 24 17:53:54 CEST 2018

Dear all,

I trust this email finds you well.

I had some questions, after having spent some more time with the
bigbang notebooks, specifically the basic list statistics_Corinne's
questions.

*Question 1:*
When entering in the following time frame:

date_from = pd.datetime(2014,10,1,tzinfo=pytz.utc)
date_to = pd.datetime(2015,11,30,tzinfo=pytz.utc)

I get conflicting answers in "#Q3: Number of emails in a time frame" and
"#then you can specify some years and have the break down per month".

[Q3]: 291
[some years per month]: 281 (11 for year 2014 + (month 1 until 11 of year
2015))

2014:  tot 11
    1:   0
    2:   0
    3:   0
    4:   0
    5:   0
    6:   0
    7:   0
    8:   0
    9:   0
    10:   4
    11:   6
    12:   1
____________________
2015:  tot 297
    1:   15
    2:   12
    3:   14
    4:   8
    5:   46
    6:   11
    7:   37
    8:   11
    9:   11
    10:   61
    11:   44
    12:   13

And if I calculate backwards, so for the period I am interested in
(october 2014 until and including november 2015): 11 + (297 - 13) =
295

which is again different from the answers I got above.

What explains these discrepancies and which one is the authoritative
answer? (which might just be that my math skills suck, very possible)

*Question 2:*
#Q5 get threads with most replies

It would be interesting to have that also follow the timeline set in #here
you can set the time frame, which currently it does not do.
Which I can see because the thread it credits for being the highest, is out
of sync with my qualitative analysis for the time period (2014, 2015).

So for instance, what is the highest number of threads for

date_from = pd.datetime(2014,10,1,tzinfo=pytz.utc)
date_to = pd.datetime(2015,11,30,tzinfo=pytz.utc)

Or for

date_from = pd.datetime(2015,12,1,tzinfo=pytz.utc)
date_to = pd.datetime(2016,11,30,tzinfo=pytz.utc)

etc.

*Question 3:*
In #threads with most replies: I get the following results

[hrpc] Examining existing Venue Selection criteria   71
[hrpc] Case three: DDoS   55
[hrpc] Human Rights Research Group Call on draft-irtf-hrpc-research-07   53
Re: [hrpc] draft-tenoever-hrpc-research-02   32
[hrpc] Comments about draft-irtf-hrpc-research-07   26

However, these counts don't hold up against my qualitative count (which was
done by hand) and hold that there are 34 responses to "[hrpc] Examining
existing venue selection criteria" and it also doesn't sync with the IETF
mailing list archive:
https://mailarchive.ietf.org/arch/browse/hrpc/?gbt=1&q=examining+venue
which says there are 35 responses to this thread.

Similarly, for

[hrpc] Human Rights Research Group Call on draft-irtf-hrpc-research-07
My hand counted notes say there are 53 responses, bigbang has 53, but
the archive has 56
see:https://mailarchive.ietf.org/arch/browse/hrpc/?gbt=1&q=Human+Rights+Research+Group+Call+on+draft-irtf-hrpc-research-07+

I am perfectly comfortable to assume that my hand-count is off by a bit,
but the discrepancy between the ietf archive and bigbang is odd.

Especially because for instance for hrpc Case three: DDOS, my notes, the
archive and bigbang sync up perfectly to 55 reponses. See:
https://mailarchive.ietf.org/arch/browse/hrpc/?gbt=1&q=%5Bhrpc%5D+Case+three%3A+DDoS

I am sure this has something to do with how the bigbang tool counts versus
how the ietf counts versus how I count, but this does raise questions again
about what the authoritative answer is.

Happy to think along! best,

-- 
Corinne Cath
Ph.D. Candidate, Oxford Internet Institute & Alan Turing Institute

Web: www.oii.ox.ac.uk/people/corinne-cath
Email: ccath at turing.ac.uk & corinnecath at gmail.com
Twitter: @C_Cath
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180724/db84be11/attachment.html>