<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class="">Hi Niels,</div><br class=""><div><blockquote type="cite" class=""><div class="">On Oct 3, 2017, at 3:39 AM, Niels ten Oever <<a href="mailto:niels@article19.org" class="">niels@article19.org</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">Hi NIck,<br class=""><br class="">Is there a reason you're focusing on this selection of lists, instead of<br class="">all lists [0]?<br class=""></div></div></blockquote><blockquote type="cite" class="">[0] <a href="https://www.ietf.org/mail-archive/text/" class="">https://www.ietf.org/mail-archive/text/</a><br class=""></blockquote><div><br class=""></div><div>I was trying to work off of every active WG list, which I parsed/pulled from <a href="https://datatracker.ietf.org/list/wg/" class="">https://datatracker.ietf.org/list/wg/</a> which pulls mailing list archive URLs from the charters of all the active Working Groups. Several IETF Working Groups have mailing lists that aren't hosted on IETF infrastructure. (Some of them are on W3C infrastructure, and a handful seem to just be one-off lists privately hosted.) I did the same with W3C, to get a corpus of all the active groups with mailing lists, although there are more archives than that.</div><div><br class=""></div><div>So I think my list of lists is shorter than yours but also probably has some important groups that yours doesn't.</div><div><br class=""></div><div>Issue #287 is related: <a href="https://github.com/datactive/bigbang/issues/287" class="">https://github.com/datactive/bigbang/issues/287</a></div><div><br class=""></div><blockquote type="cite" class=""><div class=""><div class="">I am getting the following error:<br class=""><br class="">DEBUG:chardet.universaldetector:no probers hit minimum threshold<br class="">Traceback (most recent call last):<br class="">  File "bin/collect_mail.py", line 41, in <module><br class="">    main(sys.argv[1:])<br class="">  File "bin/collect_mail.py", line 38, in main<br class="">    mailman.collect_from_file(arg)<br class="">  File "/home/lem/Data/bigbang/bigbang/mailman.py", line 113, in<br class="">collect_from_file<br class="">    collect_from_url(url)<br class="">  File "/home/lem/Data/bigbang/bigbang/mailman.py", line 88, in<br class="">collect_from_url<br class="">    data = open_list_archives(url)<br class="">  File "/home/lem/Data/bigbang/bigbang/mailman.py", line 269, in<br class="">open_list_archives<br class="">    return messages_to_dataframe(messages)<br class="">  File "/home/lem/Data/bigbang/bigbang/mailman.py", line 333, in<br class="">messages_to_dataframe<br class="">    for m in messages if m.get('Message-ID')]<br class="">  File "/home/lem/Data/bigbang/bigbang/mailman.py", line 280, in get_text<br class="">    charset = chardet.detect(str(part))['encoding']<br class="">  File<br class="">"/home/lem/Data/anaconda2/envs/bigbang/lib/python2.7/site-packages/chardet/__init__.py",<br class="">line 39, in detect<br class="">    return detector.close()<br class="">  File<br class="">"/home/lem/Data/anaconda2/envs/bigbang/lib/python2.7/site-packages/chardet/universaldetector.py",<br class="">line 271, in close<br class="">    for prober in self._charset_probers[0].probers:<br class="">IndexError: list index out of range<br class=""><br class="">Any suggestions?<br class=""></div></div></blockquote><div><br class=""></div><div class="">I see these comments at the top of the relevant function:</div><div class=""><br class=""></div><div class="">    ## This code for character detection and dealing with exceptions is terrible<br class="">    ## It is in need of refactoring badly. - sb<br class=""></div><div class=""><br class=""></div><div class="">It seems to me that there is no exception handling anywhere in this code and that there needs to be. We shouldn't assume that every email message will always have a consistent character set, if for no other reason than there might sometimes be straight up corrupt data in the archives we collect and analyze.</div><div class=""><br class=""></div><div class="">Niels, if you can pin down exactly what file (or better yet, what message) is causing this error, that would make it easier to make a test case. I think we could make a few small changes to the get_text method to catch and log/swallow character set exceptions.</div><div class=""><br class=""></div><div class="">I have a few error-handling fixes on my local branch that I haven't pushed back into master yet, although I don't think they would help with this particular error.</div><div class=""><br class=""></div><div class="">Is this the same issue as <a href="https://github.com/datactive/bigbang/issues/250" class="">https://github.com/datactive/bigbang/issues/250</a> ? Or actually, while that's also character encoding, it seems to be a different error. I'm not sure if migrating to Python 3 will help with both of these issues or not.</div><div class=""><br class=""></div><div class="">Cheers,</div><div class="">Nick</div></div></body></html>