[Bigbang-dev] IETF list crawling
Nick Doty
npdoty at ischool.berkeley.edu
Wed Oct 4 02:44:58 CEST 2017
Hi Niels,
> On Oct 3, 2017, at 3:39 AM, Niels ten Oever <niels at article19.org> wrote:
>
> Hi NIck,
>
> Is there a reason you're focusing on this selection of lists, instead of
> all lists [0]?
> [0] https://www.ietf.org/mail-archive/text/ <https://www.ietf.org/mail-archive/text/>
I was trying to work off of every active WG list, which I parsed/pulled from https://datatracker.ietf.org/list/wg/ <https://datatracker.ietf.org/list/wg/> which pulls mailing list archive URLs from the charters of all the active Working Groups. Several IETF Working Groups have mailing lists that aren't hosted on IETF infrastructure. (Some of them are on W3C infrastructure, and a handful seem to just be one-off lists privately hosted.) I did the same with W3C, to get a corpus of all the active groups with mailing lists, although there are more archives than that.
So I think my list of lists is shorter than yours but also probably has some important groups that yours doesn't.
Issue #287 is related: https://github.com/datactive/bigbang/issues/287 <https://github.com/datactive/bigbang/issues/287>
> I am getting the following error:
>
> DEBUG:chardet.universaldetector:no probers hit minimum threshold
> Traceback (most recent call last):
> File "bin/collect_mail.py", line 41, in <module>
> main(sys.argv[1:])
> File "bin/collect_mail.py", line 38, in main
> mailman.collect_from_file(arg)
> File "/home/lem/Data/bigbang/bigbang/mailman.py", line 113, in
> collect_from_file
> collect_from_url(url)
> File "/home/lem/Data/bigbang/bigbang/mailman.py", line 88, in
> collect_from_url
> data = open_list_archives(url)
> File "/home/lem/Data/bigbang/bigbang/mailman.py", line 269, in
> open_list_archives
> return messages_to_dataframe(messages)
> File "/home/lem/Data/bigbang/bigbang/mailman.py", line 333, in
> messages_to_dataframe
> for m in messages if m.get('Message-ID')]
> File "/home/lem/Data/bigbang/bigbang/mailman.py", line 280, in get_text
> charset = chardet.detect(str(part))['encoding']
> File
> "/home/lem/Data/anaconda2/envs/bigbang/lib/python2.7/site-packages/chardet/__init__.py",
> line 39, in detect
> return detector.close()
> File
> "/home/lem/Data/anaconda2/envs/bigbang/lib/python2.7/site-packages/chardet/universaldetector.py",
> line 271, in close
> for prober in self._charset_probers[0].probers:
> IndexError: list index out of range
>
> Any suggestions?
I see these comments at the top of the relevant function:
## This code for character detection and dealing with exceptions is terrible
## It is in need of refactoring badly. - sb
It seems to me that there is no exception handling anywhere in this code and that there needs to be. We shouldn't assume that every email message will always have a consistent character set, if for no other reason than there might sometimes be straight up corrupt data in the archives we collect and analyze.
Niels, if you can pin down exactly what file (or better yet, what message) is causing this error, that would make it easier to make a test case. I think we could make a few small changes to the get_text method to catch and log/swallow character set exceptions.
I have a few error-handling fixes on my local branch that I haven't pushed back into master yet, although I don't think they would help with this particular error.
Is this the same issue as https://github.com/datactive/bigbang/issues/250 <https://github.com/datactive/bigbang/issues/250> ? Or actually, while that's also character encoding, it seems to be a different error. I'm not sure if migrating to Python 3 will help with both of these issues or not.
Cheers,
Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20171003/ec0a204e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 529 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20171003/ec0a204e/attachment.sig>
More information about the Bigbang-dev
mailing list