[Bigbang-dev] IETF list crawling
Niels ten Oever
niels at article19.org
Tue Oct 3 12:39:52 CEST 2017
Hi NIck,
Is there a reason you're focusing on this selection of lists, instead of
all lists [0]?
I am getting the following error:
DEBUG:chardet.universaldetector:no probers hit minimum threshold
Traceback (most recent call last):
File "bin/collect_mail.py", line 41, in <module>
main(sys.argv[1:])
File "bin/collect_mail.py", line 38, in main
mailman.collect_from_file(arg)
File "/home/lem/Data/bigbang/bigbang/mailman.py", line 113, in
collect_from_file
collect_from_url(url)
File "/home/lem/Data/bigbang/bigbang/mailman.py", line 88, in
collect_from_url
data = open_list_archives(url)
File "/home/lem/Data/bigbang/bigbang/mailman.py", line 269, in
open_list_archives
return messages_to_dataframe(messages)
File "/home/lem/Data/bigbang/bigbang/mailman.py", line 333, in
messages_to_dataframe
for m in messages if m.get('Message-ID')]
File "/home/lem/Data/bigbang/bigbang/mailman.py", line 280, in get_text
charset = chardet.detect(str(part))['encoding']
File
"/home/lem/Data/anaconda2/envs/bigbang/lib/python2.7/site-packages/chardet/__init__.py",
line 39, in detect
return detector.close()
File
"/home/lem/Data/anaconda2/envs/bigbang/lib/python2.7/site-packages/chardet/universaldetector.py",
line 271, in close
for prober in self._charset_probers[0].probers:
IndexError: list index out of range
Any suggestions?
Cheers,
Niels
[0] https://www.ietf.org/mail-archive/text/
On 09/28/2017 07:11 AM, Nick Doty wrote:
> Hi Niels,
>
> Per the conversation on Gitter, I'm reviewing my logs from the IETF crawls that I did at the end of July and not immediately seeing any Unicode issues preventing downloads. I've attached the log files (which are long! we should maybe try to make these more consistent/informative). The initial log file has some failures, but the run on July 31st seems to have been more successful. This didn't include my provenance code, so I can't easily tell you exactly which version of BigBang code this was running.
>
> It's 13 gigabytes of email (!), and I don't think it's quite complete. I'm not sure my list of lists was comprehensive, I've attached that too.
>
> Hope this helps,
> Nick
>
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20171003/ce7dbe53/attachment.sig>
More information about the Bigbang-dev
mailing list