[Bigbang-dev] parsing L-Soft Listserv archives (was Re: IETF Affiliation Analysis with BigBang -- Scheduling a call)
Nick Doty
npdoty at ischool.berkeley.edu
Tue Apr 7 05:17:47 CEST 2020
Hi Niels,
> On Mar 19, 2020, at 6:58 AM, Niels ten Oever <mail at nielstenoever.net> wrote:
>
> - The focus of my work recently has shifted from ICANN, IETF and RIPE to 3GPP and the IEEE. Unfortunately these organizations don't use Mailman, but L-Soft's Listserv 16.5 (https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe> and https://listserv.ieee.org/cgi-bin/wa?HOME <https://listserv.ieee.org/cgi-bin/wa?HOME>). Is there a way we could scrape these archives as well?
L-Soft Listserv is a new one for me, I hadn’t seen mailing list archives like this before.
If you can somehow download .mbox files or plain text archives from these groups, that would clearly be easiest. If not, though, I think you could write a small scraper for their online format and then it would be fairly easy to integrate into bigbang.
For W3C archives, I have w3crawl.py, which follows the links to individual message pages in W3C’s online archives, which use a version of pipermail that is pretty specific to them. We don’t have this formally as a subclass in BigBang, but we can and should do that at some point. For now, mailman.py just tries to determine from the URL which crawler it should use, and switches to w3crawl.py explicitly when it looks to be a match.
You’d need to make a subclass of email.parser.Parser to create a single email message from a string, and then a version of collect_from_url() to handle the steps of finding all the messages for a list and parsing each. Then bigbang will save it as an mbox and it’s easy to parse whenever.
Hope this helps,
Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20200406/b8656249/attachment.html>
More information about the Bigbang-dev
mailing list