[Bigbang-dev] parsing L-Soft Listserv archives (was Re: IETF Affiliation Analysis with BigBang -- Scheduling a call)

Tue Apr 7 10:04:13 CEST 2020

Hi Nick,

On 4/7/20 5:17 AM, Nick Doty wrote:
> Hi Niels,
> 
>> On Mar 19, 2020, at 6:58 AM, Niels ten Oever <mail at nielstenoever.net <mailto:mail at nielstenoever.net>> wrote:
>>
>> - The focus of my work recently has shifted from ICANN, IETF and RIPE to 3GPP and the IEEE. Unfortunately these organizations don't use Mailman, but L-Soft's Listserv 16.5 (https://list.etsi.org/scripts/wa.exe and https://listserv.ieee.org/cgi-bin/wa?HOME). Is there a way we could scrape these archives as well?
> 
> L-Soft Listserv is a new one for me, I hadn’t seen mailing list archives like this before.

I had only seen it used by the NCSG in ICANN, who hosted their mailinglist on the Syracuse University's server who used it. But then I found out that both the IEEE and the 3GPP use it, which is quite a significant userbase. I have been in contact with quite a few sysops and even the l-soft people themselves, and they have made it clear that scraping is not something that the system is optimized for, quite the opposite. However, there is some good news, the 3GPP list admin was so nice to give me a copy of their list archive - and now I am horsing around to convert them to the mbox format.

Thusfar this has not been trivial without loss of data. For instance this script [0] does the conversion but messes up threads. This [1] Perl script seems to do better, but there are some issues with the mbox format which it spews out.

[0] https://verify.rwth-aachen.de/psk/ls2mm/
[1] http://www.hypermail-project.org/archive/99/0520.html

> 
> If you can somehow download .mbox files or plain text archives from these groups, that would clearly be easiest. If not, though, I think you could write a small scraper for their online format and then it would be fairly easy to integrate into bigbang.> 
> For W3C archives, I have w3crawl.py, which follows the links to individual message pages in W3C’s online archives, which use a version of pipermail that is pretty specific to them. We don’t have this formally as a subclass in BigBang, but we can and should do that at some point. For now, mailman.py just tries to determine from the URL which crawler it should use, and switches to w3crawl.py explicitly when it looks to be a match.
> 
> You’d need to make a subclass of email.parser.Parser to create a single email message from a string, and then a version of collect_from_url() to handle the steps of finding all the messages for a list and parsing each. Then bigbang will save it as an mbox and it’s easy to parse whenever.

Do you think it would be doable to create such a crawler for listserv? If so, I think there might potentially be quite a userbase, or at least application, for it.

Cheers,

Niels

> 
> Hope this helps,
> Nick

-- 
Niels ten Oever
Researcher and PhD Candidate
DATACTIVE Research Group
University of Amsterdam

PGP fingerprint	   2458 0B70 5C4A FD8A 9488  
                   643A 0ED8 3F3A 468A C8B3