[Bigbang-dev] refactoring archives, collection and mailman.py

Tue Mar 2 19:42:37 CET 2021

Hi all,

Many thanks to Christoph for prompting us to think about refactoring and
clearer organization for the different classes and scripts we are using for
ingesting, parsing, saving and converting email archive formats.

A few notes from our chat today below:

   1. useful to use existing python email message classes and subclasses,
   e.g. https://docs.python.org/3.7/library/mailbox.html It's Pythonic and
   let's us take advantage of the existing work/gotchas of other Python people
   handling email

   1. mailman.py should be refactored to separate mail
   collecting/provenance stuff *and* mailman-specific crawling/parsing

   1. could be a useful Archive class to handle to/from different formats
   and maybe also switching between scrapers/crawlers based on the URL or the
   archive software (whether that's an Archive class and a Collector class or
   just an Archive class, we're not sure, but it also probably doesn't matter
   much)
   2. mbox format seems to capture all the contents of the scraping
   (whether from w3c or listserv) -- lossless, we're not losing important
   information if we save and later re-open from that format

And I believe Christoph is going to open some issues; these tasks aren't
likely to be finishable during our hackathon, but will be useful for future
milestones so that it'll be faster and more comprehensible for those who
want to contribute new classes to ingest mailing list archives stored in
different systems.

Happy hacking,
Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20210302/6ef9c8d2/attachment.htm>