[Bigbang-dev] refactoring archives, collection and mailman.py
Nick Doty
npdoty at ischool.berkeley.edu
Tue Mar 2 19:42:37 CET 2021
Hi all,
Many thanks to Christoph for prompting us to think about refactoring and
clearer organization for the different classes and scripts we are using for
ingesting, parsing, saving and converting email archive formats.
A few notes from our chat today below:
1. useful to use existing python email message classes and subclasses,
e.g. https://docs.python.org/3.7/library/mailbox.html It's Pythonic and
let's us take advantage of the existing work/gotchas of other Python people
handling email
1. mailman.py should be refactored to separate mail
collecting/provenance stuff *and* mailman-specific crawling/parsing
1. could be a useful Archive class to handle to/from different formats
and maybe also switching between scrapers/crawlers based on the URL or the
archive software (whether that's an Archive class and a Collector class or
just an Archive class, we're not sure, but it also probably doesn't matter
much)
2. mbox format seems to capture all the contents of the scraping
(whether from w3c or listserv) -- lossless, we're not losing important
information if we save and later re-open from that format
And I believe Christoph is going to open some issues; these tasks aren't
likely to be finishable during our hackathon, but will be useful for future
milestones so that it'll be faster and more comprehensible for those who
want to contribute new classes to ingest mailing list archives stored in
different systems.
Happy hacking,
Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20210302/6ef9c8d2/attachment.htm>
More information about the Bigbang-dev
mailing list