[Bigbang-dev] provenance and sharing collected archives

Nick Doty npdoty at ischool.berkeley.edu
Sun Feb 4 02:45:30 CET 2018


Regarding the shared archives project, I've set up a private GitHub repo which contains IETF mailing list archives (at least the ones that my crawler and my list were able to consume as of October 2017), including provenance metadata files. It uses git-lfs to handle the archive files themselves, so that the repo itself isn't enormous and won't grow in size indefinitely.

It's still a lot of data -- it requires downloading 13 GB and takes 26 GB of disk space, and that's with a probably somewhat incomplete crawl of the archives. But I've documented how to do that with Github and git-lfs, so hopefully it'll be easier for the next collaborator. Let me know if you're interested and I can share the repo with you!

This costs a little bit of money, currently $5/month, because of the disk space and bandwidth with GitHub, so I'm open to alternatives if someone wants to run GitLab on a private server somewhere. (Hosted GitLab won't allow more than 10GB, so that's out right from the start.)

Cheers,
Nick

> On Oct 25, 2017, at 6:55 PM, Nick Doty <npdoty at ischool.berkeley.edu> wrote:
> 
> Regarding shared archives, I'm finding that the number and size of the files is making straight-up checking them into Git a little difficult; it takes minutes just to git add the files to a changelist. I suspect that the git-lfs extension would be a useful way to help with this. Under my understanding, we would, in short, check in hashes of files to a git repository and then the full additional mail archives to another location (hosted by GitHub or GitLab) and then git-lfs will download the full mail archives (but not every version of them) as needed. If that sounds reasonable, then I think we can more easily pursue the shared archives approach (for all the IETF or ICANN archives, say) with non-public hosting, either on GitLab or through a separate server as Niels had offered. I'll investigate more and let people know; if anyone on the list has experience with git-lfs (or, alternatively, git-annex) already, please let me know!
> 
> Cheers,
> Nick

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 529 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180203/97eb41a1/attachment.sig>


More information about the Bigbang-dev mailing list