[Bigbang-dev] provenance and sharing collected archives
Sebastian Benthall
sbenthall at gmail.com
Mon Feb 5 22:25:44 CET 2018
Awesome, Nick!
On Sat, Feb 3, 2018 at 8:45 PM, Nick Doty <npdoty at ischool.berkeley.edu>
wrote:
> Regarding the shared archives project, I've set up a private GitHub repo
> which contains IETF mailing list archives (at least the ones that my
> crawler and my list were able to consume as of October 2017), including
> provenance metadata files. It uses git-lfs to handle the archive files
> themselves, so that the repo itself isn't enormous and won't grow in size
> indefinitely.
>
> It's still a lot of data -- it requires downloading 13 GB and takes 26 GB
> of disk space, and that's with a probably somewhat incomplete crawl of the
> archives. But I've documented how to do that with Github and git-lfs, so
> hopefully it'll be easier for the next collaborator. Let me know if you're
> interested and I can share the repo with you!
>
> This costs a little bit of money, currently $5/month, because of the disk
> space and bandwidth with GitHub, so I'm open to alternatives if someone
> wants to run GitLab on a private server somewhere. (Hosted GitLab won't
> allow more than 10GB, so that's out right from the start.)
>
> Cheers,
> Nick
>
> > On Oct 25, 2017, at 6:55 PM, Nick Doty <npdoty at ischool.berkeley.edu>
> wrote:
> >
> > Regarding shared archives, I'm finding that the number and size of the
> files is making straight-up checking them into Git a little difficult; it
> takes minutes just to git add the files to a changelist. I suspect that the
> git-lfs extension would be a useful way to help with this. Under my
> understanding, we would, in short, check in hashes of files to a git
> repository and then the full additional mail archives to another location
> (hosted by GitHub or GitLab) and then git-lfs will download the full mail
> archives (but not every version of them) as needed. If that sounds
> reasonable, then I think we can more easily pursue the shared archives
> approach (for all the IETF or ICANN archives, say) with non-public hosting,
> either on GitLab or through a separate server as Niels had offered. I'll
> investigate more and let people know; if anyone on the list has experience
> with git-lfs (or, alternatively, git-annex) already, please let me know!
> >
> > Cheers,
> > Nick
>
>
> _______________________________________________
> Bigbang-dev mailing list
> Bigbang-dev at data-activism.net
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180205/5f3dd1c4/attachment.html>
More information about the Bigbang-dev
mailing list