[Bigbang-dev] provenance and sharing collected archives

Niels ten Oever mail at nielstenoever.net
Tue Feb 6 14:04:25 CET 2018


Hiya Nick,

If you send me your public SSH key, I will provide you access to a server with sufficient space to host that :)

Maybe we can also keep a copy of all ICANN mailinglists and RFC there.

Cheers,

Niels



On Mon, Feb 05, 2018 at 04:25:44PM -0500, Sebastian Benthall wrote:
> Awesome, Nick!
> 
> On Sat, Feb 3, 2018 at 8:45 PM, Nick Doty <npdoty at ischool.berkeley.edu>
> wrote:
> 
> > Regarding the shared archives project, I've set up a private GitHub repo
> > which contains IETF mailing list archives (at least the ones that my
> > crawler and my list were able to consume as of October 2017), including
> > provenance metadata files. It uses git-lfs to handle the archive files
> > themselves, so that the repo itself isn't enormous and won't grow in size
> > indefinitely.
> >
> > It's still a lot of data -- it requires downloading 13 GB and takes 26 GB
> > of disk space, and that's with a probably somewhat incomplete crawl of the
> > archives. But I've documented how to do that with Github and git-lfs, so
> > hopefully it'll be easier for the next collaborator. Let me know if you're
> > interested and I can share the repo with you!
> >
> > This costs a little bit of money, currently $5/month, because of the disk
> > space and bandwidth with GitHub, so I'm open to alternatives if someone
> > wants to run GitLab on a private server somewhere. (Hosted GitLab won't
> > allow more than 10GB, so that's out right from the start.)
> >
> > Cheers,
> > Nick
> >
> > > On Oct 25, 2017, at 6:55 PM, Nick Doty <npdoty at ischool.berkeley.edu>
> > wrote:
> > >
> > > Regarding shared archives, I'm finding that the number and size of the
> > files is making straight-up checking them into Git a little difficult; it
> > takes minutes just to git add the files to a changelist. I suspect that the
> > git-lfs extension would be a useful way to help with this. Under my
> > understanding, we would, in short, check in hashes of files to a git
> > repository and then the full additional mail archives to another location
> > (hosted by GitHub or GitLab) and then git-lfs will download the full mail
> > archives (but not every version of them) as needed. If that sounds
> > reasonable, then I think we can more easily pursue the shared archives
> > approach (for all the IETF or ICANN archives, say) with non-public hosting,
> > either on GitLab or through a separate server as Niels had offered. I'll
> > investigate more and let people know; if anyone on the list has experience
> > with git-lfs (or, alternatively, git-annex) already, please let me know!
> > >
> > > Cheers,
> > > Nick
> >
> >
> > _______________________________________________
> > Bigbang-dev mailing list
> > Bigbang-dev at data-activism.net
> > https://lists.ghserv.net/mailman/listinfo/bigbang-dev
> >
> >

-- 

Niels ten Oever
Researcher and PhD Candidate
Datactive Research Group
University of Amsterdam

PGP fingerprint	   2458 0B70 5C4A FD8A 9488  
                   643A 0ED8 3F3A 468A C8B3

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20180206/b2500ee8/attachment.sig>


More information about the Bigbang-dev mailing list