[Bigbang-dev] provenance and sharing collected archives

Wed Aug 16 17:31:29 CEST 2017

+1 on having data repositories.
That's a great idea.

Standalone GitHub repositories (not in BigBang but "next to" it) are
possible for smaller data sets. Versioning is nice.

Not sure how to do the bigger ones.

On Aug 11, 2017 10:09 AM, "Niels ten Oever" <niels at article19.org> wrote:

> Hi Nick,
>
> I am happy to work on keeping repositories for IETF and ICANN
> mailinglists. I can also provide server space for the three bodies (W3C,
> IETF, ICANN), also makes sense because they're connected.
>
> I am very sorry that the Datactive fork is still (far) behind my personal
> fork. We do want to organize a hackathon on this, RIPE has shown interest
> in support this work, so hopefully we can organize something to work on
> this before the end of the year.
>
> Cheers,
>
> Niels
>
>
> On Tue, Aug 01, 2017 at 04:50:03PM -0700, Nick Doty wrote:
> > We've touched on this a couple of times before; I think we've decided
> not to include collected mailing list archives in the BigBang repository
> itself. There are few archives that would be relevant to all users, and
> we're trying to write code for automated collection so that you can
> download any archive you need for your own research.
> >
> > That being said, I wonder if it might be useful to have separate
> repositories where interested researchers can share the archives they've
> downloaded. I've been downloading mailing list archives for every active
> W3C Working Group and Interest Group, and separately for every active IETF
> Working Group; it comes to a lot of data, takes a good deal of time to
> download and may require some babysitting of those long-running processes.
> Would others be interested in separate repo's with snapshots of ML archives
> for those organizations? Or any other common organizations/lists it might
> be useful to have snapshot data for?
> >
> > To that point, I also think we'll need useful provenance metadata if we
> get to the point of sharing archives. When were these downloaded, what was
> the specific mailing list, what software was used to download them, etc.
> Indeed, I feel like I should have that functionality just for my individual
> work in order to maintain good research practice. I opened
> https://github.com/datactive/bigbang/issues/283 <
> https://github.com/datactive/bigbang/issues/283> on that 6 weeks ago, and
> today I've written code to generate provenance.yaml files during the mail
> collection process: https://github.com/npdoty/bigbang/tree/provenance <
> https://github.com/npdoty/bigbang/tree/provenance>
> >
> > I'd appreciate any feedback on the issue or on this list.
> >
> > I could try to create a minimal PR, but that's getting harder for me as
> datactive/bigbang's master branch has not been updated in a long time and
> my code may rely on other changes I've made in intervening months.
> >
> > Cheers,
> > Nick
>
>
>
> > _______________________________________________
> > Bigbang-dev mailing list
> > Bigbang-dev at data-activism.net
> > https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>
>
> --
>
> Niels ten Oever
> Head of Digital
>
> Article 19
> www.article19.org
>
> PGP fingerprint    2458 0B70 5C4A FD8A 9488
>                    643A 0ED8 3F3A 468A C8B3
>
>
> _______________________________________________
> Bigbang-dev mailing list
> Bigbang-dev at data-activism.net
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20170816/d6096dbd/attachment.html>