[Bigbang-dev] provenance and sharing collected archives

Sebastian Benthall sbenthall at gmail.com
Wed Aug 23 00:15:40 CEST 2017


GitHub has private repositories. One could manage permissions through that
system.

U.S. IRB says public data isn't human subjects data. I suppose it would
fitting if EU was stricter. But I believe even the GDPR says data that's
been explicitly made public is fair game.

Another possibility would be versioned cloud storage like an Amazon S3
bucket. There must be a sweet open source equivalent one could set up?

On Aug 21, 2017 8:43 AM, "Beraldo, Davide" <d.beraldo at uva.nl> wrote:

> Hi guys,
>
> first of all, thanks a lot for keeping this on! and apologize for the very
> long inactivity on this side; resolution for coming academic year is to get
> more involved with programming for the good (aka not for evil marketing
> people )
>
> on the issue of public repository: i am myself not an ethic fanatic, but
> working with people who are made me a bit more paranoid; plus, the
> DATACTIVE project has made some pretty strict ethical commitments with the
> funders
> .
> consequently, i think that making the repositories public would be too
> much. i nonetheless see the good in having them stored somewhere and let
> interested people access them.
> ---would it be possible to have the data stored, listed, but accessible
> only at request?
>
> in the meanwhile i can check with the ethics experts here what they think
> about it
>
> cheers!
>
> Davide
>
> ________________________________________
> From: Bigbang-dev [bigbang-dev-bounces at data-activism.net] on behalf of
> Niels ten Oever [niels at article19.org]
> Sent: Sunday, August 20, 2017 2:33 PM
> To: Nick Doty
> Cc: bigbang-dev at data-activism.net
> Subject: Re: [Bigbang-dev] provenance and sharing collected archives
>
> Github sounds good to me, but Davide might have some comments re:
> (research-)ethics?
>
> Cheers,
>
> Niels
>
>
> On Fri, Aug 18, 2017 at 03:28:46PM -0700, Nick Doty wrote:
> > Yeah, separate git repositories sounds like a good way forward. I think
> having the provenance files will make it easier to collaborate and see the
> current status of such a data repository.
> >
> > Niels, is there a particular reason to use separate server space for
> these data repositories? Or should we just make them public GitHub
> repositories? I could potentially see some privacy advantage in not making
> a public mirror of these mailing list archives -- in the occasional case
> where public mailing list archive managers remove sensitive messages, our
> archives wouldn't automatically remove them as well -- but I expect that to
> be notably rare for these groups that make a point of public archives.
> >
> > > On Aug 16, 2017, at 8:31 AM, Sebastian Benthall <sbenthall at gmail.com>
> wrote:
> > >
> > > +1 on having data repositories.
> > > That's a great idea.
> > >
> > > Standalone GitHub repositories (not in BigBang but "next to" it) are
> possible for smaller data sets. Versioning is nice.
> > >
> > > Not sure how to do the bigger ones.
> > >
> > > On Aug 11, 2017 10:09 AM, "Niels ten Oever" <niels at article19.org
> <mailto:niels at article19.org>> wrote:
> > > Hi Nick,
> > >
> > > I am happy to work on keeping repositories for IETF and ICANN
> mailinglists. I can also provide server space for the three bodies (W3C,
> IETF, ICANN), also makes sense because they're connected.
> > >
> > > I am very sorry that the Datactive fork is still (far) behind my
> personal fork. We do want to organize a hackathon on this, RIPE has shown
> interest in support this work, so hopefully we can organize something to
> work on this before the end of the year.
> > >
> > > Cheers,
> > >
> > > Niels
> > >
> > >
> > > On Tue, Aug 01, 2017 at 04:50:03PM -0700, Nick Doty wrote:
> > > > We've touched on this a couple of times before; I think we've
> decided not to include collected mailing list archives in the BigBang
> repository itself. There are few archives that would be relevant to all
> users, and we're trying to write code for automated collection so that you
> can download any archive you need for your own research.
> > > >
> > > > That being said, I wonder if it might be useful to have separate
> repositories where interested researchers can share the archives they've
> downloaded. I've been downloading mailing list archives for every active
> W3C Working Group and Interest Group, and separately for every active IETF
> Working Group; it comes to a lot of data, takes a good deal of time to
> download and may require some babysitting of those long-running processes.
> Would others be interested in separate repo's with snapshots of ML archives
> for those organizations? Or any other common organizations/lists it might
> be useful to have snapshot data for?
> > > >
> > > > To that point, I also think we'll need useful provenance metadata if
> we get to the point of sharing archives. When were these downloaded, what
> was the specific mailing list, what software was used to download them,
> etc. Indeed, I feel like I should have that functionality just for my
> individual work in order to maintain good research practice. I opened
> https://github.com/datactive/bigbang/issues/283 <
> https://github.com/datactive/bigbang/issues/283> <
> https://github.com/datactive/bigbang/issues/283 <
> https://github.com/datactive/bigbang/issues/283>> on that 6 weeks ago,
> and today I've written code to generate provenance.yaml files during the
> mail collection process: https://github.com/npdoty/bigbang/tree/provenance
> <https://github.com/npdoty/bigbang/tree/provenance> <
> https://github.com/npdoty/bigbang/tree/provenance <
> https://github.com/npdoty/bigbang/tree/provenance>>
> > > >
> > > > I'd appreciate any feedback on the issue or on this list.
> > > >
> > > > I could try to create a minimal PR, but that's getting harder for me
> as datactive/bigbang's master branch has not been updated in a long time
> and my code may rely on other changes I've made in intervening months.
> > > >
> > > > Cheers,
> > > > Nick
> > >
> > >
> > >
> > > > _______________________________________________
> > > > Bigbang-dev mailing list
> > > > Bigbang-dev at data-activism.net <mailto:Bigbang-dev at data-activism.net>
> > > > https://lists.ghserv.net/mailman/listinfo/bigbang-dev <
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev>
> > >
> > >
> > > --
> > >
> > > Niels ten Oever
> > > Head of Digital
> > >
> > > Article 19
> > > www.article19.org <http://www.article19.org/>
> > >
> > > PGP fingerprint    2458 0B70 5C4A FD8A 9488
> > >                    643A 0ED8 3F3A 468A C8B3
> > >
> > >
> > > _______________________________________________
> > > Bigbang-dev mailing list
> > > Bigbang-dev at data-activism.net <mailto:Bigbang-dev at data-activism.net>
> > > https://lists.ghserv.net/mailman/listinfo/bigbang-dev <
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev>
> > >
> >
>
>
>
> --
>
> Niels ten Oever
> Head of Digital
>
> Article 19
> www.article19.org
>
> PGP fingerprint    2458 0B70 5C4A FD8A 9488
>                    643A 0ED8 3F3A 468A C8B3
>
>
> _______________________________________________
> Bigbang-dev mailing list
> Bigbang-dev at data-activism.net
> https://lists.ghserv.net/mailman/listinfo/bigbang-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20170822/474d0adc/attachment.html>


More information about the Bigbang-dev mailing list