[Bigbang-user] R: R: Issue with listserv fetching (3GPP)
Christoph Becker
chrbecker01 at gmail.com
Fri May 7 09:45:08 CEST 2021
Hi Riccardo & Niels,
with the most recent PR and the release of the new version 0.3 you are now
able to scrape 3GPP & IEEE with:
- python3 bin/collect_mail.py -f
examples/url_collections/listserv.3GPP.txt
- python3 bin/collect_mail.py -f
examples/url_collections/listserv.IEEE.txt
To collect the entire archive takes a while, so you could create several
.txt files that contain only parts of the complete list of urls.
The downloaded archives will be stored in the directory from where you
execute the code as:
/bigbang/archives/IEEE/
/bigbang/archives/3GPP/
Scraped mailing list will appear inside this folder as .mbox formatted
files, with the file name being the name of the mailing list.
Please let me know if you have any questions, suggestions, bug reports, ...
Have a nice weekend,
Chris
Op vr 30 apr. 2021 om 13:54 schreef Riccardo Nanni <riccardo.nanni9 at unibo.it
>:
> Hi everyone,
>
> how are you?
> I noticed that when running
>
> python3 bin/collect_mail.py -u
> https://atlarge-lists.icann.org/pipermail/idn-wg/
>
> it doesn't download all the emails, but only July to December 2007. Then I
> realised all the other emails are archived per quarter instead of month.
> Are the two things connected? Maybe it is something you're already aware
> of, but I thought it was useful to report it here.
>
> @Chris, thanks a lot for the alternative way to scrape Listserv you sent!
> Haven't tried it yet, sorry.
> Best,
>
> Riccardo
> ------------------------------
> *Da:* Christoph Becker <chrbecker01 at gmail.com>
> *Inviato:* venerdì 23 aprile 2021 12:35
> *A:* Riccardo Nanni <riccardo.nanni9 at unibo.it>
> *Cc:* Niels ten Oever <mail at nielstenoever.net>;
> bigbang-user at data-activism.net <bigbang-user at data-activism.net>
> *Oggetto:* Re: R: R: [Bigbang-user] Issue with listserv fetching (3GPP)
>
> Hi Riccardo,
> just realised that I forgot to declare what "auth_key_mock" is.
> You can set up an AuthSession on the 3GPP Listserv archive after createing
> <https://list.etsi.org/scripts/wa.exe?GETPW1> an account there and input
> your credentials into the function as shown below.
>
>
> ---------------------------------------------------------------------------------------------------
> import bigbang
> from bigbang import listserv
> from bigbang.listserv import ListservArchive, ListservList, ListservMessage
>
> url_archive = "https://list.etsi.org/scripts/wa.exe?"
> url_list = url_archive + "A0=3GPP_TSG_CT_WG6"
> auth_key_mock = {"username": "your_usrname", "password": "your_psw"}
>
> ListservArchive.from_url(
> name="3GPP",
> url_root=url_archive,
> url_home=url_archive + "HOME",
> login=auth_key_mock,
> instant_save=True,
> only_mlist_urls=False,
> )
>
> ---------------------------------------------------------------------------------------------------
>
> Please excuse this awkward way of explaining it.
> I will try to update the wiki on the git repo asap.
>
> Best Wishes,
> Christoph
>
>
> Op vr 23 apr. 2021 om 10:53 schreef Riccardo Nanni <
> riccardo.nanni9 at unibo.it>:
>
> Great!
>
> Thank you again, Christoph and Niels, later I'll try it.
> Best,
>
> Riccardo
> ------------------------------
> *Da:* Niels ten Oever <mail at nielstenoever.net>
> *Inviato:* venerdì 23 aprile 2021 11:50
> *A:* Riccardo Nanni <riccardo.nanni9 at unibo.it>; Christoph Becker <
> chrbecker01 at gmail.com>
> *Cc:* bigbang-user at data-activism.net <bigbang-user at data-activism.net>
> *Oggetto:* Re: R: R: [Bigbang-user] Issue with listserv fetching (3GPP)
>
> Thanks Christoph!
>
> This was the content of the file example.py:
>
> import bigbang
> from bigbang import listserv
> from bigbang.listserv import ListservArchive, ListservList, ListservMessage
>
> url_archive = "https://list.etsi.org/scripts/wa.exe?"
> url_list = url_archive + "A0=3GPP_TSG_CT_WG6"
>
> ListservArchive.from_url(
> name="3GPP",
> url_root=url_archive,
> url_home=url_archive + "HOME",
> login=auth_key_mock,
> instant_save=True,
> only_mlist_urls=False,
> )
>
>
> Best,
>
> Niels
>
> On 23-04-2021 09:25, Riccardo Nanni wrote:
> > Dear Niels and Christoph,
> >
> > thanks a lot for your help!
> > I tried Niels' way and I keep getting the 'instant_dump'.
> > I did 'git branch' and it shows the following:
> >
> > *main
> > master
> >
> > I understand I am on the 'main' branch, is it right?
> > Then I tried 'git pull' again and it says it is already updated, but it
> keeps showing the 'instant_dump' message when I try the usual command.
> >
> > @Christoph: thank you for sharing the file on the alternative way to
> gather listserv emails, but I don't think it came through: all I can find
> is an error message that says an attachment was detected as malware (guess
> my computer 'misread' your file?). Any chance you can share it again,
> please?
> >
> > Thanks a lot again, you're all very helpful! As I'm better at cooking
> than programming, when you come to Italy I owe you a dinner 🙂🙂
> > Cheers,
> >
> > Riccardo
> >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > *Da:* Christoph Becker <chrbecker01 at gmail.com>
> > *Inviato:* venerdì 23 aprile 2021 00:23
> > *A:* Niels ten Oever <mail at nielstenoever.net>
> > *Cc:* Riccardo Nanni <riccardo.nanni9 at unibo.it>;
> bigbang-user at data-activism.net <bigbang-user at data-activism.net>
> > *Oggetto:* Re: R: [Bigbang-user] Issue with listserv fetching (3GPP)
> >
> > Hi Niels & Riccardo,
> > the argument 'instant_dump' for the ListservArchive class object does
> not exist anymore in the up-to-date 'main' branch of the git repo.
> > @Niels: Do you mean that you did a 'git pull' and encountered the
> TypeError caused by missing 'instant_dump' too?
> >
> > But as I said in another message, we are not quite there yet for 3GPP
> and IEEE to use the 'conventional' method on how BigBang scrapes archives
> such as W3C.
> > I attached a small examples that shows how you can currently scrape the
> 3GPP archive and save it to mbox files in the CONFIG.mail_path folder.
> > Be aware that this could take very long and could use a lot of memory.
> >
> > Best Wishes,
> > Christoph
> >
> >
> > Op do 22 apr. 2021 om 17:17 schreef Niels ten Oever <
> mail at nielstenoever.net <mailto:mail at nielstenoever.net
> <mail at nielstenoever.net>>>:
> >
> > Hi Riccardo and Christoph,
> >
> > I see there might be an issue with the usage of special characters
> in the mailinglist URLs, to get it working I had to put a '\' in front on
> the '?', but this could also be fixed by using " " around the URL. However,
> after that fetching did not work either - so let's ask Christoph (cc).
> >
> > Cheers,
> >
> > Niels
> >
> >
> >
> >
> >
> >
> > On 22-04-2021 17:43, Riccardo Nanni wrote:
> > > Hi Niels,
> > >
> > > thanks for your answer!
> > > I did, and I found the changes I can see in Github (e.g. the
> listserv.3GPP.txt file, etc.).
> > > I did it again when I saw it didn't work and it says 'già
> aggiornato' (already updated).
> > >
> > > Riccardo
> > >
> >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > *Da:* Bigbang-user <bigbang-user-bounces at data-activism.net <
> mailto:bigbang-user-bounces at data-activism.net
> <bigbang-user-bounces at data-activism.net>>> per conto di Niels ten Oever <
> mail at nielstenoever.net <mailto:mail at nielstenoever.net
> <mail at nielstenoever.net>>>
> > > *Inviato:* giovedì 22 aprile 2021 17:38
> > > *A:* bigbang-user at data-activism.net <
> mailto:bigbang-user at data-activism.net <bigbang-user at data-activism.net>> <
> bigbang-user at data-activism.net <mailto:bigbang-user at data-activism.net
> <bigbang-user at data-activism.net>>>
> > > *Oggetto:* Re: [Bigbang-user] Issue with listserv
> > >
> > > Hi Riccardo,
> > >
> > > This is not a very informed response - but did you first do:
> > >
> > > git pull
> > >
> > > to ensure that you have the latest version with all the recent
> changes?
> > >
> > > Best,
> > >
> > > Niels
> > >
> > > On 22-04-2021 17:31, Riccardo Nanni wrote:
> > >> Dear all,
> > >>
> > >> how are you?
> > >> I tried to collect email from 3GPP by running these commands:
> > >> python bin/collect_mail.py -u
> https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>?
> <https://list.etsi.org/scripts/wa.exe <
> https://list.etsi.org/scripts/wa.exe>?> <
> https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>?
> <https://list.etsi.org/scripts/wa.exe <
> https://list.etsi.org/scripts/wa.exe>?>>;
> > >> python3 bin/collect_mail.py -u
> https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>?
> <https://list.etsi.org/scripts/wa.exe <
> https://list.etsi.org/scripts/wa.exe>?> <
> https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>?
> <https://list.etsi.org/scripts/wa.exe <
> https://list.etsi.org/scripts/wa.exe>?>>
> > >> AND
> > >> python3 bin/collect_mail.py -f
> examples/url_collections/listserv.3GPP.txt
> > >>
> > >> Also tried to scrape a specific group's list with the same
> commands: https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <
> https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN> <
> https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <
> https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN>> <
> https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <
> https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN> <
> https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <
> https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN>>>
> > >>
> > >> I get the following error:
> > >> TypeError: from_url() got an unexpected keyword argument
> 'instant_dump'
> > >>
> > >> I don't understand what I'm missing. Can you help me, please?
> > >> Thanks a lot in advance! The only similar argument I could find
> on Stackoverflow has no answers...
> > >>
> > >> Riccardo
> > >>
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> Bigbang-user mailing list
> > >> Bigbang-user at data-activism.net <
> mailto:Bigbang-user at data-activism.net <Bigbang-user at data-activism.net>>
> > >> https://lists.ghserv.net/mailman/listinfo/bigbang-user <
> https://lists.ghserv.net/mailman/listinfo/bigbang-user> <
> https://lists.ghserv.net/mailman/listinfo/bigbang-user <
> https://lists.ghserv.net/mailman/listinfo/bigbang-user>>
> > >>
> > >
> > > --
> > > Niels ten Oever, PhD
> > > Postdoctoral Researcher - Media Studies Department - University of
> Amsterdam
> > > Research Fellow - Centre for Internet and Human Rights - European
> University Viadrina
> > > Associated Scholar - Centro de Tecnologia e Sociedade - Fundação
> Getúlio Vargas
> > >
> > > https://nielstenoever.net <https://nielstenoever.net> <
> https://nielstenoever.net <https://nielstenoever.net>> -
> mail at nielstenoever.net <mailto:mail at nielstenoever.net
> <mail at nielstenoever.net>> - @nielstenoever - +31629051853
> > > PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
> > >
> > > Read my latest article on Internet infrastructure governance in
> New Media & Society here:
> https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <
> https://journals.sagepub.com/doi/full/10.1177/1461444820929320> <
> https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <
> https://journals.sagepub.com/doi/full/10.1177/1461444820929320>>
> > >
> > > _______________________________________________
> > > Bigbang-user mailing list
> > > Bigbang-user at data-activism.net <
> mailto:Bigbang-user at data-activism.net <Bigbang-user at data-activism.net>>
> > > https://lists.ghserv.net/mailman/listinfo/bigbang-user <
> https://lists.ghserv.net/mailman/listinfo/bigbang-user> <
> https://lists.ghserv.net/mailman/listinfo/bigbang-user <
> https://lists.ghserv.net/mailman/listinfo/bigbang-user>>
> >
> > --
> > Niels ten Oever, PhD
> > Postdoctoral Researcher - Media Studies Department - University of
> Amsterdam
> > Research Fellow - Centre for Internet and Human Rights - European
> University Viadrina
> > Associated Scholar - Centro de Tecnologia e Sociedade - Fundação
> Getúlio Vargas
> >
> > https://nielstenoever.net <https://nielstenoever.net> -
> mail at nielstenoever.net <mailto:mail at nielstenoever.net
> <mail at nielstenoever.net>> - @nielstenoever - +31629051853
> > PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
> >
> > Read my latest article on Internet infrastructure governance in New
> Media & Society here:
> https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <
> https://journals.sagepub.com/doi/full/10.1177/1461444820929320>
> >
> >
> >
> > --
> > <><><><><><><><><><><><><><><><>
> > /Christoph Becker /(/he/him/his/)///
> > PhD at the
> > /
> > /Institute for Data Science and/
> > /Institute for Computational Cosmology/
> > /Durham University/
> > /United Kingdom/
> > //christovis.github.io// <http://christovis.github.io>
>
> --
> Niels ten Oever, PhD
> Postdoctoral Researcher - Media Studies Department - University of
> Amsterdam
> Research Fellow - Centre for Internet and Human Rights - European
> University Viadrina
> Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio
> Vargas
>
> https://nielstenoever.net - mail at nielstenoever.net - @nielstenoever -
> +31629051853
> PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
>
> Read my latest article on Internet infrastructure governance in New Media
> & Society here:
> https://journals.sagepub.com/doi/full/10.1177/1461444820929320
>
>
>
> --
> <><><><><><><><><><><><><><><><>
> *Christoph Becker (he/him/his)*
>
> * PhD at the *
> *Institute for Data Science and*
> *Institute for Computational Cosmology*
> *Durham University*
> *United Kingdom*
> *christovis.github.io* <http://christovis.github.io>
>
--
<><><><><><><><><><><><><><><><>
*Christoph Becker (he/him/his)*
*PhD at the*
*Institute for Data Science and*
*Institute for Computational Cosmology*
*Durham University*
*United Kingdom*
*christovis.github.io* <http://christovis.github.io>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-user/attachments/20210507/2d09696c/attachment-0001.htm>
More information about the Bigbang-user
mailing list