[Bigbang-user] R: R: Issue with listserv fetching (3GPP)
Niels ten Oever
mail at nielstenoever.net
Fri May 7 18:01:24 CEST 2021
Hi Christoph,
Some things are going well - some things are not :) Please find some code outputs below:
gagarin at kosmos ~/Data/bigbang main v0.3.0 python3 bin/collect_mail.py -f examples/url_collections/listserv.3GPP.txt
Traceback (most recent call last):
File "bin/collect_mail.py", line 71, in <module>
main(args)
File "bin/collect_mail.py", line 67, in main
mailman.collect_from_file(args.f, notes=notes)
File "/home/gagarin/Data/bigbang/bigbang/mailman.py", line 168, in collect_from_file
collect_from_url(urls, archive_dir=archive_dir, notes=notes)
File "/home/gagarin/Data/bigbang/bigbang/mailman.py", line 107, in collect_from_url
url, archive_dir=archive_dir, notes=notes
File "/home/gagarin/Data/bigbang/bigbang/mailman.py", line 311, in collect_archive_from_url
instant_save=True,
File "/home/gagarin/Data/bigbang/bigbang/listserv.py", line 992, in from_mailing_lists
mlist.to_mbox(dir_out=dir_out)
File "/home/gagarin/Data/bigbang/bigbang/listserv.py", line 844, in to_mbox
msg.to_mbox(filepath, mode="a")
File "/home/gagarin/Data/bigbang/bigbang/listserv.py", line 396, in to_mbox
self.fromaddr,
File "/home/gagarin/Data/bigbang/bigbang/listserv.py", line 372, in create_message_id
message_id = (".").join([date, from_address])
TypeError: sequence item 1: expected str instance, NoneType found
gagarin at kosmos ~/Data/bigbang main v0.3.0 cd archives/3GPP 1 ↵ 1009 17:57:14
gagarin at kosmos ~/Data/bigbang/archives/3GPP main v0.3.0 ls ✔ 1010 17:57:20
3GPP_TSG_GERAN_WG1.mbox
So the download somehow does not run - maybe due to the fact I have not set the account name + password? Where should I set them?
Content of listserv.log can be found here: https://pastebin.com/mXuUBeHR
Thanks for the work!
Cheers,
Niels
On 07-05-2021 09:45, Christoph Becker wrote:
> Hi Riccardo & Niels,
> with the most recent PR and the release of the new version 0.3 you are now able to scrape 3GPP & IEEE with:
>
> * python3 bin/collect_mail.py -f examples/url_collections/listserv.3GPP.txt
> * python3 bin/collect_mail.py -f examples/url_collections/listserv.IEEE.txt
>
> To collect the entire archive takes a while, so you could create several .txt files that contain only parts of the complete list of urls.
> The downloaded archives will be stored in the directory from where you execute the code as:
> /bigbang/archives/IEEE/
> /bigbang/archives/3GPP/
> Scraped mailing list will appear inside this folder as .mbox formatted files, with the file name being the name of the mailing list.
>
> Please let me know if you have any questions, suggestions, bug reports, ...
>
> Have a nice weekend,
> Chris
>
> Op vr 30 apr. 2021 om 13:54 schreef Riccardo Nanni <riccardo.nanni9 at unibo.it <mailto:riccardo.nanni9 at unibo.it>>:
>
> Hi everyone,
>
> how are you?
> I noticed that when running
>
> python3 bin/collect_mail.py -u https://atlarge-lists.icann.org/pipermail/idn-wg/ <https://atlarge-lists.icann.org/pipermail/idn-wg/>
>
> it doesn't download all the emails, but only July to December 2007. Then I realised all the other emails are archived per quarter instead of month. Are the two things connected? Maybe it is something you're already aware of, but I thought it was useful to report it here.
>
> @Chris, thanks a lot for the alternative way to scrape Listserv you sent! Haven't tried it yet, sorry.
> Best,
>
> Riccardo
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *Da:* Christoph Becker <chrbecker01 at gmail.com <mailto:chrbecker01 at gmail.com>>
> *Inviato:* venerdì 23 aprile 2021 12:35
> *A:* Riccardo Nanni <riccardo.nanni9 at unibo.it <mailto:riccardo.nanni9 at unibo.it>>
> *Cc:* Niels ten Oever <mail at nielstenoever.net <mailto:mail at nielstenoever.net>>; bigbang-user at data-activism.net <mailto:bigbang-user at data-activism.net> <bigbang-user at data-activism.net <mailto:bigbang-user at data-activism.net>>
> *Oggetto:* Re: R: R: [Bigbang-user] Issue with listserv fetching (3GPP)
>
> Hi Riccardo,
> just realised that I forgot to declare what "auth_key_mock" is.
> You can set up an AuthSession on the 3GPP Listserv archive after createing <https://list.etsi.org/scripts/wa.exe?GETPW1> an account there and input your credentials into the function as shown below.
>
> ---------------------------------------------------------------------------------------------------
> import bigbang
> from bigbang import listserv
> from bigbang.listserv import ListservArchive, ListservList, ListservMessage
>
> url_archive = "https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>?"
> url_list = url_archive + "A0=3GPP_TSG_CT_WG6"
> auth_key_mock = {"username": "your_usrname", "password": "your_psw"}
>
> ListservArchive.from_url(
> name="3GPP",
> url_root=url_archive,
> url_home=url_archive + "HOME",
> login=auth_key_mock,
> instant_save=True,
> only_mlist_urls=False,
> )
> ---------------------------------------------------------------------------------------------------
>
> Please excuse this awkward way of explaining it.
> I will try to update the wiki on the git repo asap.
>
> Best Wishes,
> Christoph
>
>
> Op vr 23 apr. 2021 om 10:53 schreef Riccardo Nanni <riccardo.nanni9 at unibo.it <mailto:riccardo.nanni9 at unibo.it>>:
>
> Great!
>
> Thank you again, Christoph and Niels, later I'll try it.
> Best,
>
> Riccardo
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *Da:* Niels ten Oever <mail at nielstenoever.net <mailto:mail at nielstenoever.net>>
> *Inviato:* venerdì 23 aprile 2021 11:50
> *A:* Riccardo Nanni <riccardo.nanni9 at unibo.it <mailto:riccardo.nanni9 at unibo.it>>; Christoph Becker <chrbecker01 at gmail.com <mailto:chrbecker01 at gmail.com>>
> *Cc:* bigbang-user at data-activism.net <mailto:bigbang-user at data-activism.net> <bigbang-user at data-activism.net <mailto:bigbang-user at data-activism.net>>
> *Oggetto:* Re: R: R: [Bigbang-user] Issue with listserv fetching (3GPP)
>
> Thanks Christoph!
>
> This was the content of the file example.py:
>
> import bigbang
> from bigbang import listserv
> from bigbang.listserv import ListservArchive, ListservList, ListservMessage
>
> url_archive = "https://list.etsi.org/scripts/wa.exe? <https://list.etsi.org/scripts/wa.exe?>"
> url_list = url_archive + "A0=3GPP_TSG_CT_WG6"
>
> ListservArchive.from_url(
> name="3GPP",
> url_root=url_archive,
> url_home=url_archive + "HOME",
> login=auth_key_mock,
> instant_save=True,
> only_mlist_urls=False,
> )
>
>
> Best,
>
> Niels
>
> On 23-04-2021 09:25, Riccardo Nanni wrote:
> > Dear Niels and Christoph,
> >
> > thanks a lot for your help!
> > I tried Niels' way and I keep getting the 'instant_dump'.
> > I did 'git branch' and it shows the following:
> >
> > *main
> > master
> >
> > I understand I am on the 'main' branch, is it right?
> > Then I tried 'git pull' again and it says it is already updated, but it keeps showing the 'instant_dump' message when I try the usual command.
> >
> > @Christoph: thank you for sharing the file on the alternative way to gather listserv emails, but I don't think it came through: all I can find is an error message that says an attachment was detected as malware (guess my computer 'misread' your file?). Any chance you can share it again, please?
> >
> > Thanks a lot again, you're all very helpful! As I'm better at cooking than programming, when you come to Italy I owe you a dinner 🙂🙂
> > Cheers,
> >
> > Riccardo
> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > *Da:* Christoph Becker <chrbecker01 at gmail.com <mailto:chrbecker01 at gmail.com>>
> > *Inviato:* venerdì 23 aprile 2021 00:23
> > *A:* Niels ten Oever <mail at nielstenoever.net <mailto:mail at nielstenoever.net>>
> > *Cc:* Riccardo Nanni <riccardo.nanni9 at unibo.it <mailto:riccardo.nanni9 at unibo.it>>; bigbang-user at data-activism.net <mailto:bigbang-user at data-activism.net> <bigbang-user at data-activism.net <mailto:bigbang-user at data-activism.net>>
> > *Oggetto:* Re: R: [Bigbang-user] Issue with listserv fetching (3GPP)
> >
> > Hi Niels & Riccardo,
> > the argument 'instant_dump' for the ListservArchive class object does not exist anymore in the up-to-date 'main' branch of the git repo.
> > @Niels: Do you mean that you did a 'git pull' and encountered the TypeError caused by missing 'instant_dump' too?
> >
> > But as I said in another message, we are not quite there yet for 3GPP and IEEE to use the 'conventional' method on how BigBang scrapes archives such as W3C.
> > I attached a small examples that shows how you can currently scrape the 3GPP archive and save it to mbox files in the CONFIG.mail_path folder.
> > Be aware that this could take very long and could use a lot of memory.
> >
> > Best Wishes,
> > Christoph
> >
> >
> > Op do 22 apr. 2021 om 17:17 schreef Niels ten Oever <mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:mail at nielstenoever.net <mailto:mail at nielstenoever.net>>>:
> >
> > Hi Riccardo and Christoph,
> >
> > I see there might be an issue with the usage of special characters in the mailinglist URLs, to get it working I had to put a '\' in front on the '?', but this could also be fixed by using " " around the URL. However, after that fetching did not work either - so let's ask Christoph (cc).
> >
> > Cheers,
> >
> > Niels
> >
> >
> >
> >
> >
> >
> > On 22-04-2021 17:43, Riccardo Nanni wrote:
> > > Hi Niels,
> > >
> > > thanks for your answer!
> > > I did, and I found the changes I can see in Github (e.g. the listserv.3GPP.txt file, etc.).
> > > I did it again when I saw it didn't work and it says 'già aggiornato' (already updated).
> > >
> > > Riccardo
> > >
> > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > > *Da:* Bigbang-user <bigbang-user-bounces at data-activism.net <mailto:bigbang-user-bounces at data-activism.net> <mailto:bigbang-user-bounces at data-activism.net <mailto:bigbang-user-bounces at data-activism.net>>> per conto di Niels ten Oever <mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:mail at nielstenoever.net <mailto:mail at nielstenoever.net>>>
> > > *Inviato:* giovedì 22 aprile 2021 17:38
> > > *A:* bigbang-user at data-activism.net <mailto:bigbang-user at data-activism.net> <mailto:bigbang-user at data-activism.net <mailto:bigbang-user at data-activism.net>> <bigbang-user at data-activism.net <mailto:bigbang-user at data-activism.net> <mailto:bigbang-user at data-activism.net <mailto:bigbang-user at data-activism.net>>>
> > > *Oggetto:* Re: [Bigbang-user] Issue with listserv
> > >
> > > Hi Riccardo,
> > >
> > > This is not a very informed response - but did you first do:
> > >
> > > git pull
> > >
> > > to ensure that you have the latest version with all the recent changes?
> > >
> > > Best,
> > >
> > > Niels
> > >
> > > On 22-04-2021 17:31, Riccardo Nanni wrote:
> > >> Dear all,
> > >>
> > >> how are you?
> > >> I tried to collect email from 3GPP by running these commands:
> > >> python bin/collect_mail.py -u https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe> <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>>? <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe> <https://list.etsi.org/scripts/wa.exe>? <https://list.etsi.org/scripts/wa.exe%3E?>> <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe> <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>>? <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe> <https://list.etsi.org/scripts/wa.exe>? <https://list.etsi.org/scripts/wa.exe%3E?>>>;
> > >> python3 bin/collect_mail.py -u https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe> <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>>? <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe> <https://list.etsi.org/scripts/wa.exe>? <https://list.etsi.org/scripts/wa.exe%3E?>> <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe> <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>>? <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe> <https://list.etsi.org/scripts/wa.exe>? <https://list.etsi.org/scripts/wa.exe%3E?>>>
> > >> AND
> > >> python3 bin/collect_mail.py -f examples/url_collections/listserv.3GPP.txt
> > >>
> > >> Also tried to scrape a specific group's list with the same commands: https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN> <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN>> <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN> <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN>>> <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN> <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN>> <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN> <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN>>>>
> > >>
> > >> I get the following error:
> > >> TypeError: from_url() got an unexpected keyword argument 'instant_dump'
> > >>
> > >> I don't understand what I'm missing. Can you help me, please?
> > >> Thanks a lot in advance! The only similar argument I could find on Stackoverflow has no answers...
> > >>
> > >> Riccardo
> > >>
> > >>
> > >>
> > >>
> > >> _______________________________________________
> > >> Bigbang-user mailing list
> > >> Bigbang-user at data-activism.net <mailto:Bigbang-user at data-activism.net> <mailto:Bigbang-user at data-activism.net <mailto:Bigbang-user at data-activism.net>>
> > >> https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user> <https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user>> <https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user> <https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user>>>
> > >>
> > >
> > > --
> > > Niels ten Oever, PhD
> > > Postdoctoral Researcher - Media Studies Department - University of Amsterdam
> > > Research Fellow - Centre for Internet and Human Rights - European University Viadrina
> > > Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio Vargas
> > >
> > > https://nielstenoever.net <https://nielstenoever.net> <https://nielstenoever.net <https://nielstenoever.net>> <https://nielstenoever.net <https://nielstenoever.net> <https://nielstenoever.net <https://nielstenoever.net>>> - mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:mail at nielstenoever.net <mailto:mail at nielstenoever.net>> - @nielstenoever - +31629051853
> > > PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
> > >
> > > Read my latest article on Internet infrastructure governance in New Media & Society here: https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <https://journals.sagepub.com/doi/full/10.1177/1461444820929320> <https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <https://journals.sagepub.com/doi/full/10.1177/1461444820929320>> <https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <https://journals.sagepub.com/doi/full/10.1177/1461444820929320> <https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <https://journals.sagepub.com/doi/full/10.1177/1461444820929320>>>
> > >
> > > _______________________________________________
> > > Bigbang-user mailing list
> > > Bigbang-user at data-activism.net <mailto:Bigbang-user at data-activism.net> <mailto:Bigbang-user at data-activism.net <mailto:Bigbang-user at data-activism.net>>
> > > https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user> <https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user>> <https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user> <https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user>>>
> >
> > --
> > Niels ten Oever, PhD
> > Postdoctoral Researcher - Media Studies Department - University of Amsterdam
> > Research Fellow - Centre for Internet and Human Rights - European University Viadrina
> > Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio Vargas
> >
> > https://nielstenoever.net <https://nielstenoever.net> <https://nielstenoever.net <https://nielstenoever.net>> - mail at nielstenoever.net <mailto:mail at nielstenoever.net> <mailto:mail at nielstenoever.net <mailto:mail at nielstenoever.net>> - @nielstenoever - +31629051853
> > PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
> >
> > Read my latest article on Internet infrastructure governance in New Media & Society here: https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <https://journals.sagepub.com/doi/full/10.1177/1461444820929320> <https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <https://journals.sagepub.com/doi/full/10.1177/1461444820929320>>
> >
> >
> >
> > --
> > <><><><><><><><><><><><><><><><>
> > /Christoph Becker /(/he/him/his/)///
> > PhD at the
> > /
> > /Institute for Data Science and/
> > /Institute for Computational Cosmology/
> > /Durham University/
> > /United Kingdom/
> > //christovis.github.io// <http://christovis.github.io//> <http://christovis.github.io <http://christovis.github.io>>
>
> --
> Niels ten Oever, PhD
> Postdoctoral Researcher - Media Studies Department - University of Amsterdam
> Research Fellow - Centre for Internet and Human Rights - European University Viadrina
> Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio Vargas
>
> https://nielstenoever.net <https://nielstenoever.net> - mail at nielstenoever.net <mailto:mail at nielstenoever.net> - @nielstenoever - +31629051853
> PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
>
> Read my latest article on Internet infrastructure governance in New Media & Society here: https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <https://journals.sagepub.com/doi/full/10.1177/1461444820929320>
>
>
>
> --
> <><><><><><><><><><><><><><><><>
> /Christoph Becker /(/he/him/his/)///
> PhD at the
> /
> /Institute for Data Science and/
> /Institute for Computational Cosmology/
> /Durham University/
> /United Kingdom/
> //christovis.github.io// <http://christovis.github.io>
>
>
>
> --
> <><><><><><><><><><><><><><><><>
> /Christoph Becker /(/he/him/his/)///
> PhD at the
> /
> /Institute for Data Science and/
> /Institute for Computational Cosmology/
> /Durham University/
> /United Kingdom/
> //christovis.github.io// <http://christovis.github.io>
--
Niels ten Oever, PhD
Postdoctoral Researcher - Media Studies Department - University of Amsterdam
Research Fellow - Centre for Internet and Human Rights - European University Viadrina
Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio Vargas
Affiliated Factulty - Digital Democracy Insitute - Simon Fraser University
https://nielstenoever.net - mail at nielstenoever.net - @nielstenoever - +31629051853
PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
Read my latest article on Internet infrastructure governance in New Media & Society here: https://journals.sagepub.com/doi/full/10.1177/1461444820929320
More information about the Bigbang-user
mailing list