[Bigbang-user] R: R: R: Issue with listserv fetching (3GPP)

Riccardo Nanni riccardo.nanni9 at unibo.it
Fri Apr 30 14:53:58 CEST 2021


Hi everyone,

how are you?
I noticed that when running

python3 bin/collect_mail.py -u https://atlarge-lists.icann.org/pipermail/idn-wg/

it doesn't download all the emails, but only July to December 2007. Then I realised all the other emails are archived per quarter instead of month. Are the two things connected? Maybe it is something you're already aware of, but I thought it was useful to report it here.

@Chris, thanks a lot for the alternative way to scrape Listserv you sent! Haven't tried it yet, sorry.
Best,

Riccardo
________________________________
Da: Christoph Becker <chrbecker01 at gmail.com>
Inviato: venerdì 23 aprile 2021 12:35
A: Riccardo Nanni <riccardo.nanni9 at unibo.it>
Cc: Niels ten Oever <mail at nielstenoever.net>; bigbang-user at data-activism.net <bigbang-user at data-activism.net>
Oggetto: Re: R: R: [Bigbang-user] Issue with listserv fetching (3GPP)

Hi Riccardo,
just realised that I forgot to declare what "auth_key_mock" is.
You can set up an AuthSession on the 3GPP Listserv archive after createing<https://list.etsi.org/scripts/wa.exe?GETPW1> an account there and input your credentials into the function as shown below.

---------------------------------------------------------------------------------------------------
import bigbang
from bigbang import listserv
from bigbang.listserv import ListservArchive, ListservList, ListservMessage

url_archive = "https://list.etsi.org/scripts/wa.exe?"
url_list = url_archive + "A0=3GPP_TSG_CT_WG6"
auth_key_mock = {"username": "your_usrname", "password": "your_psw"}

ListservArchive.from_url(
    name="3GPP",
    url_root=url_archive,
    url_home=url_archive + "HOME",
    login=auth_key_mock,
    instant_save=True,
    only_mlist_urls=False,
)
---------------------------------------------------------------------------------------------------

Please excuse this awkward way of explaining it.
I will try to update the wiki on the git repo asap.

Best Wishes,
Christoph


Op vr 23 apr. 2021 om 10:53 schreef Riccardo Nanni <riccardo.nanni9 at unibo.it<mailto:riccardo.nanni9 at unibo.it>>:
Great!

Thank you again, Christoph and Niels, later I'll try it.
Best,

Riccardo
________________________________
Da: Niels ten Oever <mail at nielstenoever.net<mailto:mail at nielstenoever.net>>
Inviato: venerdì 23 aprile 2021 11:50
A: Riccardo Nanni <riccardo.nanni9 at unibo.it<mailto:riccardo.nanni9 at unibo.it>>; Christoph Becker <chrbecker01 at gmail.com<mailto:chrbecker01 at gmail.com>>
Cc: bigbang-user at data-activism.net<mailto:bigbang-user at data-activism.net> <bigbang-user at data-activism.net<mailto:bigbang-user at data-activism.net>>
Oggetto: Re: R: R: [Bigbang-user] Issue with listserv fetching (3GPP)

Thanks Christoph!

This was the content of the file example.py:

import bigbang
from bigbang import listserv
from bigbang.listserv import ListservArchive, ListservList, ListservMessage

url_archive = "https://list.etsi.org/scripts/wa.exe?"
url_list = url_archive + "A0=3GPP_TSG_CT_WG6"

ListservArchive.from_url(
    name="3GPP",
    url_root=url_archive,
    url_home=url_archive + "HOME",
    login=auth_key_mock,
    instant_save=True,
    only_mlist_urls=False,
)


Best,

Niels

On 23-04-2021 09:25, Riccardo Nanni wrote:
> Dear Niels and Christoph,
>
> thanks a lot for your help!
> I tried Niels' way and I keep getting the 'instant_dump'.
> I did 'git branch' and it shows the following:
>
> *main
> master
>
> I understand I am on the 'main' branch, is it right?
> Then I tried 'git pull' again and it says it is already updated, but it keeps showing the 'instant_dump' message when I try the usual command.
>
> @Christoph: thank you for sharing the file on the alternative way to gather listserv emails, but I don't think it came through: all I can find is an error message that says an attachment was detected as malware (guess my computer 'misread' your file?). Any chance you can share it again, please?
>
> Thanks a lot again, you're all very helpful! As I'm better at cooking than programming, when you come to Italy I owe you a dinner 🙂🙂
> Cheers,
>
> Riccardo
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> *Da:* Christoph Becker <chrbecker01 at gmail.com<mailto:chrbecker01 at gmail.com>>
> *Inviato:* venerdì 23 aprile 2021 00:23
> *A:* Niels ten Oever <mail at nielstenoever.net<mailto:mail at nielstenoever.net>>
> *Cc:* Riccardo Nanni <riccardo.nanni9 at unibo.it<mailto:riccardo.nanni9 at unibo.it>>; bigbang-user at data-activism.net<mailto:bigbang-user at data-activism.net> <bigbang-user at data-activism.net<mailto:bigbang-user at data-activism.net>>
> *Oggetto:* Re: R: [Bigbang-user] Issue with listserv fetching (3GPP)
>
> Hi Niels & Riccardo,
> the argument 'instant_dump' for the ListservArchive class object does not exist anymore in the up-to-date 'main' branch of the git repo.
> @Niels: Do you mean that you did a 'git pull' and encountered the TypeError caused by missing 'instant_dump' too?
>
> But as I said in another message, we are not quite there yet for 3GPP and IEEE to use the 'conventional' method on how BigBang scrapes archives such as W3C.
> I attached a small examples that shows how you can currently scrape the 3GPP archive and save it to mbox files in the CONFIG.mail_path folder.
> Be aware that this could take very long and could use a lot of memory.
>
> Best Wishes,
> Christoph
>
>
> Op do 22 apr. 2021 om 17:17 schreef Niels ten Oever <mail at nielstenoever.net<mailto:mail at nielstenoever.net> <mailto:mail at nielstenoever.net>>:
>
>     Hi Riccardo and Christoph,
>
>     I see there might be an issue with the usage of special characters in the mailinglist URLs, to get it working I had to put a '\' in front on the '?', but this could also be fixed by using " " around the URL. However, after that fetching did not work either - so let's ask Christoph (cc).
>
>     Cheers,
>
>     Niels
>
>
>
>
>
>
>     On 22-04-2021 17:43, Riccardo Nanni wrote:
>     > Hi Niels,
>     >
>     > thanks for your answer!
>     > I did, and I found the changes I can see in Github (e.g. the listserv.3GPP.txt file, etc.).
>     > I did it again when I saw it didn't work and it says 'già aggiornato' (already updated).
>     >
>     > Riccardo
>     >
>     ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>     > *Da:* Bigbang-user <bigbang-user-bounces at data-activism.net<mailto:bigbang-user-bounces at data-activism.net> <mailto:bigbang-user-bounces at data-activism.net>> per conto di Niels ten Oever <mail at nielstenoever.net<mailto:mail at nielstenoever.net> <mailto:mail at nielstenoever.net>>
>     > *Inviato:* giovedì 22 aprile 2021 17:38
>     > *A:* bigbang-user at data-activism.net<mailto:bigbang-user at data-activism.net> <mailto:bigbang-user at data-activism.net> <bigbang-user at data-activism.net<mailto:bigbang-user at data-activism.net> <mailto:bigbang-user at data-activism.net>>
>     > *Oggetto:* Re: [Bigbang-user] Issue with listserv
>     >
>     > Hi Riccardo,
>     >
>     > This is not a very informed response - but did you first do:
>     >
>     > git pull
>     >
>     > to ensure that you have the latest version with all the recent changes?
>     >
>     > Best,
>     >
>     > Niels
>     >
>     > On 22-04-2021 17:31, Riccardo Nanni wrote:
>     >> Dear all,
>     >>
>     >> how are you?
>     >> I tried to collect email from 3GPP by running these commands:
>     >> python bin/collect_mail.py -u https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>? <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>?<https://list.etsi.org/scripts/wa.exe%3E?>> <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>? <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>?<https://list.etsi.org/scripts/wa.exe%3E?>>>;
>     >> python3 bin/collect_mail.py -u https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>? <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>?<https://list.etsi.org/scripts/wa.exe%3E?>> <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>? <https://list.etsi.org/scripts/wa.exe <https://list.etsi.org/scripts/wa.exe>?<https://list.etsi.org/scripts/wa.exe%3E?>>>
>     >> AND
>     >> python3 bin/collect_mail.py -f examples/url_collections/listserv.3GPP.txt
>     >>
>     >> Also tried to scrape a specific group's list with the same commands: https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN> <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN>> <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN> <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN <https://list.etsi.org/scripts/wa.exe?A0=3GPP_TSG_RAN>>>
>     >>
>     >> I get the following error:
>     >> TypeError: from_url() got an unexpected keyword argument 'instant_dump'
>     >>
>     >> I don't understand what I'm missing. Can you help me, please?
>     >> Thanks a lot in advance! The only similar argument I could find on Stackoverflow has no answers...
>     >>
>     >> Riccardo
>     >>
>     >>
>     >>
>     >>
>     >> _______________________________________________
>     >> Bigbang-user mailing list
>     >> Bigbang-user at data-activism.net<mailto:Bigbang-user at data-activism.net> <mailto:Bigbang-user at data-activism.net>
>     >> https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user> <https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user>>
>     >>
>     >
>     > --
>     > Niels ten Oever, PhD
>     > Postdoctoral Researcher - Media Studies Department - University of Amsterdam
>     > Research Fellow - Centre for Internet and Human Rights - European University Viadrina
>     > Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio Vargas
>     >
>     > https://nielstenoever.net <https://nielstenoever.net> <https://nielstenoever.net <https://nielstenoever.net>> - mail at nielstenoever.net<mailto:mail at nielstenoever.net> <mailto:mail at nielstenoever.net> - @nielstenoever - +31629051853
>     > PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
>     >
>     > Read my latest article on Internet infrastructure governance in New Media & Society here: https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <https://journals.sagepub.com/doi/full/10.1177/1461444820929320> <https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <https://journals.sagepub.com/doi/full/10.1177/1461444820929320>>
>     >
>     > _______________________________________________
>     > Bigbang-user mailing list
>     > Bigbang-user at data-activism.net<mailto:Bigbang-user at data-activism.net> <mailto:Bigbang-user at data-activism.net>
>     > https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user> <https://lists.ghserv.net/mailman/listinfo/bigbang-user <https://lists.ghserv.net/mailman/listinfo/bigbang-user>>
>
>     --
>     Niels ten Oever, PhD
>     Postdoctoral Researcher - Media Studies Department - University of Amsterdam
>     Research Fellow - Centre for Internet and Human Rights - European University Viadrina
>     Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio Vargas
>
>     https://nielstenoever.net <https://nielstenoever.net> - mail at nielstenoever.net<mailto:mail at nielstenoever.net> <mailto:mail at nielstenoever.net> - @nielstenoever - +31629051853
>     PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3
>
>     Read my latest article on Internet infrastructure governance in New Media & Society here: https://journals.sagepub.com/doi/full/10.1177/1461444820929320 <https://journals.sagepub.com/doi/full/10.1177/1461444820929320>
>
>
>
> --
> <><><><><><><><><><><><><><><><>
> /Christoph Becker /(/he/him/his/)///
> PhD at the
> /
> /Institute for Data Science and/
> /Institute for Computational Cosmology/
> /Durham University/
> /United Kingdom/
> //christovis.github.io//<http://christovis.github.io//> <http://christovis.github.io>

--
Niels ten Oever, PhD
Postdoctoral Researcher - Media Studies Department - University of Amsterdam
Research Fellow - Centre for Internet and Human Rights - European University Viadrina
Associated Scholar - Centro de Tecnologia e Sociedade - Fundação Getúlio Vargas

https://nielstenoever.net - mail at nielstenoever.net<mailto:mail at nielstenoever.net> - @nielstenoever - +31629051853
PGP: 2458 0B70 5C4A FD8A 9488 643A 0ED8 3F3A 468A C8B3

Read my latest article on Internet infrastructure governance in New Media & Society here: https://journals.sagepub.com/doi/full/10.1177/1461444820929320


--
<><><><><><><><><><><><><><><><>
Christoph Becker (he/him/his)
PhD at the
Institute for Data Science and
Institute for Computational Cosmology
Durham University
United Kingdom
christovis.github.io<http://christovis.github.io>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-user/attachments/20210430/22b2ddf3/attachment-0001.htm>


More information about the Bigbang-user mailing list