[Bigbang-dev] Clarifying theoretical commitments going into IETF 116

Xue Li x.li3 at uva.nl
Mon Jan 30 09:29:55 CET 2023


Hi all,

I was on holiday in China till today. I just saw there are interesting comments and insights for the NLP side of work.
I would be happy to join the meetings forward and discuss more about the details 😊 .

Best,
Effy

From: Bigbang-dev <bigbang-dev-bounces at data-activism.net> on behalf of Priyanka Sinha <priyanka.sinha.iitg at gmail.com>
Date: Monday, 30 January 2023 at 07:03
To: Sebastian Benthall <sbenthall at gmail.com>
Cc: bigbang-dev at data-activism.net <bigbang-dev at data-activism.net>
Subject: Re: [Bigbang-dev] Clarifying theoretical commitments going into IETF 116
I agree with you .. please find my comments inline

On Wed, 25 Jan 2023 at 18:18, Sebastian Benthall <sbenthall at gmail.com<mailto:sbenthall at gmail.com>> wrote:
From a computational perspective, in my opinion from what you are saying, doing CI would mean I just look at the flow of dialogues, i.e., turn by turn or order of the messages (posts and comments) that one and others have posted, but in a graph theory sense, I can ignore the temporal aspect and treat all the conversation together. Technically, this may avoid getting into issues of short text, noisy text that some statistical NLP methods become difficult due to short context. This may also be less complex computationally.

Aha. I see what you mean. This does seem computationally tractable.
It reminds me of some of the earliest work I did with BigBang.

What comes to mind is that different working groups might be different 'contexts' and so have different patterns to how the discourse unfolds.

To be honest, this is a bit of a stretch for CI as envisioned by Helen Nissenbaum. But when I originally approached Helen after working on BigBang, I also was thinking about mailings lists as contexts and messages sent as information flows. I suppose making this connection in a publication would be worthwhile :)

To really make it work with CI, we would need to also track personal identifiers within email bodies. I.e not only replies to people, but also references to people. (Maybe this would potentially include legal persons, such as company names.) So entity recognition would be great for this, if it was working.

So, identifying whether the email address used even when slightly different refers to the exact same person, is something my algorithm can do which I have presented at the AID workshop.

Within the email body, doing the entity recognition as well as perhaps coreference resolution (i.e., the name of the person or company is not present but is referred to with pronouns such as he/she/they) has varying accuracy. I was happy to know of Effy's work in this direction. Myself, I would try to use Effy's published work as well as try Lauren Berk's (now Lauren Wheelock) work https://github.com/lauren897<https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flauren897&data=05%7C01%7Cx.li3%40uva.nl%7Cb02ed17db40a47e017dd08db0287b14d%7Ca0f1cacd618c4403b94576fb3d6874e5%7C0%7C0%7C638106554046165793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PmV9RNj%2BLeBD1qRNSEL967AvU8JJ8bcmBLPKwr2Wljo%3D&reserved=0> https://dspace.mit.edu/handle/1721.1/127291?show=full<https://eur04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdspace.mit.edu%2Fhandle%2F1721.1%2F127291%3Fshow%3Dfull&data=05%7C01%7Cx.li3%40uva.nl%7Cb02ed17db40a47e017dd08db0287b14d%7Ca0f1cacd618c4403b94576fb3d6874e5%7C0%7C0%7C638106554046165793%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=dAJjPOkwLux0zkO0NJhtP9rL6UIKEwU6bbd2NBPMtvg%3D&reserved=0> which when I had attended worked well for cases with short context.


What kind of graph metrics would you find worth tracking?

This is an interesting question for me, since I haven't thought of the graph from the perspective of say measures like betweenness centrality, etc. I thought of it as a representation based on which we mine for insights, using new graph neural network algorithms.  For example, if we represent the discourses as a multi edged temporal graph, where the different types of edges represent different aspects of the communication that we take into account, then if we work on extracting say graphlets (which in my mind are homeomorphic subgraph patterns (say could have maybe 15 nodes which could be one set of folks that hold a particular view). Then these graphlets we could label as different viewpoints in how they view privacy?? I apologize if it doesn't make sense, I haven't yet figured this out . I mean we could take the direction where we are not doing this .. and we model the problem as a agent simulation where the goals are related to the CI .. and inside we represent the agents and their interaction in the graph structure and we create a learning model whose weights we are trying to learn by trying to reach the goals based on the existing dialogue traces (aka mailing list conversations) we have.


If the WN world view is so fine-grained that we need to look at timestamps and model in continuous time domain, then for me I think that is too challenging, albeit interesting. If WN is just major events and thus we can split our data into windows or chunks manually, then we avoid the problem.

I need to dig deeper to recall exactly how the computational sociology components of WN work.
But my sense is that the qualitative theory in WN is much richer than its technical operationalization.
That leaves a big gap that we can start trying to fill.

I don't think continuous time analysis will be necessary; windows or chunks should be fine.

AWesome !!!!

- S

-priyanka
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20230130/8f7ac1ba/attachment-0001.htm>


More information about the Bigbang-dev mailing list