[Bigbang-dev] BigBang and LLMs

Mon Nov 28 17:34:09 CET 2022

Hello,

I had a new idea about how to expand BigBang that I wanted to run by the
rest of you.

Perhaps you have been following recent developments in "Large Language
Models" (LLMs), which are deep learned models of natural language developed
at great expense with enormous corpora of text data and billions of
parameters. The LLM has enough knowledge of language to perform many basic
NLP tasks with competence. It is also possible to 'fine tune' an LLM with
examples of a new task, and thereby train it to do more. (This is called
'in-context learning'.)

We have been using rather traditional NLP tools with BigBang, such as Bag
of Words models, and have not had great success at automating tasks such as
attributing organizational affiliation to individual emails, or performing
entity resolution. It would make sense to try to fine-tune an LLM to
perform these tasks.

One open science, open access LLM that we could use is BLOOM:
https://bigscience.huggingface.co/blog/bloom

In particular, I expect that extracting information from email signatures
despite the complications of nested and quoted text, which normally is
quite challenging to accomplish with earlier NLP paradigms, could be easy
work for an LLM. (One wonders if email data was part of its training data
set).

I believe that such a fine-tuning exercise can be trained and tested using
rather standard ML techniques such as K-fold validation, as long as we have
enough gold standard, hand labeled data for training and testing. We might
be able to create those labels using the organizations dataset:
https://github.com/datactive/bigbang/tree/main/bigbang/datasets/organizations

Best regards,
Seb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ghserv.net/pipermail/bigbang-dev/attachments/20221128/96283aed/attachment.htm>