[liberationtech] Linguistics identifies anonymous users
Gregory Foster
gfoster at entersection.org
Wed Jan 9 07:20:25 PST 2013
29c3 - "Stylometry and Online Underground Markets" w/ Aylin Caliskan
Islam, Rachel Greenstadt, and Sadia Afroz:
http://www.youtube.com/watch?v=QRY2mfLpPCs
http://events.ccc.de/congress/2012/Fahrplan/events/5230.en.html
gf
On 1/9/13 7:34 AM, Shava Nerad wrote:
> Such a framework can be social engineered as easily as SEO. I make a
> small living as a ghost writer and speech writer - the informal
> version of that very process. Several of my clients say my writing
> sounds more like them in print than they do, because they are less
> facile writers - but that is a fault that could be avoided in
> competent forgeries. ;)
>
> SN
>
>
> On Jan 9, 2013 8:25 AM, "Eugen Leitl" <eugen at leitl.org> wrote:
>> http://www.scmagazine.com.au/News/328135,linguistics-identifies-anonymous-users.aspx
>>
>> Linguistics identifies anonymous users
>>
>> By Darren Pauli on Jan 9, 2013 9:49 AM
>>
>> Researchers reveal carders, hackers on underground forums.
>>
>> Up to 80 percent of certain anonymous underground forum users can be
>> identified using linguistics, researchers say.
>>
>> The techniques compare user posts to track them across forums and
>> could even
>> unveil authors of thesis papers or blogs who had taken to underground
>> networks.
>>
>> "If our dataset contains 100 users we can at least identify 80 of them,"
>> researcher Sadia Afroz told an audience at the 29C3 Chaos Communication
>> Congress in Germany.
>>
>> "Function words are very specific to the writer. Even if you are
>> writing a
>> thesis, you'll probably use the same function words in chat messages.
>>
>> "Even if your text is not clean, your writing style can give you away."
>>
>> The analysis techniques could also reveal botnet owners, malware tool
>> authors
>> and provide insight into the size and scope of underground markets,
>> making
>> the research appealing to law enforcement.
>>
>> To achieve their results the researchers used techniques including
>> stylometric analysis, the authorship attribution framework Jstylo,
>> and Latent
>> Dirichlet allocation which can distinguish a conversation on stolen
>> credit
>> cards from one on exploit-writing, and similarly help identify
>> interesting
>> people.
>>
>> The analysis was applied across millions of posts from tens of
>> thousands of
>> users of a series of multilingual underground websites including
>> thebadhackerz.com, blackhatpalace.com, www.carders.cc, free-hack.com,
>> hackel1te.info, hack-sector.forumh.net, rootwarez.org, L33tcrew.org and
>> antichat.ru.
>>
>> It found up to 300 distinct discussion topics in the forums, with
>> some of the
>> most popular being carding, encryption services, password cracking and
>> blackhat search engine optimisation tools.
>>
>> While successful, the work faces a series of challenges. Analysis
>> could only
>> be performed using a minimum of 5000 words (this research used the "gold
>> standard" of 6500 words) which culled the list of potential targets
>> from tens
>> of thousands to mere hundreds.
>>
>> It also needs to separate discussion on product information like credit
>> cards, exploits and drugs from conversational text in order to facilitate
>> machine learning to automate the process, according to researcher Aylin
>> Caliskan Islam.
>>
>> And posts must be translated to English, a process which boosted author
>> identification from 66 to around 80 per cent but was imperfect using
>> freely
>> available tools like Google and Bing.
>>
>> However both of these tasks were performed successfully, and further
>> development including the use of "exclusive" language translation
>> tools would
>> only serve to boost the identification accuracy.
>>
>> Leetspeak, an alternative alphabet popular in some forum circles,
>> cannot be
>> translated.
>>
>> The project is ongoing and future work promises to increase the
>> capacity to
>> unmask users. This Islam said would include temporal information
>> which would
>> exploit users who logged into forums from the same IP addresses and wrote
>> posts at around the same time.
>>
>> Antichat user analysis
>>
>> "They might finish work, come home and log in," Islam said.
>>
>> It could also tie user identities to the topics they write about and
>> produce
>> a map of their interactions, identify multiple accounts held by a single
>> author, and combine forum messages with internet relay chat (IRC)
>> data sets.
>>
>> "We want to automate the whole process."
>>
>> Afroz said while the work appeals to law enforcements and government
>> agencies, it is not designed to catch users out.
>>
>> "We aren't trying to identify users, we are trying to show them that
>> this is
>> possible," she said.
>>
>> To this end, the researchers released tools last year, updated last
>> December,
>> which help users to anonymise their writing.
>>
>> One tool, Anonymouth, takes a 500 word sample of a user's writing to
>> identify
>> unique features such as function words which could make them
>> identifiable.
>>
>> The other, JStylo, is the machine learning engine which powers
>> Anonymouth.
>>
>> The Drexel and George Mason universities research team is composed of
>> Sadia
>> Afroz, Aylin Caliskan Islam, Ariel Stolerman, Rachel Greenstadt, and
>> Damon
>> McCoy.
--
Gregory Foster || gfoster at entersection.org
@gregoryfoster <> http://entersection.com/
More information about the liberationtech
mailing list