[liberationtech] Version 3.1 Complete GFW Rulebook for Wikipedia plus Comprehensive List for Websites, IPs, IMDB and AppStore (shortcut: goo.gl/aVgvO6)

Fri Feb 7 17:19:36 PST 2014

To all,

I just published Version 3.1 of my GFW research at http://goo.gl/aVgvO6.
And as always, all relevant links and the full lists can be found at
goo.gl/zKslcu. tweet
link<https://twitter.com/SummerAgony/status/431959746338234368>
.

This is a relatively small update. There are mainly two things:

- A few more Wikipedia terms were found to be blocked. Notably the article
for 新公民运动 (New Citizens' Movement). The leaders of this social movement,
including Dr Xu Zhiyong, went on trial in Beijing two weeks ago.

- Somebody who prefers to remain anonymous, provided me a list of over 5
million web site names. I applied my methodology on this big set and
identified 650 new website rules on GFW. Total number of URL terms reaches
4296 now. And we found the longest website rule from this new set, it's a
whopping 43-characters string:
".dalai-lama-dharma-dharamsala-miniguide.com"!

I tend to believe the person who offered me the big website list is from
the libtech mailing list. I really appreciate such help. Thank you!!!

Best,

On Wed, Dec 25, 2013 at 11:09 PM, 夏楚 <summer.agony at gmail.com> wrote:

> To all,
>
> Happy Holidays!
>
> I just published Version 3.0 of my GFW research.
>
> First of all, I created a "master spreadsheet" for all the findings and
> updates at http://goo.gl/zKslcu. It contains links to the papers and
> various lists. Also tweeted here<https://twitter.com/SummerAgony/status/416102422826602496>
> .
>  <http://goo.gl/zKslcu>
>
> There are several major additions in this version (V3.0 is located at
> http://goo.gl/u971J7):
>
> 1, I created a monitoring pipeline which monitors GFW's updates on
> Wikipedia. (For updates, one can subscribe to the mailing list
> summeragony+subscribe at googlegroups.com).
>
> 2, I applied the methodology to four more areas:
>   A. I examined more than 1 million website names (obtained from Alexa and
> several online lists: greatfire, autoproxy). I identified 3644 GFW
> filtering rules targeting website names. This list is significantly more
> comprehensive and more precise than all precedents.
>   B. I applied the methodology to IMDB, examined 4M titles and identified
> 6 GFW rules.
>   C. I examined a big repository of AppStore apps (648,567 items) and
> identified 26 GFW rules.
>   D. I checked 786,432 IP strings and identified 130 GFW rules.
>
> 3. 9 new rules (deployed after 2013-10-01) against Wikipedia were
> discovered.
>
> For readers who have seen V2.0 of the paper, the new sections are Section
> 9 (websites), Section 10 (IP strings), Section 11 (IMDB), Section 12
> (AppStore) and Appendix C (list of the 3644 websites).
>
> Again, this research is a solo project in my spare time, and people's
> feedback is greatly appreciated. In particular, if you know large corpus
> that GFW may filter, I'd love that input. For example, I only examined 1M
> website names and ~60% of AppStore apps here, if you have a bigger
> collection of website names or if you have a way to get the full AppStore
> list, I'd love to take a look.
>
> Last but not the least, as I mentioned in the paper, this study was
> originally motivated by Dr Xu Zhiyong (wiki page<https://en.wikipedia.org/wiki/Xu_Zhiyong>,
> news search <https://www.google.com/search?q=xu+zhiyong&tbm=nws>), whose Chinese
> Wikipedia page <http://zh.wikipedia.org/wiki/%E8%AE%B8%E5%BF%97%E6%B0%B8>is (surprisingly) accessible in China (it turned out that GFW blocked a
> non-standard variant of the page). Dr Xu is currently facing trial in
> Beijing and may be sentenced to several years in prison, for his peaceful
> efforts to make China a place with a little bit more freedom, righteousness
> and love. China's New Citizens' Movement<https://en.wikipedia.org/wiki/New_Citizens%27_Movement_%28China%29>need more support from the world!!
>
> Best,
>
> Xia Chu
>
>
> On Fri, Oct 18, 2013 at 6:20 PM, 夏楚 <summer.agony at gmail.com> wrote:
>
>> To all,
>>
>> I just wrote up my new study of GFW and it is available at
>> http://goo.gl/KfBCgT
>>
>> In this new version, I thoroughly studied GFW's HTTP response filtering
>> scheme, which has not been well studied in the past. The bulk of the new
>> result is in Section 5 (pp 8-12). The following is some excerpts regarding
>> the new findings.
>>
>>
>> *Abstract*
>>
>> In Version 2.0, we studied GFW's filtering rules for HTTP responses
>> extensively and identified a comprehensive list (including those affecting
>> Wikipedia and beyond). This list is small (19 items) but they affect many
>> more pages on Wikipedia and other websites.
>>
>> *Section 5.3 Learnings and Mysteries of GFW's HTTP Response Filtering*
>>
>>
>>    - GFW's HTTP request filtering and response filtering are two
>>    separate systems. For one, their filtering rules are entirely different.
>>    For two, GFW's HTTP request filtering is homogeneous and has near perfect
>>    trigger rate, but GFW's HTTP response filtering varies hugely, not only in
>>    the triggering rates, but also in the filtering rules in effect. For
>>    example, CERNET (Chinese Education and Research Network) seems to have all
>>    the rules in place, but some other ISPs only have a subset.
>>
>>
>>    - One remarkable finding is that GFW does not just look at individual
>>    TCP packet, but instead, it ``remembers'' the entire TCP session to look
>>    for offenders. This becomes evident when the filtering rule is ``\$term\_A
>>    \& \$term\_B'', and the two terms show up far apart (hundreds of thousands
>>    bytes from each other) on a webpage, GFW will still be able to reset the
>>    connection. To achieve this requires significant investment in
>>    infrastructure, and it is probably also the reason why the rulebook is so
>>    much smaller for HTTP response filtering than HTTP request filtering.
>>
>>
>> Best,
>>
>> On Mon, Sep 30, 2013 at 4:26 PM, 夏楚 <summer.agony at gmail.com> wrote:
>>
>>> To all,
>>>
>>> I just finished writing up my research on GFW (Great Firewall of China)
>>> blacklist for Wikipedia. Some of you might find it interesting.
>>>
>>> The paper can be found at goo.gl/RnMvG1 (tweeted here<https://twitter.com/SummerAgony/status/384820318402920448>).
>>> Here I paste excerpts from the Abstract and Conclusions below.
>>>
>>> *Abstract*
>>>
>>> In this report, we detail the *complete* and *exact* rulebook that the
>>> Great Firewall of China (GFW) exerts on Wikipedia. We call it "rulebook''
>>> (instead of the common term "blacklist'') because we not only identify the
>>> blacklisted terms, but also the exact string matching rules deployed by
>>> GFW. An efficient probing methodology makes this possible.
>>>
>>> ...
>>> Wikipedia contains millions of pages, e.g. more than 700,000
>>> articles for the Chinese version, and more than 4,240,000 articles for the
>>> English version. It seems a daunting and unfeasible task to test these
>>> pages exhaustively, hence there has been no well known attempt to gather
>>> the complete blacklist.
>>>
>>> While a small sample of the blacklist is useful, the complete picture
>>> can be much more powerful in revealing the underlying works of GFW and
>>> its operators. In this study, we devised a methodology which efficiently
>>> examines the entire Wikipedia corpus, hence exposing to the world the
>>> complete GFW rulebook for Wikipedia the first time. In total, there are 919
>>> rules (excluding URL terms) which are applicable to Wikipedia, affecting
>>> 5336 pages in Chinese Wikipedia and 67 English Wikipedia pages.
>>>
>>> The revealed rulebook also demonstrates that the GFW operation is
>>> haphazard and ill-maintained. At the same time, Chinese
>>> censorship bureaucracy *intends* to be thorough and extensive.
>>>
>>> To be precise, the findings in this report are on two Wikipedia
>>> snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the
>>> English version.
>>>
>>> *Conclusion Remarks*
>>>
>>> In this study, we examined the entire Wikipedia corpus (Chinese version
>>> and English version) and revealed the complete and exact GFW rulebook for
>>> Wikipedia (with caveats described in Section 6).
>>>
>>> A sample of notable findings are:
>>>
>>>    - There are 78 terms for which GFW blocks a non-standard variant but
>>>    not the canonical path. These are cases the censors intend to block but the
>>>    block does not really happen, suggesting the censors have poor
>>>    understanding of Wikipedia's serving system.
>>>    - Many obscure non-article pages are blocked, which raises suspicion
>>>    that these pages were provided to the censorship bureaucrats by Wikipedia
>>>    editors who are very familiar with the content (e.g. those who participated
>>>    in the edit wars and/or discussions regarding self-censorship proposals).
>>>    - GFW string matching rules have a 64-byte hard limit of size.
>>>
>>> The biggest learning out of this study, in my opinion, is that GFW
>>> operation
>>> is haphazard and ill-maintained. Also, there are many indications that
>>> the
>>> GFW operators are somewhat disconnected from the censorship bureaucrats.
>>>
>>> We hope the revealing can be of interest to internet censorship watchers,
>>> Wikipedia researchers, China observers, and ordinary Chinese citizens.
>>>
>>>
>>> --
>>> Xia Chu (Twitter: @summer.agony)
>>>
>>
>>
>> --
>> Xia Chu (Twitter: @summer.agony)
>>
>
>
>
> --
> --
> Xia Chu (Twitter: @summer.agony; Google+: gplus.to/summer.agony)
>

-- 
--
Xia Chu (Twitter: @summer.agony; Google+: gplus.to/summer.agony)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/liberationtech/attachments/20140207/740ecbf0/attachment.html>