[liberationtech] Version 3.0 Complete GFW Rulebook for Wikipedia plus Comprehensive List for Websites, IPs, IMDB and AppStore (shortcut: goo.gl/zKslcu)

Wed Dec 25 23:09:03 PST 2013

To all,

Happy Holidays!

I just published Version 3.0 of my GFW research.

First of all, I created a "master spreadsheet" for all the findings and
updates at http://goo.gl/zKslcu. It contains links to the papers and
various lists. Also tweeted
here<https://twitter.com/SummerAgony/status/416102422826602496>
.
 <http://goo.gl/zKslcu>

There are several major additions in this version (V3.0 is located at
http://goo.gl/u971J7):

1, I created a monitoring pipeline which monitors GFW's updates on
Wikipedia. (For updates, one can subscribe to the mailing list
summeragony+subscribe at googlegroups.com).

2, I applied the methodology to four more areas:
  A. I examined more than 1 million website names (obtained from Alexa and
several online lists: greatfire, autoproxy). I identified 3644 GFW
filtering rules targeting website names. This list is significantly more
comprehensive and more precise than all precedents.
  B. I applied the methodology to IMDB, examined 4M titles and identified 6
GFW rules.
  C. I examined a big repository of AppStore apps (648,567 items) and
identified 26 GFW rules.
  D. I checked 786,432 IP strings and identified 130 GFW rules.

3. 9 new rules (deployed after 2013-10-01) against Wikipedia were
discovered.

For readers who have seen V2.0 of the paper, the new sections are Section 9
(websites), Section 10 (IP strings), Section 11 (IMDB), Section 12
(AppStore) and Appendix C (list of the 3644 websites).

Again, this research is a solo project in my spare time, and people's
feedback is greatly appreciated. In particular, if you know large corpus
that GFW may filter, I'd love that input. For example, I only examined 1M
website names and ~60% of AppStore apps here, if you have a bigger
collection of website names or if you have a way to get the full AppStore
list, I'd love to take a look.

Last but not the least, as I mentioned in the paper, this study was
originally motivated by Dr Xu Zhiyong (wiki
page<https://en.wikipedia.org/wiki/Xu_Zhiyong>,
news search <https://www.google.com/search?q=xu+zhiyong&tbm=nws>),
whose Chinese
Wikipedia page <http://zh.wikipedia.org/wiki/%E8%AE%B8%E5%BF%97%E6%B0%B8>is
(surprisingly) accessible in China (it turned out that GFW blocked a
non-standard variant of the page). Dr Xu is currently facing trial in
Beijing and may be sentenced to several years in prison, for his peaceful
efforts to make China a place with a little bit more freedom, righteousness
and love. China's New Citizens'
Movement<https://en.wikipedia.org/wiki/New_Citizens%27_Movement_%28China%29>need
more support from the world!!

Best,

Xia Chu

On Fri, Oct 18, 2013 at 6:20 PM, 夏楚 <summer.agony at gmail.com> wrote:

> To all,
>
> I just wrote up my new study of GFW and it is available at
> http://goo.gl/KfBCgT
>
> In this new version, I thoroughly studied GFW's HTTP response filtering
> scheme, which has not been well studied in the past. The bulk of the new
> result is in Section 5 (pp 8-12). The following is some excerpts regarding
> the new findings.
>
>
> *Abstract*
>
> In Version 2.0, we studied GFW's filtering rules for HTTP responses
> extensively and identified a comprehensive list (including those affecting
> Wikipedia and beyond). This list is small (19 items) but they affect many
> more pages on Wikipedia and other websites.
>
> *Section 5.3 Learnings and Mysteries of GFW's HTTP Response Filtering*
>
>
>    - GFW's HTTP request filtering and response filtering are two separate
>    systems. For one, their filtering rules are entirely different. For two,
>    GFW's HTTP request filtering is homogeneous and has near perfect trigger
>    rate, but GFW's HTTP response filtering varies hugely, not only in the
>    triggering rates, but also in the filtering rules in effect. For example,
>    CERNET (Chinese Education and Research Network) seems to have all the rules
>    in place, but some other ISPs only have a subset.
>
>
>    - One remarkable finding is that GFW does not just look at individual
>    TCP packet, but instead, it ``remembers'' the entire TCP session to look
>    for offenders. This becomes evident when the filtering rule is ``\$term\_A
>    \& \$term\_B'', and the two terms show up far apart (hundreds of thousands
>    bytes from each other) on a webpage, GFW will still be able to reset the
>    connection. To achieve this requires significant investment in
>    infrastructure, and it is probably also the reason why the rulebook is so
>    much smaller for HTTP response filtering than HTTP request filtering.
>
>
> Best,
>
> On Mon, Sep 30, 2013 at 4:26 PM, 夏楚 <summer.agony at gmail.com> wrote:
>
>> To all,
>>
>> I just finished writing up my research on GFW (Great Firewall of China)
>> blacklist for Wikipedia. Some of you might find it interesting.
>>
>> The paper can be found at goo.gl/RnMvG1 (tweeted here<https://twitter.com/SummerAgony/status/384820318402920448>).
>> Here I paste excerpts from the Abstract and Conclusions below.
>>
>> *Abstract*
>>
>> In this report, we detail the *complete* and *exact* rulebook that the
>> Great Firewall of China (GFW) exerts on Wikipedia. We call it "rulebook''
>> (instead of the common term "blacklist'') because we not only identify the
>> blacklisted terms, but also the exact string matching rules deployed by
>> GFW. An efficient probing methodology makes this possible.
>>
>> ...
>> Wikipedia contains millions of pages, e.g. more than 700,000 articles for
>> the Chinese version, and more than 4,240,000 articles for the English
>> version. It seems a daunting and unfeasible task to test these pages
>> exhaustively, hence there has been no well known attempt to gather the
>> complete blacklist.
>>
>> While a small sample of the blacklist is useful, the complete picture
>> can be much more powerful in revealing the underlying works of GFW and
>> its operators. In this study, we devised a methodology which efficiently
>> examines the entire Wikipedia corpus, hence exposing to the world the
>> complete GFW rulebook for Wikipedia the first time. In total, there are 919
>> rules (excluding URL terms) which are applicable to Wikipedia, affecting
>> 5336 pages in Chinese Wikipedia and 67 English Wikipedia pages.
>>
>> The revealed rulebook also demonstrates that the GFW operation is
>> haphazard and ill-maintained. At the same time, Chinese
>> censorship bureaucracy *intends* to be thorough and extensive.
>>
>> To be precise, the findings in this report are on two Wikipedia
>> snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the
>> English version.
>>
>> *Conclusion Remarks*
>>
>> In this study, we examined the entire Wikipedia corpus (Chinese version
>> and English version) and revealed the complete and exact GFW rulebook for
>> Wikipedia (with caveats described in Section 6).
>>
>> A sample of notable findings are:
>>
>>    - There are 78 terms for which GFW blocks a non-standard variant but
>>    not the canonical path. These are cases the censors intend to block but the
>>    block does not really happen, suggesting the censors have poor
>>    understanding of Wikipedia's serving system.
>>    - Many obscure non-article pages are blocked, which raises suspicion
>>    that these pages were provided to the censorship bureaucrats by Wikipedia
>>    editors who are very familiar with the content (e.g. those who participated
>>    in the edit wars and/or discussions regarding self-censorship proposals).
>>    - GFW string matching rules have a 64-byte hard limit of size.
>>
>> The biggest learning out of this study, in my opinion, is that GFW
>> operation
>> is haphazard and ill-maintained. Also, there are many indications that the
>> GFW operators are somewhat disconnected from the censorship bureaucrats.
>>
>> We hope the revealing can be of interest to internet censorship watchers,
>> Wikipedia researchers, China observers, and ordinary Chinese citizens.
>>
>>
>> --
>> Xia Chu (Twitter: @summer.agony)
>>
>
>
> --
> Xia Chu (Twitter: @summer.agony)
>

-- 
--
Xia Chu (Twitter: @summer.agony; Google+: gplus.to/summer.agony)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/liberationtech/attachments/20131225/39b33a03/attachment.html>