[liberationtech] Version 2.0 Complete GFW Rulebook for Wikipedia

Fri Oct 18 18:20:52 PDT 2013

To all,

I just wrote up my new study of GFW and it is available at
http://goo.gl/KfBCgT

In this new version, I thoroughly studied GFW's HTTP response filtering
scheme, which has not been well studied in the past. The bulk of the new
result is in Section 5 (pp 8-12). The following is some excerpts regarding
the new findings.

*Abstract*

In Version 2.0, we studied GFW's filtering rules for HTTP responses
extensively and identified a comprehensive list (including those affecting
Wikipedia and beyond). This list is small (19 items) but they affect many
more pages on Wikipedia and other websites.

*Section 5.3 Learnings and Mysteries of GFW's HTTP Response Filtering*

   - GFW's HTTP request filtering and response filtering are two separate
   systems. For one, their filtering rules are entirely different. For two,
   GFW's HTTP request filtering is homogeneous and has near perfect trigger
   rate, but GFW's HTTP response filtering varies hugely, not only in the
   triggering rates, but also in the filtering rules in effect. For example,
   CERNET (Chinese Education and Research Network) seems to have all the rules
   in place, but some other ISPs only have a subset.

   - One remarkable finding is that GFW does not just look at individual
   TCP packet, but instead, it ``remembers'' the entire TCP session to look
   for offenders. This becomes evident when the filtering rule is ``\$term\_A
   \& \$term\_B'', and the two terms show up far apart (hundreds of thousands
   bytes from each other) on a webpage, GFW will still be able to reset the
   connection. To achieve this requires significant investment in
   infrastructure, and it is probably also the reason why the rulebook is so
   much smaller for HTTP response filtering than HTTP request filtering.

Best,

On Mon, Sep 30, 2013 at 4:26 PM, 夏楚 <summer.agony at gmail.com> wrote:

> To all,
>
> I just finished writing up my research on GFW (Great Firewall of China)
> blacklist for Wikipedia. Some of you might find it interesting.
>
> The paper can be found at goo.gl/RnMvG1 (tweeted here<https://twitter.com/SummerAgony/status/384820318402920448>).
> Here I paste excerpts from the Abstract and Conclusions below.
>
> *Abstract*
>
> In this report, we detail the *complete* and *exact* rulebook that the
> Great Firewall of China (GFW) exerts on Wikipedia. We call it "rulebook''
> (instead of the common term "blacklist'') because we not only identify the
> blacklisted terms, but also the exact string matching rules deployed by
> GFW. An efficient probing methodology makes this possible.
>
> ...
> Wikipedia contains millions of pages, e.g. more than 700,000 articles for
> the Chinese version, and more than 4,240,000 articles for the English
> version. It seems a daunting and unfeasible task to test these pages
> exhaustively, hence there has been no well known attempt to gather the
> complete blacklist.
>
> While a small sample of the blacklist is useful, the complete picture
> can be much more powerful in revealing the underlying works of GFW and
> its operators. In this study, we devised a methodology which efficiently
> examines the entire Wikipedia corpus, hence exposing to the world the
> complete GFW rulebook for Wikipedia the first time. In total, there are 919
> rules (excluding URL terms) which are applicable to Wikipedia, affecting
> 5336 pages in Chinese Wikipedia and 67 English Wikipedia pages.
>
> The revealed rulebook also demonstrates that the GFW operation is
> haphazard and ill-maintained. At the same time, Chinese
> censorship bureaucracy *intends* to be thorough and extensive.
>
> To be precise, the findings in this report are on two Wikipedia
> snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the
> English version.
>
> *Conclusion Remarks*
>
> In this study, we examined the entire Wikipedia corpus (Chinese version
> and English version) and revealed the complete and exact GFW rulebook for
> Wikipedia (with caveats described in Section 6).
>
> A sample of notable findings are:
>
>    - There are 78 terms for which GFW blocks a non-standard variant but
>    not the canonical path. These are cases the censors intend to block but the
>    block does not really happen, suggesting the censors have poor
>    understanding of Wikipedia's serving system.
>    - Many obscure non-article pages are blocked, which raises suspicion
>    that these pages were provided to the censorship bureaucrats by Wikipedia
>    editors who are very familiar with the content (e.g. those who participated
>    in the edit wars and/or discussions regarding self-censorship proposals).
>    - GFW string matching rules have a 64-byte hard limit of size.
>
> The biggest learning out of this study, in my opinion, is that GFW
> operation
> is haphazard and ill-maintained. Also, there are many indications that the
> GFW operators are somewhat disconnected from the censorship bureaucrats.
>
> We hope the revealing can be of interest to internet censorship watchers,
> Wikipedia researchers, China observers, and ordinary Chinese citizens.
>
>
> --
> Xia Chu (Twitter: @summer.agony)
>

--
Xia Chu (Twitter: @summer.agony)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/liberationtech/attachments/20131018/cb0e1cb3/attachment.html>