[liberationtech] Complete GFW Rulebook for Wikipedia

夏楚 summer.agony at gmail.com
Mon Sep 30 16:26:01 PDT 2013


To all,

I just finished writing up my research on GFW (Great Firewall of China)
blacklist for Wikipedia. Some of you might find it interesting.

The paper can be found at goo.gl/RnMvG1 (tweeted
here<https://twitter.com/SummerAgony/status/384820318402920448>).
Here I paste excerpts from the Abstract and Conclusions below.

*Abstract*

In this report, we detail the *complete* and *exact* rulebook that the
Great Firewall of China (GFW) exerts on Wikipedia. We call it "rulebook''
(instead of the common term "blacklist'') because we not only identify the
blacklisted terms, but also the exact string matching rules deployed by
GFW. An efficient probing methodology makes this possible.

...
Wikipedia contains millions of pages, e.g. more than 700,000 articles for
the Chinese version, and more than 4,240,000 articles for the English
version. It seems a daunting and unfeasible task to test these pages
exhaustively, hence there has been no well known attempt to gather the
complete blacklist.

While a small sample of the blacklist is useful, the complete picture
can be much more powerful in revealing the underlying works of GFW and
its operators. In this study, we devised a methodology which efficiently
examines the entire Wikipedia corpus, hence exposing to the world the
complete GFW rulebook for Wikipedia the first time. In total, there are 919
rules (excluding URL terms) which are applicable to Wikipedia, affecting
5336 pages in Chinese Wikipedia and 67 English Wikipedia pages.

The revealed rulebook also demonstrates that the GFW operation is
haphazard and ill-maintained. At the same time, Chinese
censorship bureaucracy *intends* to be thorough and extensive.

To be precise, the findings in this report are on two Wikipedia
snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the
English version.

*Conclusion Remarks*

In this study, we examined the entire Wikipedia corpus (Chinese version
and English version) and revealed the complete and exact GFW rulebook for
Wikipedia (with caveats described in Section 6).

A sample of notable findings are:

   - There are 78 terms for which GFW blocks a non-standard variant but not
   the canonical path. These are cases the censors intend to block but the
   block does not really happen, suggesting the censors have poor
   understanding of Wikipedia's serving system.
   - Many obscure non-article pages are blocked, which raises suspicion
   that these pages were provided to the censorship bureaucrats by Wikipedia
   editors who are very familiar with the content (e.g. those who participated
   in the edit wars and/or discussions regarding self-censorship proposals).
   - GFW string matching rules have a 64-byte hard limit of size.

The biggest learning out of this study, in my opinion, is that GFW operation
is haphazard and ill-maintained. Also, there are many indications that the
GFW operators are somewhat disconnected from the censorship bureaucrats.

We hope the revealing can be of interest to internet censorship watchers,
Wikipedia researchers, China observers, and ordinary Chinese citizens.


--
Xia Chu (Twitter: @summer.agony)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/liberationtech/attachments/20130930/3b85b980/attachment.html>


More information about the liberationtech mailing list