[liberationtech] Complete GFW Rulebook for Wikipedia

Collin Anderson collin at averysmallbird.com
Tue Oct 1 12:45:21 PDT 2013


Congratulations, this is impressive work. I am also completely jealous -- a
colleague and myself will be releasing a similar report for Iran in the
next two weeks. This is intended at a broader global project on Wikipedia
censorship ({{Citation Filtered}}) that I would hope might merge well into
what you are doing.


On Mon, Sep 30, 2013 at 7:26 PM, 夏楚 <summer.agony at gmail.com> wrote:

> To all,
>
> I just finished writing up my research on GFW (Great Firewall of China)
> blacklist for Wikipedia. Some of you might find it interesting.
>
> The paper can be found at goo.gl/RnMvG1 (tweeted here<https://twitter.com/SummerAgony/status/384820318402920448>).
> Here I paste excerpts from the Abstract and Conclusions below.
>
> *Abstract*
>
> In this report, we detail the *complete* and *exact* rulebook that the
> Great Firewall of China (GFW) exerts on Wikipedia. We call it "rulebook''
> (instead of the common term "blacklist'') because we not only identify the
> blacklisted terms, but also the exact string matching rules deployed by
> GFW. An efficient probing methodology makes this possible.
>
> ...
> Wikipedia contains millions of pages, e.g. more than 700,000 articles for
> the Chinese version, and more than 4,240,000 articles for the English
> version. It seems a daunting and unfeasible task to test these pages
> exhaustively, hence there has been no well known attempt to gather the
> complete blacklist.
>
> While a small sample of the blacklist is useful, the complete picture
> can be much more powerful in revealing the underlying works of GFW and
> its operators. In this study, we devised a methodology which efficiently
> examines the entire Wikipedia corpus, hence exposing to the world the
> complete GFW rulebook for Wikipedia the first time. In total, there are 919
> rules (excluding URL terms) which are applicable to Wikipedia, affecting
> 5336 pages in Chinese Wikipedia and 67 English Wikipedia pages.
>
> The revealed rulebook also demonstrates that the GFW operation is
> haphazard and ill-maintained. At the same time, Chinese
> censorship bureaucracy *intends* to be thorough and extensive.
>
> To be precise, the findings in this report are on two Wikipedia
> snapshots: 2013-09-08 for the Chinese version and 2013-09-04 for the
> English version.
>
> *Conclusion Remarks*
>
> In this study, we examined the entire Wikipedia corpus (Chinese version
> and English version) and revealed the complete and exact GFW rulebook for
> Wikipedia (with caveats described in Section 6).
>
> A sample of notable findings are:
>
>    - There are 78 terms for which GFW blocks a non-standard variant but
>    not the canonical path. These are cases the censors intend to block but the
>    block does not really happen, suggesting the censors have poor
>    understanding of Wikipedia's serving system.
>    - Many obscure non-article pages are blocked, which raises suspicion
>    that these pages were provided to the censorship bureaucrats by Wikipedia
>    editors who are very familiar with the content (e.g. those who participated
>    in the edit wars and/or discussions regarding self-censorship proposals).
>    - GFW string matching rules have a 64-byte hard limit of size.
>
> The biggest learning out of this study, in my opinion, is that GFW
> operation
> is haphazard and ill-maintained. Also, there are many indications that the
> GFW operators are somewhat disconnected from the censorship bureaucrats.
>
> We hope the revealing can be of interest to internet censorship watchers,
> Wikipedia researchers, China observers, and ordinary Chinese citizens.
>
>
> --
> Xia Chu (Twitter: @summer.agony)
>
> --
> Liberationtech is public & archives are searchable on Google. Violations
> of list guidelines will get you moderated:
> https://mailman.stanford.edu/mailman/listinfo/liberationtech.
> Unsubscribe, change to digest, or change password by emailing moderator at
> companys at stanford.edu.
>



-- 
*Collin David Anderson*
averysmallbird.com | @cda | Washington, D.C.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.stanford.edu/pipermail/liberationtech/attachments/20131001/68192986/attachment.html>


More information about the liberationtech mailing list