[liberationtech] Introduction --- Haystack

Sat Sep 11 18:42:56 PDT 2010

I'm Daniel Colascione. I wrote every line of Haystack and I am the
author of the FAQ on the Haystack website. I'm glad to have been
invited to this list.

Before I begin, let me say that I've always profoundly admired and
respected the work of the anti-censorship and cryptographic
communities. My own project would not be possible without the
pioneering work of others, some of whom are reading this list.

I owe a special debt of gratitude to the Tor project, which has
explored and developed solutions for a huge variety of potential
problems, and which showed us that a large-scale privacy system can be
made at all.

In this email, I hope to clear up some of the confusion surrounding
Haystack. It represents a good-faith effort to engage with the
community and undo some of the misgiving that's grown up around the
project.

Below, I describe the present state of Haystack and the thinking that
led to its current design. We have explored some areas more firmly
than others, and there are still some open problems. Nevertheless,
providing this unprecedented level of detail is intended to
demonstrate that we have at least given due consideration to our
design, and that any mistakes we make due to ignorance, nor
recklessness. We'd also like suggestions, of course.

INTRODUCTION
------------

I have recently, and _tentatively_, agreed to resume a more active
role in the Censorship Research Center, the organization I co-founded
June 2009 with Austin Heap, and to continue development of Haystack.
Several months ago, I distanced myself from the organization due to
fundamental disagreements over transparency, press relations, and
other issues. I've agreed to return provided certain changes are made,
including a greater commitment to openness, frankness, and involvement
with the community.

In the interest of transparency, let me make a few unambiguous
statements and hopefully resolve any confusion:

- Haystack is not complete. A large amount of work remains.
  Development was in hiatus until I returned.

- We have fewer than two dozen testers as far as I am aware.

- The "test" version of Haystack is an early functional testbed that
  does not provide the security guarantees we have advertised for the
  final product. It was never intended for distribution to anyone
  except a small cadre of testers. We are aware of its
  vulnerabilities.

- Our testers are aware that the test version is not the finished
  product, and they are aware that it does not provide the security
  properties we guarantee for the finished product.

- Our public statements about Haystack's capabilities apply to our
  design for the finished product, not to the test version.

- The Haystack client will be released as an open-core system under
  the GPLv3 or later.

- We will submit this open-core version to peer review and incorporate
  community suggestions.

- We will publish our threat model and rationale and open it to peer
  review.

- We will publish our cryptographic protocols and open them to peer
  review.

- We will release the system to the general public only after a third
  party has verified that the program operates as designed, and that
  our design provides the security properties we describe.

- We will never censor our users on the basis of content.

NON-GUARANTEES
--------------

- We reserve the right to selectively prioritize traffic, and to block
  certain protocols wholesale --- for network management purposes ---
  though we'd rather not have to resort to games like this.

- We will not publish our client-side obfuscation code, and probably
  will not publish the algorithms. As described below, we do not rely
  on this portion of the program for privacy or anonymity --- just for
  blocking-resistance, and feel that releasing it would do more harm
  than good. It will change frequently in any case.

- We will not block a release over concerns that are outside our
  published threat model.

- We will not publish our server code.

SECURITY GUARANTEES
-------------------

Haystack will:

- provide a secure channel with perfect forward secrecy over which a
  user can send arbitrary traffic.

- resist to automated traffic classification and blocking.

- attempt resist to resist offline detection and attempt to provide
  plausible deniability.

- mitigate eventual network-level blocking.

- not store unencrypted or identifiable state.

- ensure that external connection endpoints (reached after passing
  through our system) do not IP address of the connection originator
  except via unavoidable application-layer leakage.

PHILOSOPHICAL DECISIONS
-----------------------

Haystack is a centralized censorship-circumvention system. We
fundamentally disagree with the notion (spelled out by Dingledine, et.
al. [1]) that only a decentralized system can protect users, and feel
that a centralized system offers compelling advantages in performance,
reliability, and distribution that offset the risk that we would
suddenly decide to take actions contrary to our users' interests, or
that our facilities would be compromised.

I realize that many on this list and elsewhere are deeply suspicious
of centralized systems. I understand their objections. But the
argument is not one that can be resolved with discourse, and the
community must accept our decision in this area. We can work together
to build the best centralized system possible, or we can argue
endlessly over something that will never change.

Ultimately, we ask our users to trust us, and we will give them tools
to ensure that what they trust really comes from us. This level is
trust is greater than what some other tools ask, yes. In return, we
will do our best to protect them from everyone else. If users do not
feel comfortable providing this trust, nothing is compelling them to
use our system.

TECHNICAL DESIGN
----------------

Haystack is an obfuscating HTTP multiplexing circuit proxy designed to
circumvent censorship and provide users basic connectivity even in a
hostile network environment. (However, as discussed below, Haystack is
not a _pure_ circuit proxy.)

Haystack's primary differencing factors are that it mimics normal HTTP
traffic, and that we will use a novel distribution scheme to secure
the network as discussed below. The purpose of traffic obfuscation is
*NOT* to make Haystack absolutely undetectable: that is impossible. We
assume that a human looking at network traces could eventually learn
to differentiate the traffic.

The goal is to raise the false positive rate of any automated blocking
attempt to a level that is unpalatable for firewall administrators.
Similarly, we aim to make automated differentiation of Haystack and
non-Haystack users infeasible. For our purposes, a moderate degree of
obfuscation will suffice.

For emphasis,

 ** PERFECT STEGANOGRAPHY IS NOT A SECURITY GOAL OF HAYSTACK **

 ** HAYSTACK IS DESIGNED TO RESIST AUTOMATED, LARGE-SCALE AND RAPID
    TRAFFIC CLASSIFICATION **.

That said, we will use the strongest practical obfuscation we can
muster, subject to practical concerns like ciphertext expansion and
algorithmic complexity. and welcome suggestions from the community in
this area. We are merely stating that perfect stenography is not a
prerequisite for Haystack to function as designed.

NETWORK DESIGN
--------------

Haystack uses a three-tier network design composed of feeder nodes,
processing nodes, and exit nodes. We realize other members of the
community have different names for these concepts; these are ours.

Feeder nodes are numerous inexpensive machines that merely forward
traffic to the processing nodes. We will have a great number of them,
and we will change them constantly. Each client knows about only a
small subset of the set of feeder nodes then extant. Feeder nodes do
not decrypt traffic; they merely forward it. They do not re-encrypt
traffic either: because Haystack traffic is already encrypted, it can
be forwarded as-is to our processing nodes.

Feeder nodes do work with sensitive information: namely, the source IP
address of each connection. For that reason, we are not comfortable
using volunteer resources for these machines. But because the job of a
feeder is not computationally demanding, it can be quite inexpensive.

Processing nodes are powerful servers in a secure location that
decrypt Haystack user traffic and forward it on behalf of our users.
Processing nodes forward traffic to exit nodes.

Exit nodes are similar to entry nodes in that they merely forward
traffic. Their primary purpose is to obscure the location of our
processing nodes. We have considered using Tor as the exit node
network, but in addition to this being quite impolite, it would
constrain scalability. We would appreciate community input in this
area.

To reduce bandwidth costs, the processing and feeder nodes will run
locally-caching HTTP proxies, though with a short maximum expiration
time to minimize the impact of any compromise. Locally-caching proxies
would be worse than useless on feeder nodes.

Feeder nodes will attempt to act as transparent proxies for whatever
host is specified in the Host: header in an attempt to avoid some
kinds of detection. This plan is tentative because it might have
unforeseen consequences.

CRYPTOGRAPHY
------------

Unlike Tor and Ultrasurf, we do not use SSL or mimic SSL. While it is
true that all SSL streams are indistinguishable on the basis of their
content, firewall administrators may choose to block SSL traffic they
can't decrypt. Our system avoids this possibility by appearing to be
unencrypted traffic. We run a local HTTP proxy (as opposed to a SOCKS
proxy) in order to reliably take advantage of HTTP-specific
optimizations. Performance is a prime concern, and treating HTTP as a
special and dominant case allows us to improve performance; much
research [2,3,4] has gone into optimizing HTTP. Most of it is
inapplicable to the Internet at large due to the multiplicity of
servers and clients, but because we control both the server and
client, we believe we can use some of these ideas.

Our cryptosystem is quite boring: we establish 128-bit session keys
via Diffie-Hellman over Daniel J. Bernstein's Curve25519 [5], which
admits an efficient integer implementation and uses 256-bit keys. (A
256-bit elliptic curve key has the same strength as a 3072-bit RSA key
according to NIST [6].)

We use EC-KCDSA over the same curve to authenticate the client and
server to each other. This initial setup establishes a secure,
authenticated channel with perfect forward secrecy. For symmetric
encryption, we use 128-bit AES in CTR mode (128-bit counter
incremented in little-endian fashion) with each message authenticated
by the venerable HMAC-SHA1 algorithm. (We are considering truncating
the HMAC output to 32 or 64 bits to reduce message size, but are open
to input in this area. The risk would be minimal because even a
well-equipped attacker couldn't forge a 32-bit MAC fast enough to
insert messages in the stream.)

Authentication happens before encryption, just like SSL, but
Krawczyk's criticisms won't apply and the combination is secure.

(Why AES? It's The Standard and has seen more cryptanalysis than any
other cipher, except maybe 3DES. Also, Intel offers hardware
acceleration on its newer chips, which may be relevant for us in the
future.)

We use the Boost implementation of Mersenne Twister for our CSPRNG,
which we use to generate session keys, nonces, and general purpose
random numbers, though the specific CSPRNG used is subject to change,
at least before release. We seed the CSPRNG with entropy from the
system entropy source (CryptGenRandom under Windows and /dev/random
under Unix systems).

The above will be released as open-source so the implementation can be
verified. The AES and SHA1 implementations come from OpenBSD; we've
removed the AES decryption functions because they're not necessary for
CTR-mode operation. Curve25519 is DJB's, of course. We use SHA1 for
signing and other purposes as well: we are trying to minimize the
number of "moving parts" in our system.

I can provide a more formal description of the protocols at another
time. The goal is to ensure that Haystack is _AT_WORST_, no worse than
any other encrypted proxy in terms of formal security guarantees. It
is this property that permits us to tolerate a best-effort approach to
our obfuscation engine. Privacy is the single most important concern.

HIGH-LEVEL OPERATION
--------------------

We multiplex many local connections over the secure channel described
above; doing that allows fast session startup with only one round
trip, reducing latency. We have yet to decide on an optional
compression scheme. The output bits from this secure channel are then
fed into the obfuscation engine, which in turn makes HTTP connections
to the CRC's servers and sends traffic that mimics normal HTTP
communication. Replies from the server flow the other way through the
pipeline. We intend to eventually extend the system to scavenge bits
from protocols other than obfuscated HTTP as well --- we have
considered DNS and gaming protocols.

We also regularly sends "realistic cover traffic" (similar in concept
to [7], though not produced with that approach) in order to make the
user's total network footprint look more like that of an unprotected
user's and to frustrate traffic analysis.

Bootstrapping is achieved through a static list of proxies embedded in
each Haystack binary. Once a connection is established, new nodes are
downloaded to each client and stored encrypted and authenticated.

Some nodes are kept in reserve so that even if all the feeder nodes
active at a particular time are simultaneously blocked, connectivity
can be re-established at a later time.

DISTRIBUTION
------------

Each copy of Haystack is unique. We encourage users to redistribute
the program but require them to generate a new binary for each
recipient; we enforce this requirement on the server side. Each
generated binary has its own unique decryption key and its own random
subset of our overall feeder nodes. Of course, users receive new
binaries and their keys over the secure channel.

The binary itself is encrypted, preventing offline attacks that would
reveal the embedded list of feeder nodes, and to provide some measure
of plausible deniability. When possible, we also prevent Haystack
being written to the operating system page file. Before the program is
run for the first time, a small loader asks the user for a key, which
is used to decrypt the rest of the binary. We strongly encourage users
to never enter a key into a computer unless starting Haystack.

We intend to use (though have not implemented) well-known code
modification techniques for the loader portion of the Haystack binary.
This approach, if successful, would frustrate signature-based scanning
for Haystack. (As far as I know, this may be the only application of
this technique for the purposes of good; it's something usually done
by malware.)

We will keep records of each binary generated, and provide a
captcha-protected, rate-limited, web-accessible oracle that can be
used to verify that a particular copy of Haystack is authentic. The
rate limitation stops the oracle from being very useful for aiding
system scans, but gives users some measure of protection.

Of course, accesses to the Oracle, if unencrypted, can be monitored,
blocked, or logged, which seriously limits its use. This problem is
unsolved. One possible solution is to provide, instead of the oracle,
an offline, but *very* computationally-expensive method to verify a
binary (e.g., by computing the signature of the result of 10,000
iterations of SHA1 of the binary and appending it to the end), but any
such technique would require a custom tool, and that tool could be
itself rigged. A conventional signature would seem to allow
authorities to too easily scan an entire filesystem for Haystack.

Even if one copy of Haystack is compromised and its embedded list of
feeder nodes is revealed, the network as a whole will continue to
function because the attackers will have only obtained a small part of
the whole set.

Of course, an attacker could repeatedly obtain copies of the program
and expand his list of blocked feeder nodes. But because we will know
which feeder nodes are blocked, we believe we can heuristically detect
and preemptively block such attempts.

The goal, again, is not to be perfect, but to try to address some
common attacks while being no worse than ordinary distribution that
involves a plain download of millions of identical copies.

PROGRAM ORGANIZATION
--------------------

Haystack is designed to "just work" in almost all circumstances. It
has minimal external dependencies, no installation requirements, and a
small footprint. We are aiming at 500KB for the size of the
single-file program --- a small package places as few barriers as
possible to distribution even over low-quality networks.

Internally, Haystack is a small custom Lisp interpreter written in C++
optimized for small code size. While the cryptographic primitives are
implemented in C++, the handshaking logic, HTTP operation, and
obfuscation are implemented in portable Lisp. This design choice
allows faster development in general, particularly of new obfuscation
engines, and allows quicker deployment of new code, which can be run
without restarting the client: the program silently updates itself
over a pre-established secure channel.

NON-GOALS
---------

We focus on circumventing large-scale censorship technology while
preserving user privacy via encryption. We do not require perfect
steganography to meet our objectives, and do not purport to be able to
forever disguise traffic under close scrutiny of dedicated
professionals. All systems are vulnerable to that attack.

We do not plan to support public relays, zero-knowledge forwarding, or
onion routing. But because all parts of the network are controlled by
us, all attacks that take the form of "a malicious node does X" are
moot. A large portion of the attacks on other anti-censorship systems
take this form.

Tor undoubtedly better protection of anonymity.

END USER RISKS
--------------

End-user risks fall into two categories: those shared by all
anti-censorship systems, and those specific to Haystack. The former
should be familiar to everyone on this list: network applications
leaking personally identifiable information, malware, and browser
history. These can only be solved by user education and mitigated by
software.

Another risk shared by all tools, to varying degrees, is an external
authority detecting the use of a tool [8].

Some have expressed that the effective consequences of detection might
be greater for Haystack than for other systems because Haystack has
been "marketed toward dissidents". First of all, this is not a
technical objection and should be rejected out of hand. Second, I am
incredulous: Haystack is designed to bypass Internet censorship, and
that is its stated purpose. That capability is useful to dissidents,
but to many others besides. Haystack's userbase will include the same
kind of people who use Tor, Freegate, and Ultrasurf. One could argue
that many users of these programs could be targets for authorities.
Haystack is not special in this respect, and this entire line of
argumentation is unfounded.

Another possible attack is "fingerprinting": it's possible to guess
the content of an encrypted stream by looking at how that stream is
broken up for transit. SSL in particular is vulnerable. However,
because Haystack uses connection multiplexing and baseload traffic
generation, it is far less vulnerable to this attack than some other
solutions.

All programs to some extent are vulnerable to detection to varying
degrees. Haystack and many other projects go through pains to make the
detection as difficult as possible, but the risk remains.

One genuine risk found in Haystack, found also in Freegate and
Ultrasurf, but not in Tor, is the centralized trust. Let's example
what the worst-case scenario is:

A break-in of our central facility would reveal the content of user
traffic as long as attacks had control. Traffic older than a few hours
would not be vulnerable due to our cryptography having the "perfect
forward secrecy" property, though very recent traffic may still be in
the HTTP cache.

- A compromise of one of our feeder nodes would reveal the IP
  addresses of some users, though not the content of their traffic.

- A compromise of one of our exit nodes would not reveal very
  sensitive information except very recent cache contents and the
  identify of our processing nodes.

- If one of our secure development machines were compromised and the
  signing key passphrase mechanically extracted from my brain or my
  keyboard, an attacker would be able to execute a successful MITM
  attack against Haystack (assuming he knew at least some of the
  feeder nodes) and read traffic.

- If our obfuscation methods were defeated to such an extent that
  automatic traffic classification became possible and
  non-resource-prohibitive, authorities could detect and block
  Haystack. We expect that this will happen, and plan to change and
  improve obfuscation techniques often enough to defeat this attack
  overall.

  Instead of blocking Haystack, an attacker might simply log which
  users are connecting to it. By changing obfuscation techniques often
  enough, we can at least make this endeavor very expensive.

(N.B. --- the "obfuscation engine" is flexible enough to simply send
traffic over SSL if it turns out that's the best approach for a given
time and place.)

In short, there are risks. But we believe we can reduce all of them to
acceptable levels through secure administration practices, system
hardening, and excellent physical security. Suggestions for
improvement in this respect are welcome. Arguments over the
fundamental architecture are not.

CONCLUSION
----------

HAYSTACK IS NOT A PANACEA. No system is perfect [9]. Instead, Haystack
is meant to be an incremental improvement and a new option, trading
improvements in some area for worsening in others. Users still incur
significant risks and must be informed. But it is unreasonable to hold
Haystack to a standard that other systems do not themselves meet. As
long as users are safer than they would otherwise be and can choose a
system that fits their personal risk tolerances, it's a good thing.

REFERENCES
----------

[1] "Design of a blocking-resistant anonymity system Tor Project
technical report", Roger Dingledine, Nick Mathewson,
https://svn.torproject.org/svn/projects/design-paper/blocking.html

[2] "rproxy", Martin Pool,
http://rproxy.samba.org/

[3]  "Squeezing more bits out of HTTP caches", Mogul, J.C.,
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=844495

[4] "Design, implementation, and evaluation of duplicate transfer
detection in HTTP", Mogul, J.C. et. al.,
http://portal.acm.org/citation.cfm?id=1251175.1251179

[5] "A state-of-the-art Diffie-Hellman function", D.J. Bernstein
http://cr.yp.to/ecdh.html

[6] "The Case for Elliptic Curve Cryptography"
http://www.nsa.gov/business/programs/elliptic_curve.shtml

[7] "Preventing SSL Traffic Analysis with Realistic Cover Traffic",
Nabil Shear and Nikita Borisov,
http://www.cs.uiuc.edu/homes/nschear2/ccs09-ext-abstract.pdf

[8] "Active and Passive Tor Detection"
http://blog.tenablesecurity.com/2007/09/active-and-pass.html

[9] "Vulnerabilities in Tor: Past, present, future"
http://freehaven.net/~arma/slides-25c3.pdf

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 195 bytes
Desc: OpenPGP digital signature
URL: <http://mailman.stanford.edu/pipermail/liberationtech/attachments/20100911/648a9b0e/attachment.asc>