diff --git a/thesis.tex b/thesis.tex index 1f64df0..acd5896 100644 --- a/thesis.tex +++ b/thesis.tex @@ -11,6 +11,7 @@ \usepackage{graphicx} \usepackage{makeidx} \usepackage{microtype} +\usepackage[x11names]{xcolor} \usepackage{todonotes} % http://grad.berkeley.edu/academic-progress/dissertation/: @@ -21,11 +22,17 @@ % biblatex manual: % "When using the hyperref package, it is preferable to load it after biblatex." -\usepackage[backend=biber,bibencoding=utf8,maxbibnames=99,backref=true]{biblatex} +\usepackage[backend=biber,bibencoding=utf8,sortcites=true,indexing=true,maxbibnames=99,backref=true]{biblatex} +\bibliography{local,censor-local,censor} % Remove "URL: " prefix from bibliography URLs. Original declaration is from % texmf-dist/tex/latex/biblatex/biblatex.def. \DeclareFieldFormat{url}{\url{#1}} -\bibliography{local,censor-local,censor} +% Better look for citations that include a section reference like \cite[\S 3]{foobar}. +\renewcommand{\postnotedelim}{~} +% Index only author names, not work titles. +% https://tex.stackexchange.com/a/305486 +\renewbibmacro*{citeindex}{\ifciteindex{\indexnames{author}}{}} +\renewbibmacro*{bibindex}{\ifbibindex{\indexnames{author}}{}} \usepackage[hidelinks]{hyperref} \urlstyle{same} \def\chapterautorefname{Chapter} @@ -234,7 +241,7 @@ Other forms of censorship that are \emph{not} in scope include: \item forum moderation and deletion of social media posts \item deletion-resistant publishing in the vein of the Eternity Service~\cite{Anderson1996a} - (what Köpsell and Hillig call ``censorship resistant publishing systems''~\cite[\S~1]{Koepsell2004a}), + (what Köpsell and Hillig call ``censorship resistant publishing systems''~\cite[\S 1]{Koepsell2004a}), except insofar as access to such services may be blocked % Dagster~\cite{Stubblefield2001a} @@ -265,23 +272,45 @@ far more topics could fit under the umbrella of Internet censorship. Nevertheless, for the purpose of this thesis, I will continue to use ``Internet censorship'' without further qualification to refer to the border firewall case. -% Even within this narrowed scope, there is plenty to do. - - -\section{Overview} - -\dragons -\autoref{chap:principles} -is the ``thesis'' within the thesis. -experience with tor - -pluggable transports - -My blind spots: VPNs, -systems without research documentation (FreeGate, Ultrasurf, Shadowsocks), -foreign-language documentation and forums. +\section{Context and overview} + +This thesis contains knowledge I~have collected +and research projects I have taken part in +over the last five years. +The next chapter, ``\nameref{chap:principles},'' +is the thesis of the thesis, +wherein I~lay out opinionated general principles +of the field. +The remaining chapters are split between +the topics of modeling and circumvention: +Chapters~\ref{chap:censor-modeling}--\ref{chap:proxy-probe} on censor modeling +and Chapters~\ref{chap:domain-fronting} and~\ref{chap:snowflake} on circumvention systems. + +One's point of view is colored by experience. +I~will therefore briefly describe the background to my research. +I~owe much of my experience to collaboration +with the Tor Project, +producers of the Tor anonymity network. +Although Tor was not originally intended as a circumvention system, +it has grown into one +thanks to pluggable transports, +a modularization system for circumvention implementations. +whose anonymity network has been the vehicle for deployment +of my circumvention systems, +as well as a common object of research. +I~know a lot about Tor +and pluggable transports, +but I~have less experience +(especially implementation experience) +with other systems, +particularly those +that are developed in a language other than English. +And while I~have plenty of operational experience---deploying +and maintaining systems with real users---I~have +not been in a situation where I~needed +to circumvent regularly, as a user. \chapter{Principles of circumvention} @@ -409,11 +438,11 @@ that are difficult for the censor to detect or block. The way of organizing censorship and circumvention techniques that I have presented is not the only way. Köpsell and Hillig divide detection into -``content'' and ``circumstances''~\cite[\S~4]{Koepsell2004a}; +``content'' and ``circumstances''~\cite[\S 4]{Koepsell2004a}; their circumstances include addresses and also what I would consider more content-like: timing, data transfer characteristics, and protocols. Philipp Winter divides circumvention into three problems: -bootstrapping, endpoint blocking, and traffic obfuscation~\cite[\S~1.1]{Winter2014c}. +bootstrapping, endpoint blocking, and traffic obfuscation~\cite[\S 1.1]{Winter2014c}. Endpoint blocking and traffic obfuscation correspond to my detection by address and detection by content; bootstrapping is the challenge of getting a copy of circumvention software @@ -426,17 +455,17 @@ break detection into four aspects: destinations, content, flow properties, -and protocol semantics~\cite[\S~2.4]{Khattak2016a}. +and protocol semantics~\cite[\S 2.4]{Khattak2016a}. I think of their ``content,'' ``flow properties,'' and ``protocol semantics'' as all fitting under the heading of content. -Tschantz et~al.\ identify ``setup'' and ``usage''~\cite[\S~V]{Tschantz2016a-local}, +Tschantz et~al.\ identify ``setup'' and ``usage''~\cite[\S V]{Tschantz2016a-local}, and Khattak, Elahi, et~al.\ identify -``communication establishment'' and ``conversation''~\cite[\S~3.1]{Khattak2016a}, +``communication establishment'' and ``conversation''~\cite[\S 3.1]{Khattak2016a}, as targets of obfuscation; these mostly correspond to address and content. What I call ``detection'' and ``blocking,'' -Khattak, Elahi, et~al.\ call ``fingerprinting'' and ``direct censorship''~\cite[\S~2.3]{Khattak2016a}, -and Tschantz et~al.\ call ``detection'' and ``action''~\cite[\S~II]{Tschantz2016a-local}. +Khattak, Elahi, et~al.\ call ``fingerprinting'' and ``direct censorship''~\cite[\S 2.3]{Khattak2016a}, +and Tschantz et~al.\ call ``detection'' and ``action''~\cite[\S II]{Tschantz2016a-local}. A major difficulty in developing circumvention systems is that however much you model and try to predict the reactions of a censor, @@ -597,7 +626,7 @@ that the censor doesn't mind blocking. In my opinion, collateral damage provides a more productive way to think about the behavior of censors than do alternatives. -Is is able to take into account different censors' +It is able to take into account different censors' differing resources and motivations, and so is more useful for generic modeling. Moreover, it gets to the heart of what makes @@ -609,7 +638,7 @@ called the essential element ``deniability,'' meaning that a user could plausibly claim to have been doing something other than circumventing when confronted with a log of their network activity. -Khattak, Elahi, et~al.~\cite[\S~4]{Khattak2016a} also consider +Khattak, Elahi, et~al.~\cite[\S 4]{Khattak2016a} also consider ``deniability'' separately from ``unblockability.'' % \cite{Houmansadr2011a} also says ``deniability'' % \cite{Burnett2010a} also says ``deniability'' @@ -679,7 +708,7 @@ is exactly such an irrational decision, at the greater societal level. \label{sec:obfuscation-strategies} \begin{itemize} -\item Sony thing on passive/active detection \cite[\S~5.1]{SladekBroseEANTC} +\item Sony thing on passive/active detection \cite[\S 5.1]{SladekBroseEANTC} \item relation to website fingerprinting---circumvention is potentially harder because you can't just use e.g. constant bitrate \end{itemize} @@ -708,7 +737,7 @@ and allows all others. This is not to say that steganography is strictly superior to polymorphism---there are tradeoffs in both directions. -Effective mimickry can be difficult to achieve, +Effective mimicry can be difficult to achieve, and in any case effectiveness can only be judged against a censor's specific computations of collateral damage. Whitelisting, by its nature, @@ -729,7 +758,7 @@ but proposed other attacks designed for efficiency and low false positives, against both steganographic and polymorphic protocols. Geddes et~al.~\cite{Geddes2013a} showed that even perfect imitation -(achieved via tunnelling) may leave vulnerabilities +(achieved via tunneling) may leave vulnerabilities due to mismatches between the cover protocol and the covert protocol---for instance randomly dropping packets may disrupt circumvention more than other uses of the cover protocol. @@ -738,11 +767,11 @@ perhaps entropy measurement, most of the attacks proposed in academic literature have not been used by censors in practice. Some systematizations -(for example those of Brubaker et~al.~\cite[\S~6]{Brubaker2014a}; -Wang et~al.~\cite[\S~2]{Wang2015a}; and -Khattak, Elahi, et~al.~\cite[\S~6.1]{Khattak2016a}) +(for example those of Brubaker et~al.~\cite[\S 6]{Brubaker2014a}; +Wang et~al.~\cite[\S 2]{Wang2015a}; and +Khattak, Elahi, et~al.~\cite[\S 6.1]{Khattak2016a}) further subdivide steganographic systems -into those based on mimickry +into those based on mimicry (attempting to replicate the behavior of a cover protocol) and tunneling (sending through a genuine implementation of the cover protocol). @@ -750,7 +779,7 @@ I do not find the distinction useful, except when speaking of concrete implementation choices; to me, there are various degrees of fidelity in imitation, and tunneling only tends to offer higher fidelity -than mimickry. +than mimicry. I will list some representative circumvention systems that exemplify the steganographic strategy. @@ -806,8 +835,7 @@ permitted fitting to distributions other than uniform). It was not susceptible to passive deobfuscation, relying on an out-of-band key exchange before each session. Shadowsocks~\cite{Shadowsocks} -is a lightweight encryption layer atop a simple proxy protocol, -widely used in China. +is a lightweight encryption layer atop a simple proxy protocol. There is a line of successive look-like-nothing protocols---known by the names obfs2, obfs3, ScrambleSuit, and obfs4---whose history @@ -835,7 +863,7 @@ and remains vulnerable to active probing. (The Great Firewall of China had begun active-probing for obfs2 by January 2013, and for obfs3 by February 2015, -or possibly as early as July 2013~\cite[\S~5.4]{Ensafi2015b}.) +or possibly as early as July 2013~\cite[\S 5.4]{Ensafi2015b}.) ScrambleSuit~\cite{Winter2013b}, first available to users in 2014~\cite{tor-blog-tor-browser-364-and-40-alpha-1-are-released}, arose in response to the active-probing of obfs3. @@ -876,6 +904,11 @@ it is not necessarily useful for detecting other server instances. \section{Address blocking resistance strategies} \label{sec:address-strategies} +\begin{itemize} +\item VPN Gate ``collaborative spy detection''~\cite[\S 4.3]{Nobori2014a}, other ways of fingerprinting censor +\item DEFIANCE~\cite{Lincoln2012a} +\end{itemize} + The first-order solution for reaching a destination whose address is blocked is to instead route through a proxy. But a single, static proxy is not much better than direct access, @@ -913,7 +946,7 @@ The simplest proxy infrastructure is no infrastructure at all: require every client to set up and maintain a proxy for their own personal use, or for a few of their friends. As long as the use of any single address remains low, -it may escape the censor's notice~\cite[\S~4.2]{tor-techreport-2006-11-001}. +it may escape the censor's notice~\cite[\S 4.2]{tor-techreport-2006-11-001}. The problem with this strategy, of course, is usability and scalability. If it were easy for everyone to set up their own proxy on an unblocked address, they would do it, @@ -960,7 +993,7 @@ despite public appeals for volunteers to run bridges there have never been more than a few thousand of them, and Dingledine reported in 2011 that the Great Firewall of China had managed to enumerate both the HTTPS and email distribution -pools~\cite[\S~1]{tor-techreport-2011-05-001}\cite[\S~1]{tor-techreport-2011-10-002}, +pools~\cites[\S 1]{tor-techreport-2011-05-001}[\S 1]{tor-techreport-2011-10-002}, presumably taking advantage of its greater resources. % (A curious fact, though, is that nearly But nearly all clients use the default bridges~\cite{Matic2017a}. % I will cover this seeming paradox in more detail in @@ -998,7 +1031,7 @@ A way to make proxy distribution more robust against censors is to ``poison'' the set of proxy addresses with the addresses of important servers, blocking which would result in high collateral damage. -VPN Gate employed this idea~\cite[\S~4.2]{Nobori2014a}, +VPN Gate employed this idea~\cite[\S 4.2]{Nobori2014a}, mixing into the their public proxy list the addresses of root DNS servers and Windows Update servers. @@ -1006,14 +1039,13 @@ and Windows Update servers. Apart from ``in-band'' discovery of bridges via subversion of a proxy distribution system, one must also worry about ``out-of-band'' discovery, -for example by mass scanning~\cite[\S~6]{tor-techreport-2011-10-002}\cite[\S~9.3]{tor-techreport-2006-11-001}. +for example by mass scanning~\cites[\S 6]{tor-techreport-2011-10-002}[\S 9.3]{tor-techreport-2006-11-001}. Durumeric et~al. found about 80\% of existing (unobfuscated) -Tor bridges~\cite[\S~4.4]{Durumeric2013a} +Tor bridges~\cite[\S 4.4]{Durumeric2013a} by scanning all of IPv4 on a handful of common bridge ports. % surf and serve~\cite{McLachlan2009a} (didn't actually scan) % extensive analysis~\cite{Ling2012a} (didn't scan) -% https://lists.torproject.org/pipermail/tor-dev/2014-December/007957.html (Project Sonar) -Matic et~al.\ had similar results in 2017~\cite[\S~V.D]{Matic2017a}, +Matic et~al.\ had similar results in 2017~\cite[\S V.D]{Matic2017a}, using public search engines in lieu of active scanning. The best solution to the scanning problem is to do as ScrambleSuit and obfs4 do, @@ -1180,7 +1212,7 @@ is not responsible for forwarding the packets onward. Another parallel is that censors are susceptible to the same kinds of evasion and obfuscation attacks that affect network monitors more generally. In 1998, Ptacek and Newsham~\cite{Ptacek1998a} -and Paxson~\cite[\S~5.3]{Paxson1999a} outlined various attacks +and Paxson~\cite[\S 5.3]{Paxson1999a} outlined various attacks against network intrusion detection systems---such as manipulating the IP time-to-live field or sending overlapping IP fragments---that @@ -1231,7 +1263,7 @@ estimating the cost to counteract them. \section{Early censorship and circumvention} Internet censorship and circumvention began to rise to importance -in the mid-1900s, conciding with the popularization of the World Wide Web. +in the mid-1900s, coinciding with the popularization of the World Wide Web. At that time, online censorship focused mainly on the web. Computer security companies were developing technology for IP address, URL, and web page filtering. @@ -1277,7 +1309,7 @@ special client-side software other than a web browser. The difficulty they faced was second-order blocking as censors discovered and blocked the proxies themselves. Circumvention designers deployed some countermeasures; -for example Circumventor had a mailing list~\cite[\S~7.4]{tor-techreport-2006-11-001} +for example Circumventor had a mailing list~\cite[\S 7.4]{tor-techreport-2006-11-001} which would send out fresh proxy addresses every few days. A 1996 article by Rich Morin~\cite{Morin1996Rover} presented a prototype HTML-rewriting proxy called Rover, @@ -1314,7 +1346,7 @@ censors operate is as important as the censors themselves. A good example of this is the paper on Infranet, the first academic circumvention design I am aware of. Its authors argued, in 2001, -that TLS would not suffice as a cover protocol~\cite[\S~3.2]{Feamster2002a}, +that TLS would not suffice as a cover protocol~\cite[\S 3.2]{Feamster2002a}, because the relatively few TLS-using services at that time could \emph{all} be blocked without much harm. Certainly the circumstances are different today---domain @@ -1335,9 +1367,13 @@ just as censors do. \chapter{Understanding censors} +\label{chap:censor-modeling} \dragons +A detached view is helpful when taking a longer view. +(As long as it is not \emph{too} detached.) + The main tool we have to build relevant threat models is the natural study of censors. The study of censors is complicated by difficulty of access: @@ -1532,37 +1568,429 @@ what's used and what's not used \chapter{Active probing} \label{chap:active-probing} -\dragons +The Great Firewall of China rolled out an innovation +in the identification of proxy servers around 2010: +\emph{active probing} of suspected proxy addresses. +In active probing, the censor pretends to be a legitimate client, +making its own connections to suspected addresses +to see whether they speak a proxy protocol. +Any addresses that are found to be proxies +are added to a blacklist +so that the destination will be blocked in the future. +The input to active probing, a set of suspected addresses, +comes from passive observation of the content of client connections. +The censor sees a client connect to a destination. +Whenever the censor's content classifier is unsure +whether an ongoing connection is accessing a proxy, +it may pass the address of the destination to the active prober. +The active prober's connection then checks---with a low +chance of false positives---whether the destination actually is a proxy. +\autoref{fig:active-probing} illustrates the process. + +\begin{figure} +\centering +\parbox[c][2in][c]{\textwidth}{\centering placeholder for figure} +\caption{ +The censor watches a connection between a client and a destination. +If content inspection does not definitively indicate a circumvention protocol, +but also does not definitively rule it out, +the censor passes the destination's address an active prober, +which itself attempts connections using various proxy protocols. +If any of the proxy connections succeeds, +the censor adds the destination +to an address blacklist. +} +\label{fig:active-probing} +\end{figure} + +Active probing makes good sense for the censor, +whose main restriction is the risk of false-positive classifications +that result in collateral damage. +Any classifier based purely on passive content inspection +must be very precise (have a low rate of false positives). +Active probing increases the precision, +by only blocking those servers determined through active inspection +to be proxies. +With active probing, +the censor can get away with a mediocre content-based classifier, +one that returns a rough superset of actual proxy connections, +because active probes will weed out any false positives it might have had. +The content-based classifier only has to reduce the total connections +to a small enough number that the active probing subsystem can handle them. +Another benefit, from the censor's point of view, +is that active probing can be run as a batch job, +separate from the the firewall's responsibilities that require +a low response time. + +Active probing, as I~use the term in this chapter, +is distinguished by being reactive, +driven by observation of client connections. +It is distinct from proactive, wide-scale port scanning, +in which a censor regularly scans likely ports on the Internet +to find proxies, independent of client activity. +The potential for the latter kind of scanning +has been appreciated for over a decade. +Dingledine and Mathewson~\cite[\S 9.3]{tor-techreport-2006-11-001} +raised scanning resistance as an issue +in Tor's initial bridge design document. +McLachlan and Hopper~\cite[\S 3.2]{McLachlan2009a} +observed that the tendency of bridges to run on +a handful of popular ports +would make them discoverable in an Internet-wide scan, +which they estimated would take weeks. +Dingledine~\cite[\S 6]{tor-techreport-2011-10-002} +mentioned indiscriminate scanning as one of ten ways to discover Tor +bridges---while also bringing up the possibility of active probing +in the sense of the present chapter, +then just beginning to be used by the Great Firewall. +Durumeric et~al.~\cite[\S 4.4]{Durumeric2013a} +demonstrated the effectiveness of Internet-wide scanning, +discovering about 80\% of public bridges in a matter of hours, +targeting only two ports, 9001 and 443. +Tsyrklevich~\cite{tor-dev-internet-wide-bridge-scanning} and +Matic et~al.~\cite[\S V.D]{Matic2017a} +later showed how to existing public repositories of Internet scan data +could reveal many bridges, without even the necessity +of manually running a new scan. + +The Great Firewall of China is the only censor known +to employ active probing. +Its sophistication has increased over time, +with the addition of new protocols +and a reduction in the delay before new servers get probed. +The Great Firewall has the documented ability +to active-probe plain Tor and some of its pluggable transports, +certain VPN protocols, +as well as certain HTTPS-based proxies. +The probing takes place +only seconds or minutes after +a connection by a legitimate client, +and the active-probing connections come +from a large range of source IP addresses. +The experimental results in this chapter +all have to do with China. + +Active probing lies somewhere in the middle of the dichotomy, +put forward in \autoref{chap:principles}, +of blocking by content and blocking by address. +The censor's active probing subsystem takes +addresses as input and produces addresses as output +(to be added to a blacklist). +But it is content-based classification that +produces the list of input addresses. +Active probing only became an issue because +content obfuscation had gotten better: +if a censor could easily identify +circumvention protocols by passive inspection, +it would not go to the extra trouble of active probing. + +Contemporary circumvention systems must be designed +to resist active probing attacks. +The look-like-nothing systems +ScrambleSuit~\cite{Winter2013b}, +obfs4~\cite{obfs4}, +and Shadowsocks~\cite{Shadowsocks-AEAD,BlessingStudio-why-do-shadowsocks-deprecate-ota} +do it by having the proxy authenticate client connections, +using a per-proxy password or private key. +Domain fronting (\autoref{chap:domain-fronting}) +and Snowflake (\autoref{chap:snowflake}) +deal with active probing differently. + -In 2015 I helped study the phenomenon of ``active probing'' -by the Great Firewall to discover hidden proxy servers. -In active probing, the censor pretends to be a legitimate client -of the proxy server: it connects to suspected servers -to check whether they speak a proxy protocol. -If they do, then they are blocked. -Active probing makes good sense for the censor: -it has high precision (low risk of collateral damage), -and is efficient because it can be run as a batch job -apart from a firewall's real-time responsibilities. -The Great Firewall can dynamically active-probe and block -the servers of a number of common circumvention protocols, -such as Tor, obfs2, and obfs3, -within only seconds or minutes of -a connection by a legitimate client. -The need to resist active probing has informed the design -of recent circumvention systems, including meek. - -My primary contribution to the active probing project was the analysis -of server logs to uncover the history -of about two and a half years of active probing. -My work revealed the wide distribution of active probing -source addresses (there were over 14,000 of them). -It also discovered previously undocumented types of probes, -for the protocol used by VPN Gate -and for a simple form of domain-fronted proxy. -I helped analyze the network ``fingerprints'' -of active probes and how they might be -distinguished from connections by legitimate clients. +\section{History of active probing research} + +Active probing research has primarily +had to do with Tor and its pluggable transports. +There is also some work on Shadowsocks. +\autoref{tab:active-probing-timeline} +summarizes the research of this section. + +\begin{table} +\begin{tabular}{lp{5 in}} +2010 August & +Nixon notices strange, random-looking connections from China in his SSH logs~\cite{Nixon-sshprobes}. +\\ +2011 May--June & +Nixon's random-looking probes are temporarily replaced +by TLS probes before changing back again~\cite{Nixon-sshprobes}. +\\ +2011 October & +hrimfaxi reports that Tor bridges are quickly detected by the GFW~\cite{tor-trac-4185}. +\\ +2011 November & +Nixon publishes observations and hypotheses about the strange SSH connections~\cite{Nixon-sshprobes}. +\\ +2011 December & +Tim Wilde investigates Tor probing~\cite{WildeGFW,tor-blog-knock-knock-knockin-bridges-doors,tor-trac-4744}. +He finds two kinds of probe: ``garbage'' random probes +and Tor-specific ones. +\\ +2012 February & +The obfs2 transport becomes available~\cite{tor-blog-obfsproxy-next-step-censorship-arms-race}. +The Great Firewall is initially unable to active-probe it. +\\ +2012 March & +Winter and Lindskog investigate Tor probing in detail~\cite{Winter2012a}. +\\ +2013 January & +The Great Firewall begins to active-probe obfs2~\cites{tor-trac-8591}[\S 4.3]{Ensafi2015b}. +The obfs3 transport becomes available~\cite{tor-blog-combined-flash-proxy-pyobfsproxy-browser-bundles}. +\\ +2013 June--July & +Majkowski observes TLS and garbage probes +and identifies fingerprintable features of the probers~\cite{Majkowski-fun-with-the-great-firewall}. +\\ +2013 August & +The Great Firewall begins to active-probe obfs3~\cite[Figure~8]{Ensafi2015b}. +\\ +2014 August & +The ScrambleSuit transport (resistant to active probing) +becomes available~\cite{tor-blog-tor-browser-40-released}. +\\ +2015 April & +The obfs4 transport (resistant to active probing) +becomes available~\cite{tor-blog-tor-browser-45-released}. +\\ +2015 August & +BreakWa11 discovers an active-probing weakness in +ShadowSocks~\cites{github-shadowsocks-rss-issue-38}[\S 2]{BlessingStudio-why-do-shadowsocks-deprecate-ota}. +\\ +2015 October & +Ensafi et~al.~\cite{Ensafi2015b} publish results of +multi-modal experiments on active probing. +\\ +2017 February & +Shadowsocks changes its protocol against active probing~\cite{github-shadowsocks-org-issue-42}. +\end{tabular} +\caption{ +Timeline of active probing research. +} +\label{tab:active-probing-timeline} +\end{table} + +Nixon~\cite{Nixon-sshprobes} in late 2011 published +an analysis of suspicious connections from IP addresses in China +that his servers had been receiving for a year. +The connections were to the SSH port, but did not follow the SSH protocol; +rather they contained apparently random bytes, +resulting in error messages in the log file. +Nixon discovered a pattern: the random-looking probes +were preceded, at an interval of 5--20 seconds, +by a legitimate SSH login from some other IP address in China. +The same pattern was repeated at three other sites. +Nixon supposed that the probes were triggered +by legitimate SSH users, +as their connections traversed the firewall; +and that the random payloads +were a simple form of service identification, +sending non-protocol-conforming data to see how the server would respond. +For a few weeks in May and June 2011, +the probes did not look random, but looked like TLS. + +In October 2011, Tor user hrimfaxi reported that +a newly set up, unpublished Tor bridge +would be blocked within 10~minutes of first +being accessed from China~\cite{tor-trac-4185}. +Moving the bridge's address to another port +on the same IP address would work temporarily, +but then be blocked again before another 10~minutes. +Wilde systematically investigated the phenomenon and +published an extensive analysis +of active probing behavior caused by +making a connection from inside China to outside~\cite{WildeGFW,tor-blog-knock-knock-knockin-bridges-doors}. +There were two kinds of probes: +``garbage'' random probes like those Nixon had described, +and specialized Tor probes that established a TLS session +and inside the session sent the Tor protocol. +The garbage probes were sent in response +to TLS connections to port 443 only, +and followed the triggering connection within moments. +The Tor probes were sent in response +to TLS connections on any port that shared +characteristics with Tor's client handshake~\cite{tor-trac-4744}, +and were not sent immediately, +but batched to the next quarter hour. +The probes came from diverse IP addresses in China: +20 different /8 networks~\cite{WildeProberIPs}. +Bridges using the obfs2 transport were +neither probed nor blocked. + +Winter and Lindskog revisited the question +of Tor probing +a few months later in 2012~\cite{Winter2012a}. +They used open proxies and a VPS in China +to reach bridges and relays in Russia, Singapore, and Sweden +(configured so that ordinary users would not +connect to them by accident). +They confirmed Wilde's finding that the blocking +of one port did not affect other ports on the same IP address. +Blocks expired after 12 hours. +By simulating multiple Tor connections, +they collected over 3,000 active probe samples in 17~days +During that time, there were about 72 hours +which where mysteriously free of active probing. +Half of the probes came from a single IP address, +202.108.181.70\index{202.108.181.70 (active prober)}; +the other half were almost all unique. +Reverse-scanning the source IP addresses of probes +after a few minutes sometimes found a live host, +though usually with a different IP TTL than +was used during the probing, +which the authors suggest may be a sign of +address spoofing by the probing infrastructure. +% diurnal pattern in scanning delay +Because probing was triggered by patterns in the TLS client handshake, +they developed a server-side tool, brdgrd~\cite{brdgrd}\index{brdgrd}, +that rewrote the TCP window so that +the client's handshake would be split across packets. +The tool sufficed, at the time, to prevent active probing. + +The obfs2 pluggable transport, +first available in February 2012~\cite{tor-blog-obfsproxy-next-step-censorship-arms-race}, +worked against active probing +for about a year. +The first report of its active probing arrived in March~2013~\cite{tor-trac-8591}. +By analyzing the logs of my web server, +I~found evidence for an even earlier onset: +January~2013~\cite[\S 4.3]{Ensafi2015b}. +At about the same time, +the obfs3 pluggable transport became available~\cite{tor-blog-combined-flash-proxy-pyobfsproxy-browser-bundles}. +It was as vulnerable to active probing as obfs2 was, +but the firewall did not gain the ability +to active-probe it until August~2013~\cite[Figure~8]{Ensafi2015b}. + +Majkowski~\cite{Majkowski-fun-with-the-great-firewall} +observed a change in active-probing behavior +between June and July~2013. +In June, he reproduced the observations +of Winter and Lindskog, +eliciting pairs of TLS probes, +one from +202.108.181.70\index{202.108.181.70 (active prober)} +and one from another IP address. +He also provided TLS fingerprints for the probers, +which were distinct from the fingerprints of ordinary Tor clients. +In July, he began to see pairs of probes +with apparently random contents, +like the garbage probes Wilde described. +The TLS fingerprints of probes in July +differed from those seen earlier, +but were still identifiable. + +The ScrambleSuit transport, +designed to be immune to active-probing attacks, +first shipped with Tor Browser~4.0 +in October~2014~\cite{tor-blog-tor-browser-40-released}. +The successor transport obfs4, similarly immune, +shipped in Tor Browser~4.5 in +April 2015~\cite{tor-blog-tor-browser-45-released}. + +In August 2015, +developer BreakWa11 described an active-probing vulnerability +in the Shadowsocks protocol~\cites{github-shadowsocks-rss-issue-38}[\S 2]{BlessingStudio-why-do-shadowsocks-deprecate-ota}. +The flaw had to do with a lack of authentication of ciphertext, +allowing a prober to introduce errors and watch +how the server responds. +The Shadowsocks developers deployed a modified protocol, +a stopgap measure that proved to have its own vulnerabilities to probing. +Shadowsocks deployed another protocol change in +February~2017 fixing the problem~\cite{github-shadowsocks-org-issue-42}. +Despite the long window of vulnerability, +there is no evidence that the Great Firewall +tried to active-probe Shadowsocks servers~\cite{ProgramThink-comment1508314948860}. + +Ensafi et~al. (including me)~\cite{Ensafi2015b} did the largest +controlled study of active probing to date +throughout early 2015. +We collected data from a variety of sources: +a private network of our own bridges, +isolated so that only we and active probers would connect to them; +induced intensive probing of a single bridge +over a short time period, in the manner of Winter and Lindskog; +analysis of server log files going back to 2010; +and back-scanning active prober source IP addresses +using tools such as ping, traceroute, and Nmap\index{Nmap}. +Using these sources of data, +we investigated many aspects of active probing, +such as the types of probes the firewall is capable of sending, +the probers' source addresses, +and potentially fingerprintable peculiarities of the probers' +implementation of protocols. +Observations from this research project +appear in the remaining sections of this chapter. + +\section{Types of probes} + +Our experiments confirmed the existence of certain probe types +that had been documented in previous research, +and other types that had not been previously documented. +Of the probe types that had been documented before, +our observations were mostly consistent, +with some differences in the details. +Our research found, at varying times, these kinds of probes: +\begin{description} + +\item[Tor] +We expected to find probing for Tor, and so we did. +The probes we observed in 2015, however, +differed from those Wilde described in 2011: +ours had a lighter-weight check inside the TLS layer +that did not require building a circuit. +Also, in contrast to what Winter and Lindskog found in 2012, +our Tor probes were sent seconds after a connection, +no longer batched to a multiple of 15~minutes. + +\item[obfs2] +The obfs2 protocol has a weakness that makes it trivial to identify, +passively or retroactively, +as long as you have at least the first 20 bytes sent by the client. +We turned the weakness to our advantage. +The ready identifiability of obfs2 allowed us to distinguish it +from other random-looking contents and +isolate a set of connections that +could only belong to legitimate circumventors or active probers. + +\item[obfs3] +Unlike obfs2, the obfs3 protocol is not easily identified passively, +except by general characteristics like its random payloads +and certain bounds on message sizes during the initial handshake. +In certain of our experiments, we were running an obfs3 server +that was able to participate in the handshake and so confirm +that what was being sent was really obfs3. +In others, such as the passive log analysis, we called ``obfs3'' +those probes that looked random and were not obfs2. + +\item[SoftEther] +We were initially only looking for Tor-related active probing, +but in the process we inadvertently found other kinds of probes. +One of these was an HTTPS request, +``\texttt{POST /vpnsvc/connect.cgi HTTP/1.1}'', +which resembles the client handshake of the SoftEther VPN software +that underlies the VPN Gate circumvention system~\cite{Nobori2014a}. + +\item[AppSpot] +This type of probe is an HTTPS request, +\begin{verbatim} +GET / HTTP/1.1 +Host: webncsproxyXX.appspot.com +\end{verbatim} +where the `\texttt{XX}' is a number that varies. +The intent of this probe seems to be the discovery +of servers that are capable of domain-fronting for Google services. +(See \autoref{chap:domain-fronting} for more on domain fronting.) +At one time, there were simple proxies running at +\nolinkurl{webncsproxyXX.appspot.com}. + +\item[urllib] +\todo[inline]{describe; this one is new since the 2015 paper} +\end{description} +This is not an exhaustive list of the Great Firewall's +active probing capability; +these are just the probes we were able to document comprehensively. +The purpose of the random ``garbage'' probes that Nixon and Wilde +had described is still not known; +they were not obfs2 and were too early to be obfs3, +so they must have been something else. \begin{figure} \centering @@ -1571,45 +1999,179 @@ distinguished from connections by legitimate clients. Active probes received at my web server over five years. This is an updated version of Figure~8 -in our paper ``Examining How the Great Firewall -Discovers Hidden Circumvention Servers''; +from the paper ``Examining How the Great Firewall +Discovers Hidden Circumvention Servers''~\cite{Ensafi2015b}; the vertical blue stripe divides old and new data. +The ``short'' probes are those that looked random but +did not provide enough data (20~bytes) +for the obfs2 test; it is likely that they, +along with the ``empty'' probes, are really obfs2, obfs3, or Tor probes +that were truncated at the first +`\textbackslash 0' or `\textbackslash n' byte. +\todo[inline]{urllib} Active probing activity---at least against this server---has -subsided since 2016 +subsided since 2016. } \label{fig:active-probing-http} \end{figure} -\begin{table} -\begin{tabular}{ll} -August 2010 & Leif Nixon notices strange connections from China in his SSH logs~\cite{Nixon-sshprobes}. \\ -November 2011 & Leif Nixon publishes observations and speculation about the strange SSH connections~\cite{Nixon-sshprobes}. -\end{tabular} +Most of our experiments were designed around +exploring known Tor-related probe types: +plain Tor (without pluggable transports), obfs2, and obfs3. +The server log analysis, however, +unexpectedly turned up the other probe types. +The server log data set consisted of application-layer logs +from my personal web/mail server, +which was also a Tor bridge. +Application-layer logs lack much of the fidelity you would want +in a measurement experiment; +they do not have precise timestamps +or transport-layer headers, for example, +and web server logs truncate the client's payload +at a `\textbackslash 0' or `\textbackslash n' byte. +But they make up for all that with time coverage. +\autoref{fig:active-probing-http} shows the history of +probes received at my server since 2013 +(there were no probes before that, though the logs go back to 2010). +We started by searching the logs for likely probes: +those that passed the obfs2 test or otherwise looked like random garbage. +Then we looked at what else appeared in the logs +for the IP addresses that had sent the certain probes. +In a small fraction of cases, the other logs lines +appeared to be genuine HTTP requests from legitimate clients; +but usually they were other probe-like payloads. +We continued this process, adding new classifiers +for likely probes, until reaching a fixed point. + +\section{Probing infrastructure} + +\begin{figure} +\centering +\includegraphics{figures/active-probing-tsval} \caption{ -Timeline of active probing. +TCP timestamp values from active probes. +% figures/active-probing.R: +% Total number of probes in http_tcpdump and https_tcpdump: 4239 +% Number of unique IP addresses in http_tcpdump and https_tcpdump: 3797 +Depicted are 4,239 probes from 3,797 distinct source IP addresses, +sharing however only a few TCP timestamp sequences. +The shaded area +{\color{gray!60}\rule{1em}{1em}} +marks a gap in packet capture. } -\label{tab:active-probing-timeline} -\end{table} - -The work on active probing appeared in the 2015 research paper -``Examining How the Great Firewall Discovers Hidden Circumvention Servers''~\cite{Ensafi2015b}, -which I coauthored with -Roya Ensafi, Philipp Winter, Nick Feamster, Nicholas Weaver, Vern Paxson. - -Dingledine and Mathewson~\cite[\S~9.3]{tor-techreport-2006-11-001} -McLachlan and Hopper~\cite{McLachlan2009a} -Ling et~al.~\cite{Ling2012a} -Dingledine~\cite{tor-techreport-2011-10-002} +\label{fig:active-probing-tsval} +\end{figure} -breakwa11 documented an active-probing vulnerability -in Shadowsocks in 2015(?) -but no evidence of probing for it. -\cite{BlessingStudio-why-do-shadowsocks-deprecate-ota} -\cite{ProgramThink-comment1508314948860} +The most salient feature of active probes, +when considered all together, +is the large number of source addresses +from which they are sent. +The 13,089 probes received by the HTTP and HTTPS ports +of my server came from 11,907 distinct IP addresses, +96\% of them appearing only once. +There is one extreme outlier, +the address 202.108.181.70\index{202.108.181.70 (active prober)}, +which by itself accounted for 2\% of the probes. +Among the address ranges are ones belonging to residential ISPs. + +Despite these many source addresses, +the sending of probes seems to be controlled +by only a few underlying processes. +The evidence for this lies in shared metadata patterns: +TCP initial sequence numbers +and TCP timestamps. +\autoref{fig:active-probing-tsval} shows patterns +in TCP timestamps +from about six months during which we ran a full packet capture +on the web server, in addition to application-layer logging. + +Wilde, and Winter and Lindskog, +had found that random ``garbage'' probes +were sent immediately after the client activity +that triggered them, +while Tor probes were batched and sent only every 15 minutes. +The Tor probing behavior had changed by 2015, +so that Tor probes were also sent immediately. + +We tried connecting back to the source address of probes. +Immediately after receiving a probe, +the probing IP address was completely unresponsive +to any stimulus we could think to apply. +In some cases though, within an hour the address would become responsive. +The responsive hosts looked like what you would expect to find +if you scanned such address ranges: +a variety of operating systems and open ports. \section{Fingerprinting the probers} -\dragons +A potential countermeasure against active probing is to have each proxy, +when it receives a connection, somehow decide whether the connection +come from a legitimate client or a prober, +Of course, the right way to distinguish legitimate clients +is with proper cryptographic authentication, +whether at the transport layer (like BridgeSPA~\cite{Smits2011a}) +or at the application layer (like ScrambleSuit, obfs4, and Shadowsocks). +Failing that, one might hope to distinguish probers by their fingerprints, +idiosyncrasies in their implementation +that make them stand out from legitimate clients. +In the case of the Great Firewall, source IP address alone +does not suffice +because---apart from the special address 202.108.181.70\index{202.108.181.70 (active prober)}---the +probers' source addresses +come from many networks, including those where we might expect +legitimate clients to reside. +There are, however, certain fingerprints at the application layer. +While none of the ones we found were robust enough +to effectively exclude active probers, +they do hint at how the probing is implemented. + +The active probers have an unusual TLS fingerprint, +TLSv1.0 with a peculiar list of ciphersuites. +Tor probes sent only a VERSIONS cell~\cite[\S 4.1]{tor-spec}, +waited for a response, +then closed the connection. +The VERSIONS cell corresponded to a ``v2'' Tor handshake +that had been superseded since 2011 +(though one that was still in use by a small number of real clients). +The Tor probes described by Wilde in 2011 +went further into the protocol. +It hints at the possibility that at one time, +the active probers used a (possibly modified) Tor client, +and later switched to a lighter-weight custom implementation. + +The obfs2 probes were conformant with the protocol +and unremarkable except for the fact that sometimes +payloads were duplicated. +obfs2 clients are supposed to use fresh randomness for each connection, +but a small fraction, about 0.65\%, of obfs2 probes +shared an identical payload with another probe. +The two probes in a pair came from different source IP addresses +and arrived within a second of each other. +The apparently separate probers therefore share some state +or a pseudorandom number generator. + +The obfs3 protocol calls for the client to send +random padding, the amount of padding being randomly distributed. +The active probers' implementation of obfs3 protocol +gets the distribution wrong, +half the time sending too much padding. +This feature would be difficult to exploit for detection, though, +because it would rely on the application-layer proxy code +being able to infer TCP segment boundaries. + +The SoftEther probes seemed to have been based on an earlier +version of the official SoftEther probe than was current at the time, +lacking an HTTP Host header. +They also differed from the official client +in that they were not preceded by a GET request. +The TLS fingerprint of the official client +is much different from that of the probers, +again hinting at a custom implementation. + +The AppSpot probes have a User-Agent header +that claims to be a specific version of Chromium; +however the rest of the header, +and the TLS fingerprint are inconsistent with Chromium. \chapter{Time delays in censors' reactions} @@ -1676,6 +2238,13 @@ at different network layers. \label{fig:domain-fronting} \end{figure} +Three places visible to the censor: +* DNS request +* SNI +* Server certificate +And one place not visible to the censor: +* Host header + \section{Related work on domain fronting} \cite{Koepsell2004a} @@ -1748,7 +2317,7 @@ carrying around 10 terabytes of user traffic each month. Köpsell and Hillig were ahead of the game when in 2004 they posed -a hypothetical situation~\cite[\S~5.2]{Koepsell2004a}: +a hypothetical situation~\cite[\S 5.2]{Koepsell2004a}: ``Imagine that all web pages of the United States are only retrievable (from abroad) by sending encrypted request to one and only one special node. @@ -1774,8 +2343,10 @@ CloudTransport~\cite{Brubaker2014a}, \item First release of Orbot that had meek? \item Funding/grant timespans \item cost table +\item origin of the name \item ``Seeing Through Network-Protocol Obfuscation''~\cite{Wang2015a} October 2015 \item ``Towards Measuring Unobservability in Anonymous Communication Systems''~\cite{Tan2015} October 2015 +\item ``Research and Realization of Tor Anonymous Communication Identification Method Based on Meek''? 2016 \url{http://cdmd.cnki.com.cn/Article/CDMD-10004-1016120870.htm} \end{itemize} \begin{figure} @@ -1920,17 +2491,23 @@ Our final report, ``Blocking-resistant communication through high-value web services,'' was the kernel of our later paper on domain fronting. -% I began the process of getting -% meek integrated into Tor Browser in February 2014~\cite{tor-trac-10935}. -% A lot happened in the next few months, -% before the integration was finished in August 2014. +I~began the process of getting +meek integrated into Tor Browser in February 2014~\cite{tor-trac-10935}. +The initial integration would be completed in August 2014. +In the intervening time, along with much testing and debugging, +Chang Lan and I~wrote browser extensions +for Chrome and Firefox in order to hide +the TLS fingerprint of the base meek client. +I~placed meek's code in the public domain +(Creative Commons CC0~\cite{cc0}) +on February~8, 2014. +The choice of (non-)license +was a strategic decision to +encourage adoption by projects other than Tor. % I am grateful to the Tor Browser developers % Kathleen Brade, Georg Koppen, and Mark Smith; % and the volunteers on the \mbox{tor-qa} mailing list\index{tor-qa mailing list} % for their assistance during this time especially. -% Along the way, I extended the base meek client with -% a browser-based TLS camouflage module -% using the same Firefox core on which Tor Browser is based. In March 2014, I met some developers of Lantern at a one-day hackathon sponsored by OpenITP~\cite{openitp-usability-hackathon}. @@ -1998,7 +2575,7 @@ on September 15; numbers after that date are more trustworthy. In any case, the usage before this first release was tiny: the App Engine bill (\$0.12/GB, with one~GB free each day) -was less than \$1.00 per month for the first seven months of 2014~\cite[\S~Costs]{meek-wiki}. +was less than \$1.00 per month for the first seven months of 2014~\cite[\S Costs]{meek-wiki}. In August, the cost started to be nonzero every day, and would continue to rise from there. @@ -2034,7 +2611,7 @@ where it was accepted and appeared on June~30 at the symposium. The increasing use of domain fronting by various -circumvention tools begain to attract more attention. +circumvention tools began to attract more attention. A March 2015 article by Eva Dou and Alistair Barr in the Wall Street Journal~\cite{DouBarrWallStreetJournal} described domain fronting and ``collateral freedom'' in general, @@ -2070,7 +2647,7 @@ by having one of the backends not run as fast as possible. The deployment of domain fronting was being partly supported by a \$500/month grant from Google. Already the February 2015, the monthly cost for App Engine alone -began to exceed that amount~\cite[\S~Costs]{meek-wiki}. +began to exceed that amount~\cite[\S Costs]{meek-wiki}. In an effort to control costs, in May 2015 we began to rate-limit the App Engine and CloudFront bridges, deliberately slowing the service @@ -2119,13 +2696,13 @@ but it would not be available to the majority of users until the next release of which happened on August~11. Between September~30 and October~9, the CloudFront-fronted bridge was effectively down because of an expired TLS certificate. -When it rebooted on October~9, an adminstrative oversight -caused its Tor relay identity fingerprint changed---meaning +When it rebooted on October~9, an administrative oversight +caused its Tor relay identity fingerprint to change---meaning that clients expecting the former fingerprint would refuse to connect to it~\cite{tor-trac-17473}. The situation was not fully resolved until November~4 with the next release of Tor Browser: -cascading failured led to over a month of downtime. +cascading failures led to over a month of downtime. One of the benefits of building a circumvention system for Tor is the easy integration with Tor Metrics---the source of the user @@ -2894,8 +3471,11 @@ Analyzing Internet Censorship in Pakistan\cite{Aceto2016a} \backmatter -\printbibliography +\printbibliography[heading=bibintoc] +% \clearpage +% \phantomsection +% \addcontentsline{toc}{chapter}{\indexname} % \printindex \end{CJK}