diff --git a/thesis.tex b/thesis.tex index acd5896..dda49e2 100644 --- a/thesis.tex +++ b/thesis.tex @@ -11,7 +11,8 @@ \usepackage{graphicx} \usepackage{makeidx} \usepackage{microtype} -\usepackage[x11names]{xcolor} +\usepackage{relsize} +\usepackage[table,x11names]{xcolor} \usepackage{todonotes} % http://grad.berkeley.edu/academic-progress/dissertation/: @@ -39,6 +40,12 @@ \def\sectionautorefname{Section} \def\figurenutorefname{Figure} +% Use one sequence to number both figures and tables. +% https://stackoverflow.com/a/43266057 +\makeatletter +\let\c@table\c@figure +\let\ftype@table\ftype@figure +\makeatother % Disable metadata for reproducible PDF. % https://tex.stackexchange.com/a/313605 @@ -53,10 +60,12 @@ \makeindex \hyphenation{Web-RTC} \hyphenation{GAE-uploader} +\hyphenation{Go-Agent} \usepackage{yfonts} \newcommand{\dragons}{\bigskip\noindent\textfrak{\Large here be dragons:}\bigskip\noindent} +\newcommand{\placeholder}[1]{\colorbox{lightgray}{\parbox[c][#1][c]{\textwidth}{\centering placeholder for figure}}} \begin{document} @@ -707,6 +716,8 @@ is exactly such an irrational decision, at the greater societal level. \section{Content obfuscation strategies} \label{sec:obfuscation-strategies} +\index{blocking by content|(} + \begin{itemize} \item Sony thing on passive/active detection \cite[\S 5.1]{SladekBroseEANTC} \item relation to website fingerprinting---circumvention is potentially harder because you can't just use e.g. constant bitrate @@ -900,6 +911,7 @@ That means that even if a censor is able to build a profile for a particular server, it is not necessarily useful for detecting other server instances. +\index{blocking by content|)} \section{Address blocking resistance strategies} \label{sec:address-strategies} @@ -1589,7 +1601,7 @@ chance of false positives---whether the destination actually is a proxy. \begin{figure} \centering -\parbox[c][2in][c]{\textwidth}{\centering placeholder for figure} +\placeholder{2in} \caption{ The censor watches a connection between a client and a destination. If content inspection does not definitively indicate a circumvention protocol, @@ -2217,149 +2229,462 @@ of censors' priorities with respect to circumvention. \chapter{Domain fronting} \label{chap:domain-fronting} -\dragons - -My most influential contribution -to the world of circumvention is -my research on domain fronting. -While the basic idea is not mine, -the research I led -and the code I wrote -helped domain fronting become the -ubiquitous tool it is today. +Domain fronting is a general-purpose circumvention technique +based on HTTPS. +It disguises the true destination of a client's messages +by routing them through a large web server or +content delivery network that hosts +many web sites. +The messages appear to go not to their actual destination +but to some \emph{front domain}, +one whose blocking would result in high collateral damage. +Because (with certain caveats) the censor cannot distinguish +domain-fronted HTTPS requests from ordinary HTTPS requests, +it cannot block circumvention without blocking the front domain. +Active probing primarily addresses the problem +of detection by address, +but also deals with detection by content and active probing. +Domain fronting is today an important component +of many circumvention systems. + +The core idea of domain fronting is the +use of different domain names at different protocol layers. +When you make an HTTPS request, the domain name +of the server you're trying to access normally appears in three places +that are visible to the censor: +\begin{itemize} +\item the DNS query +\item the client's TLS Server Name Indication (SNI) extension~\cite[\S 3]{rfc6066} +\item the server's TLS certificate~\cite[\S 7.4.2]{rfc5246} +\end{itemize} +and in one place that is not visible to the censor, +because it is encrypted: +\begin{itemize} +\item the HTTP Host header~\cite[\S 5.4]{rfc7230} +\end{itemize} +In a normal request, the same domain name appears in all four places, +and all of them except for the Host header afford the censor +an easy basis for blocking by address. +The only difference in a domain-fronted request is that +the domain name that appears in +the Host header, on the ``inside'' of the request, +is not the same +as the domain that appears in the other places, +on the ``outside.'' +\autoref{fig:domain-fronting} illustrates. \begin{figure} \centering -\includegraphics[width=3in]{figures/domain-fronting} +\includegraphics[height=1.5in]{figures/domain-fronting} \caption{ Domain fronting uses different names -at different network layers. +at different protocol layers. +The forbidden destination domain is hidden under +ordinary TLS encryption. +The censor only sees a front domain, +one chosen to be expensive to block. } \label{fig:domain-fronting} \end{figure} -Three places visible to the censor: -* DNS request -* SNI -* Server certificate -And one place not visible to the censor: -* Host header - -\section{Related work on domain fronting} - -\cite{Koepsell2004a} -Bryce Boe -GoAgent -flashproxy - - - -Domain fronting assumes a rather strong censor model, -essentially equivalent to the state of the art of national censors -at the time of its popularization. -That is, a censor that can block IP addresses and domain names, -that can filter plaintext HTTP, -can fingerprint protocol implementations. -The main censor capabilities not provided for -are probabilistic classification by traffic flow characteristics, -and high-collateral-damage blocking of HTTPS on important web servers. -What I find most intellectually compelling -about domain fronting research -is that is finally begins to transcend -the ``cat-and-mouse'' paradigm that has plagued -thinking around circumvention, -and to put blocking resistance on a scientific basis. -By this I mean that one can state assumptions, -and consequences that hold as long as the assumptions are true. -For example, we do not make claims such as -``domain fronting is unblockable''; rather, -we may state hypotheses and consequents: -``if fronting through a domain with sufficient collateral damage, -such that the censor is unwilling to block it, -and if the censor does not find some side channel -that distinguishes fronted from non-fronted traffic, -then the communication will be unblocked.'' -This kind of thinking, -that of weighing censors' \emph{costs} and \emph{capabilities}, -underlies my thinking about threat modeling. - -Like flash proxy, domain fronting is primarily targeted at -the problem of address blocking -(though it is effective against content blocking and active probing as well). -The core idea is the use of different -domain names at different layers of communication. -The ``outside'' layers, those visible to the censor, -contain an innocuous ``front'' domain name, ideally -one that is hard to block because of the value of the -services behind it. -The ``inside'' layer, invisible to the censor under encryption, -contains the true, presumably censored, destination. -An intermediate server, whose name is the front domain name, -removes the outer layer of encryption and forwards the information -to the covert destination. -There are a number of important services that -support domain fronting, mainly cloud providers -and content delivery networks. -On top of this basic machinery, -it is relatively easy to build a general-purpose -covert bidirectional communications channel, -one that can even be made reasonably efficient. - -I wrote and continue to maintain the code of meek, -a circumvention transport for Tor based on domain fronting. -It first appeared in Tor Browser in October 2014, -and continues operation to the present. -My code has been forked and incorporated by other circumvention projects, -notably including Psiphon and Lantern, -with whom I continue to collaborate. -Today, meek is Tor's second-most-used transport, -carrying around 10 terabytes of user traffic each month. - -Köpsell and Hillig -were ahead of the game when in 2004 they posed -a hypothetical situation~\cite[\S 5.2]{Koepsell2004a}: +The SNI extension and the Host header serve similar purposes: +they both enable virtual hosting, +where one server handles requests for multiple domains. +Both fields allow the client to inform the server +of which domain it wants to access. +The SNI works at the TLS layer, +telling the server which certificate to send; +and the Host header works at the HTTP layer, +telling the server what contents to serve. +It is something of an accident that these +partially redundant fields both exist. +Before TLS, virtual hosting only required the Host header. +When HTTP is combined with TLS, +the client cannot send the Host header until the TLS handshake is complete, +and the TLS handshake cannot complete without the server knowing +which certificate to send. +The SNI extension resolves the deadlock by sending +the domain name in plaintext at the beginning of the handshake. +Domain fronting takes advantage of decoupling +the two normally coupled values. +It relies on the server decrypting the TLS layer +and throwing it away, +then routing requests internally string according to the Host header. + +Virtual hosting, in the form of content delivery networks (CDNs), is now common. +A~CDN works by placing an ``edge server'' between +the client and the destination, called an ``origin server'' in this context. +When an edge server receives a request, +it forwards the request to the origin server according to the Host header. +The edge server receives the response, +optionally caches it, and forwards it back to the client. +The edge server is effectively a proxy: +the client never contacts the destination directly. +The contents of the client's messages, +as well as their true destination, +are protected by TLS encryption. +If the censor active-probes the server, +all it gets is whatever the edge server would return normally. +The censor may block edge servers or the front domain, +but only at the cost of blocking all other, +non-circumvention-related traffic +to the CDN or domain, +with the collateral damage that entails. + +Domain fronting may be an atypical use of HTTPS, +but it is not a way to get free CDN service. +The CDN will not forward requests to arbitrary destinations, +only to the domains of its customers. +Setting up domain fronting requires +paying for CDN service---and the costs can be high, +as \autoref{sec:meek-history} shows. + +It might seem at first that domain fronting is +only useful for accessing HTTPS resources, +and only when they are hosted on a service that supports fronting. +But extending to general-purpose circumvention +only requires a minor extra step: +host an HTTP-based tunneling proxy on the web service in question. +Domain fronting shields the address of the proxy, +which then provides access to arbitrary destinations. +HTTP tunneling underlies meek, +a circumvention system based on domain fronting, +discussed further in \autoref{sec:meek-impl}. + +One of the best features of domain fronting is that it does +not require any secret information, +completely bypassing the proxy distribution problem +(\label{sec:address-strategies}). +The address of the CDN edge server, +the address of the proxy hidden behind it, +the fact that some fraction of traffic to the edge server is circumventing---all +of these may be known by the censor. +This is not to say, of course, that domain fronting +is impossible to block---as always, +a censor's capacity to block depends on its +tolerance for collateral damage. +But the lack of secrecy makes the censor's choice especially stark: +either allow circumvention, or block a domain. +This is how we should think of all circumvention: +not ``can it be blocked,'' +but ``what does it cost to block.'' + + +\section{Work related to domain fronting} + +Neither I~nor my coauthors invented the technique +of domain fronting. +We did, however, give it a name, +popularize its use, +and produce an important implementation. +As far as I know, +the first implementation of domain fronting +in a circumvention system was in GoAgent circa 2012. +GoAgent employed a variant where +the SNI is omitted completely, +rather than being faked. +% GoAgent 2.0 began sending HTTPS requests: +% b4ab1f83f57b91eda34ae1743021fbb60ecd2f60 is the first bad commit +% commit b4ab1f83f57b91eda34ae1743021fbb60ecd2f60 +% Author: Phus Lu +% Date: Wed Aug 29 21:30:46 2012 +0800 +% +% merge 2.0 code +Earlier in 2012, +Bryce Boe wrote a blog post~\cite{Boe2012a} +outlining how to use Google App Engine as a proxy, +and suggested that sending a false SNI could +bypass SNI whitelisting. +Way back in 2004, +in an era when HTTPS and CDNs were less common +than they are today, +Köpsell and Hillig foresaw +the possibilities~\cite[\S 5.2]{Koepsell2004a}: ``Imagine that all web pages of the United States are only -retrievable (from abroad) by sending encrypted request to +retrievable (from abroad) by sending encrypted requests to one and only one special node. Clearly this idea belongs to the `all or nothing' concept because a blocker has to block all requests to this node.'' -The situation they describe---one server -hosting many sites, encrypted and indistinguishably---is -not far off from what exists today with CDNs and HTTPS. -Domain fronting removes the last remaining easy distinguisher, -the domain name that appears in the clear. -Domain fronting appeared in the 2015 research paper -``Blocking-resistant communication through domain fronting''~\cite{Fifield2015a-local}, -which I coauthored with Chang Lan, Rod Hynes, Percy Wegmann, and Vern Paxson. +Refraction networking is the name for +a class of circumvention techniques, +similar in spirit to domain fronting. +The idea was introduced in 2011 with the designs +Cirripede~\cite{Houmansadr2011a}, +CurveBall~\cite{Karlin2011a}, +and Telex~\cite{Wustrow2011a}. +In refraction networking, +it is network routers that act as proxies, +lying at the middle of network paths +rather than at the ends. +The client ``tags'' its messages +in a way that the censor cannot detect +(analogously to the way the Host header +is encrypted in domain fronting). +When the router finds a tagged message, +it shunts the message away from its nominal destination +and towards some other, covert destination. +Refraction networking +derives its blocking resistance +from the collateral damage that would result +from blocking the cover channel (typically TLS) +or the refraction-capable network routers. +Refraction networking has the potential +to be the basis of exceptionally high-performance circumvention, +as a test deployment in Spring 2017 demonstrated~\cite{Frolov2017a}. + +CloudTransport~\cite{Brubaker2014a}, proposed in 2014, +is similar to domain fronting in many respects. +It uses HTTPS to a shared server +(in this case a cloud storage server). +The specific storage area being accessed---what +the censor would like to know---is encrypted, +so the censor cannot block CloudTransport +without blocking the storage service completely. + +In 2015 I~published a paper on domain fronting~\cite{Fifield2015a-local} +with Chang Lan, Rod Hynes, Percy Wegmann, and Vern Paxson. +In it, we described the experience of deploying +domain fronting on Tor, +Lantern~\cite{lantern}, +and Psiphon~\cite{psiphon}, +and began an investigation of the +side channels, such as packet size and timing, +that a censor might use +to detect domain fronting. +The Tor deployment, called meek, +is the subject of Sections~\ref{sec:meek-impl} +and~\ref{sec:meek-history}. + +Later in 2015 there were a couple of papers on the detection +of circumvention transports, including meek. +Tan et~al.~\cite{Tan2015a} measured the +Kullback--Leibler divergence\index{Kullback--Leibler divergence}\index{relative entropy|see Kullback--Leibler divergence} +between the distributions of packet size and packet timing +in different protocols. +(The paper is written in Chinese +and my understanding of it is based on +an imperfect translation.) +Wang et~al.~\cite{Wang2015a} +built classifiers for meek +among other protocols +using entropy, timing, +and transport-layer features. +They emphasized practical classifiers +and tested their false-classification rates +against real traffic traces. + + +% \section{Fronting-capable web services} +% \label{sec:fronting-services} +% +% \dragons + + +\section{A pluggable transport for Tor} +\label{sec:meek-impl} + +I~am the main author and maintainer of meek, +a pluggable transport for Tor based on domain fronting. +meek uses domain-fronted HTTP POST requests +as the primitive operation to send or receive +chunks of data up to a few kilobytes in size. +The intermediate CDN forwards requests +to a bridge. +Auxiliary programs on the client and the bridge +convert between a sequence of HTTP requests +and the byte stream expected by Tor. +The Tor processes at either end are oblivious +to the domain-fronted transport between them. +\autoref{fig:meek} shows how the components +and protocol layers interact. + +\begin{figure} +\centering +\includegraphics[width=\textwidth]{figures/meek-architecture} +\caption{ +Putting it together: +domain fronting as the basic tool in a circumvention system. +The CDN acts as a limited sort of proxy, +capable of proxying only to destinations +within its own network +(one of which we control). +The node we control is a Tor bridge, +equipped with a plugin to interface +between the HTTP tunnel and the Tor protocol. +The bridge acts as a general-purpose proxy, +granting access to any destination. +} +\label{fig:meek} +\end{figure} + +When the client has something to send, +it issues a POST request with the data in the body. +Because in HTTP/1.1 there is no way +for an HTTP server to preemptively push data +to a client, +the meek server buffers data waiting to be sent +until it receives a client's request, +then includes the buffered data in the body of the HTTP response. +The client must poll the server periodically, +even when it has nothing to send, +to enable the server to send whatever buffered data it may have. +The meek server must handle multiple simultaneous clients. +Each client, at the beginning of a session, +generates a random session identifier string, +and includes it with its requests +in a special X-Session-Id HTTP header. +The server maintains separate connections +to the local Tor process for each session identifier. +\autoref{fig:meek-tunnel} shows a pattern of +request--response pairs. + +\begin{figure} +\centering +\setlength{\topsep}{0pt} +\setlength{\partopsep}{0pt} +\begin{tabular}{ccc} +meek client & & meek server \\ +\cellcolor{lightgray} +\begin{minipage}{2in} +\footnotesize +\vspace{0.5em} +\begin{verbatim} +POST / HTTP/1.1 +Host: forbidden.example +X-Session-Id: cbIzfhx1Hn+ +Content-Length: 517 + +\x16\x03\x01\x02... +\end{verbatim} +\vspace{0.5em} +\end{minipage} & $\rightarrow$ & \\ + + & $\leftarrow$ & +\cellcolor{lightgray} +\begin{minipage}{2in} +\footnotesize +\vspace{0.5em} +\begin{verbatim} +HTTP/1.1 200 OK +Content-Length: 739 + +\x16\x03\x03\x00... +\end{verbatim} +\vspace{0.5em} +\end{minipage} \\ + +\cellcolor{lightgray} +\begin{minipage}{2in} +\footnotesize +\vspace{0.5em} +\begin{verbatim} +POST / HTTP/1.1 +Host: forbidden.example +X-Session-Id: cbIzfhx1Hn+ +Content-Length: 0 + +\end{verbatim} +\vspace{0.5em} +\end{minipage} & $\rightarrow$ & \\ + + & $\leftarrow$ & +\cellcolor{lightgray} +\begin{minipage}{2in} +\footnotesize +\vspace{0.5em} +\begin{verbatim} +HTTP/1.1 200 OK +Content-Length: 75 + +\x14\x03\x03\x00... +\end{verbatim} +\vspace{0.5em} +\end{minipage} \\ +\end{tabular} +\caption{ +The HTTP-based framing protocol of meek. +Each request and response is domain-fronted. +The second POST is an example of an empty polling request, +sent only to give the server an opportunity to send +data downstream. +} +\label{fig:meek-tunnel} +\end{figure} -CloudTransport~\cite{Brubaker2014a}, +Even with domain fronting to hide the destination request, +a censor may try to distinguish circumventing HTTPS connections +by their TLS fingerprint. +TLS implementations have a lot of latitude in composing +their handshake messages, enough that it is possible to +distinguish different TLS implementations +through passive observation. +For example, the Great Firewall had used +Tor's TLS fingerprint for detection~\cite{tor-trac-4744}. +For this reason, meek strives to make its TLS fingerprint +look like that of a browser. +It does this by relaying its HTTPS requests through +a local headless browser (which is completely separate from +the browser that the user interacts with). + +meek first appeared in Tor Browser in October 2014, +and continues to be used to the present. +It is Tor's second-most-used transport +behind obfs4. +The next section is a detailed history of deployment. \section{An unvarnished history of meek deployment} +\label{sec:meek-history} \begin{itemize} \item First release of Orbot that had meek? \item Funding/grant timespans -\item cost table \item origin of the name -\item ``Seeing Through Network-Protocol Obfuscation''~\cite{Wang2015a} October 2015 -\item ``Towards Measuring Unobservability in Anonymous Communication Systems''~\cite{Tan2015} October 2015 \item ``Research and Realization of Tor Anonymous Communication Identification Method Based on Meek''? 2016 \url{http://cdmd.cnki.com.cn/Article/CDMD-10004-1016120870.htm} \end{itemize} -\begin{figure} +\begin{figure}[p] \centering \includegraphics{figures/metrics-clients-meek} \caption{ Estimated mean number of concurrent users of the meek pluggable transport, with selected events. +This graph is an updated version of +Figure~5 from the 2015 paper +``Blocking-resistant communication through domain fronting''~\cite{Fifield2015a-local}. } \label{fig:metrics-clients-meek} \end{figure} +\begin{table}[p] +\centering +\include{figures/tab-meek-costs} +\caption{ +Costs for running meek, +compiled from my monthly reports~\cite[\S Costs]{meek-wiki}. +(The reference has minor +arithmetic errors that are corrected here.) +meek ran on three different web services: +Google App Engine, Amazon CloudFront, and Microsoft Azure. +The notation `{\color{gray} ---}' means meek wasn't deployed +on that service in that month; +for example, we stopped using Google after May 2016 +following the suspension of the service +(see discussion on p.~\pageref{para:meek-suspension}). +The notation `{\color{gray} ?}' marks the months +after I~stopped handling the invoices personally. +I~don't know the costs for those months, +so certain totals are marked with `\raisebox{.3ex}{\smaller +}' +to indicate that they are higher +than what is shown, +but I~don't know by how much. +} +\label{tab:meek-costs} +\end{table} + Fielding a circumvention and keeping it running is full of unexpected challenges. At the time of the publication of the domain fronting paper~\cite{Fifield2015a-local} in 2015, meek had been deployed only a year and a half. @@ -2578,6 +2903,8 @@ the App Engine bill (\$0.12/GB, with one~GB free each day) was less than \$1.00 per month for the first seven months of 2014~\cite[\S Costs]{meek-wiki}. In August, the cost started to be nonzero every day, and would continue to rise from there. +See \autoref{tab:meek-costs} for a history +of monthly costs. Tor Browser 4.0~\cite{tor-blog-tor-browser-40-released} was released on October 15, 2014. @@ -2704,6 +3031,31 @@ The situation was not fully resolved until November~4 with the next release of Tor Browser: cascading failures led to over a month of downtime. +In October 2015 there appeared a couple of research papers +that investigated meek's susceptibility to detection +via side channels. +Tan et~al.~\cite{Tan2015a} +(including Binxing Fang, the ``father of the Great Firewall'') +used Kullback--Leibler divergence\index{Kullback--Leibler divergence} +to quantify the differences between protocols, +with respect to packet size +and interarrival time distributions. +Their paper is written in Chinese, +so I~had to read it in machine translation. +Wang et~al.~\cite{Wang2015a} +published a more comprehensive report +on detecting meek (and other protocols), +emphasizing practicality and precision. +They showed that some previously proposed +detections would have untenable false-positive rates, +and constructed a classifier for meek +based on entropy and timing features. +It's worth noting that since the first reported +efforts to block meek in 2016, +censors have not used techniques like those +described in these papers, +as far as we can tell. + One of the benefits of building a circumvention system for Tor is the easy integration with Tor Metrics---the source of the user number estimates in this section. @@ -2753,6 +3105,8 @@ incrementing the final octet from\ .200 to~.201, causing it to become unblocked. I am aware of no similar incidents before or since. +\phantomsection +\label{para:meek-suspension} The next surprise was on May~13, 2016. meek's App Engine backend stopped working and I got a notice: \begin{quote} @@ -2767,7 +3121,7 @@ to the terms of service that had happened the previous year---but the true cause was unexpected. I tried repeatedly to contact Google and learn the nature of the ``general'' violation, but was stonewalled. -None of my inquiries received so much as an acknowlegement. +None of my inquiries received so much as an acknowledgement. It as not until June~18 that I got some insight as to what happened, through an unofficial channel. @@ -2953,6 +3307,9 @@ Just as before, we did not find an explanation for the increase. Between July~29 and August~17, meek-amazon had another outage due to an expired TLS certificate. +\todo[inline]{Blocking of look-like-nothing, and success of domain fronting +during the 19th Chinese Communist Party Congress} + \chapter{Snowflake} \label{chap:snowflake}