Saturday, July 30, 2011

KISSmetrics and life of an ETag

When I read Researchers Expose Cunning Online Tracking Service That Can’t Be Dodged on Slashdot, many commentators there thought disabling JavaScript could prevent tracking because the disclosure on how KISSmetrics works mentions serving two pieces of JavaScript file. However, JavaScript here is the red herring. The magic happens with ETag, a cache validation aspect of the web whose intended use is to speed up the loading of websites that you have already visited by downloading only content that has been updated since your last visit.

When a web browser first visits a site requesting an URL, the server responds with an entity tag (ETag) for that URL that can be used to uniquely identify the version of the URL served. Whenever the resource at a URL is modified, it is guaranteed that its ETag will change. Subsequent browser requests would include a conditional query, essentially telling the server "if this URL has not changed (same ETag) then don't bother sending me the resource." A web browser cache would remember the ETag for as long as the resource is in the cache.

NoScript will not block ETag. In fact, an ETag can be attached to any resource, an HTML file, an image, etc. The browser's incognito mode may not be sufficient if it shares the same browser cache with non-incognito mode content (as it will send the same ETag). The only way to disable ETag tracking is to disable/clear browser cache, but this too may not be sufficient (more about this later).

KISSmetrics uses a combination of techniques. To find out how, I hand-crafted an HTTP request as follows, and saved it in a text file in DOS line ending ("\n\r"). Notice the file needs to have a trailing blank line, which marks the end of the HTTP request header.
$ cat i.txt
GET /i.js HTTP/1.1
Host: i.kissmetrics.com

$ hexdump -C i.txt
00000000  47 45 54 20 2f 69 2e 6a  73 20 48 54 54 50 2f 31  |GET /i.js HTTP/1|
00000010  2e 31 0d 0a 48 6f 73 74  3a 20 69 2e 6b 69 73 73  |.1..Host: i.kiss|
00000020  6d 65 74 72 69 63 73 2e  63 6f 6d 0d 0a 0d 0a     |metrics.com....|
0000002f
Now make a request to be tracked.
$ cat i.txt | nc i.kissmetrics.com 80
HTTP/1.1 200 OK
Cache-Control: max-age=864000000, public
Date: Sat, 30 Jul 2011 19:49:32 GMT
ETag: "xy5cdaPdlMSI4u2xv8rndfudaAE"
Expires: Wed, 15 Dec 2038 19:49:32 GMT
Last-Modified: Sat, 30 Jul 2011 18:49:32 GMT
P3P: CP="NOI CURa ADMa DEVa TAIa OUR IND UNI INT"
Server: nginx
Set-Cookie: _km_cid=xy5cdaPdlMSI4u2xv8rndfudaAE;expires=Wed, 15 Dec 2038 19:49:32 GMT;path=/;
Content-Type: application/x-javascript
Content-Length: 79

var KMCID='xy5cdaPdlMSI4u2xv8rndfudaAE';if(typeof(_kmil) == 'function')_kmil();
Notice that the same identity is presented as ETag, a cookie, as well as a variable in the JavaScript. To my surprise, if I run the same command again, I get the same ETag.
$ cat i.txt| nc i.kissmetrics.com 80
HTTP/1.1 200 OK
Cache-Control: max-age=864000000, public
Date: Sat, 30 Jul 2011 19:49:32 GMT
ETag: "xy5cdaPdlMSI4u2xv8rndfudaAE"
Expires: Wed, 15 Dec 2038 19:49:32 GMT
Last-Modified: Sat, 30 Jul 2011 18:49:32 GMT
P3P: CP="NOI CURa ADMa DEVa TAIa OUR IND UNI INT"
Server: nginx
Age: 298
Content-Type: application/x-javascript
Content-Length: 79

var KMCID='xy5cdaPdlMSI4u2xv8rndfudaAE';if(typeof(_kmil) == 'function')_kmil();
This time, notice that it no longer tries to set a cookie, but it somehow remembers my ETag and sets an age. Running the same command again, I get:
$ cat i.txt| nc i.kissmetrics.com 80
HTTP/1.1 200 OK
Cache-Control: max-age=864000000, public
Date: Sat, 30 Jul 2011 19:49:32 GMT
ETag: "xy5cdaPdlMSI4u2xv8rndfudaAE"
Expires: Wed, 15 Dec 2038 19:49:32 GMT
Last-Modified: Sat, 30 Jul 2011 18:49:32 GMT
P3P: CP="NOI CURa ADMa DEVa TAIa OUR IND UNI INT"
Server: nginx
Age: 542
Content-Type: application/x-javascript
Content-Length: 79

var KMCID='xy5cdaPdlMSI4u2xv8rndfudaAE';if(typeof(_kmil) == 'function')_kmil();
Notice the longer age now.

Now, why does this result surprise me? If I hand-craft an HTTP request, the request should be perfectly stateless. I am expecting to get a different ETag every time I try the same command. But I am getting the same one every time, as if I'm still being tracked.

It turns out there is a co-conspirator. I'm using a mobile wireless connection right now, and there is a transparent proxy between my computer and KISSmetrics. The transparent proxy is part of the network infrastructure to lessen the load of my provider's connection to another network, by sharing a cache among my provider's users. An evidence of the existence of this transparent proxy is the difference in server behavior. If I switch to a network without the transparent proxy, I get this:
$ cat i.txt | nc i.kissmetrics.com 80
HTTP/1.1 503 Service Unavailable.
Content-length:0

The web server, nginx, wants the connection to remain open until at least it starts sending the response, but the transparent proxy before did not require this. This is probably a side-effect of nginx HTTP pipelining support. It's not too difficult to workaround this problem, by slightly modifying the command.
$ cat i.txt /dev/tty | nc i.kissmetrics.com 80
HTTP/1.1 200 OK
Cache-Control: max-age=864000000, public
Content-Type: application/x-javascript
Date: Sat, 30 Jul 2011 21:55:58 GMT
ETag: "ysdfEF8mCndrvOxrcnzF4tysDss"
Expires: Wed, 15 Dec 2038 21:55:58 GMT
Last-Modified: Sat, 30 Jul 2011 20:55:58 GMT
P3P: CP="NOI CURa ADMa DEVa TAIa OUR IND UNI INT"
Server: nginx
Set-Cookie: _km_cid=ysdfEF8mCndrvOxrcnzF4tysDss;expires=Wed, 15 Dec 2038 21:55:58 GMT;path=/;
Content-Length: 79
Connection: keep-alive

var KMCID='ysdfEF8mCndrvOxrcnzF4tysDss';if(typeof(_kmil) == 'function')_kmil();
After the server responds, I hit Ctrl-D to end the connection. I now get a fresh tag (as well as a cookie) every time.
$ cat i.txt /dev/tty | nc i.kissmetrics.com 80
HTTP/1.1 200 OK
...
ETag: "ikLBYzrQaWhFzc5lsacDhni3ftI"
...
Set-Cookie: _km_cid=ikLBYzrQaWhFzc5lsacDhni3ftI;expires=Wed, 15 Dec 2038 22:03:26 GMT;path=/;
Content-Length: 79
Connection: keep-alive

var KMCID='ikLBYzrQaWhFzc5lsacDhni3ftI';if(typeof(_kmil) == 'function')_kmil();
$ cat i.txt /dev/tty | nc i.kissmetrics.com 80
HTTP/1.1 200 OK
...
ETag: "fsxiUZH0lIdITI0YA4-uxXslRMQ"
...
Set-Cookie: _km_cid=fsxiUZH0lIdITI0YA4-uxXslRMQ;expires=Wed, 15 Dec 2038 22:03:33 GMT;path=/;
Content-Length: 79
Connection: keep-alive

var KMCID='fsxiUZH0lIdITI0YA4-uxXslRMQ';if(typeof(_kmil) == 'function')_kmil();
I abbreviated the irrelevant headers.

Further, by modifying the HTTP request, I could get KISSmetrics to replay the ETag. A cookie is added to the HTTP request:
$ cat j.txt
GET /i.js HTTP/1.1
Host: i.kissmetrics.com
Cookie: _km_cid=fsxiUZH0lIdITI0YA4-uxXslRMQ

$ cat j.txt /dev/tty | nc i.kissmetrics.com 80
HTTP/1.1 200 OK
Cache-Control: max-age=864000000, public
Content-Type: application/x-javascript
Date: Sat, 30 Jul 2011 22:08:35 GMT
ETag: "fsxiUZH0lIdITI0YA4-uxXslRMQ"
Expires: Wed, 15 Dec 2038 22:08:35 GMT
Last-Modified: Sat, 30 Jul 2011 21:08:35 GMT
P3P: CP="NOI CURa ADMa DEVa TAIa OUR IND UNI INT"
Server: nginx
Set-Cookie: _km_cid=fsxiUZH0lIdITI0YA4-uxXslRMQ;expires=Wed, 15 Dec 2038 22:08:35 GMT;path=/;
Content-Length: 79
Connection: keep-alive

var KMCID='fsxiUZH0lIdITI0YA4-uxXslRMQ';if(typeof(_kmil) == 'function')_kmil();
What is interesting is that if I perform the If-None-Match query, KISSmetrics doesn't try to set the cookie back. I thought it should.
$ cat k.txt
GET /i.js HTTP/1.1
Host: i.kissmetrics.com
If-None-Match: "fsxiUZH0lIdITI0YA4-uxXslRMQ"

$ cat k.txt /dev/tty | nc i.kissmetrics.com 80
HTTP/1.1 304 Not Modified
Date: Sat, 30 Jul 2011 22:11:20 GMT
Server: nginx
Connection: keep-alive

This exercise reveals why ETag is such a clever technique to track visitors. By leveraging the transparent proxy cache, the end user has no option opting out of tracking. In fact, the web browser cache is simply a leaf node of a greater Internet content distribution cache framework. By using ETag, your internet service provider will do the dirty work for KISSmetrics. You can still be tracked through no fault of your web browser. As to who is responsibility for tracking you, the distinction is blurred.

To illustrate how the transparent proxy aids tracking, if I connect back to the network with the transparent proxy using cookie replay, the transparent proxy now starts tracking the replayed identity.
$ cat j.txt | nc i.kissmetrics.com 80  # Cookie-replayed request.
HTTP/1.1 200 OK
...
ETag: "fsxiUZH0lIdITI0YA4-uxXslRMQ"
...
$ cat i.txt | nc i.kissmetrics.com 80  # Untracked request.
HTTP/1.1 200 OK
...
ETag: "fsxiUZH0lIdITI0YA4-uxXslRMQ"
...
I disconnect and reconnect again. Now I issue the untracked request first, followed by a cookie replay, and by an ETag replay. Notice how the replay is now ignored because the new untracked request is now cached by the transparent proxy.
$ cat i.txt | nc i.kissmetrics.com 80  # Untracked request.
HTTP/1.1 200 OK
...
ETag: "Chw55f8kmJAUzkH15o0uP8Qz6i0"
...
$ cat j.txt | nc i.kissmetrics.com 80  # Cookie-replayed request.
HTTP/1.1 200 OK
...
ETag: "Chw55f8kmJAUzkH15o0uP8Qz6i0"
...
$ cat k.txt | nc i.kissmetrics.com 80  # ETag-replayed request.
HTTP/1.1 200 OK
...
ETag: "Chw55f8kmJAUzkH15o0uP8Qz6i0"
...
If you want to surf the web without being tracked, you (1) disconnect from the network, (2) reconnect, and (3) prime the transparent proxy's cache with a new identity request; then without clearing browser cache or cookies, you will be issued a new identity. However, it is possible that when the browser presents an old identity alongside the new identity, KISSmetrics can correlate and merge the two identities. It is probably safer to clear the browser cache and cookies just to be sure.

I think this at least marks the beginning of a happy story. Even though the transparent proxy cache built into the network infrastructure by the internet provider facilitates tracking, it is still possible for an end-user to evade the tracking by manipulating the proxy in a certain way.

Finally, the identity here is not really personally identifiable information per-se. To KISSmetrics, it is just a random string that tells them the random string has been seen visiting websites X, Y and Z. Unless you provide personally identifiable information to websites X, Y, or Z, all they know is that the same person has used different internet providers to visit certain websites.

Tuesday, July 5, 2011

ISMM 2011 selected paper notes

  • Cache Index-Aware Memory Allocation. Allocator of fixed-size headerless objects with sizes that are multiples of cache line size (e.g. object size 128 on cache lines of 64 bytes) may cause certain cache indices to be evicted often when objects uniformly map to the same cache line. Adding spacers (punctuated array) to these objects could break the affinity. NOTE(liulk):
    • Cache index conflict should generally happen only at block level if the allocator uses superblocks to manage arenas.
    • Knuth boundary tags are natural spacers used for variable sized object allocation, and would not suffer this problem.
  • Compartmental Memory Management in a Modern Web Browser. Memory resources allocated for each origin (as in "same origin policy") is stored together in the same heap, and a web browser will have separate heaps for each origin. Although Chrome spawns a process for each tag, but this may not be practical for mobile devices where hardware address space isolation could be expensive. This also shortens garbage collection time because same-origin private heaps do not have reference to objects belonging to another origin.
  • Multicore Garbage Collection with Local Heaps. Illustrates how mutation handling makes it hard to separate local and global heaps, and how despite the complicated mechanism, achieves little gain in scalability.
  • Iterative Data-parallel Mark&Sweep on a GPU. Highlights challenges specific to GPU, namely the lack of deep recursion due to the relatively small size of video memory (~128MB) with no virtual memory and the large number of threads (1000+).