The other day I was watching an old episode of Extreme Engineering, Subways of America. One of the problems with scheduling is collision avoidance. In order to reduce the likelihood of collision, you space out the trains or keep them moving slowly. But that means the utilization of the track becomes low or you increase commuter time. Collision is a huge problem for subway because visibility in the underground tunnel is low, and once an accident happens, there are only very limited escape options. A smoking fire can quickly turn into a disaster. The signal system allows collision avoidance in a more predictable way, so that efficiency can be improved.
Although the episode focused more on NYC subway, I take the MBTA every day to work, and I often wished its efficiency could be improved. The signal systems of the mass transit lines, red and blue, have been recently revamped (I don't take the orange line so I don't know). The green line, on the other hand, is still using an old system. They basically stick a signal light with a sensor and a timer to the right of the track. When a train passes, the sensor detects the train, and turns the signal light red. After 10 seconds, the signal light turns yellow, and another minute the signal light turns green. (I have not actually timed the intervals. I just pulled them out from memory). Signal lights are spaced every some distance. When you hear MBTA announce signaling problem, one of the light bulbs probably just burned out or the light got stuck in green.
What a signal system boils down to is a way to estimate distance between two trains, so the conductor can adjust the speed of the train accordingly. But installing signal systems can be expensive particularly if it requires new wiring along the tunnel. It requires taking parts of the tunnel offline, and servicing is expensive. The ingenuity of the timed signal lights is that it's a decentralized, cost-effective solution that worked well 100 years ago when the pace of life was slower.
I propose a similarly decentralized signal system where each train gains the visibility of its distance from other trains without the need to install new wiring to the tunnel. The key idea lies in applying Broadband over Power Lines technology which converts the "third rail" or the power line into an ether where communication can take place. Similar to how GPS and NTP systems work, trains will broadcast time-stamped messages at a given interval, and by measuring the round-trip time of these messages, trains can estimate the distances between each other. Stationary beacons can be placed at fixed distances on the track so trains can also track their absolute position.
The stationary beacon also detect if a train is passing without the active broadcasting working. If so, it sends a warning to all trains. The train immediately behind the warning beacon will ask the beacon how long ago the no-signal train has passed and adjusts its speed accordingly, essentially reverting back to the old system. The system is successful if it would allow both old and new trains to work on the same track. A graceful upgrade option is also a graceful downgrade option.
Now done with the logistics, here is the technical part. Light travels about 30cm each nanosecond. This is called a "light-foot." Electric signal travels at similar speed. We can convert the round-trip time to a distance, but we're constrained by the accuracy and precision we can measure time. Computer processors run at 1GHz or more, so for the sake of discussion we can equate 1 nanosecond to 1 CPU cycle. It is generally possible to use off-the-shelf computers for this purpose, except off-the-shelf operating systems are not to the task. We need a dedicated real-time operating system that can achieve latency in just a handful of CPU cycles. Since main memory access incurs cycle penalty of 20 cycles or more, the system will run entirely off the CPU's L2 cache.
Using a general programmable CPU has the benefit that we can easily design and test the time-distance measurement protocol to make it safer, more efficient, and more robust. On the other hand, if this is programmed in hardware, then it would take many years. GPS has its root from 1940's, with initial deployment in its current form in the 1990's. It took GPS more than 20 years to reach current maturity. Although we're not launching satellites, the positioning system proposed here requires faster response time and more precise distance measurement, and will likely not be doable in another 10 years in hardware. I think we can already develop the technology in software in a matter of a few years.
Saturday, August 24, 2013
Friday, August 23, 2013
Towards a decentralized social network
These days, social network platforms are a great way to share with people aspects of your life, but once you upload your life to the provider, you have very little control over it. Gaining complete control of your data means you have to host it, but hosting a site can compulsorily expose other private information (e.g. WHOIS) that you don't want to share. One solution is to host an individual's social network profile over TOR hidden services. Here are some notes about how social network might work in a completely distributed fashion using OpenID and OAuth.
A personal profile is an URL. The URL is an HTML page with link rel serving an OpenID. The HTML page can also be a schema.org/Person. Other link rels can also expose RSS feed that your friends subscribe to. Which posts your friends get over RSS depends on their OpenID when they log into your profile site. The streams page where you read your friends post is simply a custom RSS reader. For this to work, you also need to log into all your friend's profile site in order to read their posts. You can use cookies to keep the authenticated state, but this is not ideal.
OAuth is an alternative which works differently. When you log into your streams page and wants to see the posts from all of your friends, their site would request your site to provide them with an OAuth access token. The OAuth would allow them to access just your public information enough for them to decide that you are trustworthy through OpenID Connect. They would then publish an RSS appropriate for your identity. There are two examples: Google+ Sign-In API and Facebook OAuth dialog.
Unfortunately there is no standard way to interact with other social networking services after an OAuth token is granted, e.g. posting messages publicly or privately to the other user.
A personal profile is an URL. The URL is an HTML page with link rel serving an OpenID. The HTML page can also be a schema.org/Person. Other link rels can also expose RSS feed that your friends subscribe to. Which posts your friends get over RSS depends on their OpenID when they log into your profile site. The streams page where you read your friends post is simply a custom RSS reader. For this to work, you also need to log into all your friend's profile site in order to read their posts. You can use cookies to keep the authenticated state, but this is not ideal.
OAuth is an alternative which works differently. When you log into your streams page and wants to see the posts from all of your friends, their site would request your site to provide them with an OAuth access token. The OAuth would allow them to access just your public information enough for them to decide that you are trustworthy through OpenID Connect. They would then publish an RSS appropriate for your identity. There are two examples: Google+ Sign-In API and Facebook OAuth dialog.
Unfortunately there is no standard way to interact with other social networking services after an OAuth token is granted, e.g. posting messages publicly or privately to the other user.
Wednesday, August 21, 2013
Essence of a secure socket protocol
This evening I went back and read RFC 5246 Transport Layer Security Protocol 1.2 just for the heck of it. It is pretty complex. Pages 7 through 14 define how to serialize messages. Pages 15 through 63 define the structure of different message types passed between server and client and their meaning. The remaining few pages define how client and server may use a shared secret to exchanging data.
A secure socket protocol really boils down to this. Alice and Bob wish to talk to each other, so they send each other their public keys. The public key is used to encrypt a session key for the recipient which is subsequently used to encrypt the actual message from sender. After receiving each other's public keys, they send the encrypted session key, and encrypted data would follow.
Since TLS is layered on top of TCP, phase one begins after the TCP 3-way handshake, and TLS handshake is at least 6 ways. In total this takes 9 one-way trips or 4.5 round trips.
Phase 1 can be used to advertise the capabilities of the owner of the public key, so that the peer can choose an appropriate encryption and/or compression method. TLS also defines a mechanism to resume the use of a previous session key. This can be folded into Phase 1 as an option. The additional information sent for Phase 1 and 2 must be signed by the sender's public key. Authenticity and integrity of phase 3 communication is done using HMAC with the session key.
Here are some more ideas if I were to design my secure socket protocol from scratch.
A lower level secure socket protocol can embed Phase one with the SYN and Phase 2 with the ACK. If Alice is the client and Bob is the server, Alice would send her public key in SYN, Bob would respond with his public key as well as session key for alice in SYN-ACK, and then Alice will send session key for Bob in ACK. Then both can start communicating in ciphertext.
Notice that Phase 3 does not have to wait for Phase 2. Sender can immediately start sending ciphertext as soon as the session key is sent. This allows us to achieve just 1 round-trip. However, potential for network packet loss means that the sender has to be ready to resend the session key. It is probably easier implementation wise for Phase 3 to wait.
Rather than sending the public key all the times, a variation is to send a key-digest instead. If the recipient does not know the sender's key, then an additional round trip is required to obtain the key from sender. For a 4096-bit public key, this can save 512 bytes for the SYN. This is not a lot given today's high speed network.
Either party may terminate the connection early if:
A secure socket protocol really boils down to this. Alice and Bob wish to talk to each other, so they send each other their public keys. The public key is used to encrypt a session key for the recipient which is subsequently used to encrypt the actual message from sender. After receiving each other's public keys, they send the encrypted session key, and encrypted data would follow.
Phase |
Alice
|
Bob
|
1
| Sends public key to Bob. | Sends public key to Alice. |
2
| On receipt of Bob's public key, tells Bob about how subsequent messages will be encrypted (session key). | On receipt of Alice's public key, tells Alice about how subsequent messages will be encrypted (session key). |
3
| Starts sending ciphertext to Bob. | Starts sending ciphertext to Alice. |
Phase 1 can be used to advertise the capabilities of the owner of the public key, so that the peer can choose an appropriate encryption and/or compression method. TLS also defines a mechanism to resume the use of a previous session key. This can be folded into Phase 1 as an option. The additional information sent for Phase 1 and 2 must be signed by the sender's public key. Authenticity and integrity of phase 3 communication is done using HMAC with the session key.
Here are some more ideas if I were to design my secure socket protocol from scratch.
A lower level secure socket protocol can embed Phase one with the SYN and Phase 2 with the ACK. If Alice is the client and Bob is the server, Alice would send her public key in SYN, Bob would respond with his public key as well as session key for alice in SYN-ACK, and then Alice will send session key for Bob in ACK. Then both can start communicating in ciphertext.
Notice that Phase 3 does not have to wait for Phase 2. Sender can immediately start sending ciphertext as soon as the session key is sent. This allows us to achieve just 1 round-trip. However, potential for network packet loss means that the sender has to be ready to resend the session key. It is probably easier implementation wise for Phase 3 to wait.
Rather than sending the public key all the times, a variation is to send a key-digest instead. If the recipient does not know the sender's key, then an additional round trip is required to obtain the key from sender. For a 4096-bit public key, this can save 512 bytes for the SYN. This is not a lot given today's high speed network.
Either party may terminate the connection early if:
- It does not trust the peer's public key.
- It does not understand how subsequent messages could be decrypted.
To ensure ciphertext authenticity and integrity, append HMAC to plaintext to be encrypted together. This makes it harder for a third-party to verify if they decrypted the message successfully. The receiver will decrypt first and then verify the HMAC.
The new secure socket would require extending the BSD socket API as follows:
The new secure socket would require extending the BSD socket API as follows:
- connect_ex() is like connect() but allows passing arbitrary data, limited by single packet size, as part of the SYN. Upon return, the caller also gets the SYN data from the peer.
- accept_ex() is like accept() but allows retrieving SYN data from peer as well as send its own, again limited by single packet size. This function might also take a callback which decides whether to accept or reject the peer's SYN data before sending our own.
Saturday, August 3, 2013
Eliminate Cache Coherence from HyperTransport
Processors are equipped with a cache to speed up memory access because main memory access speed failed to keep up with processor speed. On a symmetric multi-processor system, each processor possesses a private cache but all share a memory bus. Cache coherence protocol makes sure that all the private caches maintain a consistent copy of a portion of the main memory even though any processor may write to the main memory any time.
Things get complicated with NUMA where processors now come with their own memory controller and their local main memory. They still have their private cache. Processors no longer share a bus, but they send messages to one another through a point-to-point communication channel. A message may be routed through multiple hops in a network of processors. This is how HyperTransport works. Furthermore, a socket can actually house two multi-core dies, and on a quad socket system it can create quite an interesting topology.
In order to create the illusion of shared memory, a processor will need to talk to another processor for accessing remote main memory. The longer the hops between processors, the higher the latency. But HyperTransport 3.1 operates at 3.2GHz, so we're still talking about just 1 or 2 CPU cycles. Memory access time clocks in at 20 cycles or more.
Current generation AMD processor caches both local as well as remote main memory. Cache coherence over HyperTransport is a lot more complicated. If a processor wishes to conduct a transaction and invalidate the caches of other processors, it must send a message to all processors and wait until all of them to respond. This creates a torrent of messages over the HyperTransport link.
A recent feature called HT assist uses parts of the L3 cache to keep a record of which processors' caches need to be invalidated. This reduces the number of messages that need to be sent, increasing effective memory bandwidth.
Having said all this, here is my epiphany: what if we prohibit a processor from caching remote memory? It will only cache local memory, but will answer memory access on behalf of a remote processor from the cache. This would work well because accessing a remote cache over HyperTransport is still faster than accessing local memory. And we can do without the overheads of cache coherence protocol.
Furthermore, as more programs become NUMA aware, their structure would adapt so that they access primarily memory local to the processor, but leverage remote memory access as a mean of communication. Their performance will not be hindered by not caching remote memory, and it frees up the precious bandwidth on HyperTransport to do useful communication.
Things get complicated with NUMA where processors now come with their own memory controller and their local main memory. They still have their private cache. Processors no longer share a bus, but they send messages to one another through a point-to-point communication channel. A message may be routed through multiple hops in a network of processors. This is how HyperTransport works. Furthermore, a socket can actually house two multi-core dies, and on a quad socket system it can create quite an interesting topology.
In order to create the illusion of shared memory, a processor will need to talk to another processor for accessing remote main memory. The longer the hops between processors, the higher the latency. But HyperTransport 3.1 operates at 3.2GHz, so we're still talking about just 1 or 2 CPU cycles. Memory access time clocks in at 20 cycles or more.
Current generation AMD processor caches both local as well as remote main memory. Cache coherence over HyperTransport is a lot more complicated. If a processor wishes to conduct a transaction and invalidate the caches of other processors, it must send a message to all processors and wait until all of them to respond. This creates a torrent of messages over the HyperTransport link.
A recent feature called HT assist uses parts of the L3 cache to keep a record of which processors' caches need to be invalidated. This reduces the number of messages that need to be sent, increasing effective memory bandwidth.
Having said all this, here is my epiphany: what if we prohibit a processor from caching remote memory? It will only cache local memory, but will answer memory access on behalf of a remote processor from the cache. This would work well because accessing a remote cache over HyperTransport is still faster than accessing local memory. And we can do without the overheads of cache coherence protocol.
Furthermore, as more programs become NUMA aware, their structure would adapt so that they access primarily memory local to the processor, but leverage remote memory access as a mean of communication. Their performance will not be hindered by not caching remote memory, and it frees up the precious bandwidth on HyperTransport to do useful communication.
Subscribe to:
Posts (Atom)