Designing for Reliability

In the past months, I have not written much. I pushed forward with work for the Interpeer Project. But more recently, I also started as a researcher at AnyWi Technologies, joining friends from a past job. There, we participate in public/private research projects into next generation commercial drone platforms.

While both domains have a multitude of differences between them, one strong overlap exists in the need for reliable and performant networking connections over the public Internet.

When you know a little bit about Internet technologies and you read “reliable”, chances are good your mind immediately goes to DARPA’s venerable Transmission Control Protocol (TCP). Though other protocols providing reliable message transmission exist, TCP is by far the most widely used, as it underlies… well, you reading this, most likely. All web traffic today is passed over TCP/IP (with some experimental exceptions).

I went into these projects knowing full well that TCP has flaws that we may need to address – but one thing I was not prepared for: speaking to the partners at the different projects, it became apparent that everyone has different issues with TCP that all derive from wildly diverging opinions on what “reliability” actually means.

The full quote from Through the Looking Glass goes like this:

“When I use a word,” Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to mean – neither more nor less.” “The question is,” said Alice, “whether you can make words mean so many different things.” “The question is,” said Humpty Dumpty, “which is to be master – that’s all.”

Mastery of one definition is not the goal, but reconciliation is. However, in order to reconcile, one must first look at the motivation for each different position.

In this article, I will outline the different definitions of the term I have recently encountered, and very briefly address how existing protocols such as TCP fare with them. In follow-on posts, I will go into more detail on each topic, to arrive at a requirements list for a future protocol design.

Haute nouveaute, Mme. Demorest's reliable patterns. In sizes, illustrated & described. 'What to wear,' 15 cts. 7 1/2 stg. Portfolio of fashions. 15 cts. 7 1/2 stg. Paris, Vienna, New York, London, agencies everywhere. Demorest's Monthly Magazine. 1879. — Figure: “Haute nouveaute, Mme. Demorest’s reliable patterns. In sizes, illustrated & described. ‘What to wear,’ 15 cts. 7 1/2 stg. Portfolio of fashions. 15 cts. 7 1/2 stg. Paris, Vienna, New York, London, agencies everywhere. Demorest’s Monthly Magazine. 1879.” by Boston Public Library is licensed under CC BY 2.0

So what is reliability in networking?

Soft Delivery Guarantee

As mentioned before, the de-facto standard for reliability in networking is what TCP does. Its main feature is that it will actively try to re-send a packet if the receiving end has not acknowledged its receipt. In this way, TCP provides a soft delivery guarantee.

There is an implication to this: because TCP can only re-send what it hasn’t forgotten, it has to buffer all packets until they have been acknowledged. As send buffers are limited – and should be limited – that means that the application may not be able to send new packets until previous ones are known to have been received.

Because TCP is a stream oriented protocol, it also guarantees that packets are received in the order in which they are sent. There is an equivalent buffer on the receiving end which keeps packets that have been received. If a previous packet was lost, the buffer may contain data that has successfully been transmitted, but that the application cannot read. If this buffer fills up, the receiver must signal to the sender to stop sending more packets.

What if either acknowledgements or this stop signal are in the packet that gets lost? Well, that is where TCP gets interesting. And by interesting, I mean complex. And by complex I mean, prone to error cascades and the thundering herd problem.

The upshot is, TCP is great when there is relatively little packet loss. But in that kind of scenario, it does provide a soft delivery guarantee.

Hard Delivery Guarantee

Some applications require a much harder delivery guarantee than TCP provides: they need to ensure that a message is delivered no matter what. Since realistically that goal cannot be fulfilled under signal loss conditions, “no matter what” also includes “no matter when” – that is, as long as a packet gets delivered eventually, all is well.

In practice, of course, there are also limits to either of these conditions, they are just a lot broader than what TCP provides. Typically such hard delivery guarantees are implemented in a layer above the transport protocol, where persistent storage buffers keep application messages rather than TCP packets.

As communications are rarely along a direct link between two endpoints, and messages from Alice to Bob pass Carol, it isn’t always the case that Alice or Bob perform the buffering. When Alice has a clear channel to Carol, but Carol has temporarily lost sight of Bob, messages will be stored with Carol until she can forward them to Bob again. For this reason, the type of technique for guaranteeing hard delivery is usually called store and forward.

Local Decision Making

In my previous post on distributed consensus, I ended up arguing that local decision making beats maintaining global state. The same is true for the smaller scale of two endpoints communicating.

Let’s take TCP again as an example: in order for its strict ordering of packets to work, both parties must agree on an initial sequence number for the first data packet. This is transmitted in a SYN packet, acknowledged in a SYN-ACK packet, and the acknowledgement is acknowledged in a final ACK packet. Only when all three have been exchanged do both parties have the same state, and data transmission can begin.

Of course this poses the problem that if any of these three packets are lost, transmission is stalled, which can be construed as a lack of reliability. This definition of reliability centres around timing guarantees, whereas the previous two have delivery guarantees as their focus.

To fulfil timing guarantees, one tactic is to generally prefer local decision making over requiring error-prone state synchronizing between communication endpoints.

Time-Sensitive Networking

Timing guarantees are such an important part of one definition of reliability, especially in the automotive industry, that a whole bunch of Ethernet standards have been created to provide them.

Of course, they address timing guarantees at a layer well below our TCP Internet standard. And yet, we can effectively shape traffic at upper layers by restricting the transmission rate and transmitting at regular intervals.

If the selected transmission rate is below the capacity of the path between the endpoints, chances are better that packets don’t get lost. And transmitting at regular intervals allows all nodes on the path to empty their buffers as fast as they are getting filled.

These are techniques one would choose for better video streaming, or for online gaming. In either case – for varying reasons of user experience – it is much better to lose some data than to let the application wait for it. In this definition of reliability, guaranteed delivery is not as desirable as (relatively) timely delivery.

Tamper-Proofing

Tamper-proofing a networking connection is looking at a wholly different set of expectations than the previous examples. It’s all about ensuring that packets that you receive are actually from the expected sender.

The typical approach is to employ message authentication codes (MACs), often as part of a wider cryptographic scheme, such as Transport Layer Security (TLS).

But tamper-proofing is conceptually separate enough to consider it as a definition of reliability in its own right, and MACs can be applied without encryption.

Privacy

The other definition of reliability related to tamper-proofing is privacy. It is concerned with relying that no unauthorized parties have access to the transmitted data.

This is generally solved with encryption on the otherwise public Internet – though of course entirely private networks are another technique for ensuring privacy.

It is worth stressing that this and the previous reliability definition are relatively directly derived from the Authentication, Authorization and Accounting (AAA) architectural approach to distributed systems. Authenticated messages are tamper-proof, and only endpoints are authorized to read them.

Non-Interference

It is also fairly important that one communication link should not interfere with any other. Packet loss or congestion on one should not lead to similar conditions on the other, if at all avoidable.

This principle is reflected in TCP. For one, it tries to back off when it detects congestion on a link, giving other participants a chance to finish their transactions. For another, TCP implementations always manage the state of all TCP links in the system, allowing them to schedule fairly between them.

As all intermediate links like our aforementioned Carol contain these features, TCP implementations actually behave relatively fairly between competing connections.

The way to leverage this is to send two independent data streams on two separate TCP connections. But while that is useful in limited settings, TCP also has a fixed upper bound on how many connections it can handle simultaneously – which can make life hard for servers with many inbound client connections.

Failover & Bonding

Finally, a definition of reliability is “by any means necessary” – that is, if there exist multiple ways to communicate with another party, all of them should be tried.

This definition comes straight from the drone world, where certain requirements need to be met to achieve certification. For example, Specific Operations Risk Assessment (SORA) defines so-called Specific Assurance and Integrity Levels (SAIL), which require command and control links (C2) to have some failover capabilities.

In practice, this can mean one of two things:

That one link is a primary, and the other a backup (failover) link, or
that data is multiplexed along all available links (bonding).

In the latter case, one can distinguish between data priorities, and assign links according to this priority, to fulfil ever finer reliability definitions.

It’s worth pointing out that there exists a fairly widely deployed extension to TCP, Multi-Path TCP (MPTCP) that addresses this reliability definition. Other protocols such as the Stream Control Transmission Protocol (SCTP) offer similar features.

Summary

It should be clear after reading this that many different definitions of reliability exist in the networking world, and that they are all derived from real-world concerns. Existing protocols tend to fulfil only a subset of the requirements derived from these – and some achieve their goal only by explicitly or implicitly excluding other reliability definitions from their scope.

One of the aims of both the Interpeer Project and AnyWi Technologies is to arrive at a (set of) protocols that can address all of these concerns – albeit with the expectation that each application domain demands extra components that tailor this into more specific solutions.