Message in a Bottle

Since I took such a long break before my last post, I figured I should forge ahead and write them while I can. Previously, I outlined the packet envelope information we’re going to send over the wire. Today’s topic is the basic messaging framework, handshakes and some considerations on channel establishment.

To recap, every packet belongs to a channel, and may contain one or more messages. In the absence of any other established channel, packets will belong to a default channel. This default channel shall be used for initiating connections.

This handshake can come in a number of different forms, to be discussed in detail at a later date. For now, let’s recall that we like the simplicity of WireGuard’s handshake as compared to TLS. However, WireGuard relies on pre-shared keys, whereas TLS allows for peers to send their keys and associated certificates. We also explored briefly how WireGuard’s model might be ideal for situations in which peers already exchanged keys, and a key exchange handshake might be required in addition for the initial connection. Those make two different kinds of handshakes, let’s now add a third:

Handshake with pre-shared keys (like WireGuard).
Handshake with key exchange (like TLS).
Handshake without encryption or authentication (like TCP).

In practice, it’s fine if this last handshake is never used. Implementations may choose not to support it. However, for the purposes of punting the heavy crypto stuff to a later date when some of the basics are already established, it’s a useful thing to discuss. And as we’ll see, it doesn’t actually require much implementation effort…

One more thing before we dig in: since we’re sending packets over UDP, and UDP is unreliable, the safest thing to do is to ensure that every packet we send fits into a UDP datagram - in other words, we need to be aware of the Maximum Transmission Unit (MTU) of a connection.

Unfortunately, the actual MTU depends on which underlying data link protocol UDP is sent. Ethernet is famously capable of carrying ca. 1500 Bytes per frame, but jumbo frames are widely deployed. On the other hand, other protocols, especially wireless ones, may allow for less. A bit of research suggests that a value generally safe from fragmentation lies around 1200 Bytes. With the packet header out of the way, that leaves space for just over 1 KiB of data per packet.

The protocol itself is capable of dealing with much larger MTU sizes. However, every implementation should choose an MTU size it supports, and default to a small packet size - in a later extension, we may add path MTU discovery.

Figure: “Message in a bottle” by Infomastern is licensed under CC BY-SA 2.0

Unencrypted and Unauthenticated Handshake

So what exactly is required in a handshake when no cryptographic information needs to be exchanged? Well… nothing, really.

Incoming packets are identified by their sender and recipient peer identifiers. Of course, any implementation should reject (silently discard) packets for recipients other than themselves, as well as packets without content. But if those hurdles are passed, we can really treat every packet as valid.

A handshake then constitutes a packet from a sender who has not previously sent a packet, or whose last packet was received before a connection reset timeout.

And that’s it?

Almost. I mean, it’s important already that connections have some kind of timeout associated with them. That alone is making a big difference in how implementations treat incoming packets.

But the other thing to consider are channels. Let’s assume that the sender picks a channel ID of, say, 42. The recipient, since they’re unaware of the sender, won’t really know what to do with that channel ID. The application using the protocol on the recipient side won’t be aware of such a channel number, and therefore will not be able to understand what this channel is all about.

The meaning of handshakes is inextricably linked to channels; the only channel that the protocol implementation can handle without application support is the default channel. So let’s derive the first rule of channels, which is:

Handshake messages shall be sent on the default channel only.

But that brings us around again to asking, what is a handshake without crypto setup? I mean, what kind of messages can we send as a handshake? Should we be permitted to just send application data immediately? That is what TCP would do (from an application perspective).

Well… it’s possible. But that messes a little with adding encryption later. Because the easiest way for encryption is to negotiate parameters per channel: remember, we do not want one channel to interfere with another. So as a TL;DR, any crypto state needs to be held per channel. So that would lead us to a second rule:

Handshake messages shall be sent on the default channel only.
Application data shall only be sent on non-default channels.

Now that’s fine, but now we’re back to square one, what is it that a non-authenticated, non-encrypted handshake should contain if there is nothing to exchange in the handshake, and application data may not be transmitted? Actually, it’s pretty straightforward: any messages related to establishing channels are fair game. You’ll need at least one in order to send application data, so why not put it into the initial packet?

So here we are at rule number three:

Handshake messages shall be sent on the default channel only.
Application data shall only be sent on non-default channels.
Channel establishment messages shall be sent on the default channel only.

While the rules don’t explicitly state that you must send a channel establishment message in the first packet, it’s implied by the fact that empty packets will be rejected. In crypto handshakes, the channel establishment message would likely be appended to or sent after the last handshake messages.

Messaging

Outside of the packet header, we’ve been talking about messages. In QUIC, they talk about frames instead; it’s the same thing. I just happen to think that messages translate better to the application usage, where a message-queue like interface is fairly understandable to developers. Also, since Ethernet uses frames, IP packets, UDP datagrams, and hey, we also have a packet header, it would be nice to not recycle terminology for what’s effectively the same thing any more.

So messages it is, and a packet payload is a sequence of one or more messages, all belonging to the channel the packet is part of.

One of the most obvious messages we’ll need to work with are application data messages - they’re just simple encapsulations for message data. We’ll discuss them later in more detail, but there’s already something we can say about them: since it’s unknown how much data an application wants to send, these must be variable length messages.

On the other hand, crypto handshake messages tend to be fixed length: we know in advance precisely how many Bytes of key material and whatnot needs to change hands, err, ports in order to establish a secure connection.

So we have different messages that we need to distinguish, but also some messages have a fixed length, and others a variable length. That gives us something to start with:

The first part of every message is going to be the type of message, encoded as a unique value.
Variable length messages will then have a length field. Fixed length messages have their length encoded in the protocol and don’t need this field.
The next length Bytes then are the message payload, which follows a message-specific format.

Note how the presence or absence of a length field is comparable to the presence or absence of a length in a packet header; in the header case, it depends on whether the underlying transport works on datagram boundaries or is stream based. Here the reasons differ, but we also need a length when we can’t determine it any other way.

We’ll consider the message type and the length to be integer values, encoded as variable length integers. A variable length integer is one where each Byte carries 7 bits of value, and one flag bit that determines whether the next Byte is part of the value, or the value ends. That means a single Byte can encode values from 0-127, two Bytes values from 0-32767, etc.

The rationale either way is space saving, because our MTU might be fairly low. With an MTU of 1200, we may have data messages of around a KiB in size; that’s a length that cannot be encoded into a single length Byte. Two length Bytes for a 16 bit value should be sufficient for MTUs that conceivably fit into a data link frame, but that doesn’t account for e.g. stream-based local pipes as transports, where we may use larger data messages.

The best trade-off here are those two Bytes: for any values between 128 and 32767 Bytes, whether to encode the length as a single 16-bit value or a variable length integer makes no space difference. Larger values will be rare except for applications not on the Internet, and we may save a Byte for smaller values. Processing variable length integers is not so heavy that we can’t do it even on embedded computers.

For data message sizes, there’s probably not a lot won or lost either way. The ca. 1 KiB data message that fits into the MTU certainly will use two Bytes in whatever encoding. However, for message types, we may well get away with having single Bytes in most implementations, because it’s unlikely that we’ll have more than 127 different messages across all protocol extensions. And if we do, encoding them is possible.

I mentioned packet sizes earlier, because specifying packet sizes either implicitly or explicitly, and also specifying message lengths implicitly or explicitly leaves room for extra Bytes at the end of a packet that do not belong to any message. That’s on purpose. It’s for padding.

Padding should not be uninitialised memory, allowing for information leakage, but follow some kind of pattern. But padding is good for security: in the past, researchers managed to understand which websites users were visiting in TLS encrypted connections by matching the packet sizes to those sent by a number of known websites. Introducing random amounts of junk at the end of packets certainly helps mitigate against this kind of snooping, and is surely one of the reasons WireGuard does the same thing. At the very least, implementations should be free to add such padding for this reason.

Encrypted Messages

And that would be all for our basic messaging format considerations, except there is the problem of encrypted messages. If we are in a situation where a crypto handshake is used, and the encryption parameters are to be followed by a channel establishment message, should that message then be encrypted or not? And how does that fit into the rest of the setup?

Well, on the packet level, we have conveniently reserved ourselves some flags in the packet header. It’s no problem whatsoever for channels other than the default channel to set one flag when packets are encrypted (by parameters established in the default channel), and leave it unset when it’s a plain text channel.

But the default channel poses a special problem: handshake messages must be unencrypted, in order for the handshake to proceed. What about the channel establishment messages? Should they be encrypted or not? Security considerations say they should, but in practice it’s a little more complicated.

If you view the channel as a message stream, then all messages up to the (successful) end of a crypto handshake must be plain text, and following messages should be encrypted. But encryption occurs at the packet/channel level, not at the level of individual messages. How to proceed from here?

One option would be to force separation of handshake and other messages into individual unencrypted and encrypted packets. But that kind of defeats the purpose of 1-RTT WireGuard handshakes; you’d need a second packet just to establish the first channel(s). That seems wasteful.

Another option would be to force channel establishment messages off the default channel, onto a second default channel that becomes automatically active without any negotiation between peers once the handshake has concluded. But in the unencrypted use case, the handshake is the channel establishment, so we’d have to find a different way for that. Also, it would require a second packet again, since packets belong to one channel only.

On balance, the best approach can probably be outlined in this simple algorithm:

If a packet header has the encrypted flag set, decrypt it to known channel parameters before passing on to message parsing.
If a packet header does not have the encrypted flag set, assume it is unencrypted. This works for handshakes on the default channel.
If a MSG_START_ENCRYPTION message is encountered in an unencrypted packet, treat the remaining packet payload as encrypted.

And here we have our first actual message definition.

MSG_START_ENCRYPTION:

Zero length message payload.
Marks the start of encrypted packet payload until the end of the packet.
Encrypted packet payload is treated exactly like the payload of an encrypted packet.
The message must clearly always be unencrypted itself.

The mechanism doesn’t allow for an arbitrary mixture of encrypted and unencrypted messages in a packet; this would complicate implementations and is really not necessary in the multi-channel setup. However, it permits the special case of switching to encrypted channel establishment messages after a crypto handshake in a single packet, at the cost of a single message type definition. That seems like a fine tradeoff.

Channel Establishment

We’ve discussed most of the algorithmic considerations for establishing channels already now; the main thing that’s missing are the message definitions for establishing channels. We’ll be a little bit vague here, because some of this depends on future decisions, but we can certainly provide the main outline here.

Channels are negotiated between peers, with the exception of the default channel. Negotiation is the key word: in a message-based system, there is no reason why two peers having performed a handshake may not decide simultaneously that they want to establish a channel with, say, the hypothetical ID 42.

What to do here? If they agree on the channel ID, then they should just both accept that the channel exists and communicate over it, no?

Probably not. The reason for this is best expressed in terms of an API, though. Application code doesn’t just establish channels for no reasons, so let’s look at a particular example. Let’s consider video conferencing, and for simplicity consider a two person call only.

Both parties may want to establish video and audio channels, and probably do so simultaneously. There are easy ways to select random channel IDs that have a low chance of collisions, but we must assume that it’s possible that peer A assigns channel ID 42 to its audio channel, and peer B to its video channel. Clearly the two peers are then in strong disagreement as to channel 42’s purpose.

On the API level, you can go with two reasonable choices: either let the application pick channel IDs, or let the protocol implementation do it.

// Let application pick channel ID
auto video = open_channel(connection, VIDEO_CHANNEL_ID);
auto audio = open_channel(connection, AUDIO_CHANNEL_ID);

// Let protocol implementation do it.
auto video = open_channel(connection);
auto audio = open_channel(connection);

// Later
send_channel(video, videodata, videolength);

Either way, whatever channel gets established becomes some kind of handle for the application code to then send video or audio data. Of course, an API can support both styles. In the first example, if a conflict is detected, the open_channel() function may raise some kind of error. In the second example, it could handle retries on errors itself, until some channel ID could be agreed upon.

The way we’ll solve this is for the initiating peer to propose a channel ID, and for the responding peer to acknowledge it. But it’s feasible for the initiating peer to want to open multiple channels per packet, so we’ll always have to disambiguate which channel we’re currently negotiating. Also, packet loss conditions may mean that neither party actually receives the messages from their peers.

At the end of the day, the negotiation of each channel ID is a two peer version of a consensus protocol. Which protocol we’re choosing determines the number and format of messages to be sent, so we’ll skip this for a later post. However, we can already decide upon a number of things:

Each channel ID is a separate value negotiated by the consensus protocol, so each protocol message must make a reference to the channel ID under negotiation, as not to confuse multiple parallel negotiations.
Proposal messages should also include channel characteristics, such as whether the channel should process packets reliably and in-order as TCP, or skip either. Similarly, crypto parameters (if any) can be sent in the proposal messages.
We really only care about two peer scenarios. Most consensus protocols consider multiple peers, so we can simplify things.

Summary

In this post, we’ve come to some understanding on how messaging works in general, but also how messages interact with channels. In particular, we’ve explored in the abstract the considerations when handshakes initiate encrypted or unencrypted connections, how that affects channels, etc.

In the next post, it’s time to look at channel establishment as well as simple data messages. At that point, we could implement a first, unencrypted version of the protocol.