Information-Centric Networking Rethought

The recent issues with Google’s WEI proposal have provided for a few more views of this blog and website, which makes it worth diving into our work a little again.

The previous post on resource access is quite old at this stage, after all.

Quick Recap

Under different grants, we’ve been working on a bunch of loosely related technologies. The highlights are:

Channeler – a protocol that has can switch between UDP-like lossy and TCP-like lossless modes of connection, as well as novel modes suitable for live broadcast. Channeler can run on top of UDP, on top of IP, or conceivably on top of Ethernet (though this would require some additional routing protocol).
Vessel – a container file format that has (limited) self-ordering properties for use in eventually consistent transfers, multi- authorship features, can multiplex different content, and is optionally end-to-end encrypted.
Wyrd – a CRDT implementation that integrates with Vessel.
CAProck – a cryptographic capability framework for distributed authorization.

Each of these are building blocks for a complete information-centric networking (ICN) stack we’re constructing.

Which begs some questions people keep asking, such as: what is different about this ICN approach? Why do this? And what’s next?

Let’s look at ICN as it’s usually approached now to begin answering that.

Problems with ICN

Problems with existing ICN stacks tend to fall into two broad categories: privacy/security on the one hand, and performance on the other. Most solutions favour one over the other.

Privacy and security oriented approaches usually consider data immutable. More specifically, they consider each version of a file immutable. A large file gets split into individual extents¹, and a hash gets calculated for each extent.

At that point, there are two basic approaches projects take: some construct a merkle trie from these hashes to calculate a root hash. Using that root hash, they can quickly validate that a list of extents belongs to the same data file, and in which order.

Changes to the data results in changes to the merkle trie, which in turn leads to a new root hash. A new version for the same file so produces an entirely new identifier.

The other approach uses manifests; lists of extent hashes that constitute a file, and possibly metadata. These lists are files themselves, for which a hash can be generated, leading to an identifier for the file.

Here, changes update the manifests, and as a result the manifest hashes, and so the file identifier.

Either approach makes it relatively difficult to stream data as it is being generated, because each update requires re-negotiation of file metadata.

On the other hand, the basic approach works the same whether file extents are in plain text or end-to-end encrypted. So some of these approaches include encryption, which provides security and privacy (others, sadly, relegate this to the application layer).

These concepts also provide a neat layering of concerns: at the transport layer, all that happens is that extents are requested using their identifiers. From the transport point of view, how the data is encoded is largely irrelevant. All the other concerns are a layer or two above, in the presentation layer.

This conceptual separation is powerful for experimentation. But it also implies that the streaming issue outlined above is entirely invisible to the transport layer, which is awkward… because the job of streaming data lies here, after all.

An additional problem lies in how ICN routing works…

ICN Routing

ICN routing works differently from IP routing; the fundamental concept in ICN is to route to data, not route to machines.

Let’s first look at how IP does routing to machines: In IP packets, the header contains a source and a destination IP address. Routers have routing tables, which effectively state which IP address range(s) are reachable via which physical network interface (and then routers exchange routing information by various means).

Figure: Packet Switching as in IP

So in order to forward an IP packet to the right destination, all the router needs to do is look at the destination IP address in the packet header, find a matching interface in its routing table, and send the packet there.

TCP complicates this a tiny bit. In TCP, consecutive packets with the same source and destination addresses and ports belong to the same flow. Routing is usually best if all packets from a flow traverse the same path through various routers to the destination, which is why TCP includes flow management features (TL;DR). But this is mostly relevant in the case where a router knows multiple paths to the same destination. The point here is that routers may do more than the IP-destination-address-based routing. But those are in some sense optimizations of the same underlying mechanism.

Figure: Information-Centric Networking

By contrast, in ICN you don’t send a regular packet to a destination in the same way. Rather, you send an Interest packet, which contains a data identifier. ICN routers take note of the machine address an Interest was received from. They then have three basic options:

If they have the data stored locally, either because they are the data custodian, or because they’re caching the data, they can fulfil the Interest and return the data. Once that is done, the Interest is removed from their internal state.
If they do not have the data stored locally, they forward the Interest – but keep a copy internally. The machine they forward to either is
1. a known custodian of the Interest data, or
2. some configured upstream router.
When a response to this forwarded Interest is received from upstream, they forward this response back to the origin they recorded, and remove the Interest from their internal table.
If the data cannot be sourced, a negative response is returned and the Interest is nevertheless flushed from internal storage.

So far, so good. This reads like a fine way for routing to data rather than to machines.

There is one obvious, and one less obvious issue with this approach.

The obvious issue here is that ICN adds a lot of overhead over IP. In IP, all the routing information (source and destination addresses) are encoded in the IP header. The only thing routers need to know is how to map those addresses to their physical devices, based on a routing table. They keep no state for each packet transmitted (excepting TCP’s flow management).

By requiring each Interest to be kept in order to record a return path for the data, ICN effectively asks for a fair amount of bookkeeping data in the routing layer.

The less obvious problem lies in the question of how data extent sizes relate to MTU on the path. In all likelihood, data extents are chosen to exceed MTU, which means routers have to deal with fragmentation of extents, and potentially with resends.

That means that any real ICN implementation either builds on top of TCP to receive these features for free, or perhaps on something like QUIC. Alternatively, it has to include comparable functionality, meaning the outline above is actually way too simple to fully explain the transport layer in ICN.

The Interpeer Stack

The stack we’re proposing works fundamentally different. While it distinguishes as well between the representation, routing and transport layers, it passes more information between them to better optimize for streaming. The guiding principle here is that if streaming can be made reasonably efficient, then non-streaming use cases are also served well – whereas the opposite is not true.

Representation Layer: Vessel

On the representation layer, Vessel works fundamentally different from merkle trie or manifest based approaches. Ignoring all of Vessel’s other features for the moment, lets focus on how it provides consistency.

Rather than generate some root identifier based on the file’s contents, it instead expects the root to be generated once, and remain static. This provides a stable reference to work with.

The first extent of data in Vessel is identified by this root identifier. Subsequent extents derive their identifiers from:

The previous extent’s identifier, and
the identifier of the extent’s author.

On the one hand, this provides for a definitive order of extents based on the preceding identifier, all the way to the root. This is, fundamentally, also how blockchain works (but without the proof), or how git works.

By including the author identifier as a second component, we also ensure that two authors creating a extent in parallel off the same parent will generate extents with non-conflicting identifiers.

Vessel then provides an algorithm for disambiguating the order of extents created in parallel. You can read more about that in the Vessel specifications.

The main gain for streaming is that there is no need to communicate new metadata when a extent is added to a resource. Since it refers to the last previous known extent, it can be established which resource it updates.

Routing Layer: Hubur

The Hubur protocol to take the place of a more traditional ICN routing protocol functions in a relatively similar way to ICN. Also here, Interests are generated and may be forwarded.

Hubur is currently undergoing development, so speaking about specific implementation details is a tad difficult. But we can discuss the abstract functionality well enough.

Hubur leverages the fact that applications are well aware of the context in which they request data. Do they want a single data extent? Or do they want to stream a resource? This information is recorded within the Interest packet alongside the resource identifier.

When a single data extent is requested, it makes sense to provide more or less the same functionality as the ICN approach outlined at the top. We still have more per-extent overhead to manage than per IP or TCP packet. But an Interest in a single data extent is also relatively short-lived.

When the application intends to stream a resource, however, routers now have that information available. The Interest thus serves to document an entire flow of data packets, and is not (conceptually) more of a management burden than a TCP flow.

To achieve this, Interests not only contain the data/resource identifier they are concerned with, but also a client-generated cookie that serves as a reference. Data packets then also contain this cookie, establishing the entire flow. This is not unlike (but also not quite the same as) a flow identifier, used already in existing networks to help optimize data flows.

The effect is that of a publish-subscribe mechanism, where an Interest’s cookie references a subscription to a resource.

If a stream stalls for long enough, intermediate routers may flush Interests from their tables. The data custodian(s), however, should not. If a resource does get updated after some time, it is up to the data custodians to advertise this fact to routers. And if they keep cookie information for long enough, that can even be used to re-establish a path back to the requester again.

Conversely, clients can unsubscribe from a resource with an Interest that, well, expresses disinterest.

But what about the fragmentation issues?

Transport Layer: Channeler

This is where Channeler comes in. As it provides for various modes of machine-to- machine communications, it can be used also to provide (de-)fragmentation features.

Because routers have more information at hand on the purpose of data transfers, they can choose the optimal Channeler modes for streaming vs. reliable retrieval more suitable to dealing with single data extents.

Channeler contains a kind of flow identifier itself. To be abstract from the underlying protocols, and also to be compatible with multi-path extensions (themselves implemented elsewhere, to be ported over), Channeler records a source and destination peer identifier in its packet headers. The tuple of both acts as a uniflow (unidirectional flow, where typically flows are bidirectional) identifier.

Hubur can ask Channeler to produce suitable uniflows for the purpose, and map data flow cookies to uniflows as its main routing information. It is not all that important to keep an entire Interest around after that (though various redundancy and recovery scenarios make it a good idea).

How do APIs fit into ICN?

That is a very good question.

For our purposes, APIs are endpoints on a peer that provide some kind of bidirectional communication. Let’s not get bogged down in the specific IPC mechanism here.

Strictly speaking, ICN and this kind of bidirectional communications are complementary approaches.

But with Channeler, we have a protocol for connecting peers – technically in a way that is independent from their current IP-or-lower network attachment. And with Hubur’s cookies, we already have some identifier for a data flow. There is no need to limit this to a flow where Interests flow in one direction, and data flows in the other.

We therefore extent Hubur to also permit specifying an Interest in a bidirectional “virtual resource”, i.e. a resource that is not Vessel data. It’s just a name for a particular endpoint, in much the same way as a URL specifies an endpoint, or a IP:port combination does².

Due to the overall ICN approach, this virtual resource could conceivably also have multiple custodians – which is great for efficiency when requesting data. But is this a good idea for bidirectional API access?

One of the things we’re contributing to is the ROSA Internet-Draft. ROSA stands for “Routing on Service Addresses”, and addresses very much this question. It turns out that in a lot of situations, it is actually not a bad idea to resolve an abstract service address to a concrete service instance either per larger request, or indeed per IP packet. ROSA allows both modes.

On the one hand this means that, yes, it should be possible for multiple peers to respond to the same virtual resource aka service address. And that implies that an Interest should also be able to decide whether to have a sticky mapping to a service instance, or whether it’s best served by any instance, whichever is best available.

It is one research goal to provide a mapping between Hubur’s abstract service addressing and ROSA’s much lower level mechanisms.

What about Wyrd?

Wyrd, the conflict-free replicated data type, is not technically part of the ICN stack. But it is relevant to the bigger picture.

Most network communications is designed to work between two endpoints. As such, it is relatively clear that when one endpoints writes an update to a resource, that this update is authoritative, and to append it to the resource.

Modern usages of the Internet assume, however, that multiple people collaboratively edit a single resource. And the technology of choice for this kind of operation, at this point in time, is a conflict-free replicated data type.

With Vessel, we provide a semblance of order for multi-authoring of a resource. But this really just disambiguates in which order parallel data extents are to be processed in a content agnostic way. When you wish to merge modifications from multiple authors, it is required that either the order of modifications is not important, or an absolute order can be established with more precision.

As a CRDT, Wyrd takes care of this. However, Wyrd and Vessel cooperate here: since Vessel provides ordering of parallel extents, and an overall mechanism that makes parallel extents relatively unlikely, it also follows that Vessel effectively provides a view of a resource in slices.

Here, each slice unambiguously follows its predecessor. Within each slice, all extents were created in parallel. Vessel provides an order in which to process these parallel extents, but only in a content-agnostic way.

This reduces the amount of effort layer such as Wyrd has to take to produce a content-aware, absolute order. At the same time, Wyrd also does not have to take care of e.g. encryption or authorization, because Vessel takes care of much of that.

Finally, Vessel provides an abstraction from Wyrd’s detailed data structures to Hubur and Channeler, who just deal in resources and extents.

What about CAProck?

CAProck’s contribution to this is that it provides cryptographic capabilities that record provable authorization information. It’s much like a digital key card. The card will open a lock whether or not the lock currently has Internet access.

That mechanism makes it unnecessary (in most use cases) to require an authorization server that can be asked whether a specific kind of access is authorized. That’s great for a lot of reasons!

We use CAProck in two distinct ways:

Authors can embed CAProck capabilities in a Vessel resource, multiplexed with the remaining data. This allows them to tell recipients who may or may not do certain things with the resource along the same channel as the resource itself gets distributed. Such recipients may be routers, but also clients.
Clients can include capabilities they know applies to them in an Interest, which allows routers and data custodians alike to refuse or permit service based on authorization information. No round-trips to an authorization server are required.

Relation to Local-First Software

A recent Wired article titles The Cloud Is a Prison. Can the Local-First Software Movement Set Us Free? argues that

To build products like this would require fundamentally different ways of structuring data. Different math. The result of that effort? Less shitty software. Freed from worrying about backends, servers, and extortionate cloud computing fees, startups and indie developers could skip strings-attached VC funding and pursue more interesting apps. What’s more, they could take advantage of hardware improvements that cloud developers often missed out on. When an app is cloud-based, its performance is limited by the speed of its connection to the central server and how quickly that server can reply. With a local-first app, the user’s device runs all the code. The better your laptop or smartphone gets, the more the app can do.

A CRDT does not really change how documents are represented in memory – often, a traditional in-memory representation remains the most efficient. But rather than serializing this representation into an on-disk document, CRDTs serialize a series of updates or changes, which can be replayed to reproduce the in-memory representation.

With Wyrd, we write such changes into Vessel extents, and extents can be addressed individually, or an entire resource subscribed to. ICN thus provides an ideal storage & distribution medium for logs of changes that can be used even when partially synchronized.

What’s missing from the data focused discussion of local-first software like this article is that, fundamentally, ICN is better suited to this kind of local-first approach to data. It’s particularly relevant that such logs can be streamed efficiently, and that they do not require awkward metadata renegotiation with every change. The Interpeer stack is being built as a foundation for local-first software.

Summary

If this post is supposed to do anything, that is to illustrate how a more co-operative layering approach in the Interpeer stack can lead to better streaming performance. This then also highlights the main distinguishing factor of the Interpeer approach over more traditional ICN: the layering of responsibilities is slightly different and more co-operative.

Vessel is optimized to not require metadata communications overhead. Hubur contains a publish-subscribe mode for streaming access. And Hubur can leverage Channeler’s transport modes to have as much or as little reliability as the access mode requires. Finally, Wyrd provides for an API to the application that is as simple as modifying properties in a document tree.

As to what’s next, Hubur is under active development. Wyrd is getting more development time, because it can certainly be better. But we also need a little more routing related work, which we’ve submitted additional grant proposals for. And then we’ll have to put it all together into a coherent whole. Integration is sure to reveal a raft of things to address.

I wrote in the beginning how our work is based on research grants. That is true, and undoubtedly a great source of most of the income required to make all of this happen.

Research grants tend to focus on results, however, be it code that demonstrates something works, or academic papers, etc. What they’re very bad at funding is simple code maintenance. Bug fixing. Performance tweaking. All the things that make something run well is not easily financed this way (neither are administrative costs).

Note that implementation-first, integration-later is not a common approach for modern software development. But that is in part because modern software development is incredibly strongly influenced by venture capital. VC demands that an MVP is produced, which is then iterated upon.

Unfortunately, such approaches are not easily compatible with research grants. Each grants needs to have some kind of measurable goal or impact, and nobody funds development of the entire stack. As such, we’re taking a lower risk approach and chunk up development project-by-project or even milestone-by-milestone, in order to integrate later. At minimum, this gives the public something should we fail in our mission.

For that reason, we do also rely on donations. Anything you can give is appreciated. Or contribute to source code and issues. And if neither of that works for you, sharing this on social media is still helping. Thank you very much!

The typical name for an extent is a “chunk”, or something similar. We use the term extent as it is used in filesystems. Here, extents denote multiple fixed-sized blocks that are to be treated as a unit. Vessel’s extents are strict multiples of a block size. ↩︎
The main distinguishing factor between a Vessel resource and a virtual resource is actually whether data can flow from a client producing an Interest to a custodian. Furthermore, caching may not be desirable for virtual resources. Hubur records these as option flags to an Interest rather than fully separate approaches, which permits later research into special treatment of data flows with specific option combinations. ↩︎