Comparing Vessel to a Merkle DAG

One of the recurring conversations I’m having is on whether vessel is a Merkle DAG or Merkle Tree/Trie, and every time I have to start over with explaining that it is not. And this is a deliberate choice.

In this post, I’d like to explore the differences – and this post will also kick off a mini series on how vessel and it’s sibling project wyrd together form a DAG-based conflict-free, replicated data type (CRDT) akin to a Merkle CRDT.

tree

Figure: “tree” by Robert Couse-Baker is licensed under CC BY-SA 2.0

Merkle Trees

But first things first. What is a Merkle tree, or trie, or directed acyclic graph (DAG)?

It’s an elegant concept by which you can easily identify arbitrarily large data. It goes like this:

  1. Chop up the data into blocks, most commonly of equal size.
  2. Compute a cryptographic hash over each block, which so uniquely identifies the block (if the content changes, the hash would change).
  3. Concatenate two adjacent sibling hashes, and compute a hash over the result. Continue this for all siblings. These form a new layer of nodes in the tree.
  4. Continue concatenating and hashing hashes at each layer until a single root hash remains.

The root hash so computed would need to change if any of the content blocks were to change. In this way, it can serve as an identifier for the entire content.

Figure: Graphical representation of the Merkle Tree. Azaghal, CC0, via Wikimedia Commons

Merkle trees have some desirable properties; one is that each sub-tree is in itself a complete tree. This means that one can perform integrity checks on incomplete data: if in the above example, only L1 and L2 were known, it would still be possible to verify them via Hash 0.

Another property is immutability. As any modification of data blocks must necessarily lead to a new root hash, every root hash effectively describes an immutable data set.

Merkle Tree Complications

For the purposes of the Interpeer Project, Merkle trees pose two major issues that are related, and a number of smaller, additional issues.

The first is immutability. Merkle tree based approaches “solve” mutability by acknowledging that modifications result in a new root hash. Therefore a root hash can never be an identifier for a conceptual resource; it is only a suitable identifier for a particular version of a resource. However, the Interpeer architecture expressly revolves around mutable resources with stable identifiers.

If one were to use a Merkle tree, this would require some kind of mapping from a stable identifier to the current up-to-date root hash. This is, for example, how the Interplanetary File-System (IPFS) approaches this issue.

The downside here is that this means resources are no longer self-contained. They rely on this mapping. Furthermore, there is nothing intrinsic to Merkle trees that would allow you to verify the mapping is an accurate reflection of the resource author’s views. Finally, resolving such a mapping requires out-of-band communications, which may be costly or incur latency.

Immutability can also be a legal liability. The EU’s “right to be forgotten” would be violated if personally identifiable information could not be removed from records in a way that is both reliable and in depth.

Immutability is a very attractive feature, because it provides a solid and simple basis for other building blocks. Unfortunately, there are more issues with it than it solves (in the context of this project).

The second issue is not so much an issue with Merkle trees in themselves, but with what they’re lacking. A root hash is great at telling you whether a data sequence is consistent with some identifier. It cannot make any claim as to whether the data’s author agrees with this, or whether the sequence and hash have been subject to a man-in-the-middle attack.

A root hash is not enough. We also need a cryptographic signature (and perhaps content encryption, though that is easily performed before computing the tree). Which means there is need for additional out-of-band messaging, both of the signature itself (which can be over the root hash), as well as of the key or key identifiers with which it was created.

For the Interpeer architecture to have any sense, however, it must revolve around self-contained resources. So how can we create a different system?

Vessel DAG

The first thing we note is that immutability can only be solved if we do not use a hash over a content chunk as a leaf identifier. Instead, each leaf’s identifier must itself remain stable while the content can mutate. Let’s say we compute them at random.

Assuming such a stable identifier, we can still construct a Merkle-like tree with a root hash. It doesn’t really allow us to verify the content any longer, but it allows is that a particular sequence of data chunk identifiers is what makes up the current version of the resource. We can modify chunks, and the identifier remains stable.

What this doesn’t solve is addition or removal of chunks, either at the end of the resource or somewhere else in the tree. This would necessitate communicating not only a new tree root, but also the hashes on the path from the root to added leaf. The tree growths logarithmically with the content, and so then does the communications overhead. That’s a problem for streaming applications where the streams may be arbitrarily long.

The way to solve this is to create a link between a data chunk and the chunk following it, and the ways to do this boil down to two:

  1. Either each chunk declares the identifier of the chunk that follows it. This is relatively elegant, as we could avoid random identifiers and instead base them off some data in the prior chunk.
  2. Or each chunk identifies the chunk it is following. This is more or less how does things, and creates a causal relationship between changes.

Git also solves an adjacent issue, which is that in an environment in which multiple authors contribute to a shared resource, it allows each author to sign their change cryptographically.

In fact, vessel does it precisely in that way: it records the parent extent (chunk) identifier, and the author key (identifier), and then also a cryptographic signature of its contents with that key.

In fact, this provides for a stable algorithm for generating extent identifiers: we can concatenate the parent identifier and the key identifier, and compute a hash over the result. This makes the sequence of extents verifiable in much the same way that a Merkle tree would.

As a result, extents cannot be just raw application data any longer. They must contain some envelope information as well.

All that remains is an algorithm for generating the origin extent identifier of a resource. For now, vessel is just using a sufficiently large identifier space that random identifiers should suffice. This origin extent identifier also serves as the identifier for the overall resource.

The resulting DAG survives modification, but can be more deeply verified than a Merkle DAG. It provides for stable resource and sub-resource identifiers, and each sub-tree remains a complete (and so verifiable) structure.

Additional Vessel Features

You may wish to read the full vessel specification for details – suffice to say that having multiple authors requires some additional tie-breaker algorithm in forming the DAG.

Additionally, since vessel no longer describes raw content, we’re putting content into sections instead – and vessel can multiplex sections of various different types. This allows us to equally encapsulate arbitrary application data, as well as special CRDT sections.

But that’s the topic of the next post.


Published on January 24, 2024