Intro to JFDB

by Jared Flatow

❘ August 2016

Curiosity killed the cat by SuperJesusPump

In defense of curiosity

When I tell people about JFDB, their first question is usually What's wrong with all of the existing databases?. I originally felt pressure to tell a story along the lines of Here's a problem that I faced, here's why none of the existing solutions work, here's how my solution works. But in trying to explain what's wrong with everything else, I kept hitting a wall.

I've since come to the realization that telling the story that way is disingenuous. I didn't actually set about writing JFDB because other systems failed to do one particular thing or another. I didn't have to create JFDB at all. It's always easy to justify continuing with status quo.

The truth is I was driven by a nagging sense that there's got to be a better way and a compulsion to figure out what that could be. The pieces of the puzzle just never gave me that satisfying click, and if nothing else, I wanted to understand why. For me, curiosity is often reason enough to try something new.

Eye of the beholder

With that said, in my opinion, JFDB is an elegant solution to a problem I've been thinking about for many years. And the process of creating it has brought me much personal satisfaction. But as the saying goes, beauty is in the eye of the beholder, and I'll be the first to admit that I'm kind of weird.

My mom is a social worker. My dad was a photographer, now a beekeeper. My first computer was a Commodore 64. In college I studied electrical engineering (I was a terrible student, but it shaped the way I look at things).

I'm fascinated by the idea of perpetual learning machines. For the past decade, I've alternated working on infrastructure, (machine) intelligence, and (human) interaction. Its all pretty much motivated by my obsession with self-sustaining feedback systems, a.k.a. life.

JFDB falls into the infrastructure bucket.

Storage and retrieval

Suppose I'm collecting data over time. I want to remember the data, and to be able to search it by various properties.

For example, let's say I want to store data that each look something like this:

{
  "first_name": "Jared",
  "last_name": "Flatow",
  "email": "jared@convex.io",
  "interests": ["machine learning", "convex optimization", ...],
  "affiliations": [{name: "AdRoll", title: "Engineer", ...}, ...],
  ...
}

And then I want to retrieve the data in various ways:

search by first name or last name or email address
search by affiliation name or title
search by interest
search by a partial match of any of the text
any combination thereof, e.g. engineers interested in convex optimization

How do I do that?

Garden of Eden by Jan Brueghel the Elder and Pieter Paul Rubens

In the beginning, things were relational

Okay, so maybe not the very beginning, but pretty early on. I think it's fair to call the relational model the classic approach. At the least, relational databases have been the de facto off-the-shelf solution for information storage and retrieval for the past 40 years.

Most of us are familiar with the relational model, but let's quickly review:

relations are tuples, each field in the tuple is an attribute with a domain, as specified by a schema
relations can be retrieved by matching combinations of their attributes
relations can be joined via their common attributes

Here's what our schema and relations might look like in a relational database:

I can't help but notice that there are several awkward things about this model.

First of all, there is an impedance mismatch between the data we are collecting, and how it is handled by the database. Our variable length lists were transformed into separate relations, tied to their owner by an id. We had to define three types of relations initially, and must use those whenever we want to work with the data. This places a cognitive burden on not only the developer, but also on the machine. There is a ubiquitous layer of mapping between the natural form of the data, and the relational form, and that is not free.

Secondly, we are required to declare all the attributes our relations may have up front. During development especially, this can be an onerous burden. When multiple parts of the system are evolving in parallel, such a rigid structure can slow down iterations on the system itself. Of course in other cases, we may find such structure valuable, but in this particular case, it feels unnecessary.

Finally, in the real world, we need to give the database more information in order to feasibly handle the ways we are going to search. We generally declare indices along with our schema, so that the database knows to make retrieval by those attributes more efficient. Any property we want to search by must be an attribute within a relation, we cannot derive properties just for searching. The support for partial text matches are foreign to the relational model, and would need to be handled by custom extensions to the model, specific to the particular database we are using.

I can't help feeling that this is all more complicated than it needs to be.

Documents are the new tuples

Of course, I'm not the only one who feels that way. Those coming from the relational model might think what happens if we add support for dynamic and nested attributes to relations? That's essentially what document-oriented databases do.

Documents can be added to typical relational databases by adding a new type and a corresponding syntax to the query language. That's essentially what PostgreSQL does with JSONB, for instance.

Alternatively, we can start from scratch and abandon the relational model altogether. The past 15 years have seen the rise of document-oriented databases which do precisely that (e.g. MongoDB, CouchDB, etc.).

In any case, we can store our person data more or less directly under the document model. But in every DBMS I know of, we must separately describe which parts of a document we want to be indexed for fast retrieval. To do that, every database has its own language for annotating what should happen when the database indexes a document. And although the goal is to hide implementation details, in practice, we must be aware of how the indices are maintained, in order to understand system performance.

In other words, every database adds a specification layer and a runtime, which invariably increase complexity of the system.

This whole thing is something of black box. This is the point where you need to decide to take the red pill or the blue pill. Do you just want to use the black box, and never worry what happens inside it? Or do you want to open the box?

I think you know which I'm going to choose. I want to try and find a simpler model. And if I can't do that, then I want to understand why not.

Keys to Your Acadamic Success by University of Nebraska-Lincoln

Just give me the keys (and values)

If we open the box, what we find are key-value stores and a bunch of logic for controlling them. Most databases are built on top of some sort of B-tree. A primary key maps to a value in one of these trees, and indices are implemented as secondary trees whose value points to the primary value.

We could replace our favorite document-oriented database with our favorite key-value store, and things would be fine and dandy. We only start running into problems when we try to index our data.

First of all, we need a way of storing a pointer directly to a value in another tree. And we need a way of making sure that pointer is always valid. We don't want to start writing a primary key, and fail before the indices are written. We want to write the primary key, value, and indices together atomically. Well, no problem, our key-value store probably has transactions.

But now we have another problem. Let's say we have an index pointing to a key's value, and that key-value gets overwritten. We must somehow make the old indices obsolete, atomically together with the primary key update.

You not only need to figure out which index keys are affected, you need to obsolete them all atomically with the primary key-value.

This is an annoyingly complicated issue to think about every single time you want to index data. It seems to me that if indexing key-value data were easier, in many cases a DBMS would not be needed. I wonder what happens if we try to push some of this complexity down to the key-value storage layer?

Lego my eggo

I think what we actually need is a more powerful building block. Instead of just primary keys and values, I want to make indices a primitive. I have a dream that looks like:

The indices are written atomically together with the primary key and value. When a primary key is overwritten, the value and indices are completely replaced as well.

Under this model, our motivating example becomes trivial. We write something like this to our database:

"jflatow@gmail.com"
 => {"first_name": "Jared", ...}
 <= ["first_name:Jared",
     "last_name:Flatow",
     "interest:machine learning",
     "interest:convex optimization",
     "affiliation-name:AdRoll",
     "affiliation-title:Engineer", ...]

And then we could search using logical combinations of indices, for example:

"affiliation-title:Engineer" AND "interest:convex optimization"

Notice how these indices can easily be derived properties rather than explicitly stored attributes of the value. We can index anything we want.

Ideally, all the redundant prefixes would be compressed. Better yet, if we can search by prefix. While we're at it, lets have the keys ordered, so we can search for arbitrary ranges too. For instance, first_name:J to find anyone with a first name starting with J. Or interest: to find anyone with any interests at all. Or affilitation-name:AdRoll...affiliation-name:Apple to find anyone affiliated with an organization name between AdRoll and Apple.

Imagine the possibilities

We've only scratched the surface on how we can use this API.

Demo how ideal version works with additional features:

ordered keys / range queries
key compression
nested objects

This section is incomplete. For now, the README contains the clearest examples of more sophisticated usage.

Now the NP-hard part

I can't wait to use this API. But first, we need a database design that can implement it efficiently. This seems like a good time to take stock of what we want and what we know.

We know we want to atomically write (primary, value, indices) on updates. Given how we are using keys (ordered, prefix-compressed), we also know we probably want to use some kind of trie.

We're also going to want to persist this thing on disk. It should never possibly exist in a corrupted state (machines can crash at any time and databases are supposed to be reliable). Ideally we can work with the resource the same way we work with a normal file: open it, use it, close it.

So if we know roughly what the end result should look like, how do we build it? How do we begin to evaluate a design? Let's pause to take a moment and consider the next step.

A little perspective

At times like this, I find it useful to take a step back and imagine the realm of possibility, a.k.a. the design space. The goal is to find the most efficient design (in this space) that meets our criteria. I also want to find a simple solution, but we're in luck because simplicity is efficiency.

Our design space in this case is the set of all possible database designs. And roughly speaking, we can measure the efficiency of a database design in terms of reading and writing. To understand the solution space a little bit better, let's evaluate some well-known database designs according to a loose conception of read and write throughput.

Here we visualize the fact that LMDB favors read performance, while LevelDB favors writes. BerkeleyDB and SQLite are designed for roughly even amounts of both.

Almost all databases are based on B-trees. In order to get some more context, let's add a couple of more data points. Let's include log files, which are pretty much optimal for writing. And we'll also include DiscoDB, which is immutable and uses perfect hashing to optimize for reading.

Looking at it this way gives us a feel for the design space. The last two we added are especially interesting because they are at the extremes. We can use the extremes to help us understand the space in between. Let's think of these as our basis vectors.

We can try to understand all database designs as a combination of the basis vectors. The more a database favors writing, the more it is like a log file. The more a database favors reading, the more it is like an immutable perfect hash.

The write spectrum

To get some deeper insight, I want to look at how each of these examples behaves with respect to incremental updates. Let's consider what happens when we add N items to each database, one at a time.

(We'll ignore the key size for now since it doesn't change the relative analysis much. We're also mainly paying attention to disk operations, since this is generally the dominant cost in a database.)

On one end of the spectrum, we have the log:

On the other end of the spectrum is DiscoDB:

Somewhere in the middle we have the copy-on-write B-tree used by LMDB:

And a little bit closer to log files we find the variant of log-structured merge trees, which JFDB is based on:

The power of power laws is kind of remarkable. This is a simple idea which works well in practice, a.k.a. leverage. This strategy also has another property that I'd like to exploit: a chance to take any database and optimize if for reading. The cost of lookups is basically due to the number of layers we may need to search in order to find what we're looking for. So it should be possible to merge the layers at any time, and lookups would be O(1) instead of O(log N). This almost sounds too good to be true, the caveat is we pay O(N) to compress the database to a single level.

Putting it together

Now we have not only an API, but a strategy for writing to the database. We also need a data structure that can handle all of these requirements.

Consider a trie which looks something like this:

Now imagine that we have a way of constructing such a trie from a single (primary, value, indices). We'll call the result an atom:

Imagine also that we have a way of merging tries together to form other tries. By induction, we can create tries of any size. We can write these tries sequentially to form layers of information in our database.

Every time we write to the database, we add a layer which is a trie. Whenever we reach an even number of layers, we merge all the small layers into a larger trie. When we reach 2ⁿ layers (where n is the level of our database), we merge all the layers into a new file, and increment the level of the database.

It's not that complicated, we are basically just counting:

JFDB has just such a data structure called a JFT: an adaptive radix tree with an in-memory layout the same as on disk. And it can efficiently merge up to 64 tries at a time. JFDB separates the database into two files, a keys file which contains the tries and a vals file which contains the value data. The values in the tries are offsets into the vals file or NULL to represent a lack of value (i.e. for deletion).

When you merge JFTs into a new trie, a scan list or jump table (and range) is chosen based on the branching factor of each merged branch. Tries on disk can be directly mapped to data structures in memory for lightning fast load times. Sub-tries are automatically cached in virtual memory pages by the operating system. The merge operation also serves as a way of folding over many tries to produce an arbitrary accumulated value.

Primary keys and indices in a JFT are actually two (special) keyspaces in a single trie, distinguished by a 1-byte prefix. Each leaf node in the primary keyspace stores an offset into the vals file (or NULL). Each leaf node in the indices keyspace stores a list of offsets to the nodes in the primary sub-trie.

The views of a navigation interface by Apple

Combine the layers

We need to combine layers whenever we read from multiple tries, or merge and write a new one. Let's consider how exactly that's supposed to work.

In the primary keyspace, new paths simply clobber old paths. When a path is clobbered, any values under the path become obsolete. Normally, this happens implicitly, since when we search backwards for a path, we will stop after we find the most recent modification to it.

In the indices keyspace, new paths are merged with old paths. This is implemented by searching all levels for indices.

However, this introduces a problem, since values from the old layers may have become obsolete. This is actually the problem we noticed earlier when we looked at how to store indices in a plain key-value store. Fortunately we can solve this problem efficiently at the low-level, so users of JFDB never have to worry about it. If we set a dirty bit on the leaf node whenever the value is overwritten, we have an easy way to check whether we should skip it (i.e. when looking up indices or merging tries). Unfortunately, this generally means an extra page will be written to disk every time the database is modified, but that's a story for another time and place.

Mind the gaps

When we write a value, we may need space to store it. New space is allocated (in blocks) from the vals file whenever a non-nil value gets written. The keys file stores tries with values that are offsets into that file. Deletes are represented as a NULL value in a trie, and remain explicit until compacted out. A gap is created whenever a value gets overwritten.

To support reusing space in the vals file, we can track the largest gaps and try to allocate from those instead of appending to the file. But to be certain which gaps are the largest, we would need to keep track of all of them. In JFDB we keep only a fixed number of gaps, and thus may lose sight of where the biggest holes are. To mitigate this, whenever we compact the layers into a fresh keys file, we also recalculate the largest gaps in the vals file.

Moses views the promised land by Gerard Jollain

To be continued...

Until then, if you want to know more about the implementation, please check out the project on GitHub. Especially if you want to see how the search interface works.

As far as benchmarking, there's still a lot of work to be done. For a rough sense of throughput, my laptop can do:

~25K writes/second using a db size of 1M primary keys
~2M lookups/second on the same database
~0.7 seconds to crush the database to a single layer
~4.5M lookups/second on the crushed database

I get similar throughput with at least 10x the number of keys. I have yet to experiment with much larger databases, though in theory nothing should change. Keep in mind these lookups are not doing anything with the values, nor are they using any of the advanced querying features.

JFDB has become a great tool for me these past few months, I hope others may find it useful too!