The Computer Wants to Lose Your Data: Bonus Bits

Title slide from my talk 'The Computer Wants to Lose Your Data' with a red promotional-style sticker on the top right that says 'Bonus Bits'

Back in October, I gave a talk at SREcon EMEA called ‘The Computer Wants to Lose Your Data’ on the challenges of building crash-safe databases. It ended up being the most research-intensive one I’ve ever written¹ – something I should have anticipated when writing the proposal, but didn’t fully realise until I was a solid week into writing the talk.

More so than most talks I’ve given, I ended up leaving a lot of that research on the cutting-room floor. There were multiple points where I could have spent fifteen minutes diving into a tangent with enough detail to stand up as its own talk. In lieu of that, for anyone who found themselves wanting even more detail, I thought I’d turn the bookmark folder from my research into a blog post.

If you haven’t watched the talk, I highly recommend doing that first. Everything below this intro assumes you’ve seen it.

Differences I glossed over

I said up front in my talk that I was going to gloss over some topics for the sake of time. The biggest of those topics by far is the architectural differences between MySQL’s and Postgres’s storage engines.

MySQL doublewrite buffer vs Postgres full page writes

At the very end of the section on the doublewrite buffer, I mentioned that Postgres has its own way of solving it – full page writes.

In contrast with MySQL’s separate doublewrite buffer that all writes have to go through, Postgres writes a full copy of a page the first time it gets written to after a checkpoint. They’re able to make use of the append-only nature of write-ahead logs (WALs) to easily detect an incomplete write and hold onto the good copy of each page. As WAL playback happens from the last successful checkpoint, the full page only needs to be written to WAL once per checkpoint. All later writes to the same page can use a much more compact logical change description until the next checkpoint happens.

While the Postgres approach has the advantage of reducing the overall number of full pages written, it has its downsides too:

It happens on the synchronous write path of a query, so adds some overhead to it.
It can lead to spikes of disk writes after checkpoints, which can make database performance less consistent².
Because Postgres clusters use physical replication, those spikes in write activity translate into spikes of data sent to replicas over the network.

These different mechanisms evolved in the contexts of the databases they’re part of, and I would be reluctant to call one better than the other without comparing how they perform under a variety of workloads.

O_DIRECT vs the kernel pagecache

I ended up skipping over this because it wasn’t essential to any of the points I made³, but it’s worth mentioning that MySQL and Postgres have radically different approaches to buffering page writes in memory and playing them out to persistent storage later. MySQL’s InnoDB has its own buffer pool and uses Direct I/O to bypass the Linux kernel pagecache, whereas Postgres leans heavily into the Linux kernel pagecache for I/O buffering.

While there are plenty of high-level explanations of the two implementations, I haven’t been able to find much writing that directly compares their trade-offs. Uber’s 2016 article on why they switched from Postgres to MySQL has a short section titled ‘The Buffer Pool’ that explains some overheads they saw from the Postgres approach, but I suspect there’s plenty more to compare between them.

If you know of a more detailed teardown of the two approaches, please send it my way.

The Postgres mailing-list thread(s)

A smaller thing I glossed over is that while the first Postgres fsyncgate thread is the most famous, there were two other threads on pgsql-hackers that I learned a lot from too.

As you’d expect from discussion threads, you’re not going to get a neat summary when you open them. They’re long reads and they meander.

They’re also among a small handful of places that record the decisions that were reached at the time. I’d struggle to recommend them as a casual weekend read with a coffee, but I feel they’re worth sharing for anyone who wants to really dig deep into the subject.

In chronological order:

There’s the infamous one that kicked off the whole discussion.
There’s a second, much more solution-focused one which got split out a month later to focus on the specific changes Postgres and the Linux kernel would make.
Six months later, a thread on refactoring the checkpointer’s fsync request queue ended up touching on the fsyncgate topics again. It’s one of the few places that talks about the inode cache eviction problem (see ‘An unanswered question’ below), though no conclusion is reached about how to solve it.

The Linux kernel perspective

While my talk looked at fsyncgate from the perspective of Postges, the Linux kernel team was also heavily involved in solving the problem, and had their own detailed discussions about it.

Matthew Wilcox’s PGCon 2018 talk – “How Linux handles IO errors” also provided a tonne of useful information. It was by far the most detailed record of the Linux side of fsyncgate that I was able to find.

In addition, there were a few short LWN articles that covered the error handling changes in the kernel, how they affected Postgres, and Postgres’s subsequent visit to the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM):

Torn-write prevention in the Linux kernel

Preventing torn writes is a topic that’s had renewed interest in the Linux kernel community recently. There was a series of articles on LWN earlier this year covering progress being made on the issue.

The basic building blocks – like the RWF_ATOMIC flag on the pwritev2 system call – have started landing in the kernel, but there’s still a long way to go.

One of the biggest sticking points has been whether the kernel should fall back to a slower software implementation (like journalling or doublewrite) or return an error when a user has requested a size of atomic write that the underlying hardware can’t support natively.

My hunch is that people running larger fleets will likely want explicit errors which provide an obvious failure signal to investigate. Having a handful of computers drop down to a less performant path because they’ve been misconfigured is much more tedious to debug². If you’re only running one or two servers, you’d probably prefer they were slow but online.

Supercapacitors on newer RAID cards

There’s been a really neat evolution of the tech I talked about in the section on hardware disk caches. I was sad to cut this out of the talk, but it was tangential and was ultimately a distraction from the point I was making.

It turns out that some RAID cards with volatile write-back caches have moved away from using a battery backup unit (BBU) to keep that cache memory powered during a mains failure. Instead, they borrow the trick that enterprise NVMe drives use and have a large bank of capacitors that lets them write the volatile cache into non-volatile storage on a power failure⁴.

Now, you’re probably thinking, “The NVMe drive is made of fast, non-volatile storage. How does a RAID card pull the same trick?” Well, they simply add a small amount of fast solid-state storage to the card that only gets used in this power failure scenario. Pretty cool, right?

Could we live without crash safety?

This was an interesting question someone asked after the talk:

In the crash safety and replication example, wouldn’t a checksum for integrity at the disk level/storage layer help?

It’s true that with checksums applied to every on-disk data structure, a database could detect some kinds of corruption – particularly the style caused by torn writes⁵, where the individual page is internally inconsistent.

The ability to detect corruption is only part of the problem, though. The questions of when you do it and what you do when you find it are important too.

You could scan the whole database for corruption when starting after a crash or power failure, but for large databases that would add a long delay before the replica could rejoin the cluster. You could defer the work until query time, but then you either need to return an error to the user or make them wait while you ask other members of the cluster for a good copy of the data.

It’s an interesting idea, and I can see the value of implementing corruption repair as well as crash safety – after all, individual sectors of drives can fail unrecoverably – but I’d be hesitant to throw out crash safety mechanisms in favour of it. Riak’s Read Repair and Active Anti-Entropy are the only examples of database-level corruption repair that I know of and I’d love to hear about any others out there!

An unanswered question

There was one strand of the fsyncgate work that I was never able to find a solid answer to.

There’s a good summary of it in the last of the three pgsql-hackers threads (point 4 in that message) and in Matthew Wilcox’s talk (at the linked timestamp).

The sequence of events they describe that leads to silent data corruption is:

Process A opens a file and writes to it.
The Linux kernel’s asynchronous write-back process happens, runs into a disk error, and records the error on the inode.
Process A closes the file.
Note: this is different from the scenarios in my talk because no process has an open handle to the file at this point. In the examples in my talk, there was always at least one open handle to the file with unsynced data.
Due to memory pressure, the Linux kernel evicts that inode from the cache. As far as it’s concerned, there are no processes interested in the dirty cache pages or the error.
Process B opens the file.
Process B calls fsync on the file and gets a successful return value as the error was lost when the inode was evicted.

At least at the time, this was a plausible scenario in Postgres’s approach to handling fsyncs in its multi-process model. Backend processes write to files and send a request to the checkpointer process for it to call fsync on them later. It’s entirely possible to have a window where the backend process has closed the file and the checkpointer hasn’t opened it yet, giving the kernel a chance to evict the inode from the cache.

I’m not sure if the Postgres team implemented a scheme⁶ to guarantee that unsynced files will always have at least one file handle open to them, or if the Linux kernel team were able to come up with a compromise they found acceptable to keep inodes with writeback errors in memory indefinitely. You can just about hear an audience member in that talk responding to Matthew, saying that he’s able to reproduce the inode cache eviction scenario on-demand. It’s possible that it’s rare enough in real-world scenarios that fixing it hasn’t been a priority.

Once again, this is a topic I’d love to hear more about. If you know about work that happened between 2018 and now to fix the issue, please reach out and tell me. I’m on Mastodon and Bluesky regularly, and I’m happy to hear from you by email if you’re not there.

And a shout out

Lastly, I want to give a huge shout out to LWN.net. They’ve been doing high-quality reporting on the Linux kernel and the wider free software ecosystem for almost three decades now. I ran into their articles several times during my research, and each time I learned something and left with new avenues to explore.

Publications that maintain this level of quality over such a long stint are rare. If their style of covering deeply technical topics in a friendly, approachable way is up your alley, I think you’ll find the ~$100 a year very worthwhile. I reactivated my own subscription just before heading to SREcon.

Peace and love ✌🏻💖

The closest contender is my 2023 talk Deconstructing an Abstraction to Reconstruct an Outage. In this year’s talk, it was the variety of topics – particularly the choice to cover technically deep subjects in two different relational databases – which made writing it a step harder. The 2023 talk also benefited from a more spread out research process, with much of it being done in the weeks after the outage and recorded in a public incident review. ↩
And “it’s slow sometimes” is one of the worst issues to diagnose in computing. ↩ ↩²
MySQL’s InnoDB was affected by and had to fix the same EIO handling issue on fsync calls that Postgres did. ↩
“Flash-backed write cache” seems to be the term of art if you want to go looking for these. Most of the results are product pages from vendors and I didn’t want to link to a specific one. ↩
There are other forms of corruption that checksums wouldn’t save us from though. In the case of dependent writes – where a write to page B should only be present if an earlier write to page A was successful – checksums wouldn’t help us if we didn’t have write barriers enforcing the order that those pages were flushed down to disk. ↩
The checkpointer redesign thread on pgsql-hackers does touch on the idea of an fd-passing scheme, where file descriptors aren’t closed by the backend and are instead passed directly to the checkpointer via Unix sockets. I wasn’t able to find any later discussions or patches related to that work, but they may well have slipped past my searches! ↩