The attic, and the noise we found in it

Last week's post was about mytraces pushing on the shared bits and the shared bits pushing back. This week's is the same shape of story, a level lower down, and about a piece of the family that runs behind the apps rather than in front of them.

We've quietly been building a thing called attic. It's a small Go program that runs on a few hosts in different regions, takes copies of each app's data, and keeps those copies in sync. When a disk fails in one region, the copies elsewhere still know who we are.

Why a tool for this

Across the Pointegrity family — pouch, mytraces, the authentication piece that signs you in to either, and more in the workshop — each app runs as its own service with its own database. Some are growing. They share the same need: durable backups, in more than one region, on our terms.

Each app keeps its state in a single SQLite database — one file, on one host, no live replicas. That's a deliberate floor on operational cost. Until traffic demands otherwise, we'd rather spend complexity budget on the apps themselves than on running a database cluster.

Attic is what carries the backup story. The apps don't have to think about it; they just write to their own databases, and the copies appear elsewhere on schedule.

The apps don't have to think about backups. The attic does.

What we tried first

There's a well-known pattern in the indie-SaaS world for protecting a SQLite-backed app: stream every database change — every transaction, every write-ahead log frame — to an S3-compatible store in real time, replicated across regions. The pitch is recovery to any specific second. We picked it. It was the careful-looking option, and "careful" felt right while we were learning what we were doing.

It worked. For a few months, it just hummed along.

The polite email

Then last week, our cloud provider wrote to us. The message was polite: one of your servers has been sending several hours of high-rate outbound traffic. Just letting you know.

It had every right to ask. The traffic was big enough that, if it kept going, it would have eaten through the month's bandwidth allowance. We sat down to look.

The actual data we were protecting was small. A few hundred megabytes, all in. The traffic replicating copies of it was, by then, several hundred gigabytes a week — several thousand times more bandwidth than the data we were protecting could possibly explain.

What was happening

The "ship every change" approach we'd picked turns each change into its own object in the storage layer. Thousands of tiny objects per day, per app. The replication layer underneath — three regions, copies kept in sync — tracks every object across every region. Per-object overhead is small; multiply it by thousands of objects, then by three regions, then by the way the layer reconciles its replicas, and the tail wags the dog.

Most teams running this pattern aim it at a big cloud provider's object-storage service — the provider handles the metadata layer for them, the per-object overhead is amortized inside their pricing, the bill comes out to pennies. We weren't doing that; we were running our own object-storage layer, replicated across regions on our own hosts, for the predictability and the not-leasing-our-durability angle. The combination — thousands of tiny files plus a storage layer optimized for fewer larger ones — was the actual mismatch. Each piece, on its own, behaves the way it's documented to.

Three small assumptions, each one reasonable in isolation, had stacked into a much louder result than any of them looked like alone. The thing wasn't broken. It was working exactly as designed. The design just didn't match the shape of our data — lots of small apps, slow writes, no need to recover to a particular second.

Anticlimactic, mostly.

What we changed

We stopped streaming every change. Each app's database now gets one full snapshot a day — one consistent copy taken at a known moment, uploaded as a single object, instead of thousands of tiny ones representing every intermediate state.

Thirty backups in a month per database, instead of thirteen thousand. The traffic dropped to the kind of trickle you'd expect for the actual amount of data we're protecting. The alert hasn't come back. The recovery story changed in one specific way: if a host dies, the worst case is now "we lose up to yesterday's edits," not "we lose the last few seconds." For apps where the data is drops, trip plans, saved places — that's the right trade. For apps that take credit cards or hospital readings it wouldn't be, and when those appear in the family they'll bring their own machinery alongside attic, not on top of it.

What stays with us

Same shape as last week. The infrastructure we built was right for a workload we might have someday — bigger apps, always-on writes, a need to recover to the second. It was wrong for the workload we actually have today.

We sanded the seam. The attic is quieter, the apps don't know anything changed, and we wrote down the lesson so we don't walk into the same shape of mismatch the next time the family grows.

The attic is doing its job when nobody notices it. That it stopped being noticeable this week is the result we wanted.

← All journal posts