THROW DATABASE 'away'

The biggest event your company will see for two years is fast approaching. Life or death for the startup.

Everyone is gearing up, finishing off new features, load testing like it’s 1999. It’s all going to be fine.

Except, you’ve got a problem.

That older system. The one with the MVP¹ you grafted on. It’s not going to make it; it’s not going to scale.

You don’t have much choice. It’s time to rip it’s database out and put in something completely different.

It’s not time for downtime. Or for fixing new bugs.

No sweat. 😥

This is the story of how I faced exactly this situation and turned it into success.
It’s also the story of how some good practices made it all possible. Practices I hope you can learn from and turn into your own success.

But surely there’s another way

Nope.

This old system handled user accounts; login, settings and so forth. It was simple and stable. It was built on Riak; keys, values and nothing else. It was non relational.

The MVP was to add ‘social features’. Friends. How users relate to one another. In a non relational database.

It was neither easy, nor pretty, but I made it work. The MVP lived to stick around.

But in the months to come it became obvious that it needed to grow up, drop the Minimum and become an actual Product. The data was awkward to work with, it was error prone, projects were taking longer and longer as it got more complex.

But most of all it wasn’t scaling. As more users came it’s database was put under more and more pressure. It wasn’t going to survive the coming storm.

A glimmer of hope

Riak, it’s database at the time was very simple. All the data was ‘unstructured’² JSON. But what we needed for our relational problem was a good old relational database.

Relational databases don’t play well with messy, unstructured JSON. Theirs is the world of neat, structured rows, columns and indexes.

So, if the data just wouldn’t fit, would I have to rebuild everything??

Maybe not. Rumor in the blogosphere was that Postgres had made the impossible happen. JSON working in the battle tested, blazing fast relational DB. Fully indexed and everything.

If this was true it’d be a perfect match for our data model. It’d be orders of magnitude easier to migrate to and it’d be a solid bet to build on in the future.

If it were true.

So, in theory, I could swap the databases. But everything works in theory. This had to work in practice, and soon.

The major risk with a change like this are the assumptions. As systems grow they tend to assume all sorts of things about the databases they’re built upon. And when the databases change these assumptions turn into bugs. Dangerous bugs. Ones that can hide almost anywhere and if you’re not lucky, everywhere.

I simply didn’t have time for comprehensive bug finding missions. Remember, if this wasn’t done in time it was game over. I couldn’t do a rush job either; there was serious risk of corrupting data and making a real mess out of the whole thing.

Faced with the option to chance it we probably would have just turned off the MVP. It wasn’t a pleasant thought, but no MVP is better than a broken MVP³.

So, I had to:

Do something, soon.
Waste precious little time on the wrong thing.
Not break anything along the way.

Easy⁴.

Sounds like a plan (is called for)

As much as I like moving quickly, this wasn’t going anywhere without a plan. The overriding goal here was to make sure we don’t embark on something that won’t be done on time.

It went somewhat like this:

Prove the theory: Prototype the DB
1. Take some of the data we had and put it in the new DB.
2. Take the queries we needed and translate them.
3. Bomb it and see if it’s performance and reliability was up to scratch.
Go/No go
If this didn’t look like it’d solve our immediate needs it was a no go. We should stop wasting our time and find another route.
If it didn’t solve our future needs it would be a maybe. It’d be time to consider burning this MVP down and rebuilding it after the storm.
Hack it into the app
If the database proves itself it’s time to see how compatible our app is.
Take a fixed amount of time, make as much work as possible.
Our test suite would be the guiding light of how far we got.
Go/No go
If we got here we’d have a good estimate of how long it’d take to complete. It’d either fit, or not.
Spend 5 days and get 25% complete? You need another 15 days. (Not including ‘shit goes wrong’ tax). Only have 10 days? Plan B.
Time to commit. Get it working.
We couldn’t afford downtime, and even with the best planning you can ‘be suprised’⁵. Anything we can do here to mitigate data loss or corruption is well worth it.
We have to make the trigger safe to pull. That means developing good migration scripts and rollbacks in case anything goes wrong.
QA
Our tests would be a help, but there’s always something.
Pull the trigger
Deploy and watch it like a hawk.

The doing is the thing. The talking and worrying and thinking is not the thing.

There was only one way to find out if this would work. First up: Postgres.

I love Postgres

Postgres quite simply blew me away. I wasn’t sure in the beginning. The index syntax looked ‘interesting’⁶ and the docs long. But quickly I realised that the syntax was extremely flexible and the docs were both comprehensive and to the point.

That was nice, but what would pass this test was performance under load.

Soon it became clear that the Postgres team hadn’t just wedged an extra feature into their product. They built it to use exactly the same indexing, querying and storage as the rest of the database.

The same indexing, querying and storage that has 29 years of optimisation behind it.

It was bombproof!⁷

Go/No Go: GO!

Hack hack hack

Next step: see if our system has an allergic reaction or not.

Time to hack the database into the app. If I’ve written good, well factored code and abstracted our data model well it’ll go smoothly. If not, it’s time for plan B.

For this kind of thing the development cycle often looks like this:

Make a sweeping change
Test everything
Fix the sweeping change
Go to step 1 until everything works.

Without tests, step 2 can take anywhere from days to months⁸. It either takes forever or you swallow your pride and accept that stuff will break.

The third way is to have a solid battery of automated tests to test for you. Thankfully, over the years before I had built up exactly this. A test suite covering all of our customers use cases in detail. Testing everything⁹ took a minute or two, not days.

As a bonus, I immediately saw what was broken and what was working. No matter the size of the system I always knew exactly where to look for the next bug. No more hunting. Hunting is slow.

In almost no time my tests were far more green than red. The assumptions around our database were managed well. It was a big win for the clean code around our database.

It was all go now. We’d built up confidence much faster than we expected. It would work and we had time to work it.

Build build build

Just the little task of getting it all working left.

With the test suite guiding the way and the lack of fat in our codebase I flew through the rest of the work. Very soon I had polished my hacks into solid code and had nothing but passing tests.

It was time to bring this home.

Safety first

We had disarmed the risk of doing the wrong thing, but we could still make an awful mess of our data. The last piece of this puzzle was to build up the migration and rollback strategies and make damn well sure they worked.

I won’t bore you with the detail (there is a lot of it), but suffice it to say it was time well spent. Our databases could be migrated back and forth with neither downtime, nor an all or nothing leap of faith from one to the other¹⁰.

Nothing makes both sysadmins and those in charge happier than telling them that the big, risky change is going to be nice and smooth, completely reversible and that even if there are issues you’ve stacked the deck so we can catch them long before they get big.

The final bend

I was ready. The tests passed. Our migration scripts were waiting. Time to deploy.

What!? Are you crazy? There’s always something else.

Time for a round of manual QA.

And sure enough, one or two things popped up. Nothing major. Nothing that slowed the project down.

Pushing the button

Now it’s time to be happy and deploy. You’ve done you’re work, you’ve been diligent. There’s no more point fretting or worrying.

Be watchful and ready to revert.

Test it when it’s out, and again when it’s migrated.

And not on a Friday¹¹ >:-(

Minimal Viable Product. ↩︎
As unstructured as any data can be. There is always a schema. ↩︎
Even ‘minimal’ has a quality bar. ↩︎
Possibly the single most dangerous word in tech. ↩︎
A euphemism for fucking it up. ↩︎
A euphemism for a complete pain in the ass. ↩︎
As much as any data product can be. ↩︎
I’ve interviewed in places with 6+ month QA passes, suffice it to say I didn’t take the offers. ↩︎
You can never test everything, but you can definitely get all the way to confidence. ↩︎
Think Indiania Jones in “the Last Crusade” ↩︎
This is very often bandied about, said with a knowing nod. But I’ve seen it happen far too many times. That being said if your riskiest time of the week is a Tuesday adjust accordingly. ↩︎

THROW DATABASE 'away';