Migrating Large PostgreSQL Columns to DynamoDB

Rohan Sahai
Affinity
Published in
9 min readOct 25, 2017

--

UPDATE (over a ~year later): Looking back at this project it was worthwhile for us to move email bodies out of our main RDS database, and it was worthwhile to test out DynamoDB in a real world setting. That said, if I were to do it again, I would probably move email bodies to a smaller separate RDS instance to reduce the complexity of our infrastructure. Also, the cost savings did make sense at the time, but as we scaled up, DynamoDB became pricier which offset the savings from delaying an RDS upgrade. Hopefully this article is still useful/an interesting read! There are definitely still valid use cases for Dynamo.

Before the data migration described in this post, more than half of our Postgres storage space was eaten up by ONE column from a single table. This blog post outlines how we moved all this large data into what we think is a more suitable datastore given our query patterns. Scroll to the bottom if you aren’t interested in the background or implementation and just want to see some tricks/tips/headaches related to DynamoDB.

Background/Motivation

At Affinity we sync with our clients’ email inboxes, which means we have an emails table in Postgres. This table has a body column which is responsible for storing the actual content of the various email records. Email bodies can be fairly large (not including attachments). A rough average we calculated at one point was ~8kb per email body (after encryption). That might not sound like a lot, but if you have 60 million + email records that ends up being ~500 GB of storage.

So what’s the issue with these really large bodies in Postgres? Postgres is designed to handle large attributes fairly well. There is a technique known as TOAST (The Oversized-Attribute Storage Technique) in which tuples that are over 8kb, which is commonly the fixed page size in Postgres, are broken up into multiple physical rows - so Postgres natively avoids bloated pages. But we were still compelled to test out storing these columns in NoSQL for the following reasons:

  • We want to keep these potentially giant entries out of Postgres memory so other queries can remain as fast as possible.
  • Similar to the above point, but we’d had issues in the past from our DB working set size being larger than our DB memory capacity— hopefully this would alleviate that, or postpone the need for another memory upgrade to our master database instance.
  • We’d like to by default not include email bodies in queries hitting our emails table so we’re not also clogging our application memory.
  • We have extremely high write volume on our emails table — and NoSQL is designed to horizontally scale more easily than most relational databases. Even though we are just moving one column for now, this can be a POC for potentially moving the entire table to NoSQL.
  • Storage is more expensive on Amazon RDS (where our Postgres instance is hosted) than most hosted NoSQL solutions.
  • The ‘body’ field would be fairly easy to port to another data store since we never filter queries based on the this field, thus our query patterns would be mostly unaffected.

Why DynamoDB?

As a small startup without a dedicated devOps team, a fully managed solution is crucial for us. This was the real separating factor between Dynamo and other popular NoSQL solutions such as Cassandra. We also already use AWS for quite a bit of our architecture and we’ve had mostly positive experiences, so it made sense to first explore the Amazon option before looking into other fully managed solutions.

Implementation/Migration

Getting our Dynamo table up and running was incredibly simple. Since we were only moving a single column from our Postgres emails table, we didn’t have to worry about implementing a new PK strategy… we just needed to store a reference to the existing email record and the body. Creating the table was as simple as:

The trickier parts of the problem were:

1- Figuring out how to alter as little of our application code as possible so developers could still interact with our ORM’s Email model the same way they did previously.

2- Coming up with a smooth migration strategy so we could drop our ‘body’ column in Postgres safely. This involved writing to both Dynamo and Postgres, and then gradually converting reads to Dynamo.

Application Code

Most of our backend at Affinity is in Ruby and we use an ORM called Sequel to interact with our database records. Our email records are referenced in lots of places throughout the codebase, and updating each of those places individually to account for the new Dynamo column would be a hassle. What we did to mitigate this was essentially write a wrapper around our ORM, so NoSQL columns were updated/fetched automatically. Well… we only fetch the body from Dynamo if the body method of our email record is specifically called, not just when the email record is fetched, otherwise we’d be loading up the bodies into application memory unnecessarily. It wasn’t quite this simple, but the wrapper looked something like this:

Tapping into Sequel to do this properly with the various different update types could occupy its own small blog post so I won’t dive into more detail here.

Migration Strategy

Email syncing is at the core of our product at Affinity, so a smooth rollout of our Dynamo implementation was crucial. Instead of just switching over to reading and writing from Dynamo in one go, we broke up the feature into several steps that were rolled out over a span of roughly two weeks.

Step 1 — Update our worker responsible for writing emails to Postgres, to ALSO write to Dynamo.

Step 2 — Backfill all Postgres bodies from before step 1 into Dynamo. Fun fact, at about 5000 WCU migrating ~40 million emails took around 40 hours.

Step 3 — Update codebase to always read email bodies from Dynamo and never from Postgres.

Step 4 — Stop writing email bodies to Postgres.

Step 5 — Drop the body column like it’s hot.

Dynamo DB Gotchas/Tips

This is essentially a collection of things I would tell someone if they were about to dive into Dynamo — that aren’t immediately clear when you start using Dynamo.

Pricing: Write Capacity and Read Capacity Units

Dynamo’s pricing model has nothing to do with how many machines you have or how powerful they are — it’s based on usage. More specifically Write Capacity Units (WCU) and Read Capacity Units (RCU).

The WCU/RCU which you have to set in advance for your tables (unless you use auto scaling… more on that below) are based on not just number of read/writes, but also the size of the items you are reading or writing.

One read capacity unit represents one strongly consistent read per second, or two eventually consistent reads per second, for items up to 4 KB in size. If you need to read an item that is larger than 4 KB, DynamoDB will need to consume additional read capacity units. The total number of read capacity units required depends on the item size, and whether you want an eventually consistent or strongly consistent read.

One write capacity unit represents one write per second for items up to 1 KB in size. If you need to write an item that is larger than 1 KB, DynamoDB will need to consume additional write capacity units. The total number of write capacity units required depends on the item size.

Eventually Consistent/Strongly Consistent Reads

One pretty important detail that I breezed over when initially diving into the docs was that Dynamo by default does not have strongly consistent reads, and instead eventually consistent reads are the default. What does that mean?

When you read data from a DynamoDB table, the response might not reflect the results of a recently completed write operation. The response might include some stale data. If you repeat your read request after a short time, the response should return the latest data.

And the time factor which was surprisingly hard to hunt down in their docs

Consistency across all copies of data is usually reached within a second.

Fortunately for us, eventually consistent reads are fine since we’re not reading that data within a second of writing it. Also keep in mind strongly consistent reads require more RCU and so are more expensive — 2x the price as of this date. Relevant Docs.

Auto Scaling

It would be awesome if Amazon didn’t force us to pay for unused RCU’s and WCU’s — for instance our application is less active at night so we can probably lower the capacity units significantly. You could do this programmatically which is a bit annoying, or you can use the new (as of this blog post) Auto Scaling Feature!

A couple of gotchas with autoscaling:

1- The automatic updates to RCUs and WCUs don’t happen immediately. When there is a burst in traffic you should still expect throttling errors and handle them appropriately. This may be a deal breaker on the auto scaling feature for many applications, since it might not be worth the cost savings if some users have to deal with throttling. Luckily for us most of our Dynamo writing/reading actually comes from background jobs, where a bit of throttling is fine and doesn’t affect our users. The following charts outline the issue more clearly.

Red: Provisioned Write Capacity — — Blue: Write Requests per Second

You can see around 21:00 we have a burst in writes, and about 20 minutes later the autoscaling kicks in and provisions us more write capacity

Throttled Write Requests

Even though autoscaling does kick in eventually and keeps the throttling from spiking further, we still had ~15 minutes of throttling against the lower the write capacity.

2- There is no good documentation on how to programmatically configure autoscaling. The client libraries essentially have nothing auto scaling related in their documentation. If this was possible the throttling issues brought up in point 1 could be slightly alleviated. We could set the minimum capacities higher during the day to prevent throttling for typical morning spikes. Maybe the auto scaling will get smarter as it collects more data and this won’t be an issue!

3- If you use the Dynamo to S3 AWS Data pipeline template for backups, autoscaling can give you inconsistent backup speeds. More on that in the section below.

Backups

As of this writing there isn’t great documentation on Dynamo backups. There seems to be one semi-standard solution which involves using AWS Data Pipeline to export Dynamo DB Tables to AWS S3. Unfortunately it’s a sort of fragile system without great documentation or error handling. I’ve had some backups fail without much information, and backups often get stuck in the preparation phase for hours before actually running. Also the default template leaves you with data pipeline warnings that are confusing to fix.

One nice thing about the template is that you can specify what percentage of the table’s RCU capacity you would like to use for the backup. For example if your table is provisioned with 1000 RCU’s, and your backup is set to use 1% of the capacity, the backups will consume 10 RCU’s per second. However… if you use autoscaling, you’ll have no idea what your provisioned RCU’s are when the backup runs. In our case, we want really high read capacity for the backup so we can get it done in less than a day over the weekend. We hacked together a workaround in which we have a cron job update the RCU before the backup runs on AWS. But this isn’t foolproof because as I said before sometimes the backups don’t run immediately, and the autoscaling function may scale our RCU down before the backup runs.

Batched DynamoDB Requests

Dynamo supports batched reads and writes, but with limits. You can at most request 100 items to read at a time, and write (PUT/DELETE) up to 25 items at a time. One important thing to note that wasn’t immediately obvious to me at first, is that some items in the batched request can get throttled while others don’t. So it’s important to properly handle this in application code. The most obvious solution is to recursively call your batch request function with the throttled IDs until all the requests are complete.

Conclusion

DynamoDB has been fairly easy to use/learn, and we haven’t hit any major road blockers or deal breakers with the technology. We haven’t even come close to any sort of upper bound on WCU/RCU so it seems that any issues that arise with Dynamo in the future won’t be related to load. It hasn’t been too long since we started using Dynamo, but so far it’s been a great solution to our bloated column problem.

--

--