20 May 2015

Maximize the Work Not Done

It’s fuzzy who I first heard say it, but I remember the lesson well. It was a team I was on years ago, and it’s a principle that I have grossly assimilated. I didn’t immediately wrap my head around it, but agile, or Agile, or not, it’s important.

Here is the simple explanation: I traded weeks of development for an afternoon of work. That isn’t just weeks of work, that’s weeks of opportunity cost, thousands of lines of code to test and maintain, infrastructure that didn’t need to be built and deployed.

We need our PostgreSQL data in ElasticSearch

That about sums it up. The developer wheels were turning in my head. The other side of the coin: it was experimental, it might not work, the data wasn’t changing that often.

The solution? A cobbled together set of shell scripts. All we need is to pull out the relevant data from the database, turn it into JSON, and put it in ElasticSearch. If you aren’t familiar with ES, it has a fantastic HTTP API. My first problem, I had millions of rows to upload. The good news is that there is a bulk API, with an interesting twist. Let’s dive in and work through this step by step.

Get the data from PostgreSQL

If you aren’t familiar with Postgres, it has a wonderful command line tool called psql. There are just a couple details we need to take care of. First, we need JSON. We could do some silly string concats, like so:

SELECT '{ "id": ' || id || ', "name": "' || name || '" }' 
FROM my_data

But that’s just dirty.

SELECT row_to_json(t)
FROM (SELECT id, name FROM my_data) t

There, that’s better. Your second problem is all of the extra data that gets output.

psql -q -t ...

Great! Those two options will quiet down psql to give us just the output. Now we have one JSON document per line.

ElasticSearch Bulk API

You can read about the bulk API here. It’s very handy, but the interface is awkward. Each document requires 2 JSON rows. The first row specifies metadata such as the index, document type, and the document’s ID. The next row is the document.

This is when we turn to one of my best friends on the command line, JQ. If you work with JSON, this is a must have. We need to turn this:

{ "id": 1234, "name": "Bob", "address": "983 Mulberry" }

Into this:

{ "index": { "_id": 1234 } }
{ "id": 1234, "name": "Bob", "address": "983 Mulberry" }

You might think Python, you might think Perl, but just stop right there. All that pain, it’s just not necessary. JQ to the rescue.

jq -r -c '. | { index: { _id: .id } }, .'

I’ll be honest, it takes me time to work out how to do more advanced tweaks with JQ, but it’s worth it. At a minimum, you should always pipe JSON on the command line to “jq .” and enjoy serenity.

A Caveat

It didn’t take long to discover that 1GB bulk loads didn’t work. In fact, it seemed to choke around 50MB. No worries. We’ll chop up our output!

split -C 40M - bulk_upload_

Woot! This results in 1 or more files ready for upload. The first file went great, but the subsequent files failed. Remember how we had to turn each record into two rows? If you split those two rows across files, you will have pain. Be sure to split first, then run through JQ to product the bulk format.

A simple loop will push them all:

for f in bulk_upload_*
do
  cat $f | curl -XPOST "10.0.0.1:9200/index1/type1/_bulk" -T -
done

Wrapping Up

I’ve left out details, including wiping the existing index, creating the index again with the latest mappings, and so on. But this covers the tricky bits.

It took about an afternoon to put together a bash script that I can run over and over to populate ES with data from our database. In the long run, perhaps this will be done by a process reading from a Kafka topic, or perhaps even a fancy ES river. But this solution works like a champ today. Maximize the work not done.