DROP your DATABASE
SQL is bad news for news apps.
How many times in a newsroom have you heard (or said): “Let’s load these spreadsheets into a database and build a web app.”
Too many times.
I’ve weaned myself away from SQL over the past two years. Without SQL, I find it easier to find and tell a stories using data.
What’s wrong with a database?
If by “database” you mean “place to store data,” then nothing! Places to store data are great.
My beef is with SQL databases — in particular, this workflow that’s too common in newsrooms:
- Find data for your story: CSVs, Excel files, map data and so on.
- Load that data an SQL database.
- Use that SQL database to inform and present your story.
Since I’ve left this approach behind, my world has brightened.
If you use an SQL database, consider:
An SQL database hides the paper trail
You’re a journalist. You need to separate your data from sources’ data.
Sources’ data comes with oodles of typos and obvious errors. You can fix those in your database … and then when you SELECT a value, you can’t remember whether you edited it or not.
The solution: keep an exact log of your cleaning-up process. OpenRefine can do that. Or code a script. Or just keep two copies of their spreadsheet: one original, and one with your edits.
You’re a journalist. Don’t UPDATE or DELETE.
An SQL database is hard to install
You’re an honest journalist, so you share your code on GitHub. Good.
The most important file you publish to GitHub is your README file: the instructions someone should follow to confirm your findings.
Database systems are hard to install. The steps on Linux and Windows are different than on MacOS. And those installation steps change when new database versions come out.
You’re a journalist. Nix the database from your story.
An SQL database costs money
Are you publishing your story as a Rails or Django app, with an SQL database behind the scenes? A web server costs tens of dollars per month.
There’s an alternative: you can use S3 (or another platform) to host every file a user requests, for pennies a month.
You’re a journalist. Make your story easier to serve: years from now, it will be easier to read.
An SQL database is hard to comment
Did You Know: you can write comments in a PostgreSQL schema. Unfortunately, nobody else knows that, either. Nobody will ever read the comments you write there.
Did You Know: you can write comments in the one-liner “model” classes in popular Web frameworks like Rails? Maybe you did … but when’s the last time you kept them up to date? It’s hard to maintain comments there.
An SQL database system deters your readers — not to mention Future You — from understanding your methodology.
You’re a journalist. Show your work: make your comments current, visible and salient.
You usually don’t need multi-user edits
SQL databases tackle a hard problem: letting many users access data at the same time.
You don’t need that, because Google Sheets does even better. It’s free to use. It has a wondrous user interface, and it shows you what others are working on. Anybody in your newsroom can learn to use it within minutes.
Transactions have more cons than pros
SQL database systems have transactions. Transactions promise that if you unplug your computer, your data won’t disappear.
Surprisingly, you probably don’t want this.
Transactions makes database systems slow. They can lead to huge disk and memory usage. When somebody tries to ROLLBACK an enormous transaction, that can lock your entire database; it can be unreadable for hours or days, even after you reboot.
Transactions don’t back up your data: backups do. It’s easier to back up a folder (or use a Google Sheet, which backs up automatically) than it is to back up a database.
Transactions don’t make it “safe” for two users to edit data concurrently: nothing does. That’s a topic for a PhD dissertation, not a blog post. Suffice it to say: if you have many people editing your database at once, you’re probably making mistakes … because everybody makes mistakes.
There’s more than one kind of sort or index
My favorite thing to do with a database system is SELECT — it helps find great stories, especially using JOIN and ORDER BY.
But as fun as it is, SQL is a terrible language for this.
Behind the scenes, database systems have two key features that help with finding and joining data: sorts and indices.
Databases can sort big tables, but they do lots of arcane stuff in the process. A database sort involves locks, row visibility checks and other operations most of us haven’t imagined. In the end, the UNIX “sort” command-line utility is usually faster; and the “sort” function in your programming language is faster for all but the most enormous lists, too.
Most databases fit in memory
Database systems were built when hard drives were slow and memory was expensive. Indices and caches helped allay then-dreaded “table scans”: reading every byte of a table could take minutes.
That effort is usually wasted on journalists today. Our computing power is enormous and our databases are puny. Scanning 1GB of memory might take 50ms, and sorting a million integers might take 500ms.
Ironically, the auto-indexing and accompanying disk flushes that were built to save time can cost more time than they save. SQL is wasteful on databases smaller than a gigabyte.
SQL doesn’t scale to large databases, either. In the hundreds of gigabytes, engineers “shard” SQL databases — that is, they use lots of small SQL databases instead of one big one. Beyond a few terabytes, sharding SQL databases is impractical. Today’s “Big Data” systems — BigTable, Cassandra and others — don’t use SQL.
Most of your stories deal with too little data — or too much — for SQL.
SQL is complicated
Example: look up a value in a row.
SQL: SELECT name FROM customers WHERE id = 32
Example: look up values from two related tables
SQL: SELECT customers.name, latest_invoices.”date” FROM customers LEFT JOIN (SELECT customer_id, “date” FROM invoices WHERE id IN (SELECT MAX(id) FROM invoices GROUP BY customer_id)) latest_invoices ON customers.id = latest_invoices.customer_id ORDER BY customers.id
Example: find sums
SQL: SELECT customers.name, COALESCE(SUM(invoices.total), 0) AS all_invoices_total FROM customers LEFT JOIN invoices ON customers.id = invoices.customer_id GROUP BY customers.name;
Example: join two tables
(Psych! Joining tables is a means, not an end. Journo-coding is about ends.)
It’s scary to switch away from SQL. But it’s worth it.
Google Sheets (or a competitor) is a friend. Most journalists are totally okay with editing a spreadsheet, and they even understand the concept of joins.
Design a database in Google Sheets, one table per sheet, and share it such that journalists can edit it. Then download TSV (tab-separated value) files from Google Sheets to use as input into your program. (Parsing TSVs is a cinch.) We use google-drive-console-download to download our databases.
Next, you need to re-learn the word, “Database”. My latest “Databases” are entirely in-memory: read entire files and plop their contents into arrays and/or hash tables. For instance: “customers”, “idToCustomer”, “invoices”, “idToInvoice”.
This database-building step is where “joins” come in. For instance, each Customer might have many Invoices. So build “customers”, build “idToCustomer”, build “invoices”, and for each invoice, “idToCustomer[invoice.id].push(invoice)”. This pattern works in any language, and it’s very fast. (It replaces SQL JOIN.)
Finally, you get to “query” your “database”: invoices.length; invoices.customer; and so forth. This is when you realize how much simpler code can be when you omit SQL.
- primaries-2016: some Google Sheets and Associated Press Elections API responses. “Database” was too complicated, because we were trying to make it look like SQL; we did better for the general election. (Ruby)
For a sense of how fast things are: in decision-2016, our “load database” function took about 700ms. Our “render all 170 data-driven HTML, JSON and JPG files” function took about 1s. That’s an order of magnitude faster than Ruby on Rails and SQL.
Stay tuned for a future blog post: I’ll discuss an S3-centric web framework that’s free of SQL.