The easiest way to load a CSV into Google BigQuery
BigQuery, Google’s data warehouse as a service, is growing in popularity as an alternative to Amazon Redshift. If you’re considering working with BigQuery, you’ll find that accessing the data is quite straightforward. You can easily query huge amounts of data by running SQL queries in a number of ways: via BigQuery’s Web UI, CLI, or by integrating with your favorite BI tool. But to do that, you’ll need to move your data into BigQuery. How hard can that be?
Let’s take a simple case, a report in one of BigQuery’s recommended formats — CSV, which is generated by a backend service once per day. Looking at BigQuery’s documentation, loading the data would appear to be quite easy. Using the CLI, all you need to do is:
bq loadand specify your local file and the name of the table you want to create on BigQuery
bq lsand see that the table appears in the dataset, should look like this:
bq showto see the resulting schema in BigQuery:
But is that really all there is to it?
In our experience, no. “Naively” loading your data as per the above instructions might get you into serious trouble: don’t try it at home.
It’s not BigQuery’s fault — Google have done an amazing job at making it easy to load and analyze data on their platform. It’s the quirky nature of data flows which makes things much more tricky, even in the simplest of cases.
A Very Simple Case and How Things Can Go Wrong
Let’s take the simple use case we outlined above. You work with a backend system which generates customer data in CSV files. Every day the system generates a CSV showing new customers added on that day. Your job is to load these CSVs into BigQuery to facilitate analysis.
Stage Zero: Loading CSVs Manually
At first, you might load the CSVs manually, as we showed in the documentation snippet above.
What can go wrong: This becomes outdated every few days. You’ll want to automate this repetitive task…
Stage One: Automating Data Loading
To automate loading, you need a server that connects to the backend database, queries it for the latest customer data, creates a CSV and loads to BigQuery. An easy way would be using a post request. That will take a few days of a developer’s time.
You’ll also need to guarantee data quality — as soon as things are running automatically, you’ll need monitoring and an automatic way to handle errors (or at least stop the process and allow an operator to fix what went wrong). In BigQuery, errors are returned for an entire CSV — BigQuery reports that the CSV failed, but typically the problem is with a specific data row. For example, there are 999 rows with integers, which complies with the schema, but one row which contains strings. A smart automatic process should be able to remove that one row that caused the problem, and load the rest of the CSV, which would then succeed.
There are three levels (at least) at which you could handle possible errors to ensure data quality:
- Level 1 — recognize that a CSV failed to load, get the error message and alert the operator. This allows the operator to manually examine the CSV, see what caused the failure, fix it and load it manually into BigQuery.
- Level 2 — try to separate the erroneous rows from the good rows in the same CSV. Allow the good data to flow into BigQuery, and leave the problematic rows on the side, allowing the operator to handle them manually. This prevents frequent halting of the data pipeline and also makes the manual work much easier.
- Level 3 — for certain errors, fix them automatically. For example, if the file failed because the timestamp had only hours and not minutes and seconds, it’s possible to append “:00:00” to the end and pass it through. Certain errors will repeat themselves in your data and over time you can predict and handle them without manual intervention.
What can go wrong: There are additional failure scenarios. What happens if the server fails to read the source data from the backend database; receives the wrong data or incomplete data; or encounters a failure reading the data from its local disk? These things will happen and the data pipeline will grind to a halt. Or worse, they can go unnoticed, and later on you’ll find gaps in your data.
Even a robust script will not handle all possible error scenarios, and it will often be necessary to go back and adapt the script to handle a new issue, or troubleshoot some issues manually. It’s a given that unexpected errors will occur and your script — now your data pipeline — will require ongoing monitoring and maintenance to ensure the data flows as expected.
Stage Two: Updating vs. Inserting
Let’s say our customer data includes purchases. Customers will come back from time to time to purchase more. This means the same customer row in your table will be updated over time with different numbers, such as total purchase amount or items purchased.
If so, the original query won’t be adequate. Instead of asking for new customers added today, you need to query for all customers modified today, including older customers that you already saved to BigQuery. BigQuery does not allow modifying of an existing table. Since BigQuery does not support deletes or upserts, you’ll need to do the following:
- Load all modified customer data to BigQuery in a new table.
- Rework your script to check which customers in the old table do not exist in the new table, and copy them over (so you don’t lose old customers which weren’t modified).
- If you made changes to the data within BigQuery, the reconciliation gets much more complex, as there could be multiple changes both to the existing data and the newly ingested data, with conflicts. You need logic to correctly merge all changes.
- Delete the old table.
What can go wrong: Anyone would be nervous copying and deleting entire data sets over and over again every day, as with any destructive operation. But this is commonly how things work in BigQuery and similar products. Any hiccup in the way the data is copied over or correlated with old records can lead to serious errors in the data, or even loss of the entire dataset.
Stage Three: Schema Change
Inevitably, one day or another, the schema will change. For example, a new data field might be added with the customer’s credit score. Or an existing field with the time of last purchase might be changed from just the date, to a full timestamp including minutes and seconds.