BigQuery Loading Data And Debug Note

Jae Huang
4 min readJul 22, 2021

--

In this post, I will introduce how I migrate a couple of large data file from local to BigQuery.

Recently I have a project on work that requires do data labeling job on a couple of large data set (6~7GB per file). While I have done the job with pandas chunk feature, I am still thinking about the scalability. This triggers me to explore BigQuery.

To use BigQuery, we need to sign up for GCP account with billing information filled and with a Project created. This part is quite straightforward so I will jump to the part of loading data to BigQuery.

For each project, we can create multiple dataset and for each datasets we can have multiple tables. We can select whichever the physical location like California, Taiwan or Tokyo for this specific dataset or just use the default location.

Load Data

We can upload the files directly from local via the upload option. There are some other options such as Google Cloud Storage, Drive, Cloud BigTable.

We can select the file format and give it a table name (existed table is fine, will get to it later). Given the file format we have different setting need to be done. I have a tab-separated text that should be chose as csv. Accordingly, we need to define the table schema, here I chose auto detect as I have header in this file. For cost-sensitive project, I will suggest to set the partition wisely which will optimize the query on process performance and cost. And clustering option focus more on column-wise organizing.

Here comes advanced options for twitching. Previously I said loading data to a existed table is workable, here is what write preference is for. We can load the data to a new table or append to a existed table or overwrite it. Secondly, we can set how many errors we can tolerate, when it comes to big and raw data, it highly possible that it will cause errors when loading. And we don’t want it to jam the whole loading job. Moving on, as I have a tab separated file, I set the field delimiter accordingly.

Load Large Data

Back to the initial reason I want to use BigQuery, my file is actually too big to upload directly. So the other way around is to upload it to Google Cloud Storage then load the data from the bucket. Nothing different from direct upload.

More Advanced Options

After setting this up, I gladly clicked the sweet Create Table button. No surprise, there is an error (or I should say many errors for one reason). The error says Error: Data between close double quote (") and fields separator.

Ok, I kind of already know that some entries I have there have only one double quote, which is what BigQuery is complaining about. I searched the solution, there are two ways to fix this.

The first solution I tried is specifying Allow Quoted New Lines. This works when you have data with quoted multiple lines, which didn’t work for me.

The other way is to tell BigQuery to treat double quote as pure text, by setting quote to empty string. But I found nowhere in the console shows that setting. Then I decided to use bq command for this setting. This can simply be done in GCP built-in cloud shell with the following command.

--quote "" # instead of the default quote character (")

So here it sums up my previously setting in bq command.

bq load --source_format=CSV \--autodetect \-F tab \--replace \--quote "" \--max_bad_records 10 \-E UTF-8 \[projectID:dataset.table] \[gs://bucketname/objectname]

And hooray, it worked!

Now we can query the data from the console.

Lastly, BigQuery has a lot more functions to talk about than I can cover in this post, so here is the link to the official document.

In the next post I will talk about how to use setup the credential to make queries in Python.

--

--

Jae Huang
Jae Huang

Written by Jae Huang

Data Engineer with expertise in Python. Data enthusiast.

No responses yet