BigQuery

3 minute read

Why BigQuery?

On-premise data analysis face query, infrastructure, and storage challenges.

Why GCP is used for data analysis:

Storage is cheap
Focus on queries and insights, not infrastructure
Massive scalability

BigQuery scales automatically and you only pay for what you use.

You are billed for the resources you use:

Bucket Storage
Query processing and Table Storage

5 steps of analyzing and gleaning insights from data:

Ingest: Get data in.
Transform: Prepare, clean, and transform data.
Store: Create, save, and store datasets.
Analyze: Derive insights from data.
Visualize: Explore and present data insights.

BigQuery Fundamentals

BigQuery is a petabyte-scale data analytics warehouse.

Two services in one:

Analysis
Managed Storage

Job

The Unit of work in BigQuery is called a Job.

Each Job:

Give a Unique ID by Web UI
Can run concurrently
Perform Tasks
History Stored for 6 Months

Pricing

3 categories of BigQuery pricing

Storage: amount of data in table
Processing: amount of data processed
Free: loading, exporting, queries with errors

Apply cost optimizing principles when writing your queries

Only include the columns and rows you need
Use cached results when possible
Limit the use of User-Defined Functions

Cleaning and Transforming Data

High quality datasets conform to strict integrity rules, here are 5 principles of Dataset Integrity:

Validity: range constraints
Accuracy: a known source of truth
Completeness
Consistency: ensures harmony across systems
Uniformity: measuring the same way

Dataset Shape:

Numbers of Columns
Numbers of Rows

Dataset Skew describes the distribution of data values.

Clean and Transform Data with SQL

Setup Field Data Type Constraints
Specify fields as NULLABLE or REQUIRED
Proactively check for NULL values
Check and Filter for allowable rang values: using SQL conditionals
Require primary keys/relational constraints in upstream source systems (BigQuery is an analytics warehouse not primary operational database)

Dataprep

Cloud Dataprep is a pre-processing data pipeline building web UI tool.

Creating New Datasets

Creating Permanent Tables

Write SQL Query
Click Show Options
Specify the Destination Table
Choose Write Preference(if table already exists)
Run Query

If the Destination Table exists:

Write if empty
Append Records
Overwrite table

Temporary Tables

Use Save as Table
All Query Results are stored in tables
If you don’t Save as a Permanent table, a Temporary one is automatically created and saved for 24 hours.
Re-running the same query will hit the cached temporary table.

Creating Logical Views

View = Saved SQL Query (a virtual table)
The underlying query is re-ran each time the view is queried

Advanced BigQuery Functions

Use the right function for the right job:

String Manipulation Functions - FORMAT()
Aggregation Functions - SUM() COUNT() AVG() MAX()
Data Type Conversion Functions - CAST()
Data Functions - PARSE_DATETIME()
Statistical Functions
Analytic Functions
User-defined Functions

Statistical and Approximations

APPROX_COUNT_DISTINCT()

Analytic Window Functions

Standard aggregations

SUM, AVG, MIN, MAX, COUNT, etc.

LEAD() - Returns the value of a row n rows ahead of the current row
LAG() - Returns the value of a row n rows behind the current row
NTH_VALUE() - Returns the value of the nth value in the window

Ranking and numbering functions

CUM_DIST() - Returns the cumulative distribution of a value in a group
DENSE_RANK() - Returns the integer rank of a value in a group
ROW_NUMBER() - Returns the current row number of the query result
RANK() - Returns the integer rank of a value in a group of values
PERCENT_RANK() - Returns the rank of the current row, relative to the other rows in the partition

User-Defined Functions (UDF)

CREATE TEMPORARY FUNCTION greeting
RETURN STRING
LANGUAGE js AS """return "HELLO, " + a + "!";""";

Pitfall:

User-defined functions hurt performance
Concurrent rate limits: for non-UDF queries:50, for UDF-queries:6

Sub-query and CTE design

WITH is simply a named subquery
Acts as a temporary table
Breaks up complex queries
Chain together multiple subqueries in a single WITH
You can reference other subqueries in future subqueries

Ivan Bu

BigQuery

Why BigQuery?

BigQuery Fundamentals

Job

Pricing

Cleaning and Transforming Data

Clean and Transform Data with SQL

Dataprep

Creating New Datasets

Creating Permanent Tables

Temporary Tables

Creating Logical Views

Advanced BigQuery Functions

Statistical and Approximations

Analytic Window Functions

Standard aggregations

Navigation functions

Ranking and numbering functions

User-Defined Functions (UDF)

Sub-query and CTE design

BigQuery Architecture

BigQuery Nested and Repeated Fields

Apply ML to Your Data

Creating ML Datasets in BigQuery

Share on

You may also enjoy

Python Basics

Dataengineeringgcp

Google Analytics: App + Web

Pass the Google Cloud Professional Data Engineer exam