BigQuery

3 minute read

Why BigQuery?

On-premise data analysis face query, infrastructure, and storage challenges.

Why GCP is used for data analysis:

  • Storage is cheap
  • Focus on queries and insights, not infrastructure
  • Massive scalability

BigQuery scales automatically and you only pay for what you use.

You are billed for the resources you use:

  • Bucket Storage
  • Query processing and Table Storage

5 steps of analyzing and gleaning insights from data:

  1. Ingest: Get data in.
  2. Transform: Prepare, clean, and transform data.
  3. Store: Create, save, and store datasets.
  4. Analyze: Derive insights from data.
  5. Visualize: Explore and present data insights.

BigQuery Fundamentals

BigQuery is a petabyte-scale data analytics warehouse.

Two services in one:

  1. Analysis
  2. Managed Storage

Job

The Unit of work in BigQuery is called a Job.

Each Job:

  • Give a Unique ID by Web UI
  • Can run concurrently
  • Perform Tasks
  • History Stored for 6 Months

Pricing

3 categories of BigQuery pricing

  1. Storage: amount of data in table
  2. Processing: amount of data processed
  3. Free: loading, exporting, queries with errors

Apply cost optimizing principles when writing your queries

  • Only include the columns and rows you need
  • Use cached results when possible
  • Limit the use of User-Defined Functions

Cleaning and Transforming Data

High quality datasets conform to strict integrity rules, here are 5 principles of Dataset Integrity:

  1. Validity: range constraints
  2. Accuracy: a known source of truth
  3. Completeness
  4. Consistency: ensures harmony across systems
  5. Uniformity: measuring the same way

Dataset Shape:

  • Numbers of Columns
  • Numbers of Rows

Dataset Skew describes the distribution of data values.

Clean and Transform Data with SQL

  • Setup Field Data Type Constraints
  • Specify fields as NULLABLE or REQUIRED
  • Proactively check for NULL values
  • Check and Filter for allowable rang values: using SQL conditionals
  • Require primary keys/relational constraints in upstream source systems (BigQuery is an analytics warehouse not primary operational database)

Dataprep

Cloud Dataprep is a pre-processing data pipeline building web UI tool.

Creating New Datasets

Creating Permanent Tables

  1. Write SQL Query
  2. Click Show Options
  3. Specify the Destination Table
  4. Choose Write Preference(if table already exists)
  5. Run Query

If the Destination Table exists:

  • Write if empty
  • Append Records
  • Overwrite table

Temporary Tables

  1. Use Save as Table
  2. All Query Results are stored in tables
  3. If you don’t Save as a Permanent table, a Temporary one is automatically created and saved for 24 hours.
  4. Re-running the same query will hit the cached temporary table.

Creating Logical Views

  • View = Saved SQL Query (a virtual table)
  • The underlying query is re-ran each time the view is queried

Advanced BigQuery Functions

Use the right function for the right job:

  • String Manipulation Functions - FORMAT()
  • Aggregation Functions - SUM() COUNT() AVG() MAX()
  • Data Type Conversion Functions - CAST()
  • Data Functions - PARSE_DATETIME()
  • Statistical Functions
  • Analytic Functions
  • User-defined Functions

Statistical and Approximations

APPROX_COUNT_DISTINCT()

Analytic Window Functions

Standard aggregations

  • SUM, AVG, MIN, MAX, COUNT, etc.
  • LEAD() - Returns the value of a row n rows ahead of the current row
  • LAG() - Returns the value of a row n rows behind the current row
  • NTH_VALUE() - Returns the value of the nth value in the window

Ranking and numbering functions

  • CUM_DIST() - Returns the cumulative distribution of a value in a group
  • DENSE_RANK() - Returns the integer rank of a value in a group
  • ROW_NUMBER() - Returns the current row number of the query result
  • RANK() - Returns the integer rank of a value in a group of values
  • PERCENT_RANK() - Returns the rank of the current row, relative to the other rows in the partition

User-Defined Functions (UDF)

CREATE TEMPORARY FUNCTION greeting
RETURN STRING
LANGUAGE js AS """return "HELLO, " + a + "!";""";

Pitfall:

  • User-defined functions hurt performance
  • Concurrent rate limits: for non-UDF queries:50, for UDF-queries:6

Sub-query and CTE design

  • WITH is simply a named subquery
  • Acts as a temporary table
  • Breaks up complex queries
  • Chain together multiple subqueries in a single WITH
  • You can reference other subqueries in future subqueries

BigQuery Architecture

BigQuery Architecture introduces 3 key innovations:

  1. Column-Based Data Storage
  2. Break Apart Tables into Pieces
  3. Store Nested Fields within a Table

BigQuery Nested and Repeated Fields

Apply ML to Your Data

Creating ML Datasets in BigQuery