BigQuery
Why BigQuery?
On-premise data analysis face query, infrastructure, and storage challenges.
Why GCP is used for data analysis:
- Storage is cheap
- Focus on queries and insights, not infrastructure
- Massive scalability
BigQuery scales automatically and you only pay for what you use.
You are billed for the resources you use:
- Bucket Storage
- Query processing and Table Storage
5 steps of analyzing and gleaning insights from data:
- Ingest: Get data in.
- Transform: Prepare, clean, and transform data.
- Store: Create, save, and store datasets.
- Analyze: Derive insights from data.
- Visualize: Explore and present data insights.
BigQuery Fundamentals
BigQuery is a petabyte-scale data analytics warehouse.
Two services in one:
- Analysis
- Managed Storage
Job
The Unit of work in BigQuery is called a Job.
Each Job:
- Give a Unique ID by Web UI
- Can run concurrently
- Perform Tasks
- History Stored for 6 Months
Pricing
3 categories of BigQuery pricing
- Storage: amount of data in table
- Processing: amount of data processed
- Free: loading, exporting, queries with errors
Apply cost optimizing principles when writing your queries
- Only include the columns and rows you need
- Use cached results when possible
- Limit the use of User-Defined Functions
Cleaning and Transforming Data
High quality datasets conform to strict integrity rules, here are 5 principles of Dataset Integrity:
- Validity: range constraints
- Accuracy: a known source of truth
- Completeness
- Consistency: ensures harmony across systems
- Uniformity: measuring the same way
Dataset Shape:
- Numbers of Columns
- Numbers of Rows
Dataset Skew describes the distribution of data values.
Clean and Transform Data with SQL
- Setup Field Data Type Constraints
- Specify fields as NULLABLE or REQUIRED
- Proactively check for NULL values
- Check and Filter for allowable rang values: using SQL conditionals
- Require primary keys/relational constraints in upstream source systems (BigQuery is an analytics warehouse not primary operational database)
Dataprep
Cloud Dataprep is a pre-processing data pipeline building web UI tool.
Creating New Datasets
Creating Permanent Tables
- Write SQL Query
- Click Show Options
- Specify the Destination Table
- Choose Write Preference(if table already exists)
- Run Query
If the Destination Table exists:
- Write if empty
- Append Records
- Overwrite table
Temporary Tables
- Use Save as Table
- All Query Results are stored in tables
- If you don’t Save as a Permanent table, a Temporary one is automatically created and saved for 24 hours.
- Re-running the same query will hit the cached temporary table.
Creating Logical Views
- View = Saved SQL Query (a virtual table)
- The underlying query is re-ran each time the view is queried
Advanced BigQuery Functions
Use the right function for the right job:
- String Manipulation Functions - FORMAT()
- Aggregation Functions - SUM() COUNT() AVG() MAX()
- Data Type Conversion Functions - CAST()
- Data Functions - PARSE_DATETIME()
- Statistical Functions
- Analytic Functions
- User-defined Functions
Statistical and Approximations
APPROX_COUNT_DISTINCT()
Analytic Window Functions
Standard aggregations
- SUM, AVG, MIN, MAX, COUNT, etc.
Navigation functions
- LEAD() - Returns the value of a row n rows ahead of the current row
- LAG() - Returns the value of a row n rows behind the current row
- NTH_VALUE() - Returns the value of the nth value in the window
Ranking and numbering functions
- CUM_DIST() - Returns the cumulative distribution of a value in a group
- DENSE_RANK() - Returns the integer rank of a value in a group
- ROW_NUMBER() - Returns the current row number of the query result
- RANK() - Returns the integer rank of a value in a group of values
- PERCENT_RANK() - Returns the rank of the current row, relative to the other rows in the partition
User-Defined Functions (UDF)
CREATE TEMPORARY FUNCTION greeting
RETURN STRING
LANGUAGE js AS """return "HELLO, " + a + "!";""";
Pitfall:
- User-defined functions hurt performance
- Concurrent rate limits: for non-UDF queries:50, for UDF-queries:6
Sub-query and CTE design
- WITH is simply a named subquery
- Acts as a temporary table
- Breaks up complex queries
- Chain together multiple subqueries in a single WITH
- You can reference other subqueries in future subqueries
BigQuery Architecture
BigQuery Architecture introduces 3 key innovations:
- Column-Based Data Storage
- Break Apart Tables into Pieces
- Store Nested Fields within a Table