BigQuery¶

Data encrypted with Data Encryption Keys DEK. Key Encryption Keys encrypt DEK
Slot Time is the total elapsed time consumed by all parallel workers
Consists of two separate BigQuery Storage and BigQuery Query services
fully qualified: Project.Dataset.Table
Under-the-hood: uses Colossus storage, Jupiter network, Dremel SQL execution engine, Borg compute facility
Can ingest both bulk and streaming data
BigQuery Query service can query external (aka Federated) data such as CSV, Avro, Parquet, ORC
supports partition elimination of hive-partitioned files
natively supports STRUCT and ARRAY, preferred over normalized tables
DML queries results are cached for ~24 hours, if
issued against native storage
has no non-deterministic (eg CURRENT_TIMESTAMP, RAND etc)
if the table isn't associated with Streaming Buffer
new query compares exactly same as the saved query including white-spaces
supports GeoSpatial or GIS data
Pricing is similar to Cloud Storage, so no need to archive off BigQuery
each table or a partition within a table that is not updated for 90 days, has cost similar to nearline storage
two pricing models: per-query (~ 5$/TB scanned) or flat-rate (500 slots ~ $10,000/month)
cached results (stored as temporary tables) are free
Limit access to data by creating Authorized Views
can't export data out of a view
cost of storage and queries are separate and charged to respective users
Time-travel (temporal) queries can access data up to 7 days ago by default
deleted tables are kept for 2 days if not recreated with the same name
Data Transfer Service allows scheduling data transfers
Uses Connectors to source data, 100+ data sources supported
Javascript UDF supported, but are not as optimized as SQL functions
stored as dataset objects and can be shared
Denormalized data that contain nested, repeated columns improve performance
ARRAY as mode (where NULLABLE is specified) to implement repeated fields
STRUCT as type to implement "pre-joined" data
Nested data can be flattened using correlated-join and/or UNNEST
batch loading is not charged, but loading by streaming is charged
insert rate of 100K/sec; if dedup is disabled, BQ in USA can support 1 million rows/sec inserts
BQML supports only linear_reg and logistical_reg models
BQML EVALUATE measures effectiveness of model
BQML
SELECT * FROM ml.PREDICT(MODEL <model>, <query>) for prediction
SELECT * FROM ml.EVALUATE(MODEL <model>, <query>) for model evaluation
Data transfer service available with connectors for Teradata, Redshift, S3 and google sources
cost control: Use GCP quotas to limit how many TB can a user process in a day etc
nested array of struct is like a sub-table
more efficient than join
can be up to 100MB/ row
BP: best used for identifying, immutable parent-child relationships
BP: split-able compressed row-oriented formats are faster (compressed avro > compressed csv)
Data ingestion is free, except when loading using federated queries, which does use query capacity
information_schema contains metadata about jobs, users, datasets, tables, views, routines
Queries can be run interactively or in batch. Interactive queries count towards concurrency and daily limits

SQL¶

ARRAY_AGG([DISTINCT] <COL> [ORDER BY <COL>] [LIMIT <n>]) with GROUP BY turns the specified items form group into an ARRAY
FROM <tbl> UNNEST(<fld>) as h turns array into rows
CROSS JOIN with fields in arrays of nested struct is correlated
e.g. SELECT race, participants.name FROM racing.race_results AS r, r.participants
ANY_VALUE selects any one value when using SQL expression: SELECT A, (SELECT ANY_VALUE(B) FROM T2 WHERE T2.Y=T1.X) FROM T1
Use OPTIONS(partition_expiration_days=60) to auto-expire partitions
SELECT STRUCT("Rudisha" as name, [12.8, 23.4] as split) as runner creates runner as RECORD type with split as ARRAY type
SELECT date, ARRAY(SELECT AS STRUCT title, score FROM UNNEST(titles) ORDER BY score LIMIT2) AS top_articles FROM TitlesAndScores
above uses ARRAY constructor with SQL on a repeated field to produce a new ARRAY field

Performance¶

materialize query into a table
enable query caching since it's free
query results are cached per user, per project
minimize the number of columns read
minimize number of times expensive computations are done (eg. ST_Distance)
Use BI Engine that can cache some results in memory for applications like Data Studio (avoid query BQ query charges)
Denormalize instead of joins
use fewer passes to read data, using expressions instead of joining large tables
use GROUP BY first then JOIN instead of the other way around
use window function instead of self join
try not overwhelm one worker, instead distribute work over many workers, e.g.
ROW_NUMBER() OVER(ORDER BY C1) can use only 1 worker, whereas ROW_NUMBER() OVER(PARTITION BY C2 ORDER BY C1) will use many
Reduce data skew; try to distribute data on more workers, e.g. by making GROUP BY more granular
- e.g. GROUP BY A, B is faster than GROUP BY A
For large data, use APPROX_* over their exact counterparts, eg. APPROX_COUNT_DISTINCT over COUNT(DISTINCT x)
JOIN: put the largest table on the left
GROUP BY low cardinality is better than high cardinality
can require users to supply a partitioning query
avoid using UDFs written in javascript, which requires Java subprocess (concurrency limit dropped to 6 from 50)
clustering: order is important, won't use clustering if higher order columns aren't supplied
Explain
Each Stage shows: Wait, Read, Compute, Write times and Rows read, Rows written.
each time bucket has color, and each bucket has gradation.
- paler grade: total time, darker grade: average time, less dark: max time amongst all shards
- for skewed operation, max gradation will be much longer than average (long tail)
- split skewed queries into two, one just the skewed value and the other is everything else
shuffle (redistribution) happens when mapping from stage N to N+1 isn't statically determined
- is used when joining two large tables
broadcast join is used when joining small table to a large table
- explain shows each with all (v/s each with each when join is a shuffle join)
Each user has shuffle area limit (~2TB)
order by large tables can run out of memory because master node needs to sort the values.
use order by ... limit <n> to have BQ optimize large sort to partial sorts
prefer GENERATE_UUID or SHA256 over ROW_NUMBER to generate surrogate-key
ROW_NUMBER puts every row in a single partition
BQ uses fair-scheduler, ie each stage can use all available slots if needed
number of available slots are defined at project level
hierarchical slot reservation allow per project to prioritize certain workloads
unused slots by higher priority projects can be used by lower priority queries, but can be preempted
BP: segment users into different projects based on roles, e.g. data science team can be in a project that has fewer slots

Features	Partitioning	Clustering
implementation	physical datasets	data sorted within partition
usage	ingestion time, date-time or int col	1+ columns
Suitable for cardinality	Low (<10K)	high
query pricing	exact	best effort
Data management	like table (auto-expire)	DML

Security¶

IAM replaces SQL DCL
Permissions:
bigquery.jobs.{create,list} to {run,list,cancel} jobs (including queries)
bigquery.{dataset,table,routines,model}.{create,list,delete}
Roles may apply to organization, project, dataset, table/view
roles/bigquery.dataEditor:
- on tables/views: update data, metadata or delete the table/view
- on dataset: above + create,list tables, read dataset metadata
roles/bigquery.dataOwner: above +
- on tables/view: + share
- on dataset: + update,delete dataset
- on project/org: create new datasets
roles/bigquery.dataViewer:
- on tables/view: read data and metadata
- on dataset: read metadata, list on dataset + read tables/view
roles/bigquery.jobUser: min. level project, run queries
Authorized Views: simulate row and column level security
On table, add ACL ALLOW VIEW <view>
supports secured VPC: e.g. can't export data out of VPC etc
Encryption: AES_DECRYPT scalar function allows each value to be encrypted by a different key

bq¶

show table metadata: bq show bigquery-public-data:samples.shakespeare
run query: bq query --use_legacy_sql=false <file>
list your datasets: bq ls
list datasets for a project: bq ls <proj>:
list tables: bq ls <dataset>
create a new dataset: bq mk <dataset>
load and create table: bq load babynames.names2010 yob2010.txt name:string,gender:string,count:integer

Data Studio¶

Has two levels of cache query cache (from BigQuery) and prefetch cache (from Data Studio)