Architecture¶

User Stories¶

describe a single functionality of the system
has a title and description
form: As a role, I want functionality because need
eg: Balance Inquiry: As an account holder, I want to check balance, because overdraft can be prevented
form: Given context then action
eg: Transfer: When I click transfer, then money should be moved between accounts
good stories meet the INVEST criteria
Independent: to allow prioritization and planning
Negotiable: Not a written contract and can be adjusted
Valuable: Users should see tangible benefits
Estimate-able: if not, either missing details or it's too large
Small: faster feedback cycle, less chance of ambiguity, cost overruns
Testable: so to know when it's complete (clearly defined deliverables)

KPI and SLI¶

quantitative requirements that are measurable given the constraints of: People, Time, Finance
they are not goal, but allow measuring a goal. e.g if goal is "increase turnover", KPI could be "% of conversion on the website"
KPI can be business or technical
Business: ROI, Employee Turnover, Customer churn
Technical: Page views, response time, checkouts, errors
KPI should adhere to SMART
Specific: eg. User Friendly is not, but Section 508 Accessible is
Measurable: should be able to test if you are meeting your KPI
Achievable: 100% availability is good, but not realistic
Relevant: Does it really matter?
Time bound: period for measuring, eg 99% availability per year, month?
SLI is a measurable attribute of a service. e.g. latency
3 types in stackdriver:
- Availability: ratio of successful requests to all requests
- Performance: successful requests that satisfy some parameter, eg latency for all requests
- Other: Custom (e.g. number of transactions)
generally lower level than KPI, but may indicate what are the causes of missed KPI
SLO is: SLI + target + Compliance period. eg. 200ms for SLI latency over a week
SLA is an agreement, a more restrictive version of SLO
not all services have SLAs, but all should have SLOs
SLO should be stricter than SLA, eg SLO 200ms response time v/s SLA 300ms response time

MicroServices v/s Monolith¶

Microservice has it's own code base and manages own data
v/s Monolith has a single code base and a single data store
Domain Driven Design is important to quickly decompose monolith into smaller services
by feature/domain: Review service, Order Service, Product service
by architectural layer: Web/Mobile UI, Data Access service
shared functionality: Authentication, logging, reporting
stateful (eg database) services are harder to scale, upgrade and need backing up
isolate stateful services to a few minimum
BP: 12 factors

Requests¶

http or gRPC (for internal, performance sensitive requests)
GET, POST (entity ID is generated), PUT (entity ID must be known, should be idempotent), DELETE
PATCH is used when only part of the resource is modified (PUT replaces the resource with new version)
Response codes: 200 Ok, 400 Client error, 500 server error
URIs:
plural for sets
don't use verb to identify resource (eg do not use /getpet)
use version information

API¶

OpenAPI for language agnostic, service description
two ways in GCP: Cloud Endpoints or Apigee API platform
gRPC is fast, binary protocol based on HTTP/2, supported by Global Load Balancer over HTTP/2
BP: use Service.Collection.Verb format
e.g. product.inventory.{add,get,update,delete,search}; compute.instances.{create,list,start}

DevOps¶

CI/CD¶

CI building application and produce an artifact
GKE supports binary authorization as a service that
scans for vulnerability
attests that the image was built from trusted sources (correct source repositories)

Monitoring¶

Black box: monitoring only the user interactions
good for learning response time, user experience
examples: latency
while box: monitoring is aware of inner working of the application
can measure unusual conditions such as behavior during critical resource shortage
examples: CPU usage, IO usage

Deploying Applications to Google Cloud¶

BP: Pick correct platform
Specific machine and OS?
- Yes -> Compute Engine
- Using Containers?
- Yes, Want your own Kubernetes cluster?
  - Yes -> Kubernetes
  - No -> Cloud Run
- No, Event driven?
  - Yes -> Cloud Function
  - No -> App Engine
A/B release: make different version of application available to different sets of users
generally used for measuring usability, popularity etc
Canary release: release new application for the purpose of test and to reduce risk
compute engine: create a new instance group and add it to the LB
kubernetes: create new pod with same label
App engine: use traffic splitting
pipeline deployment: test in successive environment (dev -> qa -> uat -> perf) to increase the chance of success
rolling updates: use when two versions at a time can be supported or new version is backward compatible
instance group: available by altering instance template
kubernetes: available by default. change the docker image
App engine: completely automated
blue/green: use when multiple versions can't supported simultaneously; use different environments for old (blue) and new (green)
compute engine: use DNS to migrate requests from one LB to another then change reverse proxy or LB to switch to blue environment
kubernetes: configure service to use the new pods using labels
App engine: use traffic splitting feature
since this involves switching using LB, it is not suitable for long running transactions

MLOps¶

Prep: Data Engineer, skills: build and maintain data pipelines, databases
Build: Data Scientist, skills: stats and algorithm expertise
Deploy: ML Engineer, skills: deploying and monitoring at large scale

Design and Process¶

Maintenance and Monitoring¶

Managing versions using any of the following strategies¶

Use rolling updates with multiple instances. It's a feature in Instance Group, default in Kubernetes, built-in in App Engine
Blue (current)/Green (new) deployment strategy. Test in Green and then switch clients to it. On failure, move everyone back to original blue
Compute engines: use DNS to switch LBs
Kubernetes: use labels to switch pods
App Engine: use traffic splitting
Canary release new release runs in parallel but has smaller traffic during test before the switch happens
Compute engine: create a new instance group and add it as an additional backend to LB
Kubernetes: create new pod with same label as existing one, service will automatically route traffic to it
App Engine: Use traffic splitting
Strangler Pattern: replace legacy application piece-by-piece with a facade that increasingly directs users to new application, thus strangling the old application

Cost savings¶

use smaller machines with auto-scaling, use preemptible instance with auto healing
use Standard storage v/s SSD when IO needs aren't high
Networking costs: egress within same zone is free if using internal IP
choose right service, eg. 1GB in firestore is free, but 1400$/month in BigTable
set up budgets and alerts and/or pubsub

Monitoring¶

profiler: https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/design-process/deploying-apps-to-gcp

import googlecloudprofiler
googlecloudprofiler.start(verbose=3)

Designing Reliable Systems¶

3 key performance metrics:
Availability: probability that the system is unavailable to users
Durability: probability of losing data due to h/w or other failures
Scalability: how much load a system can take before it fails
BP strive for N+2 (2 spare units) where is N is minimum needed
+1 additional system is for planned maintenance (eg upgrades) and +2 is spare that may be needed if outage during planned maintenance
allow scalability, but not too much extra capacity
Circuit Breaker pattern: If a service is down, client retrying make matter worse, instead use a proxy that monitors service and stops more requests if the system is unhealthy. e.g Istio
Lazy Deletion pattern: Move data to "Trash" for user to recall, then move to "Soft-delete" before permanently deleting
Disaster planning: HA by deploying services in multiple regions
Define RPO (how much data can be lost, eg 24 hours) and RTO (how long it'll take to recover, eg 1 hour)
Topologies for on-prem and cloud
mesh: all systems work together
mirror: on-prem and cloud mirror each other. typical use for DR
gated egress: on-prem service APIs are (securely) made available to applications running in Cloud
gated ingress: on-prem applications (securely) consume services hosted on Cloud

Failure Types¶

single point of Failure: configuration which can fail if a particular single component fails. Solution
Use two spares. E.g. if N is the min required number of VM instances, using N+2 allows outages for upgrade/test and failure
correlated failures are failure due to systems being in a single failure domain, solutions are: decoupling, micro-service architecture, using multiple failure domains
failure domain a group of related items that could fail together. Eg. two nodes in the same rack, are in the single rack failure domain
cascading failure happens when one failed system causes another to become unstable and subsequently fail. E.g. two servers in LB configuration cannot handle the load even if only one server to fail
Positive Feedback Cycle Overload occurs due to application's retry logic retrying the failed request, even if the failure was due to overloaded system. Solutions:
Exponential Backoff pattern: double wait time during each retry before reaching max numer of retries
Circuit Breaker pattern: protect the service behind a proxy that monitors health limits number of requests. Eg Istio in GKE
User errors: data is deleted on user's request that can't be recovered
Lazy Deletion pattern: move data to trash (user revoerable) -> soft delete (admin recoverable) -> hard delete (recoverable from backups)

HA and DR¶

high availability: explicitly plan for components that aren't inherently highly available. examples,
Zonal resources such as Cloud SQL and VM instances should have redundant instances in multi zone/regions
Firestore, Spanner or BigQuery don't require any additional redundancies because they are multi-regional
Selecting regional Kubernetes cluster deployment replicates node pools in multiple zones
Compute Engine: Managed Instance Group MIG can provide regional HA, use global load balancers for multi-region or global HA
Kubernetes: In addition to underlying MIG that provides regional HA, Federated Kubernetes is available
App Engine, Cloud Function are fully managed and do not require HA consideration
Cloud Storage supports multi-region buckets
Persistent Disks can be created as regional, which will be replicated across zones
Disaster recovery: Cold Standby, instead of services, keep backups, snapshots, images in multi-region to spin up servers in a new region
Recovery Point Objective RPO: how much data can be lost (eg. last 1 day or 0 minutes)
Recovery Time Objective RTO: how long can the recovery take (eg 4 hours or 2 minutes)
Recovery procedure: documented or automated way to recover data
Typical solutions:
- For RPO of 1 day and 4 hours RTO, use daily backups and a restore script
- for RPO of 0 minutes and 2 min RTO, use failover replicas with daily backups and automatic failover

Security¶

Principle of least privilege
Separation of duties
BP use multiple projects to separate duties
Audit: use logs (Admin, Data Access, VPC Flow, Firewall, System)
Securing people:
Add people (aka members) to groups, group permissions into roles and assign roles to groups
- use inherited policies
Use Identity Aware Proxy (IAP) for authenticating applications that are HTTPS based (LB, GKE or App Engine)
- can tunnel TCP traffic to VM instances such as SSH and RDP via HTTPS tunnel
- enables users to authenticate using Cloud Identity and authorized using IAM
- works with internal IP without the need of public IP or VPN (relies on Application level access control model instead of firewall based model)
Securing machines:
use Service Accounts. Can be assigned to VMs or GKE node pools
Can be used to let users access GCP without console access. gcloud auth activate-service-account --key-file=<key.json>
Securing network:
Remove external IP from internal systems. Use, bastion host, ssh into IAP, cloud NAT
Use firewall rules for internal and Cloud Armor for layer 7 external applications
- Cloud Armor can inspect HTTP headers/cookies to filter out traffic
Global LB and CDN offer protection from DDoS attacks
Allow private access to GCS using internal IP gcloud compute networks subnet update subnet-b --allow-private-ip-google-access
Use Cloud endpoints to control access to APIs. Integration with Identity Platform, Use JWT and allow access control
Data Loss Prevention API allows redacting PII and sensitive data

DLP¶

Data Loss Prevention
de-identification: make data unidentifiable
Crypto-based Tokenization: maintain referential integrity, i.e. only one encrypted value for the same input value
Deterministic Encryption: uses symmetric encryption key (reversible), best overall option when legacy support is not required
Format Preserving Encryption FPE: symmetric encryption key (reversible), useful for supporting legacy applications
Cryptographic hashing: can't be reversed

Glossary¶

Greenfield projects: Projects that aren't constrained by legacy/prior work
Brownfield projects: New projects must take into account and co-exist with existing project that is being replaced

Architecture¶

User Stories¶

KPI and SLI¶

MicroServices v/s Monolith¶

Requests¶

API¶

DevOps¶

CI/CD¶

Site Reliability Engineering SRE¶

Monitoring¶

Deploying Applications to Google Cloud¶

MLOps¶

Design and Process¶

Maintenance and Monitoring¶

Managing versions using any of the following strategies¶

Cost savings¶

Monitoring¶

Designing Reliable Systems¶

Failure Types¶

HA and DR¶

Security¶

DLP¶

Glossary¶