Skip to content

Architecture

User Stories

  • describe a single functionality of the system
  • has a title and description
  • form: As a role, I want functionality because need
    • eg: Balance Inquiry: As an account holder, I want to check balance, because overdraft can be prevented
  • form: Given context then action
    • eg: Transfer: When I click transfer, then money should be moved between accounts
  • good stories meet the INVEST criteria
    • Independent: to allow prioritization and planning
    • Negotiable: Not a written contract and can be adjusted
    • Valuable: Users should see tangible benefits
    • Estimate-able: if not, either missing details or it's too large
    • Small: faster feedback cycle, less chance of ambiguity, cost overruns
    • Testable: so to know when it's complete (clearly defined deliverables)

KPI and SLI

  • quantitative requirements that are measurable given the constraints of: People, Time, Finance
  • they are not goal, but allow measuring a goal. e.g if goal is "increase turnover", KPI could be "% of conversion on the website"
  • KPI can be business or technical
    • Business: ROI, Employee Turnover, Customer churn
    • Technical: Page views, response time, checkouts, errors
  • KPI should adhere to SMART
    • Specific: eg. User Friendly is not, but Section 508 Accessible is
    • Measurable: should be able to test if you are meeting your KPI
    • Achievable: 100% availability is good, but not realistic
    • Relevant: Does it really matter?
    • Time bound: period for measuring, eg 99% availability per year, month?
  • SLI is a measurable attribute of a service. e.g. latency
    • 3 types in stackdriver:
      • Availability: ratio of successful requests to all requests
      • Performance: successful requests that satisfy some parameter, eg latency for all requests
      • Other: Custom (e.g. number of transactions)
    • generally lower level than KPI, but may indicate what are the causes of missed KPI
  • SLO is: SLI + target + Compliance period. eg. 200ms for SLI latency over a week
  • SLA is an agreement, a more restrictive version of SLO
    • not all services have SLAs, but all should have SLOs
    • SLO should be stricter than SLA, eg SLO 200ms response time v/s SLA 300ms response time

MicroServices v/s Monolith

  • Microservice has it's own code base and manages own data
    • v/s Monolith has a single code base and a single data store
  • Domain Driven Design is important to quickly decompose monolith into smaller services
    • by feature/domain: Review service, Order Service, Product service
    • by architectural layer: Web/Mobile UI, Data Access service
    • shared functionality: Authentication, logging, reporting
  • stateful (eg database) services are harder to scale, upgrade and need backing up
    • isolate stateful services to a few minimum
  • BP: 12 factors

Requests

  • http or gRPC (for internal, performance sensitive requests)
  • GET, POST (entity ID is generated), PUT (entity ID must be known, should be idempotent), DELETE
    • PATCH is used when only part of the resource is modified (PUT replaces the resource with new version)
  • Response codes: 200 Ok, 400 Client error, 500 server error
  • URIs:
    • plural for sets
    • don't use verb to identify resource (eg do not use /getpet)
    • use version information

API

  • OpenAPI for language agnostic, service description
  • two ways in GCP: Cloud Endpoints or Apigee API platform
  • gRPC is fast, binary protocol based on HTTP/2, supported by Global Load Balancer over HTTP/2
  • BP: use Service.Collection.Verb format
    • e.g. product.inventory.{add,get,update,delete,search}; compute.instances.{create,list,start}

DevOps

CI/CD

  • CI building application and produce an artifact
  • GKE supports binary authorization as a service that
    • scans for vulnerability
    • attests that the image was built from trusted sources (correct source repositories)

Site Reliability Engineering SRE

  • google's version of Devops
  • layers
    • monitoring: e.g. stack driver monitors
    • incidence report: e.g. alerts
    • postmortem/root cause analysis
    • testing and release procedures
    • capacity planning
    • development
    • product

Monitoring

  • Black box: monitoring only the user interactions
    • good for learning response time, user experience
    • examples: latency
  • while box: monitoring is aware of inner working of the application
    • can measure unusual conditions such as behavior during critical resource shortage
    • examples: CPU usage, IO usage

Deploying Applications to Google Cloud

  • BP: Pick correct platform
    • Specific machine and OS?
      • Yes -> Compute Engine
      • Using Containers?
        • Yes, Want your own Kubernetes cluster?
          • Yes -> Kubernetes
          • No -> Cloud Run
        • No, Event driven?
          • Yes -> Cloud Function
          • No -> App Engine
  • A/B release: make different version of application available to different sets of users
    • generally used for measuring usability, popularity etc
  • Canary release: release new application for the purpose of test and to reduce risk
    • compute engine: create a new instance group and add it to the LB
    • kubernetes: create new pod with same label
    • App engine: use traffic splitting
  • pipeline deployment: test in successive environment (dev -> qa -> uat -> perf) to increase the chance of success
  • rolling updates: use when two versions at a time can be supported or new version is backward compatible
    • instance group: available by altering instance template
    • kubernetes: available by default. change the docker image
    • App engine: completely automated
  • blue/green: use when multiple versions can't supported simultaneously; use different environments for old (blue) and new (green)
    • compute engine: use DNS to migrate requests from one LB to another then change reverse proxy or LB to switch to blue environment
    • kubernetes: configure service to use the new pods using labels
    • App engine: use traffic splitting feature
    • since this involves switching using LB, it is not suitable for long running transactions

MLOps

  • Prep: Data Engineer, skills: build and maintain data pipelines, databases
  • Build: Data Scientist, skills: stats and algorithm expertise
  • Deploy: ML Engineer, skills: deploying and monitoring at large scale

Design and Process

Maintenance and Monitoring

Managing versions using any of the following strategies

  • Use rolling updates with multiple instances. It's a feature in Instance Group, default in Kubernetes, built-in in App Engine
  • Blue (current)/Green (new) deployment strategy. Test in Green and then switch clients to it. On failure, move everyone back to original blue
    • Compute engines: use DNS to switch LBs
    • Kubernetes: use labels to switch pods
    • App Engine: use traffic splitting
  • Canary release new release runs in parallel but has smaller traffic during test before the switch happens
    • Compute engine: create a new instance group and add it as an additional backend to LB
    • Kubernetes: create new pod with same label as existing one, service will automatically route traffic to it
    • App Engine: Use traffic splitting
  • Strangler Pattern: replace legacy application piece-by-piece with a facade that increasingly directs users to new application, thus strangling the old application

Cost savings

  • use smaller machines with auto-scaling, use preemptible instance with auto healing
  • use Standard storage v/s SSD when IO needs aren't high
  • Networking costs: egress within same zone is free if using internal IP
  • choose right service, eg. 1GB in firestore is free, but 1400$/month in BigTable
  • set up budgets and alerts and/or pubsub

Monitoring

  • profiler: https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/design-process/deploying-apps-to-gcp
    import googlecloudprofiler
    googlecloudprofiler.start(verbose=3)
    

Designing Reliable Systems

  • 3 key performance metrics:
    • Availability: probability that the system is unavailable to users
    • Durability: probability of losing data due to h/w or other failures
    • Scalability: how much load a system can take before it fails
  • BP strive for N+2 (2 spare units) where is N is minimum needed
    • +1 additional system is for planned maintenance (eg upgrades) and +2 is spare that may be needed if outage during planned maintenance
    • allow scalability, but not too much extra capacity
  • Circuit Breaker pattern: If a service is down, client retrying make matter worse, instead use a proxy that monitors service and stops more requests if the system is unhealthy. e.g Istio
  • Lazy Deletion pattern: Move data to "Trash" for user to recall, then move to "Soft-delete" before permanently deleting
  • Disaster planning: HA by deploying services in multiple regions

    • Define RPO (how much data can be lost, eg 24 hours) and RTO (how long it'll take to recover, eg 1 hour)
  • Topologies for on-prem and cloud

    • mesh: all systems work together
    • mirror: on-prem and cloud mirror each other. typical use for DR
    • gated egress: on-prem service APIs are (securely) made available to applications running in Cloud
    • gated ingress: on-prem applications (securely) consume services hosted on Cloud

Failure Types

  • single point of Failure: configuration which can fail if a particular single component fails. Solution
    • Use two spares. E.g. if N is the min required number of VM instances, using N+2 allows outages for upgrade/test and failure
  • correlated failures are failure due to systems being in a single failure domain, solutions are: decoupling, micro-service architecture, using multiple failure domains
    • failure domain a group of related items that could fail together. Eg. two nodes in the same rack, are in the single rack failure domain
  • cascading failure happens when one failed system causes another to become unstable and subsequently fail. E.g. two servers in LB configuration cannot handle the load even if only one server to fail
  • Positive Feedback Cycle Overload occurs due to application's retry logic retrying the failed request, even if the failure was due to overloaded system. Solutions:
    • Exponential Backoff pattern: double wait time during each retry before reaching max numer of retries
    • Circuit Breaker pattern: protect the service behind a proxy that monitors health limits number of requests. Eg Istio in GKE
  • User errors: data is deleted on user's request that can't be recovered
    • Lazy Deletion pattern: move data to trash (user revoerable) -> soft delete (admin recoverable) -> hard delete (recoverable from backups)

HA and DR

  • high availability: explicitly plan for components that aren't inherently highly available. examples,
    • Zonal resources such as Cloud SQL and VM instances should have redundant instances in multi zone/regions
    • Firestore, Spanner or BigQuery don't require any additional redundancies because they are multi-regional
    • Selecting regional Kubernetes cluster deployment replicates node pools in multiple zones
    • Compute Engine: Managed Instance Group MIG can provide regional HA, use global load balancers for multi-region or global HA
    • Kubernetes: In addition to underlying MIG that provides regional HA, Federated Kubernetes is available
    • App Engine, Cloud Function are fully managed and do not require HA consideration
    • Cloud Storage supports multi-region buckets
    • Persistent Disks can be created as regional, which will be replicated across zones
  • Disaster recovery: Cold Standby, instead of services, keep backups, snapshots, images in multi-region to spin up servers in a new region
    • Recovery Point Objective RPO: how much data can be lost (eg. last 1 day or 0 minutes)
    • Recovery Time Objective RTO: how long can the recovery take (eg 4 hours or 2 minutes)
    • Recovery procedure: documented or automated way to recover data
    • Typical solutions:
      • For RPO of 1 day and 4 hours RTO, use daily backups and a restore script
      • for RPO of 0 minutes and 2 min RTO, use failover replicas with daily backups and automatic failover

Security

  • Principle of least privilege
  • Separation of duties
    • BP use multiple projects to separate duties
  • Audit: use logs (Admin, Data Access, VPC Flow, Firewall, System)
  • Securing people:
    • Add people (aka members) to groups, group permissions into roles and assign roles to groups
      • use inherited policies
    • Use Identity Aware Proxy (IAP) for authenticating applications that are HTTPS based (LB, GKE or App Engine)
      • can tunnel TCP traffic to VM instances such as SSH and RDP via HTTPS tunnel
      • enables users to authenticate using Cloud Identity and authorized using IAM
      • works with internal IP without the need of public IP or VPN (relies on Application level access control model instead of firewall based model)
  • Securing machines:
    • use Service Accounts. Can be assigned to VMs or GKE node pools
    • Can be used to let users access GCP without console access. gcloud auth activate-service-account --key-file=<key.json>
  • Securing network:
    • Remove external IP from internal systems. Use, bastion host, ssh into IAP, cloud NAT
    • Use firewall rules for internal and Cloud Armor for layer 7 external applications
      • Cloud Armor can inspect HTTP headers/cookies to filter out traffic
    • Global LB and CDN offer protection from DDoS attacks
    • Allow private access to GCS using internal IP gcloud compute networks subnet update subnet-b --allow-private-ip-google-access
    • Use Cloud endpoints to control access to APIs. Integration with Identity Platform, Use JWT and allow access control
    • Data Loss Prevention API allows redacting PII and sensitive data

DLP

  • Data Loss Prevention
  • de-identification: make data unidentifiable
  • Crypto-based Tokenization: maintain referential integrity, i.e. only one encrypted value for the same input value
    • Deterministic Encryption: uses symmetric encryption key (reversible), best overall option when legacy support is not required
    • Format Preserving Encryption FPE: symmetric encryption key (reversible), useful for supporting legacy applications
    • Cryptographic hashing: can't be reversed

Glossary:

  • Greenfield projects: Projects that aren't constrained by legacy/prior work
  • Brownfield projects: New projects must take into account and co-exist with existing project that is being replaced