Skip to content

Kafka

  • Uses append-only logs to store messages
  • Logs are stored in log directory, that are named using both topic and partition
  • Each topic can have a key that Kafka uses to distribute messages to partitions.
    • A key doesn't have to be unique
    • If key is null, then Kafka uses round-robin distribution
    • A custom partitioner can be supplied that distributes messages by value other than full-key
  • An offset within partition is always unique
  • Messages are ordered by arrival time within a partition
    • if message order is important, only 1 partition should be defined
  • partition are unit of parallelism. A partition is assigned to at most once consumer within a consumer group
    • having more partitions than consumers higher chances that workload is evenly distributed
  • at-least-once delivery guarantee. de-dup is application responsibility
  • retention time can be set at topic level
  • only partition level order guarantees.

Concepts

  • Topic is a where publishers can writes messages and clients can read message from
    • each topic can have replication level that determines how many replicas a partition has
  • A topic can be made up of multiple Partitions
  • Broker is a Kafka process that
    • hosts one or more partitions
    • runs on a physical machine/server/node. A machine can run more than one broker process
  • Cluster is a group of brokers that form a single Kafka cluster
  • Bootstrap servers are initial servers used for the initial connection to discover all participating servers. After the connection all servers/brokers are used
  • one replica acts as a leader elected by the controller rest are followers
  • Leader serves all client requests.
  • ZooKeeper is used for maintaining which broker is the leader
  • a message is exposed to client only after it commits, i.e. all in-sync replicas ISR have finished writing the message
  • a message is rolled into a different segment after configured roll time or segment size
  • An inactive segment is deleted or compacted after configured retention time
    • Compacting involves retaining only the most recent key from a segment.
  • Producer.send() is asynchronous.

Consumers

  • read from all partitions of a topic
  • order is maintained only within a partition
  • a consumer group is multiple consumer instances running with the same group id will provide load balancing
    • load balancing is automatic by kafka each time a new consumer with an existing group-id joins/leaves
    • each partition is only consumed by a single consumer instance within a group
      • if there are more consumers in a group than partitions, some consumers will remain idle
      • a consumer can get assigned more than one partition if there are more partitions than consumers
  • offset management: by default consumers are set to auto-commit (interval based: auto.commit.interval.ms)
    • can cause duplicate messages if message already read and processed, but failed prior to the commit
    • a consumer can choose to disable auto commit and manually commit (.commitSync())
    • BP: don't commit too frequently for better performance
  • consumers received fields: key, value, topic, partition, offset, timestamp, timestamp_type
    • timestamp_type: 0 => the timestamp refers to when the event was created, 1 => when the event was received by the kafka cluster
    • key and values are by default received as binary values

Kafka Connect

  • plugins (connectors)that provide connections to other systems such as RDBMS
  • connectors run on their own fault-tolerant cluster

Kafka Stream

  • is a library that runs within the context of the application
  • provides functional APIs for stream processing, e.g. windowing, aggregation etc
  • manages state as internal kafka topic for resiliency
  • two types of windows:
    • Fixed: non-overlapping windows; normally aligned to clock (every 5 seconds etc)
    • sliding: windows can overlap

ksqlDB

  • provides a REST API for users to submit jobs written in SQL to do stream processing
  • Runs on its own scalable, fault-tolerant cluster
  • provides integration with Kafka Connect

Log Compaction

  1. is the process of retaining only the most recent value for key-based logs
  2. deletes are supported, but maybe seen by readers as tombstones
    • are removed only after certain period delete.retention.ms
  3. min.compaction.lag.ms can be used to guarantee minimum duration before the message can be compacted
  4. messages are never re-ordered; partition offset for a message never changes
  5. ROT: Effective way to store data in Kafka to for example, achieve Event Sourcing