Kafka¶

Uses append-only logs to store messages
Logs are stored in log directory, that are named using both topic and partition
Each topic can have a key that Kafka uses to distribute messages to partitions.
- A key doesn't have to be unique
- If key is null, then Kafka uses round-robin distribution
- A custom partitioner can be supplied that distributes messages by value other than full-key
An offset within partition is always unique
Messages are ordered by arrival time within a partition
- if message order is important, only 1 partition should be defined
partition are unit of parallelism. A partition is assigned to at most once consumer within a consumer group
- having more partitions than consumers higher chances that workload is evenly distributed
at-least-once delivery guarantee. de-dup is application responsibility
retention time can be set at topic level
only partition level order guarantees.

Concepts¶

Topic is a where publishers can writes messages and clients can read message from
- each topic can have replication level that determines how many replicas a partition has
A topic can be made up of multiple Partitions
Broker is a Kafka process that
- hosts one or more partitions
- runs on a physical machine/server/node. A machine can run more than one broker process
Cluster is a group of brokers that form a single Kafka cluster
Bootstrap servers are initial servers used for the initial connection to discover all participating servers. After the connection all servers/brokers are used
one replica acts as a leader elected by the controller rest are followers
Leader serves all client requests.
ZooKeeper is used for maintaining which broker is the leader
a message is exposed to client only after it commits, i.e. all in-sync replicas ISR have finished writing the message
a message is rolled into a different segment after configured roll time or segment size
An inactive segment is deleted or compacted after configured retention time
- Compacting involves retaining only the most recent key from a segment.
Producer.send() is asynchronous.

read from all partitions of a topic
order is maintained only within a partition
a consumer group is multiple consumer instances running with the same group id will provide load balancing
- load balancing is automatic by kafka each time a new consumer with an existing group-id joins/leaves
- each partition is only consumed by a single consumer instance within a group
  - if there are more consumers in a group than partitions, some consumers will remain idle
  - a consumer can get assigned more than one partition if there are more partitions than consumers
offset management: by default consumers are set to auto-commit (interval based: auto.commit.interval.ms)
- can cause duplicate messages if message already read and processed, but failed prior to the commit
- a consumer can choose to disable auto commit and manually commit (.commitSync())
- BP: don't commit too frequently for better performance
consumers received fields: key, value, topic, partition, offset, timestamp, timestamp_type
- timestamp_type: 0 => the timestamp refers to when the event was created, 1 => when the event was received by the kafka cluster
- key and values are by default received as binary values

is a library that runs within the context of the application
provides functional APIs for stream processing, e.g. windowing, aggregation etc
manages state as internal kafka topic for resiliency
two types of windows:
- Fixed: non-overlapping windows; normally aligned to clock (every 5 seconds etc)
- sliding: windows can overlap

provides a REST API for users to submit jobs written in SQL to do stream processing
Runs on its own scalable, fault-tolerant cluster
provides integration with Kafka Connect

is the process of retaining only the most recent value for key-based logs
deletes are supported, but maybe seen by readers as tombstones
- are removed only after certain period delete.retention.ms
min.compaction.lag.ms can be used to guarantee minimum duration before the message can be compacted
messages are never re-ordered; partition offset for a message never changes
ROT: Effective way to store data in Kafka to for example, achieve Event Sourcing