Home

Distributed Systems

Your Kafka Lag Monitor Is Lying to You

High watermarks feel safe. They aren't. Here's the quiet failure mode that will burn you at 3 a.m.

Weiting LiuApril 18, 2026·9 min read

For years I watched engineers celebrate green dashboards while messages quietly rotted inside transactions. The problem wasn't their monitoring stack. It was the number they chose to monitor.

Two watermarks walk into a broker

Every Kafka partition carries two markers that define what consumers can read. The high watermark (HW) is the offset of the last message fully replicated across all in-sync replicas. It's the number most dashboards show as the 'end offset' when calculating lag.

The last stable offset (LSO) is something quieter. It is the highest offset below which all transactions are guaranteed to have been committed or aborted. A consumer running in `read_committed` isolation mode can only advance up to the LSO, never past it.

When there are no transactions on the topic, LSO equals HW. When a producer opens a transaction and sits idle — network blip, application deadlock, runaway GC pause — the LSO stops moving. The HW keeps advancing as non-transactional messages pile in. The gap between them is invisible to every lag monitor that watches only the high watermark.

Your consumer thinks it is current. Your dashboard agrees. Neither of them is reading the messages sitting inside the open transaction.

The exact failure mode

Here is the scenario I have seen unfold in production more than once:

  1. A producer opens a transaction and writes 50,000 messages.
  2. Before committing, the producer process is paused by a long GC cycle.
  3. Non-transactional producers keep writing to the same partition.
  4. Consumers in `read_committed` mode are blocked at the LSO. Consumer lag against HW shows zero — all caught up.
  5. Ten minutes later, someone asks why job completions are 50,000 short.

The silence is the problem. There is no error, no timeout, no alert. The consumer is simply waiting at a fence it cannot see.

What to monitor instead

Swap the denominator. Calculate consumer lag as `LSO - consumer_committed_offset`, not `HW - consumer_committed_offset`. Most monitoring libraries expose this as a configuration toggle, though buried in the docs.

Beyond lag, add a dedicated alert for transaction duration. Any producer transaction that remains open beyond your 95th-percentile processing latency is a candidate for an alert. Kafka exposes `kafka.producer:type=producer-metrics,client-id=*:transaction-coordinator-epoch` and the broker-side `TransactionalIdExpiration` events, both of which surface stalled transactions before they become consumer emergencies.

A third signal worth wiring up: if you run Kafka Streams, watch the state store restoration metric `restore-consumer-records-lead`. A consumer group that has fallen behind will rebuild state from scratch on restart. The restoration time is proportional to the committed lag — which is proportional to how long you were watching the wrong number.

Lag against LSO is quiet when it should be. Lag against HW is quiet even when it shouldn't be.

The exception handler you forgot

One more failure mode that composites badly with the above: `read_committed` consumers that hit a deserialization error and swallow the exception will stop processing but will not move their committed offset. Lag stays non-zero. Lag against LSO grows. But if the error handler is a no-op — or worse, a `log.error` followed by a `continue` — the consumer loop keeps spinning, burning CPU without making progress.

This is the state corruption variant of the problem: not a transaction stall, but a message your consumer can see, can read the header of, and simply cannot deserialize. Combined with missing exception handlers, it generates a consumer that appears alive to liveness probes but is making no forward progress whatsoever.

The fix is deliberate: dead-letter the unparseable message, increment a counter metric with the topic and partition label, and keep the committed offset advancing. Never silently swallow and `continue`.

A checklist before you ship

Before a Kafka consumer goes to production, I run through four questions:

1. What isolation mode? If `read_committed`, your lag metric must be against LSO.

2. What happens on deserialization failure? If the answer is 'an exception is logged', ask what happens *next*. The offset must advance.

3. What is your transaction timeout? The default `transaction.timeout.ms` is 60 seconds. If your processing pipeline can run longer, raise it. If it can't, enforce the timeout explicitly and alert when it fires.

4. Where does state come from on restart? If your Streams app rebuilds from a changelog topic, your committed lag directly sets the recovery time window. Test this deliberately in staging, with real load, before you hit it at 3 a.m.

✦ ✦ ✦

Most of the incidents I have worked on in large-scale event streaming systems were not caused by Kafka. They were caused by the assumptions engineers made about what Kafka's numbers meant. HW is an infrastructure metric. LSO is a consumer metric. The two are equal only in the absence of transactions — and the moment you adopt exactly-once semantics, you need to know which watermark you are watching.

Weiting Liu · Staff SRE · Tokyo