EM
Emmanuel Maneswa
All Posts
Event-Driven Systems

Event-Driven Architecture in Fintech

How Apache Kafka and RabbitMQ enable high-throughput, resilient financial event streams. A deep dive into consumer groups, dead-letter queues, schema evolution, and exactly-once semantics in production banking systems.

EM

Emmanuel Maneswa

Full Stack Software Engineer

1 April 2026·10 min read
#Kafka#RabbitMQ#Fintech#Microservices#Architecture

Event-Driven Architecture in Fintech

Designing a Fault-Tolerant Event-Driven Payment System (Kafka, Idempotency, and Failure Recovery)

Financial systems have always demanded reliability, consistency, and speed. For decades, the dominant model was synchronous: a client sends a request, a server processes it, and a response is returned in one blocking thread. That model served the industry when transaction volumes were modest and system boundaries were clean.

Today, the landscape is radically different. Banks process millions of transactions daily across interconnected systems: Core Banking Systems (CBS), Payment Rails such as Real-Time Gross Settlement (RTGS) and the Society for Worldwide Interbank Financial Telecommunication (SWIFT), Anti-Money Laundering (AML) and Know Your Customer (KYC) platforms, regulatory reporting engines, mobile applications, and third-party APIs. Attempting to orchestrate all of this synchronously is a recipe for cascading failures, brittle integrations, and unacceptable latency.

Event-Driven Architecture: Overview

LayerComponentsRole
ProducersMobile App, Web App, API GatewayPublish events in response to user actions
Event BusKafka / RabbitMQDurable, ordered storage and delivery of events
ConsumersValidation, CBS Debit, RTGS Gateway, Notifications, AuditEach subscribes independently; scales independently

Each consumer is independently deployable. A failed consumer does not affect others.


The Problem: Synchronous Systems at Scale

Consider a payment processing pipeline implemented synchronously. The API Gateway calls Validation, which calls CBS Debit, which calls the RTGS Gateway, which calls the Notification Service. Each call blocks the thread until a response returns.

This design has three fundamental failure modes:

Thread blocking. Every service in the chain holds an open connection and a blocked thread while it waits downstream. At high load, threads exhaust before the RTGS Gateway even responds.

Failure cascade. If the RTGS Gateway is slow (even momentarily) the entire chain backs up. Validation Service cannot free its thread. The API Gateway cannot free its thread. The client gets a 504. A single downstream hiccup brings the entire pipeline down.

No retry strategy. When a synchronous call fails, the caller receives an exception. Recovery logic is ad hoc, inconsistent, and often absent. There is no durable record of what was attempted and what failed.

The deeper problem is tight coupling. Every service must know every other service's address, contract, and availability. A schema change in CBS Debit requires simultaneous changes across every upstream caller. Independent deployment is impossible.


The Event-Driven Upgrade

Event-Driven Architecture (EDA) breaks this model. Instead of services calling each other directly, they publish and consume events: immutable facts that have already occurred. This shift from imperative thinking ("do this now") to declarative thinking ("this happened") fundamentally changes the failure and scalability properties of the system.

Three decisions drive this architecture:

Why Kafka? Kafka is a distributed, partitioned, append-only log. Events are written to disk and retained. Any consumer can read from any offset at any time. This means a consumer that crashed for 30 minutes can catch up without data loss which is impossible with synchronous RPC.

Why async? The API Gateway publishes payment.submitted and immediately returns 202 Accepted to the client. The client is not blocked waiting for RTGS. The RTGS gateway can be slow, restarted, or upgraded without affecting the client response time.

Why decoupling? Validation Service does not know CBS Debit exists. CBS Debit does not know RTGS Gateway exists. Each service only knows about the Kafka topic it subscribes to. You can replace the RTGS Gateway entirely without touching Validation. You can add a new downstream consumer (a fraud detection service, a reporting engine) without modifying any existing service.

The structural difference is stark. In the synchronous model, one slow service blocks the chain. In the event-driven model, Kafka decouples every service from every other. The Audit Log subscribes to all events independently and its failure is completely invisible to the payment pipeline.


What Is an Event?

In the context of financial systems, an event is an immutable record of something that happened. Consider these examples from a live payment system:

  • payment.initiated: a customer triggered a transfer
  • transaction.cleared: a payment cleared through the RTGS network
  • account.credited: a beneficiary account was credited
  • aml.alert.raised: the AML engine flagged a suspicious pattern

Each event carries three fundamental properties:

  1. Immutability: once published, an event cannot be changed or retracted.
  2. Ordering: events reflect a time-ordered sequence of facts about the system.
  3. Durability: events are persisted to disk and are replayable at any point in time.

This is fundamentally different from a method call or an HTTP request, both of which are ephemeral and produce side effects that cannot easily be undone.


The Core Components

Message Broker

The message broker is the backbone of any event-driven system. It receives events from producers, stores them durably, and delivers them to consumers. The two dominant technologies in fintech are Apache Kafka and RabbitMQ, each suited to distinct use cases.

Apache Kafka

Kafka is a distributed log: a partitioned, replicated, append-only store of events. Its architecture makes it ideal for high-throughput, ordered, replayable streams.

  • Topics: named channels for event streams, for example payments.rtgs.outbound.
  • Partitions: the unit of parallelism. Events within a single partition are strictly ordered.
  • Consumer Groups: multiple consumers can read the same topic independently, each tracking its own offset.
  • Retention: events are retained for a configurable period, ranging from hours to indefinitely.
  • Log Compaction: keeps only the latest record per key, useful for maintaining state snapshots.

In a core banking context, Kafka excels at high-volume transaction event streaming, audit log ingestion, real-time AML and fraud detection pipelines, and cross-system reconciliation feeds.

RabbitMQ

RabbitMQ is a traditional message broker implementing the Advanced Message Queuing Protocol (AMQP). Its model is queue-centric with richer routing semantics.

  • Exchanges: route messages to queues via binding rules (direct, topic, fanout, or headers).
  • Queues: message buffers consumed by one or more workers.
  • Acknowledgements: a consumer explicitly acknowledges (ACK) a message after successful processing. Unacknowledged messages are requeued automatically.
  • Dead Letter Queues (DLQ): failed or unprocessable messages are routed to a DLQ for investigation and replay.

RabbitMQ is best suited for task dispatch and worker queues, request-reply patterns over messaging, complex routing logic, and lower-throughput but high-reliability workflows.

Kafka vs RabbitMQ: Quick Reference

PropertyKafkaRabbitMQ
ModelDistributed log (partitioned, append-only)Message queue (AMQP, push-based)
ThroughputMillions of messages/secTens of thousands/sec
RetentionConfigurable (days to indefinitely)Until consumed or expired
ReplayYes: seek to any offsetNo: deleted on ACK
RoutingTopic + partition keyExchange bindings (direct/topic/fanout)
Best forHigh-volume streams, audit logs, AML pipelinesTask queues, RPC, complex routing

A Real-World Pattern: RTGS Payment Pipeline

The flow below shows a simplified end-to-end event sequence for an outbound RTGS payment. Each service only knows about the events it subscribes to and nothing else.

StepServiceConsumesProduces
1API GatewayCustomer HTTP requestpayment.submitted
2Validation Servicepayment.submittedpayment.validated or payment.rejected
3CBS Debit Servicepayment.validatedaccount.debited
4RTGS Gatewayaccount.debitedrtgs.submitted
5Notification Servicertgs.submitted, rtgs.settled, or rtgs.rejectedSMS or push notification
6Audit ServiceAll eventsImmutable audit log entry

Each service is independently deployable, independently scalable, and fully decoupled from all others. The failure of the notification service does not affect payment processing. The RTGS gateway can be replaced entirely without touching the validation service.


Payment Flow: End-to-End Sequence

The RTGS pipeline diagram shows what happens. The sequence diagram below shows when it happens and crucially, what the client experiences while it does.

The key insight is in step 3: the API Gateway returns 202 Accepted to the client immediately after publishing payment.submitted to Kafka. The client's HTTP connection is closed. From this point forward, all processing is asynchronous.

Validation, CBS Debit, and RTGS Gateway each consume and produce events on their own schedules. If RTGS takes 8 seconds, the client does not wait 8 seconds. If CBS Debit restarts mid-processing, Kafka holds the unconsumed event and delivers it when CBS comes back online.

This is the fundamental shift: the client's latency is decoupled from the system's processing latency.


The Hard Problems

Asynchronous systems introduce problems that synchronous systems avoid by construction. Ignoring them produces subtle, expensive bugs in production financial systems.

Idempotency

Kafka guarantees at-least-once delivery. In failure scenarios (consumer crash, network partition, rebalance) the same message may be delivered twice. For a payment system, processing a transaction twice means debiting an account twice. This is a critical correctness failure.

The solution is idempotent consumers using a persistent idempotency key (typically the event's unique paymentId):

  1. Before processing, query whether paymentId already exists in the payments store as PROCESSED.
  2. If yes: return immediately. This is a safe no-op.
  3. If no: begin a database transaction, apply all side effects (debit, credit, state update), mark the record as PROCESSED, and commit atomically.

The PROCESSED record is the idempotency proof. Any future duplicate delivery hits step 2 and exits cleanly.

Here is how this pattern is implemented across major backend stacks:

Idempotent Consumer: Prevent Duplicate Payments

✓ Production-ready pattern
1// C#: Confluent.Kafka + EF Core - idempotent consumer
2public async Task ProcessPaymentAsync(PaymentSubmittedEvent evt, CancellationToken ct)
3{
4    // 1. Idempotency guard, check BEFORE any side effect
5    var alreadyProcessed = await _db.Payments
6        .AnyAsync(p => p.PaymentId == evt.PaymentId
7                    && p.Status == PaymentStatus.Processed, ct);
8    if (alreadyProcessed) return; // safe no-op on duplicate delivery
9
10    // 2. Atomic transaction wraps ALL writes
11    await using var tx = await _db.Database.BeginTransactionAsync(ct);
12    try
13    {
14        var account = await _db.Accounts
15            .Where(a => a.Id == evt.AccountId)
16            .FirstOrDefaultAsync(ct)
17            ?? throw new AccountNotFoundException(evt.AccountId);
18
19        account.Debit(evt.Amount);
20
21        // 3. Write idempotency proof inside the same transaction
22        _db.Payments.Add(new Payment
23        {
24            PaymentId   = evt.PaymentId,
25            Status      = PaymentStatus.Processed,
26            ProcessedAt = DateTimeOffset.UtcNow,
27        });
28
29        await _db.SaveChangesAsync(ct);
30        await tx.CommitAsync(ct); // atomic: debit + PROCESSED together
31
32        // 4. Publish ONLY after successful DB commit
33        await _producer.ProduceAsync("account.debited", new AccountDebitedEvent
34        {
35            PaymentId     = evt.PaymentId,
36            Amount        = evt.Amount,
37            CorrelationId = evt.CorrelationId,
38        }, ct);
39    }
40    catch
41    {
42        await tx.RollbackAsync(ct);
43        throw; // rethrow, Kafka will retry this message
44    }
45}

Message Ordering

Kafka guarantees ordering within a partition but not across partitions. If two events for the same account land on different partitions, they may be processed out of order producing an incorrect balance.

The fix is deterministic partition assignment: always partition by account identifier. The Kafka producer computes hash(accountId) % numPartitions to select a partition. All events for a given account are always written to the same partition and therefore processed in strict sequence by the same consumer instance.


Consumer Groups and Partition Strategy

Kafka's consumer group model is elegant: multiple instances of a service form a consumer group, and Kafka assigns each partition to exactly one consumer in the group at a time. This design delivers three guarantees:

  • Parallelism: scale processing by adding consumers, up to the number of partitions available.
  • Ordering: events within a partition are always processed sequentially.
  • Exclusivity: no two consumers in the same group process the same partition simultaneously.

For payment systems, partition by account identifier or payment corridor so that all events for a given account are always routed to the same consumer instance.

PartitionAccount RangeAssigned Consumer
000000–33333Consumer A
133334–66666Consumer B
266667–99999Consumer C

This assignment guarantees sequential processing for any given account, eliminating race conditions on balance updates. Here is how to enforce the partition key at the producer level across common stacks:

Producer: Deterministic Partition Key by Account ID

✓ Production-ready pattern
1// C#: Confluent.Kafka - accountId as partition key
2using var producer = new ProducerBuilder<string, string>(
3    new ProducerConfig { BootstrapServers = "kafka:9092" }
4).Build();
5
6await producer.ProduceAsync(
7    "payment.submitted",
8    new Message<string, string>
9    {
10        // accountId as key: murmur2 hash → same partition for same account
11        Key   = payment.AccountId.ToString(),
12        Value = JsonSerializer.Serialize(new PaymentSubmittedEvent
13        {
14            PaymentId     = payment.Id,
15            AccountId     = payment.AccountId,
16            Amount        = payment.Amount,
17            CorrelationId = correlationId,
18        }),
19    }
20);

Designing for Exactly-Once Semantics

In financial systems, processing a payment twice or missing it entirely is unacceptable. Delivery guarantees are therefore one of the most critical architectural decisions.

GuaranteeDescriptionRecommended Use
At-most-onceMessage may be lost; never duplicatedMetrics, non-critical telemetry
At-least-onceMessage delivered one or more times; duplicates possibleMost financial events (pair with idempotency)
Exactly-onceMessage processed exactly once, guaranteedMonetary transactions

Why True Exactly-Once Is Harder Than It Sounds

Kafka's Transactions API (available since version 0.11) provides exactly-once semantics within Kafka itself meaning a message is written to a topic exactly once even if the producer retries. However, this only covers the Kafka-to-Kafka leg. The moment you write to a database, call an external API, or send an email, you step outside Kafka's transactional boundary.

The two-phase challenge: your database write and your Kafka offset commit are two separate operations. They can fail independently:

  1. Write succeeds, commit fails: the consumer crashes after writing to the DB but before committing the offset. Kafka redelivers the message. Without idempotency, the write runs again (double debit).
  2. Commit succeeds, write fails: the offset advances, but the DB write never completed. The message is silently skipped leading to missing transaction and reconciliation nightmare.

The Practical Solution: At-Least-Once Plus Idempotency

The most robust production pattern combines three techniques:

Step 1: Idempotency key check. Before any side effect, query whether this paymentId has been marked PROCESSED. Return immediately if it has.

Step 2: Atomic DB transaction. Wrap all writes (debit, credit, state update, idempotency record) in a single database transaction. Either all writes commit or none do.

Step 3: Transactional Outbox (for Kafka publish). Instead of publishing to Kafka inside the DB transaction (impossible for two different systems), write the outgoing event to an outbox table within the same DB transaction. A background relay reads the outbox and publishes to Kafka, deleting the row on confirmed delivery. This ensures the Kafka publish and the DB commit are never split across a crash.

BEGIN TRANSACTION
  UPDATE accounts SET balance = balance - 100 WHERE id = 'acc-123'
  INSERT INTO payments (payment_id, status) VALUES ('pay-xyz', 'PROCESSED')
  INSERT INTO outbox (topic, payload) VALUES ('account.debited', '{"paymentId":"pay-xyz",...}')
COMMIT                     <- all three writes are atomic
-> Outbox relay publishes to Kafka and deletes the row

The PROCESSED record and the outbox entry are atomic. A crash anywhere produces a safe, retryable state. This is the pattern that large-scale payment systems use in production.


Failure Recovery: Dead Letter Queues

Failures are inevitable in distributed systems. The question is not whether a message will fail to process, but whether you have a safe, observable place to put it when it does.

Even well-designed systems encounter poison messages: events that cannot be processed due to invalid data, downstream unavailability, or application bugs. Without a safety valve, a single bad message can halt an entire consumer indefinitely.

The standard pattern is retry with exponential back-off, followed by routing to a Dead Letter Queue.

AttemptDelayAction
1ImmediateFirst processing attempt
21 secondRetry
35 secondsRetry
430 secondsFinal retry
After all retriesall exhaustedRouted to payments.rtgs.inbound.dlq

In RabbitMQ, configure this via the x-dead-letter-exchange property on the queue. In Kafka, implement retry topics and a final DLQ topic in application code.

Treat every DLQ message as a Priority 1 (P1) incident. The DLQ is your audit and recovery mechanism. Operations teams inspect failed events, resolve the underlying cause, and replay them without data loss. A message in a financial system's DLQ represents a transaction that has not settled.

Here is how to implement the retry-then-DLQ pattern across backend stacks:

DLQ Handler: Retry with Exponential Back-off

✓ Production-ready pattern
1// C#: Confluent.Kafka - bounded retry + DLQ routing
2public class ResilientPaymentConsumer
3{
4    private static readonly int[] RetryDelaysMs = { 1_000, 5_000, 30_000 };
5
6    public async Task ConsumeAsync(ConsumeResult<string, string> result, CancellationToken ct)
7    {
8        var attempts = 0;
9        while (true)
10        {
11            try
12            {
13                await ProcessAsync(result.Message.Value, ct);
14                _consumer.Commit(result); // manual commit on success
15                return;
16            }
17            catch (Exception ex) when (attempts < RetryDelaysMs.Length)
18            {
19                _logger.LogWarning(ex, "Attempt {N} failed for {Key}", attempts + 1, result.Message.Key);
20                await Task.Delay(RetryDelaysMs[attempts++], ct);
21            }
22            catch (Exception ex)
23            {
24                _logger.LogError(ex, "All retries exhausted for {Key}", result.Message.Key);
25                await RouteToDlqAsync(result, ex.Message, ct);
26                _consumer.Commit(result); // advance offset, message is in DLQ
27                return;
28            }
29        }
30    }
31
32    private async Task RouteToDlqAsync(
33        ConsumeResult<string, string> result, string reason, CancellationToken ct)
34    {
35        await _dlqProducer.ProduceAsync(
36            "payments.rtgs.inbound.dlq",
37            new Message<string, string>
38            {
39                Key   = result.Message.Key,
40                Value = result.Message.Value,
41                Headers = new Headers
42                {
43                    { "error.reason",   Encoding.UTF8.GetBytes(reason) },
44                    { "original.topic", Encoding.UTF8.GetBytes(result.Topic) },
45                    { "failed.at",      Encoding.UTF8.GetBytes(DateTimeOffset.UtcNow.ToString("O")) },
46                }
47            },
48            ct
49        );
50        _alertService.TriggerP1Alert($"DLQ: {result.Message.Key}: {reason}");
51    }
52}

Compliance and AML Screening

Every payment event published to the message bus must first pass compliance screening. This is a legal obligation: regulatory frameworks across all jurisdictions require that both the sender and the recipient are screened against sanctions lists before a transfer is initiated.

Why Both Sides Must Be Screened

The compliance obligation is bilateral, covering three dimensions:

  • Sender Screening: verify the originator is not listed on sanctions lists, including the Office of Foreign Assets Control (OFAC), the United Nations (UN) Consolidated List, and local regulators. Also verify the originator is not a Politically Exposed Person (PEP).
  • Recipient Screening: verify the beneficiary is cleared. A payment to a sanctioned entity is a compliance violation even if the sender is clean.
  • Corridor Rules: the payment corridor, for example ZWE-ZAF for Zimbabwe to South Africa, may trigger additional jurisdiction-specific checks beyond the global sanctions lists.

Critical constraint: screening must happen before payment.initiated is published. Publishing first and screening later creates a complex reversal scenario once the event has already been consumed by downstream services.

AML Compliance Screening: Decision Flow

StageCheckClear outcomeFail outcome
1. IdempotencyPayment already processed?ContinueReturn early (no-op)
2. Sender screenOFAC SDN + jurisdiction list (e.g., RBZ)LOW risk: proceedHIGH risk: block, raise alert
3. Recipient screenOFAC SDN + UN Consolidated ListLOW risk: proceedMEDIUM+: escalate to compliance
4. PublishAll checks passedPublish `payment.initiated`—

Screening runs before the event is published. Never skip screening if the AML API is unavailable, fail the payment.

AML Screening Pattern

Regardless of language or framework, the pre-publish sequence is always the same:

  1. Idempotency Check: look up the idempotency key. If this payment was already processed, return early. This is a safe no-op.
  2. Sender Screening: screen the originator against the OFAC Specially Designated Nationals (SDN) List and the relevant jurisdiction list, for example the Reserve Bank of Zimbabwe. On failure: block the payment and raise aml.alert.raised.
  3. Recipient Screening: screen the beneficiary against the OFAC SDN List and the UN Consolidated List. On failure: block the payment or escalate to the compliance queue.
  4. Publish Event: publish payment.initiated with correlationId and corridorCode attached.

Key implementation rules:

  • Never skip screening if the AML API is unavailable. Fail the payment instead, using a circuit breaker pattern.
  • Always include corridorCode in the event, for example ZWE-ZAF, so downstream services apply the correct jurisdiction rules.
  • Retain all screening results for audit: store the risk score, the matched list name, and the timestamp of each check.
  • Instrument AML latency with OpenTelemetry spans. Alert on p99 latency exceeding 2 seconds to prevent cascading RTGS delays.

Regional Compliance Frameworks

Regional Compliance Frameworks

🇿🇼

Zimbabwe

RBZ - Reserve Bank of Zimbabwe

  • •Money Laundering and Proceeds of Crime Act (Chapter 9:24): primary AML legislation.
  • •RBZ AML Directives mandate screening for all cross-border payments (RTGS, SWIFT).
  • •Both sender (ZWE account) and recipient (ZAF bank) must be screened against the RBZ sanctions list and World-Check.
  • •Corridor example: ZWE-ZAF Harare to Johannesburg payments require RBZ + OFAC screening.
  • •AML check must complete before publishing the payment.initiated event.
🇪🇺

European Union

EBA - European Banking Authority

  • •6th Anti-Money Laundering Directive (AMLD6) expanded predicate offences, stricter liability.
  • •EBA Risk-Based Supervision Guidelines: proportionate controls based on customer/transaction risk profile.
  • •FATF Recommendation 16 (Travel Rule): originator and beneficiary info required for transfers ≥ €1,000.
  • •PSD2: Strong Customer Authentication (SCA) for payment initiation.
🇺🇸

United States

OFAC + FinCEN

  • •OFAC SDN List: automatic sanctions screening; transactions with listed entities are prohibited.
  • •FinCEN AML Program requirements under the Bank Secrecy Act (BSA).
  • •FATF Travel Rule applies to transfers ≥ $3,000, sender/recipient info must accompany the payment.
  • •SAR (Suspicious Activity Report) filing obligation within 30 days of detection.
🇸🇬

Singapore

MAS - Monetary Authority of Singapore

  • •MAS Notice 626: AML/CFT requirements for banks; risk-based customer due diligence.
  • •FATF member obligations: Singapore aligns with all FATF recommendations.
  • •Transfers ≥ SGD 1,500 require full originator information transmitted with the payment.
  • •Enhanced due diligence for correspondent banking relationships.

Industry Standards

ISO 20022

Global Standard

Structured message format for AML data in payment messages. pacs.008 (credit transfers) and pacs.009 (financial institution transfers) carry structured originator/beneficiary fields enabling automated AML enrichment. Foundation for SWIFT MX and SEPA integrations.

FATF Recommendation 16

Travel Rule

Originator and beneficiary information must travel with the payment, be available to the receiving institution, and be retained for five years. Thresholds vary by jurisdiction: $3,000 USD, €1,000 EUR, SGD 1,500. Design your payment.initiated event schema to include originatorInfo and beneficiaryInfo fields from day one.

SWIFT gpi

Global Payments Innovation

End-to-end payment tracking with a Unique End-to-End Transaction Reference (UETR). Every correspondent bank in the chain updates the gpi tracker. From an EDA perspective, gpi status updates are themselves events so model them as swift.gpi.updated with the UETR as the partition key.

OpenTelemetry

Observability

Instrument AML API calls with OpenTelemetry spans to trace screening latency, correlate events across the compliance pipeline, and alert on p99 latency exceeding 2s. Slow screening means blocked payments in RTGS systems this can cascade into settlement failures. Attach payment.id, jurisdiction, and risk_score as span attributes.


Schema Evolution and Schema Registry

As systems evolve, event schemas change. A new field added to payment.initiated must not break existing consumers still running the previous version. This is managed through a Schema Registry; the Confluent Schema Registry is the standard for Kafka deployments.

Three compatibility modes matter in production:

  1. Forward Compatibility: new producers, old consumers. New fields are silently ignored by older consumers.
  2. Backward Compatibility: old producers, new consumers. New fields carry sensible defaults so new consumers still function.
  3. Full Compatibility: both directions simultaneously. This is the safest mode for production financial systems.

Avro vs Protobuf: Which to Choose

Both formats are strongly typed and binary-efficient, but they make different trade-offs:

DimensionApache AvroProtocol Buffers (Protobuf)
Schema storageEmbedded in message (or Schema Registry)Generated code only; schema in .proto files
Field evolutionAdd with default or remove optional fieldsAdd new fields; never reuse field numbers
Language supportJava, Python, C#, Go (first-class)All major languages (first-class support everywhere)
Wire sizeCompact; schema overhead unless using registryVery compact; field numbers not names on wire
ToolingConfluent Schema Registry nativeAny gRPC/Protobuf toolchain
Best forKafka-first event systems with Schema RegistryCross-team APIs, gRPC services, polyglot teams

The golden rule for both: never remove a field and never change a field's type. Breaking these rules corrupts consumers silently.

Safe Schema Evolution Example

The table below illustrates a safe schema evolution for payment.initiated across two versions:

FieldVersion 1Version 2
paymentIduuid, requireduuid, required (unchanged)
amountdecimal, requireddecimal, required (unchanged)
currencystring, requiredstring, required (unchanged)
corridorCodeabsentstring, nullable, default null

Version 2 adds corridorCode as nullable with a default. Old consumers reading Version 2 messages simply receive null for this field, they do not crash. New consumers reading Version 1 messages also receive null which is backward compatible. This is Full Compatibility mode.


Observability in Event-Driven Systems

Distributed event flows are harder to trace than synchronous call chains. Investing in observability is not optional: it is the only way to diagnose and resolve production incidents in an event-driven system.

  1. Correlation Identifiers: attach a correlationId to every event and propagate it through the entire flow. This enables end-to-end tracing of a single payment across multiple services and log aggregators.

  2. Distributed Tracing: use OpenTelemetry with Jaeger or Zipkin to visualise event flows. Instrument AML screening calls as traced spans carrying payment.id, jurisdiction, and risk_score attributes.

  3. Consumer Lag Monitoring: Kafka consumer lag is the distance between the latest published event and the consumer's current read offset. Sustained lag signals slowness or downstream failure. Alert on it immediately.

  4. DLQ Monitoring: alert immediately when messages enter the DLQ. A DLQ entry in a financial system represents an unresolved transaction. Treat it as Priority 1 (P1).


Production Debugging Nightmares

Async systems fail in ways that synchronous systems cannot. Understanding the failure modes before they hit production is the difference between a 10-minute investigation and a 3-hour incident.

The Missing correlationId

The nightmare: A payment is reported as missing by a customer. You search the logs. The API Gateway log shows the request. Kafka shows the event was published. The CBS Debit service log shows nothing so you cannot find the payment because every log line uses a different internal ID. The correlationId was never propagated beyond the first service.

The fix: Treat the correlationId as a first-class citizen. Every event schema must include it as a required field. Every service that consumes an event must pass the correlationId forward to every event it produces. Every log line must include it. If a service does not have a correlationId, generate one and emit a warning, do not silently drop it.

// Bad: no correlationId means untraceable
{ "paymentId": "pay-123", "amount": 1000 }

// Good: correlationId threads the full flow
{ "paymentId": "pay-123", "amount": 1000, "correlationId": "corr-abc-789" }

Consumer Lag Accumulation

The nightmare: It is 14:30 on a Friday. Your consumer lag alert fires and the CBS Debit is 80,000 messages behind. Nobody knows when it started. Nobody knows which partitions are affected. The on-call engineer spends 40 minutes reading Kafka metrics before realizing the CBS database connection pool was exhausted at 09:00 and the service has been silently backing up ever since.

The fix: Alert on lag early (at 1,000 messages, not 80,000). Include the partition number and consumer group ID in every lag alert. Build a lag-over-time dashboard so you can see when the divergence started. Most importantly, ensure your consumer emits a heartbeat metric which means if you are not seeing processing, the service may have stalled silently.

Auto-Commit Offset Disaster

The nightmare: A Python consumer is configured with enable.auto.commit=true. It reads a batch of 50 messages, the Kafka client auto-commits the offsets at the next poll interval, and then the processing loop crashes on message 23. Messages 23–50 are now silently skipped. The offsets are committed. Kafka will never redeliver them. 27 transactions are missing.

The fix: Always disable auto-commit. Commit the offset manually, after processing is confirmed complete. This is the single most dangerous default in every Kafka client library.

# DANGER: offset committed before processing succeeds
consumer = Consumer({"enable.auto.commit": True})

# CORRECT: commit only after successful processing
consumer = Consumer({"enable.auto.commit": False})
# ... process message ...
consumer.commit(message=msg)  # only now

The Poison Pill That Halts the Queue

The nightmare: A malformed event enters the queue, perhaps a schema migration that was not backward compatible. The consumer fails, retries three times, and then... does nothing. The consumer is configured to pause on error. The partition is stuck. No other messages on that partition can move forward. Hours pass before anyone notices.

The fix: Implement a bounded retry count with DLQ routing (see the DLQ section above). A consumer must never pause indefinitely on a single message. Route unprocessable messages to the DLQ, advance the offset, and alert. Process the rest of the queue.


Failure Stories: When It Goes Wrong

These are patterns from real-world production systems. The details are anonymised but the failure modes are exact.

The Duplicate Transaction

Setting: A live RTGS integration. The RTGS gateway is an external system with a 30-second timeout. Under high load, the gateway accepts a payment but takes 32 seconds to respond, exceeding the consumer's timeout. The consumer throws a TimeoutException. The message is retried. The RTGS gateway processes the payment again.

Outcome: A customer's account is debited twice for the same payment. The first debit went through at second 32. The retry went through at second 64. The customer complains. Reconciliation begins. The investigation takes 6 hours. The refund takes 48 hours.

Root cause: No idempotency key was sent to the RTGS gateway. The gateway had no way to know it had already processed the payment.

The fix: Every request to the RTGS gateway must include a paymentId that the gateway deduplicates on. If the gateway does not support idempotency keys natively, maintain a local cache of submitted IDs with a TTL longer than the maximum retry window.

The Silent Missing Credit

Setting: A mobile banking platform. The notification service consumes account.credited events and sends a push notification. A developer, trying to reduce noise, sets enable.auto.commit=true and processes notifications in a background thread. One night, the process is OOM-killed mid-batch. Kafka has already committed the offsets. 43 customers never receive their credit notifications.

The worse problem: Because the notification service also triggers the internal balance refresh, those 43 customers see stale balances in the app until the next full sync runs approximately 6 hours later. They believe their credits have not arrived.

Root cause: Auto-commit decoupled the offset advance from actual processing success. This was indistinguishable from "processed" to Kafka but actually meant "read but not acted on."

The fix: Manual commit. Always. After every processing step that has external side effects. And decouple the balance refresh from the notification service so that balance state should come directly from a CBS event, not from a downstream notification trigger.

The Reconciliation Nightmare

Setting: Three teams working on a payments platform independently. The transaction team publishes transaction.amount. The AML team expects payment.amount. The CBS team stores debit.value. No Schema Registry. No shared event contract. Three months pass.

Outcome: End-of-month reconciliation runs. The total in the transaction log does not match the total in the AML screening log. It does not match the CBS ledger either. Nobody can tell which system has the correct figure because all three computed their totals from different field names with slightly different rounding rules. The reconciliation team spends 2 weeks manually correlating records.

Root cause: Loose schema governance. Each team evolved their event schema independently without a shared registry or compatibility checks. Small naming differences compounded over months of data.

The fix: A Schema Registry is not optional in multi-team environments. Every event schema must be registered, versioned, and compatibility-checked before deployment. A consumer that reads an unknown field must emit a warning, not silently ignore it.


Tradeoffs: The Honest Accounting

No architecture is free. Event-driven systems trade one set of problems for another. The table below is a starting point, but the deeper truth is in the nuance below it.

DimensionSynchronousEvent-Driven
LatencyLow for single calls, degrades under loadConstant: caller decoupled from processing time
ThroughputLimited by slowest synchronous callHigh: consumers scale independently
ResilienceOne failure cascadesIsolated: a failed consumer does not affect producers
CouplingTight: services know each other's contractsLoose: services know only the event schema
ConsistencyStrong: immediateEventual: consumers may lag
DebuggingSimple: request-response stack tracesComplex: traces span multiple services and offsets
Operational overheadLowHigh: Kafka cluster, Schema Registry, DLQ monitoring
TestingStraightforward integration testsHarder: must simulate event ordering, retries, and DLQs

Latency: It Is Complicated

"Event-driven is faster" is a dangerous oversimplification. The client-perceived latency is lower because the API Gateway returns 202 Accepted immediately. But the end-to-end processing latency (the time from payment submission to RTGS settlement) is often higher in an event-driven system because every service hop involves a write to Kafka, a consumer poll interval, and a network round-trip.

For use cases where a user needs to see the result immediately (a balance check before allowing a debit, for example), eventual consistency is not a tradeoff, it is a disqualifier. Synchronous RPC remains the right choice there.

Eventual Consistency Is Not "Eventually Correct"

Eventual consistency means different services will see different states of the same data at the same point in time. For most read operations in a banking app, this is acceptable, a balance display that is 500 milliseconds stale does not cause harm. For write operations with dependencies ("can I overdraft?") this is dangerous.

The pattern that works: read your own writes from the source of truth (the CBS ledger via a synchronous query) for any decision that has monetary consequences. Event-driven messaging is for propagating state changes, not for making credit decisions.

Debugging: The Invisible Call Stack

In a synchronous system, a failing request produces a stack trace in one place. In an event-driven system, a failing payment may involve log entries in six different services, written at different times, on different machines, with different timestamps. Without a correlationId threaded through every log line and every event, an investigation starts with no thread to pull.

The investment required: distributed tracing (OpenTelemetry + Jaeger/Zipkin), structured JSON logging with correlationId in every line, and Kafka consumer lag dashboards. These are not optional additions. They are the price of admission for operating an event-driven system responsibly.

The Operational Bar Is High

Running Kafka in production requires operational expertise that is not trivial to acquire. Consumer group rebalances, partition reassignments, broker leader elections, topic retention management, Schema Registry upgrades are each a potential incident if not handled correctly. Teams that underestimate this cost routinely find that their Kafka cluster becomes the single most fragile part of their infrastructure.

The core tradeoff: event-driven systems are harder to operate and harder to reason about locally, but they are dramatically more resilient and scalable under real-world load. In financial systems that process millions of transactions across dozens of services, the operational overhead is justified. In a simple CRUD application with two services, it is not.


Notifications and Industry Standards

Notification Patterns for Compliance Teams

Critical compliance events require immediate human notification, not just a log line. A payment blocked by AML screening, a DLQ message awaiting remediation, or a schema failure must reach operations teams without delay.

Standard notification patterns in financial event systems:

  • Webhook Callbacks: compliance platforms push screening results via webhook. Your consumer acknowledges the result and proceeds or blocks accordingly.
  • SMS and Email Alerts: AWS Simple Notification Service (SNS) or Twilio for Priority 0 (P0) DLQ and AML match notifications.
  • Slack and Microsoft Teams Integration: real-time dashboards surfacing consumer lag, DLQ depth, and AML flagging rates.

Key Industry Standards

StandardScopeRelevance to EDA
ISO 20022Payment message formatpacs.008 and pacs.009 carry structured originator and beneficiary fields. Mirror these in your event schema for automated AML enrichment.
FATF Recommendation 16 (Travel Rule)Wire transfer dataOriginator and beneficiary information must travel with the payment and be retained for five years. Include originatorInfo and beneficiaryInfo in payment.initiated from the start.
SWIFT gpiCross-border payment trackingEvery correspondent bank updates the gpi tracker with a Unique End-to-End Transaction Reference (UETR). Model updates as swift.gpi.updated events, using the UETR as the partition key.
OpenTelemetryObservabilityInstrument AML calls as traced spans, correlate across the compliance pipeline, and alert on latency regressions before they cause settlement failures.

Real-World Results

In the core banking and RTGS integration systems I have built, the move from synchronous RPC to event-driven messaging produced consistent, measurable improvements across three dimensions:

MetricSynchronous BaselineAfter Event-Driven Migration
Payment latency (p99)2,000 ms~200 ms (client-perceived)
Failure recoveryManual restart requiredAutomatic: Kafka holds events until consumer recovers
ThroughputCapped by slowest serviceScaled independently per consumer
Deployment couplingCoordinated releasesIndependent team delivery
DLQ visibilitySilent failuresImmediate P1 alert on first failure

The latency improvement is largely perceptual: the client receives 202 Accepted immediately rather than waiting for end-to-end processing. But for user experience in mobile banking applications, a 200 ms response versus a 2-second wait is the difference between a fluid experience and a broken-feeling one.


When NOT to Use Event-Driven Architecture

Maturity in system design includes knowing when not to apply a pattern. Event-driven architecture is not appropriate in every context.

Do not use EDA when:

  • Your system is simple CRUD. A content management tool, an internal admin panel, or a low-traffic API with a single database does not need a message broker. The operational overhead (Kafka cluster, consumer management, DLQ alerting) far exceeds any benefit.

  • Consistency must be immediate. EDA systems are eventually consistent by nature. If a business requirement states that a write must be confirmed before the next action (for example, a synchronous balance check before an overdraft decision), EDA introduces complexity without benefit.

  • Your team is not ready for the operational model. Without distributed tracing, consumer lag alerts, and DLQ monitoring in place, an event-driven system degrades into an invisible mess. The tooling investment is substantial and non-negotiable.

  • Your event volume is low and traffic is predictable. If you process 1,000 transactions per day on a predictable schedule, the concurrency and throughput benefits of Kafka do not materialise. A well-indexed relational database and a simple job queue will serve you better with far less complexity.

The right question is not "should we use event-driven architecture?" but "does our scale, failure tolerance, and team maturity justify the operational model?" In high-volume, multi-service financial platforms with strict resilience requirements, the answer is consistently yes. In everything else, default to simplicity first.


Key Takeaways

Event-Driven Architecture is not a silver bullet. For high-volume, multi-system financial platforms, however, it is consistently the right architectural choice when the team and tooling are ready to support it.

The pattern works because it resolves the three failures of synchronous design:

  • Tight coupling is broken by the event contracts which are services know only to the schema, not to each other.
  • Cascade failures are contained meaning a failing consumer does not affect producers or sibling consumers.
  • No retry strategy is replaced by durable storage, exponential back-off, and DLQ-based recovery.

The complexity cost is real. Distributed tracing, schema management, idempotency enforcement, and DLQ handling all require engineering discipline. The operational benefits at scale, however, far outweigh the overhead.

In the banking systems I have built and integrated, the shift from synchronous Remote Procedure Calls (RPC) to event-driven messaging was consistently the single most impactful architectural improvement: it reduced client-perceived latency, eliminated brittle point-to-point integrations, and enabled independent team delivery without coordination overhead.

The failure stories in this post are not edge cases, they are the rule. Duplicate transactions, missing credits, and reconciliation nightmares happen in every system that does not implement idempotency, manual offset management, and a Schema Registry from day one. The patterns in this post exist because someone, somewhere, learned them the expensive way.

EM

Emmanuel Maneswa

Full Stack Software Engineer

LinkedInGitHub
← Back to all posts