Online bank design
Let's talk about online banks. From what I've seen and researched, most of them run on good old SQL databases doing their job. This seems to be the industry standard, and for good reasons. Also, "nobody ever got fired for choosing Oracle" - a mindset that often leads to safe, defensive technology choices rather than exploring alternatives
Ok, makes sense, but what if we change the mindset and try something what is usually not chosen. What if we choose ZooKeeper as a main storage? Yes, coordination service but not only, it has some stuff we can leverage for all purposes database - strong consistency guarantees, built-in distributed coordination, and natural handling of concurrent operations. Why Zookeeper? It doesn’t matter - just for instance, you can replace it with any other coordination service like etcd or Consul.
We'll build an online bank that handles thousands of concurrent transfers reliably. Maybe we'll discover something cool about distributed systems along the way, or see exactly why SQL dominates the space!
Let’s limit the scope for the core functional part - moving funds between accounts. The tricky part is making sure all balances stay correct when lots of people are sending money at the same time. Looking at real banks gave us a useful insight - most transfers happen between accounts that are close to each other geographically, like within the same city or country. Cross-border transfers are much less common. This simple fact will help us make our system faster and simpler.
What about non-functional requirements, knowing they are ideal goals rather than guarantees:
Bank should be available 99.999% (down no more than 5 minutes per year)
Latency of transfer processing should be under 5 seconds
Throughput of the bank should be up to 3,000 transfers per second
Number of users can be scaled up to 100M
Here's a teaser of our performance results (it's important to note these promising numbers are theoretical.
2PC (single client): ~30 tps
2PC (10 clients): ~173 tps (hot accounts limited)
Event Sourcing (single validator): ~125 tps
Multi-validator (5 validators with blocks): ~3400 tps
If you want to read something more serious, look how Visa and Facebook(Libra RIP) solve these issues
and all other bft-family protocols
Toolbox
Before diving into our banking system design, let's look at the basic tools we'll use throughout the article. While we use ZooKeeper as our primary storage in examples, we're really just using it as a distributed database with some helpful reliability features. Similar systems like etcd or Consul could work equally well - they provide the same core capabilities. Think of this section as a toolbox - we're just looking at what's inside for now. Like a hammer or screwdriver, each tool's real purpose will become clear when we actually start building.
Single Node Read/Write - The foundation of our account and transfer storage. We store account states and transfer records as individual nodes. While ZooKeeper uses znodes, etcd and Consul provide similar key-value storage with slight syntax differences. In our system, this is used for storing account balances, transfer records, and block data.
Optimistic locking (versioning) - Critical for our optimistic concurrency control when updating nodes. Each node maintains a version number that changes with updates. We use this to ensure atomic updates of individual account states - if two clients try to modify the same node simultaneously, only one succeeds. ZooKeeper provides this via version numbers in Stat structure, etcd uses ModRevision, and Consul uses ModifyIndex.
Ephemeral Nodes - client can create an ephemeral node. What means If the client crashes, these nodes automatically disappear, releasing the transactions back to the pool. ZooKeeper directly supports ephemeral nodes, etcd achieves this through leases with TTL, and Consul uses sessions with TTL.
Sequential Nodes - We use these to order any entity we need, ensuring a consistent global sequence. ZooKeeper provides this natively (creating nodes like block_0000001), etcd uses revision numbers for ordering, while Consul requires custom implementation using atomic counters.
Directory Reading - Used when the client needs to scan the whole list of nodes. All three systems support listing keys with a common prefix, though with different performance characteristics. ZooKeeper provides direct child node listing, etcd and Consul support prefix-based key enumeration.
Watches - Clients can use watches to monitor their node values. ZooKeeper provides one-time triggers that must be reset, etcd offers continuous event streams, and Consul uses blocking queries.
Funds transfer (2PC)
As our first approach, let's explore a straightforward way of modeling accounts and transfers. We'll use two main tables: one to store account states and another for transfers. For accounts, we'll keep simple data like balance
, while for transfers, we'll record from
, to
, and amount
. Additional fields are included to support processing.
Data structures
Account State:
/accounts/{account_id}/state:
balance: 1000 # Current available balance
reserved_amount: 200 # Sum of pending outgoing transfers, prevents double-spending
version: 5 # Increments with every update
pending_transfers: { # Multiple parallel transfers possible
"transfer_001": {
amount: -100, # Negative for outgoing, positive for incoming
applied: false # Updated atomically with balance
},
"transfer_002": {
amount: -100,
applied: false
}
}
Transfer Record:
/transfers/{transfer_id}:
status: string # State machine: initial -> preparing -> committing -> completed
from: "account123" # Source account ID
to: "account789" # Destination account ID
amount: 100 # Transfer amount
High level design
Let's overview the high-level design of our bank, focusing on how we transfer funds between accounts. The system uses three ZooKeeper clusters: ZK balances holds current account amounts, ZK transfers tracks ongoing transfers, and ZK ledger (sharded) keeps all past transactions:
Broadcast initializes account balances across the system. This ensures every node starts with correct account data before processing any transfers.
BankApp creates a transfer record in ZK transfers. This acts as a central queue for all money movements, helping track every transfer attempt in the system.
BankApp runs the 2PC transfer by checking account balance, setting aside money, updating both accounts, and marking it done. On recovery, system sees the waiting transfer and continues where it left off - 2PC protocol makes sure money moves correctly even if something breaks.
Broadcast pulls transactions from ZK transfers. This creates a reliable buffer between processing and storage, where transactions safely wait for their turn to be recorded.
Broadcast posts confirmed transactions to ZK ledger where they're stored permanently. The ledger becomes the source of truth for all successful transfers in the system.
Broadcast cleans up posted transfers from ZK transfers to keep the system clean. This maintenance step prevents the transfer queue from growing too large.
Pitfalls
How to update balances of two accounts atomically? Since Zookeeper doesn’t support atomic update of two znodes. We solve this using two-phase commit (2PC) with pending transfers and version checking. First, transfers can prepare in parallel, reserving amounts, validating balances, and incrementing versions to prevent lost updates. For final balance updates, we use ZooKeeper's ephemeral-sequential nodes to provide brief, atomic updates to each account, ensuring each update includes the correct version number.
How to prevent Double-Spending? Concurrent transfers could potentially overdraw an account, but we prevent this by tracking reserved amounts for pending outgoing transfers. During preparation phase, we validate that balance minus reserved amount is sufficient for the transfer. This ensures we never commit to transfers without adequate funds.
What if the client fails at any stage? If system crashes, ephemeral nodes automatically cleanup thanks to ZooKeeper's session management. Pending transfers show clear intent of what should happen, while applied flags prevent double-processing of updates. A background recovery job can safely complete or rollback any interrupted transfers by examining this state
Versions vs Locking In our bank transfer system, we could protect account updates either by locking accounts before modification or by tracking version numbers that increment with each change. We chose versioning since it provides atomic updates with minimal overhead - each operation simply checks and increments a version number, rather than requiring additional network calls to acquire and release locks. This optimistic approach is simpler and more efficient when most transfers don't conflict, only requiring a retry if the account state has changed since initial read.
2PC Implementation
Ready to dive deep into 2PC implementation
#1 TRANSFER INITIATION
CREATE /transfers/001
status: "initial"
from: "account123"
to: "account789"
amount: 100
# If client crashes, cleanup job removes stale "initial" transfers.
# No money moved yet, so safe to cleanup, znode can be even ephemeral.
#2 PREPARATION PHASE
UPDATE /transfers/001/status: "preparing"
# Validate: balance - reserved_amount >= transfer_amount
UPDATE /accounts/account123/state:
balance: 1000
reserved_amount: 100 # Add to existing reserved amount
version: 5
pending_transfers: {
transfer_001: {
amount: -100,
applied: false
}
}
UPDATE /accounts/account789/state:
balance: 500
version: 11 # version 5 was for account123
pending_transfers: {
transfer_001: {
amount: 100,
applied: false
}
}
# Recovery job finds incomplete preparations
# and rolls back reserved amounts, if client failed on the way
#3 COMMIT PHASE
UPDATE /transfers/transfer_001/status: "committing"
UPDATE /accounts/account123/state: # Atomic update
balance: 900 # Decreased by 100
version: 6
reserved_amount: 0 # Decrease by completed amount
pending_transfers: {
transfer_001: {
amount: -100,
applied: true # Applied
}
}
UPDATE /accounts/account789/state: # Atomic update
balance: 600. # Increased by 100
version: 12
pending_transfers: {
transfer_001: {
amount: 100,
applied: true # Applied
}
}
# If client dies between updates: recovery job checks applied flags
# and completes missing updates.
# Atomic updates prevent double-processing.
#4 CLEAN-UP PHASE
UPDATE /transfers/transfer_001/status: "completed"
DELETE pending entries
# Safe if client dies during cleanup - completed status shows final state,
# background job can clean pending entries.
Parallel transfers case
What if transfer_001
and transfer_002
start simultaneously incrementing versions. Second transfer does not wait until the first one is finished:
#2 PREPARATION PHASE
UPDATE /accounts/account123/state:
balance: 1000
reserved_amount: 200 # Both transfers reserve
version: 2
pending_transfers: {
transfer_001: { amount: -100, applied: false },
transfer_002: { amount: -100, applied: false }
}
#3 COMMIT PHASE
# transfer_001:
# Atomic update of entire state node:
UPDATE /accounts/account123/state:
balance: 900
version: 3
reserved_amount: 100 # Decrease by completed transfer
pending_transfers: {
transfer_001: { amount: -100, applied: true },
transfer_002: { amount: -100, applied: false }
}
# transfer_002 (in parallel):
# Waits for 0001, then atomic update:
UPDATE /accounts/account123/state:
balance: 800
version: 4
reserved_amount: 0 # No more pending
pending_transfers: {
transfer_001: { amount: -100, applied: true },
transfer_002: { amount: -100, applied: true }
}
Performance
Data Volume Per Operation:
Account State: ~140 bytes (balance, version, reserved amount, 2-3 pending transfers)
Transfer Record: ~80 bytes (status, from/to accounts, amount)
Operation Costs:
- 8 ZooKeeper operations per transfer (create record, 2 preps, 2 commits, status update, 2 cleanups)
Single BankApp Instance:
ZooKeeper limit: ~250 ops/second (3-4ms per operation)
Transfer throughput: 31 tps (250/8 ops)
10 BankApp Instances:
Raw throughput: 310 tps (31 × 10)
Effective: 248 tps (20% hot account conflicts)
Production: 173 tps (70% capacity)
Daily Volume at 173 TPS:
Transfers per day: 14,947,200 (173 × 86400 seconds)
Transfer records: 1.09 GB (78 bytes × transfers)
Account updates: 2.31 GB (166 bytes × transfers)
Total daily data: 3.40 GB
Storage
ZooKeeper as a storage engine has distinct limitations - single instance can handle only about 10GB of data due to memory constraints and must keep the entire dataset in RAM, similar to Redis's fundamental design. With our system generating 30GB of validated transactions daily at 3,000 tps, we need an architecture that can scale beyond these boundaries. Traditional databases solved similar challenges through keeping only hot data in memory, while cold data is on disk
To achieve storage capabilities let’s consider single-node ZooKeeper setup, it will eliminate network coordination and consensus protocols, allowing direct writes through local Write-Ahead Log similar to RocksDB's approach. This provides higher throughput for historical data where strong consistency guarantees are less critical.
Memory constraints of ZooKeeper require keeping the entire dataset in RAM, making it impractical to store all historical data in a single instance. Our solution implements tiered storage with multiple ZooKeeper clusters - active instances hold recent data while historical data is distributed across "cold" clusters that can be brought online on demand. This mirrors how MySQL manages hot/cold data through buffer pools and how BigTable employs tablet splitting. Time-based sharding allows efficient data organization where older validated transactions are moved to dedicated historical clusters, similar to how Cassandra handles data temperature through its tiered compaction strategy.
Data organization leverages ZooKeeper's hierarchical structure. Historical transactions are stored in time-based paths (/year/month/day/hour) with batch files containing multiple transactions, reminiscent of how HBase organizes data in regions and how MongoDB uses date-based chunking. This provides natural partitioning and efficient retrieval patterns while keeping individual znodes within size limits. Each batch includes metadata for quick scanning without loading full content, similar to how PostgreSQL handles TOAST tables for large values.
Query patterns take advantage of ZooKeeper's in-memory nature for fast lookups, achieving performance similar to Redis's sorted set operations for range queries. The hierarchical path structure enables efficient time-range queries by traversing only relevant paths, comparable to how etcd handles range scans. For specific transaction lookups, we maintain lightweight indices pointing to appropriate time-based shards, similar to MongoDB's index-driven queries.
Implementation challenges focus on efficient data lifecycle management. With 294 million daily operations generating 33GB, key concerns include automated shard management, data migration between hot/cold clusters, and backup strategies - challenges that echo solutions from Spanner's data movement techniques and Cassandra's repair mechanisms. The solution maintains eventual consistency for historical data while providing strong guarantees for active transactions, similar to how DynamoDB handles different consistency levels for different access patterns.
Funds Transfer v2 (Event-sourcing)
Why focus on balance states when they're derived from transfers? A consistent balance state is only necessary when validating the next transfer to prevent double spending.
Each transfer can reference a previous validated transfer where the source account received funds (input link). New transfers are initially posted to the waiting chain. After validation, they move to a validated chain, forming an append-only log. A single validator process handles the movement of transfers between chains, ensuring a consistent order and preventing double-spending.
Data Structures
# Waiting Queue
/waiting/42:
from: "account_123"
to: "account_456"
amount: 100
input: "validated_41" # reference to previous validated transfer
# Validated Chain
/validated/41:
from: "account_123"
to: "account_456"
amount: 100
input: "validated_40"
# Account balance cache
/account/123:
balance: 100
High-level design
Let's overview the high-level design of our event-sourcing approach for handling money transfers. Unlike the 2PC approach, this design treats transfers as a stream of events that must be validated and recorded sequentially. Here's how transfers flow through the system:
BankApp adds a transfer to ZK waiting transfers. Each transfer references a previous validated transfer as its input source, creating a chain of money movement.
Validator reads waiting transfers from the queue, checking for transfers that have proper input references and are ready for processing.
Validator reads current balances, verifying the input transfer hasn't been spent and has enough funds available. If validator crashes here, the transfer stays in waiting queue - no changes were made yet.
Validator posts verified transfers to ZK validated cluster, marking them as legitimate parts of the transaction chain. On recovery, transfers in waiting queue show which ones need validation - same transfer will always validate to same result.
Validator updates account balances to reflect newly validated transfers, maintaining an accurate view of available funds. On recovery, waiting queue shows which balances to recalculate from validated history.
Validator cleans up processed transfers from the waiting queue, removing entries that have been either validated or rejected.
Broadcast reads validated transfers, preparing them for the immutable ledger while maintaining their sequential order. On failure, it's safe to restart since validated transfers create an ordered chain.
Broadcast posts transfers to the ledger, creating a permanent, append-only record of all successful transfers.
Broadcast cleans up processed transfers from the validated queue once they're safely recorded in the ledger.
Pitfalls
How to update balances of two accounts atomically? In this approach, there's no need to update both account balances atomically. Instead, balances can be evaluated using eventual consistency, where the system reaches a correct state over time rather than instantaneously.
How to prevent Double-Spending? Since each transfer must reference previous validated transfer as its input, and validator ensures this input hasn't been spent before, it's impossible to spend same money twice. If two transfers try to spend same input, only first one validated will succeed, second will stay in waiting chain as invalid.
What if the client fails at any stage? If client fails after posting to waiting chain - validator will still process it. If client fails before posting - no state changes happened. No intermediate states possible because transfer is either in waiting chain or validated chain. System always maintains consistency by following links in validated chain.
Event-sourcing implementation
## INITIAL STATE
/validated/41:
transfer_id: "transfer_41"
from: "account_789"
to: "account_123"
amount: 150
input: "transfer_40"
/validated/42:
transfer_id: "transfer_42"
from: "account_123"
to: "account_456"
amount: 50
input: "transfer_41"
Transfer process:
- Client (account_123) checks the validated chain, finding 100 unspent from transfer_41 after transfer_42 spent 50. Then, client creates ephemeral node and sets watch on it:
CREATE
/waiting/77:
transfer_id: "transfer_77"
from: "account_123"
to: "account_456"
amount: 70
input: "transfer_42"
status: "pending"
Validator verifies input transfer_42 and checks no other validated transfer spent remaining amount
If valid - creates a new validated transfer and updates waiting transfer:
CREATE /validated/43: transfer_id: "transfer_43" from: "account_123" to: "account_456" amount: 70 input: "transfer_41" UPDATE /waiting/77: status: "validated"
If invalid:
UPDATE /waiting/77: status: "rejected"
Performance
Data Volume Per Operation:
Waiting Transfer: ~100 bytes (transfer ID, from/to accounts, amount, input reference, status)
Validated Transfer: ~90 bytes (transfer ID, from/to accounts, amount, input reference)
In-memory account balance: ~30 bytes per account (ID + balance + last transfer ref)
Operation Costs:
Create waiting transfer: 1 write
Move to validated chain: 1 write
Total: 2 ZooKeeper writes per transfer
Throughput:
Single validator throughput: ~250 ops/second (network/client limited)
Maximum transfers: 250/2 = 500 tps from single validator
Sustainable production throughput: ~125 tps
To achieve higher throughput, need multiple validators, but this requires redesign to avoid races
Daily Volume (at 125 tps):
Transfers: 10.8M/day
Data: 2.27 GB/day (1.19 GB waiting + 1.08 GB validated)
Funds Transfer v2.1 (Multi-validators)
The main drawback of the previous approach was that a single validator could become a bottleneck for the entire system. Given that adding to the global append-only block can only occur in one place, we have to have only one validator. However, let's offload the validation work from it and leave it with only coordination responsibilities.
The multi-validator approach allows us to optimistically validate blocks of transfers in parallel. We keep track of which accounts are affected in each block. If two blocks involve different sets of accounts, coordinator can assume there are no conflicting operations and trust the internal validation of each block.
This design excels in real-world scenarios because most transfers in a peer-to-peer system involve different pairs of accounts. When Alice pays Bob and Carol pays Dave, these transfers can be processed in parallel with no conflicts. Even when conflicts occur (like Alice making two payments in quick succession), the system handles them gracefully through the reject-and-retry mechanism. The first payment goes through immediately while the second payment may experience a small delay as it gets included in the next block.
Data Structures
The ledger is a sequence of blocks, where each block contains multiple transactions and metadata about affected accounts. Each block in the ledger is immutable and atomic, providing a reliable source of truth. For example, a block in the ledger might look like this:
/ledger/block_41:
block_id: "block_41"
validator: "validator_1"
transactions: [{
transfer_id: "transfer_41"
from: "account_789"
to: "account_123"
amount: 150
input: "transfer_40"
}, {
transfer_id: "transfer_42"
from: "account_123"
to: "account_456"
amount: 50
input: "transfer_41"
}]
affected_accounts: ["account_789", "account_123", "account_456"]
The waiting pool is where new transactions are posted by clients. Each transaction in the pool is identified by its transfer_id for efficient lookup and processing:
/waiting/77:
transfer_id: "transfer_77"
from: "account_123"
to: "account_456"
amount: 70
input: "transfer_42"
status: "pending"
High-level design
Let's overview the high-level design of our multi-validator approach. Unlike previous designs, this system achieves high throughput by processing multiple blocks of transfers in parallel, while using a coordinator cluster to ensure consistency between blocks.
BankApp adds transfers to their local ZK waiting transfers queue. The system is sharded by account ranges, so each BankApp instance handles specific accounts for better load distribution.
Validator reads waiting transfers from its queue. It groups multiple transfers into blocks that may be processed together, improving efficiency over single-transfer processing.
Validator reads current balances to verify transfers. If validator crashes here, transfers stay safe in waiting queue until recovery - no changes were made yet.
Multiple validators submit their blocks to the Coordinator Leader for comparison. The coordinator and its followers form a consensus group that checks for conflicts between account ranges in different blocks.
Validator updates account balances after its block is approved. On recovery, pending transfers in waiting queue show exactly which balances need updates.
Validator cleans up processed transfers from its waiting queue, removing transfers that made it into approved blocks.
Coordinator posts approved blocks to ZK ledger. Leader-follower setup provides quick failover if leader crashes, while account range checking prevents double-spending.
Broadcast nodes pull new blocks from the ledger. This distributes the transaction history across the system, preparing for account updates.
Broadcast propagates updated balances across all nodes. Fast propagation reduces chance of block rejection since validators work with more current account states.
Transfer Process
Step 1: Client Transaction Submission When a client wants to make a transfer, they first traverse the ledger to find their available funds. The client looks for transactions where they received money and tracks which of these inputs have already been spent in subsequent transfers. This creates a verifiable chain of ownership. After finding available funds, the client creates a new transaction in the waiting pool, specifying which previous transaction output they're spending (the input reference).
Step 2: Parallel Block Formation & Validation Multiple validators work independently without coordination. Each validator reads transactions from the waiting pool and forms its own block. During validation, the validator verifies that each transaction's input exists in the ledger and hasn't been spent. Validators track which accounts are affected by their block's transactions, as this information is crucial for conflict detection later.
A validator's in-progress block looks like this:
/validation/in_progress/validator_1/block_42:
status: "validating"
start_time: 1699701400
transactions: [{
transfer_id: "transfer_77"
from: "account_123"
to: "account_456"
amount: 70
input: "transfer_42"
}]
affected_accounts: ["account_123", "account_456"]
To prevent multiple validators from processing the same transactions, validators create ephemeral ownership nodes when claiming transactions from the waiting pool. When a validator selects a transaction, it creates an ephemeral node at /waiting/pool/{transfer_id}/owner
with its validator ID. Other validators skip transactions that already have ownership nodes. This exclusive locking mechanism reduces unnecessary validation work and minimizes conflicts during block merging.
If a validator fails during processing, its ZooKeeper session will end and all its ephemeral ownership nodes will be automatically removed. This releases locked transactions back to the waiting pool, allowing other validators to process them. This automatic cleanup ensures that no transaction remains permanently locked if a validator crashes. The in-progress block at /validation/in_progress/{validator_id}/
is also created as an ephemeral node, so it's automatically cleaned up when the validator fails.
Step 3: Block Competition When a validator completes validation of its block, it attempts to commit the block by creating a sequential node in the pending area. The sequential nature of ZooKeeper nodes naturally orders the blocks. The first validator to complete its block gets the lowest sequence number:
CREATE /ledger/pending/block_{seq}:
block_id: "block_42"
validator: "validator_1"
timestamp: 1699701450
transactions: [...]
affected_accounts: ["account_123", "account_456"]
# Other validators continue their work and submit their blocks as they complete
CREATE /ledger/pending/block_{seq+1}:
block_id: "block_43"
validator: "validator_2"
timestamp: 1699701451
transactions: [...]
affected_accounts: ["account_789", "account_234"]
Step 4: Block Resolution A background merger (Coordinator) process handles the sequential integration of blocks into the main ledger. It processes pending blocks in sequence order, checking for conflicts between blocks. A conflict occurs when two blocks affect the same accounts. The merger process moves conflict-free blocks to the ledger and rejects conflicting blocks.
For example, when block_42 is processed first:
CREATE /ledger/block_42:
block_id: "block_42"
validator: "validator_1"
transactions: [...]
affected_accounts: ["account_123", "account_456"]
# If block_43 had conflicted with block_42, it would be rejected:
CREATE /ledger/rejected/block_43:
block_id: "block_43"
reason: "account_conflict"
conflicting_blocks: ["block_42"]
When a block is rejected, its transactions return to the waiting pool for inclusion in future blocks. This automatic retry mechanism ensures that no valid transaction is permanently lost due to conflicts.
Step 5: Client Confirmation Clients maintain watches on their transactions in the waiting pool. When a transaction moves from the waiting pool to a block in the ledger, the client receives a notification. The client can then verify the transaction details in the ledger block and proceed with confidence that the transfer is now permanent and immutable.
Performance
Data Volume Per Operation:
Transfer Record: ~100 bytes (transfer ID, from/to accounts, amount, input reference)
Block Header: ~100 bytes per block (block ID, validator ID, timestamp, affected accounts list)
Block content: Transfer records + ~50 bytes overhead
With avg 100 transfers per block: ~10KB per block
Block Structure
Size: 100 transfers per block
Block processing time: 80ms
Read 100 transfers: 20ms
Validate transfers: 30ms
Check accounts: 20ms
Create block: 4ms
Coordination: 6ms
Single Validator:
Blocks per second: 12 (1000ms/80ms)
Transfer throughput: 1,200 tps (12 blocks × 100 transfers)
5 Validators:
Raw throughput: 6,000 tps (1,200 × 5)
Effective: 4,800 tps (20% block conflicts)
Production: 3,400 tps (70% capacity)
Funds Transfer v2.2 (multi-dc)
While our previous design worked well within a single data center, real-world banking systems often operate across multiple geographical regions. The multi-validator approach naturally extends to a distributed setup, allowing us to process most transactions locally while maintaining global consistency.
The key insight is that banking transactions typically follow geographical patterns - most transfers occur between accounts in the same region. By sharding accounts based on geography, we can process the majority of transfers within a single DC, so we can keep a snapshot of account balances relevant to its region.. For example, when Alice in Tokyo transfers money to Bob in Tokyo, the transaction can be validated and committed locally by the Asia-Pacific DC coordinator without global consensus.
For cross-region transfers, like when Charlie in New York sends money to Diana in London, we use a two-tier approach. Each DC has a coordinator that participates in global consensus. These coordinators can either use a multi-leader consensus protocol within time windows, similar to HotStuff(will be covered later), or rely on a primary ZooKeeper cluster for leader election.
Short algo
Clients submit waiting transfers to their nearest DC
Each DC runs multiple validators that form potential blocks
DCs maintain synchronized balance snapshots rather than full transaction history
Each DC has a coordinator that participates in global consensus
Consensus can be achieved through either multi-leader time-window proposals (like HotStuff) or leader election in primary ZooKeeper
Once coordinators confirm a block, balance updates propagate system-wide
Account sharding follows geographical boundaries since most transfers are local
Local coordinators can process intra-region transfers independently
International transfers, with relaxed performance requirements, process through primary ZooKeeper
Let's look at a simple example:
Tokyo DC processes a local transfer:
Alice initiates transfer to Bob (both Tokyo accounts)
Tokyo validators form a block
Tokyo coordinator recognizes local accounts, commits immediately
Balance updates propagate to other DCs asynchronously
New York to London transfer:
Charlie initiates transfer to Diana
NY coordinator detects international transfer
Coordinates with global consensus layer
All DC coordinators confirm
Balance updates apply globally
This approach provides fast local transfers while maintaining consistency for international transactions where lower latency requirements are typically acceptable.