Distributed System Implementations: Building Key-Value Stores
Many modern applications need to be fast, scalable, and always available. To achieve this, they often use distributed systems. These systems run on many computers and work together to store and process data.
One important type of distributed system is the key-value store. It saves data as key-value pairs. This model is simple but very powerful. Many popular tools—like databases, caches, and service discovery systems—use this structure.
However, building a key-value store that works across multiple machines is not easy. We need to think about problems like:
- What happens if one machine fails?
- How do we make sure the data is correct and up to date?
- How do we store data evenly across machines?
These questions bring us back to some core ideas in distributed systems. Sometimes, developers focus on the latest trends and forget about these important fundamentals. But in reality, understanding these basics is key to building reliable and scalable systems.
In this blog post, we will look at some popular design ideas used in distributed key-value stores:
- Raft – helps keep data consistent across machines.
- CRDTs – allow updates without conflicts and support offline use.
- Consistent Hashing – spreads data evenly and handles machine changes.
- Primary-Backup – makes the system simple and fault-tolerant.
- Gossip Protocols – let machines share information in a smart way.
Each method has pros and cons. Choosing the right one depends on what your system needs.
By the end of this post, you’ll understand the main ideas behind these systems and when to use them.
Understanding the Basic Trade-offs
When building a distributed system, you need to balance three important properties, known as the CAP theorem:
- Consistency: All computers see the same data at the same time
- Availability: The system continues to work even when some parts fail
- Partition tolerance: The system works even when network connections between computers fail
Most distributed systems must handle network problems, so the real choice is usually between consistency and availability. We’ll look at both types of systems in this post.
Raft: A Popular Consensus Algorithm
Raft is one of the most widely used approaches for building distributed systems that prioritize consistency. It’s used in systems like etcd and Consul.
How Raft Works
Raft uses a leader-follower model:
- Leader Election: The system chooses one computer as the leader
- Log Replication: The leader receives all write requests and shares them with followers
- State Machine: Once enough computers have saved the data, it’s applied to the key-value store
public class RaftDistributedStorage implements Storage {
private final RaftNode raftNode;
private final Storage localStore;
@Override
public boolean put(byte[] key, byte[] value, long ttlSeconds) {
// Create command
Command putCommand = new PutCommand(key, value, ttlSeconds);
// Submit to Raft
try {
// Wait for command to be saved on most computers
CompletableFuture<Boolean> future = raftNode.submit(putCommand);
return future.get(5, TimeUnit.SECONDS);
} catch (Exception e) {
return false;
}
}
@Override
public byte[] get(byte[] key) {
// Read locally if up-to-date
if (raftNode.isLeader() || raftNode.isUpToDate()) {
return localStore.get(key);
} else {
// Ask the leader
try {
return raftNode.forwardToLeader(new GetCommand(key))
.get(5, TimeUnit.SECONDS);
} catch (Exception e) {
return null;
}
}
}
}
Advantages and Disadvantages of Raft
Good things about Raft:
- Strong data consistency
- Handles computer failures well
- Easier to understand than older methods
- Well-tested in real-world systems
Problems with Raft:
- Can be slower because of the consensus process
- Less available during network problems
- Needs most computers to be working
- Gets more complex with more computers
You can find more details about the Raft implementation at this link: https://raft.github.io/
CRDTs: Prioritizing Availability
Conflict-free Replicated Data Types (CRDTs) offer an alternative approach that prioritizes availability over strong consistency.
How CRDTs Work
CRDTs are special data structures that can handle conflicting updates without requiring strict coordination between computers. The key is that all operations must be:
- Order-independent: It doesn’t matter which update happens first
- Merge-friendly: Updates can be combined predictably
public class CRDTDistributedStorage implements Storage {
private final Storage localStore;
private final Map<String, NetworkClient> peerClients;
private final String nodeId;
// Use a CRDT Map
private final Map<KeyWrapper, CRDTValue> crdtStore = new ConcurrentHashMap<>();
@Override
public boolean put(byte[] key, byte[] value, long ttlSeconds) {
// Update local store
localStore.put(key, value, ttlSeconds);
// Create CRDT update
CRDTValue crdtValue = new CRDTValue(nodeId, value, ttlSeconds);
crdtStore.put(new KeyWrapper(key), crdtValue);
// Share with other computers
for (NetworkClient client : peerClients.values()) {
client.sendCRDTUpdate(key, crdtValue).exceptionally(ex -> {
// We can try again later if this fails
return null;
});
}
return true;
}
// Combine updates from other computers
public void mergePeerUpdate(byte[] key, CRDTValue peerValue) {
KeyWrapper keyWrapper = new KeyWrapper(key);
CRDTValue localValue = crdtStore.get(keyWrapper);
if (localValue == null) {
// Just use the peer's value if we don't have one
crdtStore.put(keyWrapper, peerValue);
localStore.put(key, peerValue.getValue(), peerValue.getTtlSeconds());
} else {
// Combine our value with the peer's value
CRDTValue mergedValue = localValue.merge(peerValue);
crdtStore.put(keyWrapper, mergedValue);
localStore.put(key, mergedValue.getValue(), mergedValue.getTtlSeconds());
}
}
}
Advantages and Disadvantages of CRDTs
Good things about CRDTs:
- High availability (works during network problems)
- No leader needed
- Eventually all computers will have the same data
- Simpler than consensus algorithms
Problems with CRDTs:
- Data may be temporarily inconsistent
- May need complex rules for resolving conflicts
- Requires more metadata
Consistent Hashing: Distributing Data Evenly
Consistent hashing is a way to distribute data across computers that minimizes disruption when computers are added or removed.
How Consistent Hashing Works
Keys are mapped to positions on a circle, and each computer is responsible for a section of the circle. For reliability, each key is stored on multiple computers.
public class ConsistentHashDistributedStorage implements Storage {
private final Storage localStore;
private final ConsistentHash<String> ring;
private final Map<String, NetworkClient> peerClients;
private final String nodeId;
private final int replicationFactor;
@Override
public boolean put(byte[] key, byte[] value, long ttlSeconds) {
// Find which computers should store this key
List<String> responsibleNodes = ring.getNodes(key, replicationFactor);
// Store locally if this computer is responsible
if (responsibleNodes.contains(nodeId)) {
localStore.put(key, value, ttlSeconds);
}
// Send to other responsible computers
CompletableFuture<?>[] futures = new CompletableFuture[responsibleNodes.size()];
int i = 0;
for (String node : responsibleNodes) {
if (!node.equals(nodeId)) {
futures[i++] = peerClients.get(node).put(key, value, ttlSeconds);
}
}
// Wait for enough computers to confirm
try {
CompletableFuture.allOf(futures).get(5, TimeUnit.SECONDS);
return true;
} catch (Exception e) {
return false;
}
}
@Override
public byte[] get(byte[] key) {
// Check if we have it locally
byte[] localValue = localStore.get(key);
if (localValue != null) {
return localValue;
}
// Otherwise, ask other computers
List<String> responsibleNodes = ring.getNodes(key, replicationFactor);
for (String node : responsibleNodes) {
if (!node.equals(nodeId)) {
try {
return peerClients.get(node).get(key).get(5, TimeUnit.SECONDS);
} catch (Exception e) {
// Try the next computer
}
}
}
return null;
}
}
Advantages and Disadvantages of Consistent Hashing
Good things about Consistent Hashing:
- Scales well with more computers
- No central coordinator needed
- Predictable data distribution
- Can balance consistency and availability
Problems with Consistent Hashing:
- Rebalancing can be tricky
- No strong consistency across all keys
- Requires careful handling of computer failures
Primary-Backup: A Simple Approach
In this approach, one computer (the primary) handles all writes for a set of keys and copies the data to backup computers.
How Primary-Backup Works
Clients send write requests to the primary computer, which applies them locally and then forwards them to backups. Read requests can go to any computer.
public class PrimaryBackupStorage implements Storage {
private final Storage localStore;
private final Map<String, NetworkClient> peerClients;
private final String nodeId;
private final Map<Range, String> primaryMap; // Which computer is primary for which keys
private final Map<Range, List<String>> backupMap; // Which computers are backups
@Override
public boolean put(byte[] key, byte[] value, long ttlSeconds) {
Range range = findRange(key);
String primaryNode = primaryMap.get(range);
if (nodeId.equals(primaryNode)) {
// This computer is primary for this key
localStore.put(key, value, ttlSeconds);
// Copy to backups
List<String> backups = backupMap.get(range);
CompletableFuture<?>[] futures = new CompletableFuture[backups.size()];
for (int i = 0; i < backups.size(); i++) {
futures[i] = peerClients.get(backups.get(i)).put(key, value, ttlSeconds);
}
try {
// Wait for all backups to confirm
CompletableFuture.allOf(futures).get(5, TimeUnit.SECONDS);
return true;
} catch (Exception e) {
return false;
}
} else {
// Forward to primary
try {
return peerClients.get(primaryNode).put(key, value, ttlSeconds)
.get(5, TimeUnit.SECONDS);
} catch (Exception e) {
// Primary might be down
return false;
}
}
}
}
Advantages and Disadvantages of Primary-Backup
Good things about Primary-Backup:
- Simpler than consensus algorithms
- Strong consistency for each key range
- Fast reads (can read from any copy)
- Easy to understand
Problems with Primary-Backup:
- Primary computer is a single point of failure
- Manual recovery needed when primary fails
- Load balancing is difficult
- Limited scalability
Gossip Protocols: Spreading Information Gradually
Gossip protocols work like rumors in a social network - computers periodically share information with a few other computers, and eventually all computers learn the information.
How Gossip Protocols Work
Each computer tracks a version number for each piece of data it has. Periodically, it shares this information with other randomly selected computers to detect and fix inconsistencies.
public class GossipDistributedStorage implements Storage {
private final Storage localStore;
private final Map<String, NetworkClient> peerClients;
private final String nodeId;
private final ScheduledExecutorService gossipService;
private final Map<KeyWrapper, Long> versionMap = new ConcurrentHashMap<>();
public GossipDistributedStorage() {
// Schedule regular gossip
gossipService = Executors.newScheduledThreadPool(1);
gossipService.scheduleAtFixedRate(this::gossip, 1, 1, TimeUnit.SECONDS);
}
private void gossip() {
// Pick a random computer to talk to
String peer = selectRandomPeer();
if (peer != null) {
// Send summary of our data
Map<KeyWrapper, Long> digest = new HashMap<>(versionMap);
peerClients.get(peer).sendDigest(digest)
.thenAccept(this::handleDigestResponse);
}
}
@Override
public boolean put(byte[] key, byte[] value, long ttlSeconds) {
KeyWrapper keyWrapper = new KeyWrapper(key);
// Update local version
long version = versionMap.getOrDefault(keyWrapper, 0L) + 1;
versionMap.put(keyWrapper, version);
// Store locally
localStore.put(key, value, ttlSeconds);
// No immediate sharing - will happen during gossip
return true;
}
}
Advantages and Disadvantages of Gossip Protocols
Good things about Gossip Protocols:
- High availability
- Works well when computers join and leave frequently
- Scales to many computers
- Self-healing
Problems with Gossip Protocols:
- Consistency is not immediate
- Updates take time to spread
- Uses more metadata
- Time to consistency can vary
Write-Ahead Log Shipping: Learn from Databases
Write-ahead logging is a technique used in databases for durability. By sharing these logs with other computers, we can create a distributed system.
How WAL Shipping Works
Operations are first recorded in a local log, applied locally, and then sent to other computers to be applied in the same order.
public class WALShippingStorage implements Storage {
private final Storage localStore;
private final WriteAheadLog wal;
private final Map<String, NetworkClient> peerClients;
@Override
public boolean put(byte[] key, byte[] value, long ttlSeconds) {
// Create log entry
WALEntry entry = new WALEntry(WALEntry.Type.PUT, key, value, ttlSeconds);
// Add to local log
wal.append(entry);
// Apply to local store
localStore.put(key, value, ttlSeconds);
// Send log entry to other computers
for (NetworkClient client : peerClients.values()) {
client.shipWALEntry(entry).exceptionally(ex -> {
// Handle failure
return null;
});
}
return true;
}
}
Advantages and Disadvantages of WAL Shipping
Good things about WAL Shipping:
- Simple to implement
- Natural ordering of operations
- Good for disaster recovery
- Efficient for sharing multiple operations
Problems with WAL Shipping:
- No automatic recovery when computers fail
- Can lead to differences during network problems
- May require manual fixing
- No built-in consistency guarantees
Choosing the Right Approach for Your System
The best approach depends on what your system needs:
If data consistency is most important:
- Raft is the most reliable option
- Primary-backup is simpler but requires manual recovery
- Consistent hashing offers better scaling
If high availability is most important:
- CRDTs provide mathematically guaranteed eventual consistency
- Gossip protocols work well in changing environments
If simplicity is most important:
- WAL shipping is easy to implement
- Primary-backup is straightforward to understand
If you need to handle massive scale:
- Consistent hashing and CRDTs scale well
- Raft becomes more complex with many computers
Important Implementation Considerations
Regardless of which approach you choose, you’ll need to address these aspects:
- Communication Protocol: How computers talk to each other
- Membership Management: How computers find each other and handle failures
- Failure Detection: How to detect when a computer is down
- State Transfer: How a new or recovered computer catches up
- Client Interface: How applications interact with your distributed system
- Monitoring: How to observe system health and performance
Conclusion
Building a distributed key-value store involves balancing consistency, availability, and handling network problems. Raft provides strong consistency but may be less available during network issues. CRDTs and gossip protocols prioritize availability but provide weaker consistency guarantees.
There’s no perfect solution that works for every situation. The best approach depends on your specific needs. Sometimes, using different methods for different parts of your system makes sense. For example, you might use Raft for critical configuration data and CRDTs for user data.
By understanding the different approaches available and what trade-offs they make, you can choose the right design for your distributed system. Consider your requirements for consistency, availability, scalability, and simplicity to select the best architecture for your needs.