Distributed System Implementations: Building Key-Value Stores

Posted on Apr 25, 2025

Many modern applications need to be fast, scalable, and always available. To achieve this, they often use distributed systems. These systems run on many computers and work together to store and process data.

One important type of distributed system is the key-value store. It saves data as key-value pairs. This model is simple but very powerful. Many popular tools—like databases, caches, and service discovery systems—use this structure.

However, building a key-value store that works across multiple machines is not easy. We need to think about problems like:

What happens if one machine fails?
How do we make sure the data is correct and up to date?
How do we store data evenly across machines?

These questions bring us back to some core ideas in distributed systems. Sometimes, developers focus on the latest trends and forget about these important fundamentals. But in reality, understanding these basics is key to building reliable and scalable systems.

In this blog post, we will look at some popular design ideas used in distributed key-value stores:

Raft – helps keep data consistent across machines.
CRDTs – allow updates without conflicts and support offline use.
Consistent Hashing – spreads data evenly and handles machine changes.
Primary-Backup – makes the system simple and fault-tolerant.
Gossip Protocols – let machines share information in a smart way.

Each method has pros and cons. Choosing the right one depends on what your system needs.

By the end of this post, you’ll understand the main ideas behind these systems and when to use them.

Understanding the Basic Trade-offs

When building a distributed system, you need to balance three important properties, known as the CAP theorem:

Consistency: All computers see the same data at the same time
Availability: The system continues to work even when some parts fail
Partition tolerance: The system works even when network connections between computers fail

Most distributed systems must handle network problems, so the real choice is usually between consistency and availability. We’ll look at both types of systems in this post.

Raft: A Popular Consensus Algorithm

Raft is one of the most widely used approaches for building distributed systems that prioritize consistency. It’s used in systems like etcd and Consul.

How Raft Works

Raft uses a leader-follower model:

Leader Election: The system chooses one computer as the leader
Log Replication: The leader receives all write requests and shares them with followers
State Machine: Once enough computers have saved the data, it’s applied to the key-value store

public class RaftDistributedStorage implements Storage {
    private final RaftNode raftNode;
    private final Storage localStore;
    
    @Override
    public boolean put(byte[] key, byte[] value, long ttlSeconds) {
        // Create command
        Command putCommand = new PutCommand(key, value, ttlSeconds);
        
        // Submit to Raft
        try {
            // Wait for command to be saved on most computers
            CompletableFuture<Boolean> future = raftNode.submit(putCommand);
            return future.get(5, TimeUnit.SECONDS);
        } catch (Exception e) {
            return false;
        }
    }
    
    @Override
    public byte[] get(byte[] key) {
        // Read locally if up-to-date
        if (raftNode.isLeader() || raftNode.isUpToDate()) {
            return localStore.get(key);
        } else {
            // Ask the leader
            try {
                return raftNode.forwardToLeader(new GetCommand(key))
                    .get(5, TimeUnit.SECONDS);
            } catch (Exception e) {
                return null;
            }
        }
    }
}

Advantages and Disadvantages of Raft

Good things about Raft:

Strong data consistency
Handles computer failures well
Easier to understand than older methods
Well-tested in real-world systems

Problems with Raft:

Can be slower because of the consensus process
Less available during network problems
Needs most computers to be working
Gets more complex with more computers

You can find more details about the Raft implementation at this link: https://raft.github.io/

CRDTs: Prioritizing Availability

Conflict-free Replicated Data Types (CRDTs) offer an alternative approach that prioritizes availability over strong consistency.

How CRDTs Work

CRDTs are special data structures that can handle conflicting updates without requiring strict coordination between computers. The key is that all operations must be:

Order-independent: It doesn’t matter which update happens first
Merge-friendly: Updates can be combined predictably

public class CRDTDistributedStorage implements Storage {
    private final Storage localStore;
    private final Map<String, NetworkClient> peerClients;
    private final String nodeId;
    
    // Use a CRDT Map
    private final Map<KeyWrapper, CRDTValue> crdtStore = new ConcurrentHashMap<>();
    
    @Override
    public boolean put(byte[] key, byte[] value, long ttlSeconds) {
        // Update local store
        localStore.put(key, value, ttlSeconds);
        
        // Create CRDT update
        CRDTValue crdtValue = new CRDTValue(nodeId, value, ttlSeconds);
        crdtStore.put(new KeyWrapper(key), crdtValue);
        
        // Share with other computers
        for (NetworkClient client : peerClients.values()) {
            client.sendCRDTUpdate(key, crdtValue).exceptionally(ex -> {
                // We can try again later if this fails
                return null;
            });
        }
        
        return true;
    }
    
    // Combine updates from other computers
    public void mergePeerUpdate(byte[] key, CRDTValue peerValue) {
        KeyWrapper keyWrapper = new KeyWrapper(key);
        CRDTValue localValue = crdtStore.get(keyWrapper);
        
        if (localValue == null) {
            // Just use the peer's value if we don't have one
            crdtStore.put(keyWrapper, peerValue);
            localStore.put(key, peerValue.getValue(), peerValue.getTtlSeconds());
        } else {
            // Combine our value with the peer's value
            CRDTValue mergedValue = localValue.merge(peerValue);
            crdtStore.put(keyWrapper, mergedValue);
            localStore.put(key, mergedValue.getValue(), mergedValue.getTtlSeconds());
        }
    }
}

Advantages and Disadvantages of CRDTs

Good things about CRDTs:

High availability (works during network problems)
No leader needed
Eventually all computers will have the same data
Simpler than consensus algorithms

Problems with CRDTs:

Data may be temporarily inconsistent
May need complex rules for resolving conflicts
Requires more metadata

Consistent Hashing: Distributing Data Evenly

Consistent hashing is a way to distribute data across computers that minimizes disruption when computers are added or removed.

How Consistent Hashing Works

Keys are mapped to positions on a circle, and each computer is responsible for a section of the circle. For reliability, each key is stored on multiple computers.

public class ConsistentHashDistributedStorage implements Storage {
    private final Storage localStore;
    private final ConsistentHash<String> ring;
    private final Map<String, NetworkClient> peerClients;
    private final String nodeId;
    private final int replicationFactor;
    
    @Override
    public boolean put(byte[] key, byte[] value, long ttlSeconds) {
        // Find which computers should store this key
        List<String> responsibleNodes = ring.getNodes(key, replicationFactor);
        
        // Store locally if this computer is responsible
        if (responsibleNodes.contains(nodeId)) {
            localStore.put(key, value, ttlSeconds);
        }
        
        // Send to other responsible computers
        CompletableFuture<?>[] futures = new CompletableFuture[responsibleNodes.size()];
        int i = 0;
        
        for (String node : responsibleNodes) {
            if (!node.equals(nodeId)) {
                futures[i++] = peerClients.get(node).put(key, value, ttlSeconds);
            }
        }
        
        // Wait for enough computers to confirm
        try {
            CompletableFuture.allOf(futures).get(5, TimeUnit.SECONDS);
            return true;
        } catch (Exception e) {
            return false;
        }
    }
    
    @Override
    public byte[] get(byte[] key) {
        // Check if we have it locally
        byte[] localValue = localStore.get(key);
        if (localValue != null) {
            return localValue;
        }
        
        // Otherwise, ask other computers
        List<String> responsibleNodes = ring.getNodes(key, replicationFactor);
        for (String node : responsibleNodes) {
            if (!node.equals(nodeId)) {
                try {
                    return peerClients.get(node).get(key).get(5, TimeUnit.SECONDS);
                } catch (Exception e) {
                    // Try the next computer
                }
            }
        }
        
        return null;
    }
}

Advantages and Disadvantages of Consistent Hashing

Good things about Consistent Hashing:

Scales well with more computers
No central coordinator needed
Predictable data distribution
Can balance consistency and availability

Problems with Consistent Hashing:

Rebalancing can be tricky
No strong consistency across all keys
Requires careful handling of computer failures

Primary-Backup: A Simple Approach

In this approach, one computer (the primary) handles all writes for a set of keys and copies the data to backup computers.

How Primary-Backup Works

Clients send write requests to the primary computer, which applies them locally and then forwards them to backups. Read requests can go to any computer.

public class PrimaryBackupStorage implements Storage {
    private final Storage localStore;
    private final Map<String, NetworkClient> peerClients;
    private final String nodeId;
    private final Map<Range, String> primaryMap; // Which computer is primary for which keys
    private final Map<Range, List<String>> backupMap; // Which computers are backups
    
    @Override
    public boolean put(byte[] key, byte[] value, long ttlSeconds) {
        Range range = findRange(key);
        String primaryNode = primaryMap.get(range);
        
        if (nodeId.equals(primaryNode)) {
            // This computer is primary for this key
            localStore.put(key, value, ttlSeconds);
            
            // Copy to backups
            List<String> backups = backupMap.get(range);
            CompletableFuture<?>[] futures = new CompletableFuture[backups.size()];
            
            for (int i = 0; i < backups.size(); i++) {
                futures[i] = peerClients.get(backups.get(i)).put(key, value, ttlSeconds);
            }
            
            try {
                // Wait for all backups to confirm
                CompletableFuture.allOf(futures).get(5, TimeUnit.SECONDS);
                return true;
            } catch (Exception e) {
                return false;
            }
        } else {
            // Forward to primary
            try {
                return peerClients.get(primaryNode).put(key, value, ttlSeconds)
                    .get(5, TimeUnit.SECONDS);
            } catch (Exception e) {
                // Primary might be down
                return false;
            }
        }
    }
}

Advantages and Disadvantages of Primary-Backup

Good things about Primary-Backup:

Simpler than consensus algorithms
Strong consistency for each key range
Fast reads (can read from any copy)
Easy to understand

Problems with Primary-Backup:

Primary computer is a single point of failure
Manual recovery needed when primary fails
Load balancing is difficult
Limited scalability

Gossip Protocols: Spreading Information Gradually

Gossip protocols work like rumors in a social network - computers periodically share information with a few other computers, and eventually all computers learn the information.

How Gossip Protocols Work

Each computer tracks a version number for each piece of data it has. Periodically, it shares this information with other randomly selected computers to detect and fix inconsistencies.

public class GossipDistributedStorage implements Storage {
    private final Storage localStore;
    private final Map<String, NetworkClient> peerClients;
    private final String nodeId;
    private final ScheduledExecutorService gossipService;
    private final Map<KeyWrapper, Long> versionMap = new ConcurrentHashMap<>();
    
    public GossipDistributedStorage() {
        // Schedule regular gossip
        gossipService = Executors.newScheduledThreadPool(1);
        gossipService.scheduleAtFixedRate(this::gossip, 1, 1, TimeUnit.SECONDS);
    }
    
    private void gossip() {
        // Pick a random computer to talk to
        String peer = selectRandomPeer();
        if (peer != null) {
            // Send summary of our data
            Map<KeyWrapper, Long> digest = new HashMap<>(versionMap);
            peerClients.get(peer).sendDigest(digest)
                .thenAccept(this::handleDigestResponse);
        }
    }
    
    @Override
    public boolean put(byte[] key, byte[] value, long ttlSeconds) {
        KeyWrapper keyWrapper = new KeyWrapper(key);
        
        // Update local version
        long version = versionMap.getOrDefault(keyWrapper, 0L) + 1;
        versionMap.put(keyWrapper, version);
        
        // Store locally
        localStore.put(key, value, ttlSeconds);
        
        // No immediate sharing - will happen during gossip
        return true;
    }
}

Advantages and Disadvantages of Gossip Protocols

Good things about Gossip Protocols:

High availability
Works well when computers join and leave frequently
Scales to many computers
Self-healing

Problems with Gossip Protocols:

Consistency is not immediate
Updates take time to spread
Uses more metadata
Time to consistency can vary

Write-Ahead Log Shipping: Learn from Databases

Write-ahead logging is a technique used in databases for durability. By sharing these logs with other computers, we can create a distributed system.

How WAL Shipping Works

Operations are first recorded in a local log, applied locally, and then sent to other computers to be applied in the same order.

public class WALShippingStorage implements Storage {
    private final Storage localStore;
    private final WriteAheadLog wal;
    private final Map<String, NetworkClient> peerClients;
    
    @Override
    public boolean put(byte[] key, byte[] value, long ttlSeconds) {
        // Create log entry
        WALEntry entry = new WALEntry(WALEntry.Type.PUT, key, value, ttlSeconds);
        
        // Add to local log
        wal.append(entry);
        
        // Apply to local store
        localStore.put(key, value, ttlSeconds);
        
        // Send log entry to other computers
        for (NetworkClient client : peerClients.values()) {
            client.shipWALEntry(entry).exceptionally(ex -> {
                // Handle failure
                return null;
            });
        }
        
        return true;
    }
}

Advantages and Disadvantages of WAL Shipping

Good things about WAL Shipping:

Simple to implement
Natural ordering of operations
Good for disaster recovery
Efficient for sharing multiple operations

Problems with WAL Shipping:

No automatic recovery when computers fail
Can lead to differences during network problems
May require manual fixing
No built-in consistency guarantees

Choosing the Right Approach for Your System

The best approach depends on what your system needs:

If data consistency is most important:
- Raft is the most reliable option
- Primary-backup is simpler but requires manual recovery
- Consistent hashing offers better scaling
If high availability is most important:
- CRDTs provide mathematically guaranteed eventual consistency
- Gossip protocols work well in changing environments
If simplicity is most important:
- WAL shipping is easy to implement
- Primary-backup is straightforward to understand
If you need to handle massive scale:
- Consistent hashing and CRDTs scale well
- Raft becomes more complex with many computers

Important Implementation Considerations

Regardless of which approach you choose, you’ll need to address these aspects:

Communication Protocol: How computers talk to each other
Membership Management: How computers find each other and handle failures
Failure Detection: How to detect when a computer is down
State Transfer: How a new or recovered computer catches up
Client Interface: How applications interact with your distributed system
Monitoring: How to observe system health and performance

Conclusion

Building a distributed key-value store involves balancing consistency, availability, and handling network problems. Raft provides strong consistency but may be less available during network issues. CRDTs and gossip protocols prioritize availability but provide weaker consistency guarantees.

There’s no perfect solution that works for every situation. The best approach depends on your specific needs. Sometimes, using different methods for different parts of your system makes sense. For example, you might use Raft for critical configuration data and CRDTs for user data.

By understanding the different approaches available and what trade-offs they make, you can choose the right design for your distributed system. Consider your requirements for consistency, availability, scalability, and simplicity to select the best architecture for your needs.