Redis Sentinel vs Cluster for High Availability: A Production Engineer's Guide - NextGenBeing Redis Sentinel vs Cluster for High Availability: A Production Engineer's Guide - NextGenBeing
Back to discoveries

Redis Sentinel vs Cluster for High Availability: A Production Engineer's Guide

When a Redis instance dies at 3 AM, the answer to "what happens next?" should not be "the on-call engineer wakes up." That single sentence is what high availability is about, and it is also the…

Performance 10 min read
Bekzod Erkinov

Bekzod Erkinov

May 17, 2026 1 views
Redis Sentinel vs Cluster for High Availability: A Production Engineer's Guide
Photo by Logan Voss on Unsplash
Size:
Height:
📖 10 min read 📝 3,251 words 👁 Focus mode: ✨ Eye care:

Listen to Article

Loading...
0:00 / 0:00
0:00 0:00
Low High
0% 100%
⏸ Paused ▶️ Now playing... Ready to play ✓ Finished
Table of contents · 7 sections

Redis Sentinel vs Cluster for High Availability: A Production Engineer's Guide

When a Redis instance dies at 3 AM, the answer to "what happens next?" should not be "the on-call engineer wakes up." That single sentence is what high availability is about, and it is also the cleanest way to frame the choice between Redis Sentinel and Redis Cluster. Both eliminate the single point of failure, but they solve different problems, fail in different ways, and impose different costs on your application code.

This tutorial walks through both topologies end-to-end: what they are, how they actually fail over, how to deploy them in production, the trade-offs that matter at scale, and a concrete decision framework you can apply to your own workload.


1. The 60-second comparison

Dimension Sentinel Cluster
Primary goal Automatic failover for a single dataset Failover plus horizontal sharding
Dataset size Fits in one node's RAM Larger than any one node can hold
Topology 1 primary + N replicas, watched by ≥3 Sentinels 16384 hash slots split across M primaries, each with ≥1 replica
Client awareness Client asks Sentinel "who is primary?" Client routes by slot, follows MOVED/ASK redirects
Multi-key ops Fully supported Only if keys share a hash tag ({user:42})
Min nodes (prod) 3 Sentinels + 1 primary + 2 replicas = 6 processes 3 primaries + 3 replicas = 6 processes
Read scaling Replicas (with stale-read caveats) Replicas per shard
Write scaling None — single primary Linear with shard count
Operational complexity Low–Medium Medium–High

Rule of thumb: if your working set fits on one machine and you just need failover, choose Sentinel. If you need to shard writes across nodes, choose Cluster. Don't pick Cluster "just in case" — its constraints leak into your application.


2. Redis Sentinel: failover for a single dataset

2.1 What Sentinel actually is

Sentinel is not a proxy. It is a separate process (redis-sentinel) that runs alongside Redis and does three jobs:

  1. Monitors the primary and its replicas, pinging every second.
  2. Notifies operators when something looks wrong.
  3. Orchestrates failover when a quorum of Sentinels agrees the primary is dead — promotes a replica, reconfigures the others to replicate from the new primary, and publishes the new topology.

Clients connect to Sentinels (typically with a Sentinel-aware Redis library), ask "who is the current primary for service mymaster?", get an address, then connect directly to that primary for reads and writes.

2.2 The minimum viable production layout

Three Sentinels is the floor. Two cannot survive a network partition without risking split-brain, and one is not a quorum at all. Spread them across three availability zones — putting them in the same AZ defeats the entire point.

AZ-a:  redis-primary       sentinel-1
AZ-b:  redis-replica-1     sentinel-2
AZ-c:  redis-replica-2     sentinel-3

2.3 Config: redis.conf for replicas

# /etc/redis/redis.conf (on each replica)
port 6379
bind 0.0.0.0 -::1
protected-mode yes
requirepass <strong-password>
masterauth <strong-password>
replicaof 10.0.1.10 6379
replica-read-only yes
repl-backlog-size 256mb           # survive 5+ min disconnects without full resync
min-replicas-to-write 1           # primary refuses writes if no replica is connected
min-replicas-max-lag 10
appendonly yes
appendfsync everysec

The two min-replicas-* directives are what protect you from a particularly nasty failure mode: a primary that has been silently partitioned away from all its replicas continuing to accept writes that will be lost when it is demoted. With these set, the primary becomes read-only when it loses its replicas — a behavior every production deployment should have.

2.4 Config: sentinel.conf

# /etc/redis/sentinel.conf — identical on all three Sentinels
port 26379
sentinel monitor mymaster 10.0.1.10 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel auth-pass mymaster <strong-password>
sentinel resolve-hostnames yes
sentinel announce-hostnames yes

The 2 in sentinel monitor is the quorum: how many Sentinels must agree the primary is unreachable before they start the failover vote. With three Sentinels, quorum of 2 is the standard choice. Quorum is not the number of votes needed to actually fail over — that is always a strict majority of all known Sentinels (so 2 of 3, 3 of 5, etc.). Quorum is only the trigger to begin the election.

2.5 Client code: don't connect to a primary by IP

This is the single most common mistake in Sentinel deployments — apps hard-coded to a primary IP. After failover, they happily keep writing to a now-replica that rejects writes, or worse, to a stale ex-primary.

Python with redis-py:

from redis.sentinel import Sentinel

sentinel = Sentinel(
    [("10.0.1.10", 26379), ("10.0.2.10", 26379), ("10.0.3.10", 26379)],
    socket_timeout=0.5,
    password="<strong-password>",
    sentinel_kwargs={"password": "<strong-password>"},
)

# Writes always go to the current primary
master = sentinel.master_for("mymaster", socket_timeout=0.5, password="<strong-password>")
master.set("user:42:session", "...", ex=3600)

# Reads can use a replica (tolerate eventual consistency)
replica = sentinel.slave_for("mymaster", socket_timeout=0.5, password="<strong-password>")
value = replica.get("user:42:session")

master_for returns a connection object that transparently re-resolves the primary on every connection failure. Your application code does not handle failover — the client library does.

2.6 What a failover actually looks like

  1. Primary becomes unreachable. At t + 5000ms (the down-after-milliseconds value), one Sentinel marks it S_DOWN (subjectively down).
  2. That Sentinel asks the others "do you also see it down?" When quorum agree, it transitions to O_DOWN (objectively down).
  3. Sentinels run a Raft-like leader election. The winner picks the best replica — most recent replication offset, lowest replica-priority, smallest run-id as tiebreaker.
  4. The winner sends REPLICAOF NO ONE to the chosen replica, then REPLICAOF <new-primary> to all others.
  5. Sentinels publish the new topology on the +switch-master pub/sub channel. Clients reconnect.

End-to-end, expect 10–30 seconds of unavailability with default tuning. Aggressively low down-after-milliseconds will cut this but also increase false positives during network blips.


3. Redis Cluster: sharding plus failover

3.1 What Cluster actually is

Cluster shards data across nodes using a fixed space of 16384 hash slots. Every key maps to a slot via CRC16(key) mod 16384. Each primary owns a contiguous range of slots; each primary has one or more replicas; the cluster gossips topology via a separate bus on port 6379 + 10000 = 16379.

Crucially, the cluster is its own failure detector — there is no Sentinel. Primaries vote among themselves to promote replicas when one of them goes down. This is why you need at least three primaries: two cannot form a majority.

3.2 The 16384-slot model and why it matters for your code

Node A: slots 0     – 5460   + replica A'
Node B: slots 5461  – 10922  + replica B'
Node C: slots 10923 – 16383  + replica C'

The slot model has one consequence that affects every line of application code you write: multi-key operations only work if all keys land in the same slot.

# Works only by accident — two keys, possibly different slots
r.mget("user:42", "user:43")  # may raise CROSSSLOT

# Works always — hash tag forces both into the same slot
r.mget("user:{42}:name", "user:{42}:email")

The substring inside {...} is the hash tag. Only that substring is hashed, so user:{42}:name and user:{42}:email always co-locate. Use hash tags any time you need MGET, MSET, transactions, Lua scripts, or SUNIONSTORE across keys.

This is also why blindly migrating a Sentinel app to Cluster usually breaks: code that worked on a single primary suddenly throws CROSSSLOT errors. Audit your multi-key calls before you migrate.

3.3 Building a six-node cluster

Three primaries, three replicas, one per AZ:

# On each of the 6 nodes — note cluster-enabled yes
cat > /etc/redis/redis.conf <<'EOF'
port 6379
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
cluster-require-full-coverage no
appendonly yes
appendfsync everysec
masterauth <strong-password>
requirepass <strong-password>
EOF

# Start all 6, then from any one node:
redis-cli --cluster create \
  10.0.1.10:6379 10.0.2.10:6379 10.0.3.10:6379 \
  10.0.1.11:6379 10.0.2.11:6379 10.0.3.11:6379 \
  --cluster-replicas 1 \
  -a <strong-password>

redis-cli will propose a slot assignment, ensure replicas don't end up in the same AZ as their primary, and ask for confirmation. Verify health:

redis-cli -a <pwd> -c -h 10.0.1.10 cluster info
redis-cli -a <pwd> -c -h 10.0.1.10 cluster nodes

The -c flag enables cluster-mode redirection in the CLI — without it, you'll get MOVED errors when you query the wrong shard.

The cluster-require-full-coverage no setting is important: with the default yes, if any one shard becomes unavailable, the entire cluster stops accepting writes — even for keys that live on healthy shards. no lets healthy shards keep serving their slots, which is almost always what you want in production.

3.4 Client code: use a cluster-aware client

from redis.cluster import RedisCluster, ClusterNode

rc = RedisCluster(
    startup_nodes=[
        ClusterNode("10.0.1.10", 6379),
        ClusterNode("10.0.2.10", 6379),
        ClusterNode("10.0.3.10", 6379),
    ],
    password="<strong-password>",
    decode_responses=True,
    read_from_replicas=True,
    socket_timeout=0.5,
)

rc.set("user:{42}:session", "...", ex=3600)
rc.mget("user:{42}:name", "user:{42}:email")

The client caches the slot map on first connect and refreshes it on MOVED redirects. Resharding (slot migration) produces ASK redirects, which the client follows transparently for in-flight queries. You generally do not need to handle these manually.

3.5 What failover looks like in Cluster

  1. A primary stops responding. Other primaries notice via the gossip bus.
  2. After cluster-node-timeout (default 15s, set to 5s above), they mark it PFAIL (possibly failing), then FAIL once a majority agrees.
  3. The failed primary's replicas race to be elected. The one with the most recent replication offset wins a vote from a majority of primaries (not all nodes — replicas don't vote).
  4. The winning replica takes over the failed primary's slot range. The cluster continues.

Failover is generally faster than Sentinel — 5–15 seconds with the tuning above — because there's no separate quorum tier to coordinate.


4. The decision framework

Walk through these questions in order. The first "yes" decides for you.

  1. Does your working set exceed what one node can hold (in RAM, after replication overhead)? → Cluster. No way around it.
  2. Do you need to scale writes beyond what one primary can sustain? → Cluster.
  3. Are you operating across multiple regions with active-active needs? → Neither, natively. Look at Redis Enterprise CRDTs, or shard by region at the application layer.
  4. Is your data set small (< 50 GB), and is it acceptable to vertically scale the primary as you grow? → Sentinel. The operational simplicity is a real benefit.
  5. Are you heavily reliant on multi-key transactions, Lua scripts spanning many keys, or KEYS/SCAN across the whole dataset? → Sentinel, or be ready to redesign your key layout around hash tags.
  6. Is your team small and unfamiliar with Redis internals? → Sentinel. Cluster's failure modes (split brain across shards, slot migration mid-failover, MOVED storms after topology changes) require operators who know what they're looking at.

A useful sanity check: Cluster is sharding first, HA second. If you do not need sharding, you are paying complexity for nothing.


5. The traps both topologies share

A few failure modes catch teams regardless of which they pick.

Asynchronous replication means writes can be lost. Both Sentinel and Cluster replicate asynchronously by default. A primary that acknowledges a write can die before that write reaches any replica. If you cannot tolerate this, use WAIT n timeout after critical writes — but understand it gives you durability guarantees, not durability guarantees against all failures (a primary can still die after WAIT succeeds but before the replica fsyncs).

Persistence is not optional. AOF with appendfsync everysec is the production default. RDB-only deployments lose up to the last snapshot interval on crash. After a failover, an empty-but-running ex-primary that comes back online can poison its replicas with an empty dataset — set replica-serve-stale-data no and repl-diskless-sync yes and consider appendonly yes on every node.

Monitor the right metrics. redis_master_link_status, replication offset lag, connected_slaves, cluster_state, cluster_slots_ok, evicted keys, command latency p99. Alert on lag, not just on outright failures — a replica that is 30 seconds behind will lose 30 seconds of writes when it gets promoted.

Test failover before you need it. Run redis-cli debug sleep 30 on a primary in staging. Run redis-cli -p 6379 shutdown nosave. Pull the network cable in a planned game-day. The first time you see a failover should not be in production.


6. Migration path: Sentinel → Cluster

If you started on Sentinel and are outgrowing it, the migration is doable but not trivial:

  1. Audit multi-key operations. Grep for MGET, MSET, MULTI/EXEC, EVAL/EVALSHA, SUNIONSTORE, ZINTERSTORE. Rewrite each to use hash tags or split into per-key calls.
  2. Switch the client library to a cluster-aware client behind a feature flag. Many libraries (redis-py, lettuce, ioredis) have separate RedisCluster/ClusterClient classes — they are not drop-in replacements.
  3. Stand up the cluster in parallel. Don't try to convert the Sentinel topology in place.
  4. Dual-write briefly, or use redis-cli --cluster import to copy the dataset from the Sentinel primary into the new cluster while it's live.
  5. Cut over reads first, then writes, behind a flag you can flip back instantly.

Plan for the migration to take longer than you think. The actual data move is fast; the application audit is what takes weeks.


7. Wrapping up

Sentinel and Cluster solve overlapping but different problems. Sentinel is the right answer when you need failover for a dataset that fits on one node — it's simpler, it works with every multi-key Redis feature, and its failure modes are easy to reason about. Cluster is the right answer when you need to shard, and you should commit to it knowing that the shard model will shape your code and your operations.

The worst choice is the one made by reflex. "We'll use Cluster because it's more modern" buys complexity you may never need. "We'll use Sentinel because it's simpler" caps your growth at one node's RAM. Pick based on the working set, write throughput, and multi-key patterns you actually have — not the ones you imagine you might have someday.

When in doubt: start on Sentinel, instrument everything, and migrate to Cluster the quarter your monitoring tells you the single primary is genuinely the bottleneck. Premature sharding is a tax you pay every day; necessary sharding pays for itself the day you add the fourth shard.

Bekzod Erkinov

Bekzod Erkinov

Author

Founder of NextGenBeing. Software engineer working with Laravel, Python, and cloud infrastructure. Writes about patterns that actually hold up in production. Based in Tashkent, Uzbekistan.

Never Miss an Article

Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.

Comments (0)

Please log in to leave a comment.

Log In

Related Articles

Don't miss the next deep dive

Get one well-researched tutorial in your inbox each week. No spam, unsubscribe anytime.