Redis 2.8 の Sentinel の動きを検証してみた

Redis 2.8 の redis-sentinel によるレプリケーションの自動フェイルオーバーについて、
比較的発生しそうな障害を想定して動作検証してみました。

結論から

redis-server の自動再起動を構成している場合は要注意。

daemontools とか。

  • Master が落ちた後すぐ(例えば数秒)に再起動してきた場合、
    再び Master としてレプリケーションに参加します。

  • よって、Master 再起動の前後でデータに差異があった場合でも、
    再起動後のデータをもとに同期される為、データが破壊される可能性があります。

  • これを回避する為には、Sentinel により
    sdown/odown として認識されるのを待ってからインスタンスを復帰させるようにします。

  • 復帰が早すぎると、障害(sdown/odown)ではなく再起動(reboot)と認識します。

レプリケーションの再構成は Sentinel に任せるか、Sentinel を止めた状態で行う。

  • Master を変更したい時は redis-sentinel に対して
    sentinel failover コマンドを実行するようにしたほうが良いでしょう。
    (redis-server 側から操作すると Master 不在の期間が発生したり、
    一時的にレプリケーションが切れます。また、Slave が2つ以上居る場合
    フェイルオーバー先は指定できません。(Sentinel が決めます) )

  • Slave に関する操作(追加や削除、停止)は、
    Sentinel を用いない構成と同様、サービス影響なく実施可能です。
    (Slave が増えた場合は自動的に Sentinel の監視下に入ります。)

  • どうしても Master を特定のインスタンスに変更したい場合は
    Sentinel を全て止めた状態でレプリケーションの操作を行い、
    設定ファイルを修正してから Sentinel を起動するようにします。

Sentinel の設定ファイルは、Sentinel によって動的に(!)書き換えられる。

  • 設定ファイルには、レプリケーションに参加中のインスタンス情報や、
    連携中の Sentinel の情報が書き込まれます。

  • この情報は、レプリケーションの構成に変化があったときや、
    Sentinel が増減した時に更新されます。

  • chef などにより設定ファイルをデプロイする場合は
    「ファイルが存在しない場合のみ」に限定した方が良いでしょう。

  • なお、Sentinel の起動"中"に限れば、設定ファイルを
    誤って書き換えてしまっても直ちに影響はありません。
    (Sentinel 起動時の内容が内部で保持されているので、
    上記の更新のタイミングで書き直されます)

以下に、実際に試験した事を記します。

環境

f:id:chrone3:20140228203140p:plain

  • Redis 2.8.6 (RPMを自作してインストール)
  • CentOS 6.4 / x86_64
  • ホストを3台用意し、2台は Server+Sentinel、1台は Sentinel のみを起動
    (Sentinel を3つ用意したかっただけなので、同居している事に特段の理由はない)
  • Sentinel の設定ファイルの初期状態 (3つとも共通)

    # port <sentinel-port>
    port 26379
    
    # sentinel monitor <master-name> <ip> <redis-port> <quorum>
    sentinel monitor mymaster balthasar 6379 2
    
    # sentinel down-after-milliseconds <master-name> <milliseconds>
    sentinel down-after-milliseconds mymaster 30000
    
    # sentinel parallel-syncs <master-name> <numslaves>
    sentinel parallel-syncs mymaster 1
    
    # sentinel failover-timeout <master-name> <milliseconds>
    sentinel failover-timeout mymaster 180000
    

詳細

シナリオと結果のまとめ

  1. Masterを強制終了
    → 生きている Slave にフェイルオーバーされます。

  2. Master と Sentinel 1つを強制終了 (図中のホストBalthasarが突然死)
    → Sentinel が1つ減っても、quorum 2 には足りるので、
    生きている Slave にフェイルオーバーされます。

  3. Master と Sentinel 2つを強制終了 (Balthasar, Melchiorが突然死)
    → Sentinel が残り1つになってしまい、quorum 2 に満たないので
    フェイルオーバーされません。(そもそも try-failover しない)

  4. Master を強制終了し、Sentinel の認知"後"に再起動 (この時 rdb ファイルを消す)
    → フェイルオーバーにより Master が切り替わっているので、
    旧Slave(現Master) のデータをもとにレプリケーションが再開されます。

  5. Master を強制終了し、Sentinel の認知"前"に再起動 (この時 rdb ファイルを消す)
    → Master が Master として復帰するので、再起動後のデータで同期されます。
    (今回のケースでは rdb を削除したので、全データをロストします。)

  6. Slave を手動で Master に昇格させる (slaveof no one を実行)
    →もとの状態に戻されます (Slave に降格)。

  7. Master を手動で Slave に降格させる
    → Master がダウンしたと見なされ、フェイルオーバーが発生します。
    但しその後、fix-slave-config のタイミングで一瞬レプリケーションが切れます。
    (当たり前ですが、ダウン検知されるまでは Master が居ない状態になります。)

  8. Sentinel 起動中に設定ファイルを間違った内容で上書き
    → いかなるイベントも発生しません。
    設定ファイルは、レプリケーションの構成、または Sentinel の構成に
    変化があった時に正しい内容で修正されます。

    但し、書き換えられた状態のまま何も起こらずに Sentinel が停止した場合
    設定ファイルは元の状態には戻りません。

以下に、それぞれの検証方法とログを掲載しています。 (とても長いです)

個々の詳細

1. Masterを強制終了

生きている Slave にフェイルオーバーされます。

f:id:chrone3:20140228163503p:plain

[Balthasar] # pkill -KILL redis-server

=== redis-sentinel log (leader) ===
[4080] 28 Feb 16:55:25.478 # +sdown master mymaster 192.168.0.101 6379
[4080] 28 Feb 16:55:25.542 # +odown master mymaster 192.168.0.101 6379 #quorum 2/2
( ↑Master 落ちたNE! )
[4080] 28 Feb 16:55:25.542 # +new-epoch 5
[4080] 28 Feb 16:55:25.542 # +try-failover master mymaster 192.168.0.101 6379
( ↑フェイルオーバー試すよ! )
[4080] 28 Feb 16:55:25.542 # +vote-for-leader 26b793c6aa26fbac29148f0271b4d513ff9931c4 5
[4080] 28 Feb 16:55:25.543 # 192.168.0.101:26379 voted for 26b793c6aa26fbac29148f0271b4d513ff9931c4 5
[4080] 28 Feb 16:55:25.544 # 192.168.0.103:26379 voted for 26b793c6aa26fbac29148f0271b4d513ff9931c4 5
[4080] 28 Feb 16:55:25.607 # +elected-leader master mymaster 192.168.0.101 6379
[4080] 28 Feb 16:55:25.607 # +failover-state-select-slave master mymaster 192.168.0.101 6379
[4080] 28 Feb 16:55:25.671 # +selected-slave slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[4080] 28 Feb 16:55:25.671 * +failover-state-send-slaveof-noone slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[4080] 28 Feb 16:55:25.749 * +failover-state-wait-promotion slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[4080] 28 Feb 16:55:26.664 # +promoted-slave slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[4080] 28 Feb 16:55:26.665 # +failover-state-reconf-slaves master mymaster 192.168.0.101 6379
[4080] 28 Feb 16:55:26.719 # +failover-end master mymaster 192.168.0.101 6379
[4080] 28 Feb 16:55:26.719 # +switch-master mymaster 192.168.0.101 6379 192.168.0.103 6379
( ↑フェイルオーバーうまくいったので Master 切り替えるよ! )
[4080] 28 Feb 16:55:26.720 * +slave slave 192.168.0.101:6379 192.168.0.101 6379 @ mymaster 192.168.0.103 6379
[4080] 28 Feb 16:55:56.745 # +sdown slave 192.168.0.101:6379 192.168.0.101 6379 @ mymaster 192.168.0.103 6379

=== Casper's redis-server log ===
[4124] 28 Feb 16:54:56.240 # Connection with master lost.
[4124] 28 Feb 16:54:56.240 * Caching the disconnected master state.
[4124] 28 Feb 16:54:57.061 * Connecting to MASTER 192.168.0.101:6379
[4124] 28 Feb 16:54:57.061 * MASTER <-> SLAVE sync started
[4124] 28 Feb 16:54:57.062 # Error condition on socket for SYNC: Connection refused
[4124] 28 Feb 16:54:58.070 * Connecting to MASTER 192.168.0.101:6379
[4124] 28 Feb 16:54:58.070 * MASTER <-> SLAVE sync started
[4124] 28 Feb 16:54:58.071 # Error condition on socket for SYNC: Connection refused
...
[4124] 28 Feb 16:55:26.191 * Discarding previously cached master state.
[4124] 28 Feb 16:55:26.191 * MASTER MODE enabled (user request)

2. Master と Sentinel 1つを強制終了

Sentinel が1つ減っても、quorum 2 には足りるので、
生きている Slave にフェイルオーバーされます。

f:id:chrone3:20140228215532p:plain

[Balthasar] # echo o > /proc/sysrq-trigger

=== redis-sentinel log ===
[4140] 28 Feb 17:26:27.630 # +sdown master mymaster 192.168.0.101 6379
[4140] 28 Feb 17:26:27.693 # +odown master mymaster 192.168.0.101 6379 #quorum 2/2
( ↑Master落ちたNE! )
[4140] 28 Feb 17:26:27.693 # +new-epoch 9
[4140] 28 Feb 17:26:27.693 # +new-epoch 9
[4140] 28 Feb 17:26:27.693 # +try-failover master mymaster 192.168.0.101 6379
( ↑フェイルオーバー試すよ! )
[4140] 28 Feb 17:26:27.694 # +vote-for-leader 88233d21e246e892122d3c6228ed91b0a89ef311 9
[4140] 28 Feb 17:26:27.695 # 192.168.0.102:26379 voted for 88233d21e246e892122d3c6228ed91b0a89ef311 9
[4140] 28 Feb 17:26:27.747 # +elected-leader master mymaster 192.168.0.101 6379
[4140] 28 Feb 17:26:27.747 # +failover-state-select-slave master mymaster 192.168.0.101 6379
[4140] 28 Feb 17:26:27.747 # +failover-state-select-slave master mymaster 192.168.0.101 6379
[4140] 28 Feb 17:26:27.819 # +selected-slave slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[4140] 28 Feb 17:26:27.820 * +failover-state-send-slaveof-noone slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[4140] 28 Feb 17:26:27.904 * +failover-state-wait-promotion slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[4140] 28 Feb 17:26:27.988 # +sdown sentinel 192.168.0.101:26379 192.168.0.101 26379 @ mymaster 192.168.0.101 6379
( ↑Sentinel も落ちたNE! )
[4140] 28 Feb 17:26:28.758 # +promoted-slave slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[4140] 28 Feb 17:26:28.758 # +failover-state-reconf-slaves master mymaster 192.168.0.101 6379
[4140] 28 Feb 17:26:28.795 # +failover-end master mymaster 192.168.0.101 6379
[4140] 28 Feb 17:26:28.795 # +switch-master mymaster 192.168.0.101 6379 192.168.0.103 6379
( ↑フェイルオーバーうまくいったので Master 切り替えるよ! )
[4140] 28 Feb 17:26:28.796 * +slave slave 192.168.0.101:6379 192.168.0.101 6379 @ mymaster 192.168.0.103 6379
[4140] 28 Feb 17:26:58.809 # +sdown slave 192.168.0.101:6379 192.168.0.101 6379 @ mymaster 192.168.0.103 6379

=== Casper's redis-server log ===
[4124] 28 Feb 17:26:27.904 # Connection with master lost.
[4124] 28 Feb 17:26:27.905 * Caching the disconnected master state.
[4124] 28 Feb 17:26:27.905 * Discarding previously cached master state.
[4124] 28 Feb 17:26:27.905 * MASTER MODE enabled (user request)
[4124] 28 Feb 17:26:27.906 # CONFIG REWRITE failed: Permission denied

3. Master と Sentinel 2つを強制終了

Sentinel が残り1つになってしまい、quorum 2 に満たないので
フェイルオーバーされません。(そもそも try-failover しない)

f:id:chrone3:20140228173019p:plain

[Balthasar] # echo o > /proc/sysrq-trigger
[Melchior] # echo o > /proc/sysrq-trigger

=== redis-sentinel log ===
[4140] 28 Feb 17:38:05.896 # +sdown master mymaster 192.168.0.101 6379
[4140] 28 Feb 17:38:05.896 # +sdown sentinel 192.168.0.102:26379 192.168.0.102 26379 @ mymaster 192.168.0.101 6379
[4140] 28 Feb 17:38:05.896 # +sdown sentinel 192.168.0.101:26379 192.168.0.101 26379 @ mymaster 192.168.0.101 6379

=== Casper's redis-server log ===
[4124] 28 Feb 17:38:37.841 # MASTER timeout: no data nor PING received...
[4124] 28 Feb 17:38:37.841 # Connection with master lost.
[4124] 28 Feb 17:38:37.841 * Caching the disconnected master state.
[4124] 28 Feb 17:38:37.841 * Connecting to MASTER 192.168.0.101:6379
[4124] 28 Feb 17:38:37.841 * MASTER <-> SLAVE sync started
[4124] 28 Feb 17:38:39.674 # Error condition on socket for SYNC: No route to host
[4124] 28 Feb 17:38:39.862 * Connecting to MASTER 192.168.0.101:6379
[4124] 28 Feb 17:38:39.862 * MASTER <-> SLAVE sync started
[4124] 28 Feb 17:38:42.721 # Error condition on socket for SYNC: No route to host
...
4. Master を強制終了し、Sentinel の認知"後"に再起動 (この時 rdb ファイルを消す)

フェイルオーバーにより Master が切り替わっているので、
旧Slave(現Master) のデータをもとにレプリケーションが再開されます。

### 1. データを入れる
[Balthasar] # redis-cli
balthasar:6379> set hoge fuga
OK
balthasar:6379> get hoge
"fuga"

### 2. 障害発生
[Balthasar] # pkill -KILL redis-server
(Sentinel のログに sdown/odown が出力されるのを待つ)
[Balthasar] # rm -f dump.rdb; service redis start

### 3. データの確認
(レプリケーションが再開されるのを待つ)
[Balthasar] # redis-cli
balthasar:6379> get hoge
"fuga"
[Casper   ] # redis-cli
casper:6379> get hoge
"fuga"

=== redis-sentinel log ===
[1434] 28 Feb 17:55:48.005 # +sdown master mymaster 192.168.0.101 6379
[1434] 28 Feb 17:55:48.078 # +odown master mymaster 192.168.0.101 6379 #quorum 2/2
( ↑Master落ちたNE! )
[1434] 28 Feb 17:55:48.078 # +new-epoch 13
[1434] 28 Feb 17:55:48.078 # +try-failover master mymaster 192.168.0.101 6379
( ↑フェイルオーバー試すよ! )
[1434] 28 Feb 17:55:48.078 # +vote-for-leader 1f4b26c71c732bd00e654bcbcb035cc8dd4d9172 13
[1434] 28 Feb 17:55:48.079 # 192.168.0.102:26379 voted for 1f4b26c71c732bd00e654bcbcb035cc8dd4d9172 13
[1434] 28 Feb 17:55:48.080 # 192.168.0.103:26379 voted for 1f4b26c71c732bd00e654bcbcb035cc8dd4d9172 13
[1434] 28 Feb 17:55:48.146 # +elected-leader master mymaster 192.168.0.101 6379
[1434] 28 Feb 17:55:48.146 # +failover-state-select-slave master mymaster 192.168.0.101 6379
[1434] 28 Feb 17:55:48.146 # +failover-state-select-slave master mymaster 192.168.0.101 6379
[1434] 28 Feb 17:55:48.214 # +selected-slave slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[1434] 28 Feb 17:55:48.214 * +failover-state-send-slaveof-noone slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[1434] 28 Feb 17:55:48.214 * +failover-state-send-slaveof-noone slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[1434] 28 Feb 17:55:48.306 * +failover-state-wait-promotion slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[1434] 28 Feb 17:55:49.121 # +promoted-slave slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[1434] 28 Feb 17:55:49.121 # +failover-state-reconf-slaves master mymaster 192.168.0.101 6379
[1434] 28 Feb 17:55:49.177 # +failover-end master mymaster 192.168.0.101 6379
[1434] 28 Feb 17:55:49.177 # +switch-master mymaster 192.168.0.101 6379 192.168.0.103 6379
( ↑フェイルオーバーうまくいったので Master 切り替えるよ! )
[1434] 28 Feb 17:55:49.177 * +slave slave 192.168.0.101:6379 192.168.0.101 6379 @ mymaster 192.168.0.103 6379
[1434] 28 Feb 17:56:02.146 * +convert-to-slave slave 192.168.0.101:6379 192.168.0.101 6379 @ mymaster 192.168.0.103 6379
( ↑おかえり! Slave になってね! )

=== Casper's redis-server log ===
[4124] 28 Feb 17:55:18.371 # Connection with master lost.
[4124] 28 Feb 17:55:18.371 * Caching the disconnected master state.
[4124] 28 Feb 17:55:19.374 * Connecting to MASTER 192.168.0.101:6379
[4124] 28 Feb 17:55:19.374 * MASTER <-> SLAVE sync started
[4124] 28 Feb 17:55:19.374 # Error condition on socket for SYNC: Connection refused
...
[4124] 28 Feb 17:55:48.225 * Discarding previously cached master state.
[4124] 28 Feb 17:55:48.225 * MASTER MODE enabled (user request)
[4124] 28 Feb 17:56:02.147 * Slave asks for synchronization
[4124] 28 Feb 17:56:02.147 * Full resync requested by slave.
[4124] 28 Feb 17:56:02.147 * Starting BGSAVE for SYNC
[4124] 28 Feb 17:56:02.148 * Background saving started by pid 25402
[25402] 28 Feb 17:56:02.152 * DB saved on disk
[25402] 28 Feb 17:56:02.153 * RDB: 4 MB of memory used by copy-on-write
[4124] 28 Feb 17:56:02.180 * Background saving terminated with success
[4124] 28 Feb 17:56:02.181 * Synchronization with slave succeeded

=== Balthasar's redis-server log ===
[1563] 28 Feb 17:55:52.091 # Server started, Redis version 2.8.6
[1563] 28 Feb 17:55:52.092 * The server is now ready to accept connections on port 6379
[1563] 28 Feb 17:56:02.147 * SLAVE OF 192.168.0.103:6379 enabled (user request)
[1563] 28 Feb 17:56:02.227 * Connecting to MASTER 192.168.0.103:6379
[1563] 28 Feb 17:56:02.227 * MASTER <-> SLAVE sync started
[1563] 28 Feb 17:56:02.228 * Non blocking connect for SYNC fired the event.
[1563] 28 Feb 17:56:02.228 * Master replied to PING, replication can continue...
[1563] 28 Feb 17:56:02.229 * Partial resynchronization not possible (no cached master)
[1563] 28 Feb 17:56:02.229 * Full resync from master: 28295c6983d92ca18a3ea8d1a18c5dc718d2b293:1
[1563] 28 Feb 17:56:02.263 * MASTER <-> SLAVE sync: receiving 31 bytes from master
[1563] 28 Feb 17:56:02.263 * MASTER <-> SLAVE sync: Flushing old data
[1563] 28 Feb 17:56:02.263 * MASTER <-> SLAVE sync: Loading DB in memory
[1563] 28 Feb 17:56:02.263 * MASTER <-> SLAVE sync: Finished with success
5. Master を強制終了し、Sentinel の認知"前"に再起動 (この時 rdb ファイルを消す)

Master が Master として復帰するので、再起動後のデータで同期されます。
(今回のケースでは rdb を削除したので、全データをロストします。)

(14/03/03 追記: 下記テキスト中のコマンドラインに "-KILL" が抜けていたので修正しました。)

### 1. データを入れる
[Balthasar] # redis-cli
balthasar:6379> set hoge fuga
OK
balthasar:6379> get hoge
"fuga"

### 2. 障害発生
[Balthasar] # pkill -KILL redis-server; sleep 1; rm -f dump.rdb; service redis start

### 3. データの確認
(レプリケーションが再開されるのを待つ)
[Balthasar] # redis-cli
balthasar:6379> get hoge
(nil)
[Casper   ] # redis-cli
casper:6379> get hoge
(nil)

=== redis-sentinel log ===
[1399] 28 Feb 18:17:46.104 * +reboot master mymaster 192.168.0.101 6379

=== Casper's redis-server log ===
[4124] 28 Feb 18:17:44.720 # Connection with master lost.
[4124] 28 Feb 18:17:44.721 * Caching the disconnected master state.
[4124] 28 Feb 18:17:44.881 * Connecting to MASTER 192.168.0.101:6379
[4124] 28 Feb 18:17:44.881 * MASTER <-> SLAVE sync started
[4124] 28 Feb 18:17:44.882 # Error condition on socket for SYNC: Connection refused
[4124] 28 Feb 18:17:45.892 * Connecting to MASTER 192.168.0.101:6379
[4124] 28 Feb 18:17:45.892 * MASTER <-> SLAVE sync started
[4124] 28 Feb 18:17:45.893 * Non blocking connect for SYNC fired the event.
[4124] 28 Feb 18:17:45.893 * Master replied to PING, replication can continue...
[4124] 28 Feb 18:17:45.894 * Trying a partial resynchronization (request a53da87baa24c587580c93a385820d3604b4a0fa:23007).
[4124] 28 Feb 18:17:45.895 * Full resync from master: 47750624a99f11331d0b06043cea65957552e76c:1
[4124] 28 Feb 18:17:45.895 * Discarding previously cached master state.
[4124] 28 Feb 18:17:45.976 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[4124] 28 Feb 18:17:45.977 * MASTER <-> SLAVE sync: Flushing old data
[4124] 28 Feb 18:17:45.977 * MASTER <-> SLAVE sync: Loading DB in memory
[4124] 28 Feb 18:17:45.977 * MASTER <-> SLAVE sync: Finished with success

=== Balthasar's redis-server log ===
[1698] 28 Feb 18:17:45.854 # Server started, Redis version 2.8.6
[1698] 28 Feb 18:17:45.854 * The server is now ready to accept connections on port 6379
[1698] 28 Feb 18:17:45.976 * Slave asks for synchronization
[1698] 28 Feb 18:17:45.976 * Partial resynchronization not accepted: Runid mismatch (Client asked for 'a53da87baa24c587580c93a385820d3604b4a0fa', I'm '47750624a99f11331d0b06043cea65957552e76c')
[1698] 28 Feb 18:17:45.976 * Starting BGSAVE for SYNC
[1698] 28 Feb 18:17:45.977 * Background saving started by pid 1702
[1702] 28 Feb 18:17:45.988 * DB saved on disk
[1702] 28 Feb 18:17:45.989 * RDB: 4 MB of memory used by copy-on-write
[1698] 28 Feb 18:17:46.057 * Background saving terminated with success
[1698] 28 Feb 18:17:46.058 * Synchronization with slave succeeded
6. Slave を手動で Master に昇格させる (slaveof no one を実行)

もとの状態に戻されます (Slave に降格)。

[Casper  ] # redis-cli
casper:6379> slaveof no one
OK

=== redis-sentinel log ===
[25707] 28 Feb 19:19:00.514 * +convert-to-slave slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379

=== Baltasar's redis-server log ===
[1698] 28 Feb 19:18:49.785 # Connection with slave 192.168.0.103:6379 lost.
[1698] 28 Feb 19:19:00.663 * Slave asks for synchronization
[1698] 28 Feb 19:19:00.663 * Full resync requested by slave.
[1698] 28 Feb 19:19:00.663 * Starting BGSAVE for SYNC
[1698] 28 Feb 19:19:00.664 * Background saving started by pid 2043
[2043] 28 Feb 19:19:00.674 * DB saved on disk
[2043] 28 Feb 19:19:00.675 * RDB: 4 MB of memory used by copy-on-write
[1698] 28 Feb 19:19:00.676 * Background saving terminated with success
[1698] 28 Feb 19:19:00.677 * Synchronization with slave succeeded

=== Casper's redis-server log ===
[4124] 28 Feb 19:18:49.703 # Connection with master lost.
[4124] 28 Feb 19:18:49.703 * Caching the disconnected master state.
[4124] 28 Feb 19:18:49.703 * Discarding previously cached master state.
[4124] 28 Feb 19:18:49.703 * MASTER MODE enabled (user request)
[4124] 28 Feb 19:19:00.514 * SLAVE OF 192.168.0.101:6379 enabled (user request)
[4124] 28 Feb 19:19:00.579 * Connecting to MASTER 192.168.0.101:6379
[4124] 28 Feb 19:19:00.579 * MASTER <-> SLAVE sync started
[4124] 28 Feb 19:19:00.579 * Non blocking connect for SYNC fired the event.
[4124] 28 Feb 19:19:00.580 * Master replied to PING, replication can continue...
[4124] 28 Feb 19:19:00.581 * Partial resynchronization not possible (no cached master)
[4124] 28 Feb 19:19:00.581 * Full resync from master: 47750624a99f11331d0b06043cea65957552e76c:640962
[4124] 28 Feb 19:19:00.595 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[4124] 28 Feb 19:19:00.595 * MASTER <-> SLAVE sync: Flushing old data
[4124] 28 Feb 19:19:00.596 * MASTER <-> SLAVE sync: Loading DB in memory
[4124] 28 Feb 19:19:00.596 * MASTER <-> SLAVE sync: Finished with success
7. Master を手動で Slave に降格させる

Master がダウンしたと見なされ、フェイルオーバーが発生します。
但しその後、fix-slave-config のタイミングで一瞬レプリケーションが切れます。
(当たり前ですが、ダウン検知されるまでは Master が居ない状態になります。)

[Balthasar] # redis-cli
balthasar:6379> slaveof casper 6379
OK

=== redis-sentinel log ===
[1728] 28 Feb 19:57:31.682 # +sdown master mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:31.766 # +odown master mymaster 192.168.0.101 6379 #quorum 2/2
[1728] 28 Feb 19:57:31.766 # +new-epoch 25
[1728] 28 Feb 19:57:31.766 # +new-epoch 25
[1728] 28 Feb 19:57:31.766 # +try-failover master mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:31.767 # +vote-for-leader 901119b19d0505e249572b7060a6bf7eb2011e57 25
[1728] 28 Feb 19:57:31.768 # 192.168.0.103:26379 voted for 901119b19d0505e249572b7060a6bf7eb2011e57 25
[1728] 28 Feb 19:57:31.768 # 192.168.0.101:26379 voted for 901119b19d0505e249572b7060a6bf7eb2011e57 25
[1728] 28 Feb 19:57:31.830 # +elected-leader master mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:31.830 # +failover-state-select-slave master mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:31.830 # +failover-state-select-slave master mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:31.908 # +selected-slave slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:31.908 * +failover-state-send-slaveof-noone slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:31.908 * +failover-state-send-slaveof-noone slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:31.962 * +failover-state-wait-promotion slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:32.855 # +promoted-slave slave 192.168.0.103:6379 192.168.0.103 6379 @ mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:32.855 # +failover-state-reconf-slaves master mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:32.855 # +failover-state-reconf-slaves master mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:32.919 # +failover-end master mymaster 192.168.0.101 6379
[1728] 28 Feb 19:57:32.919 # +switch-master mymaster 192.168.0.101 6379 192.168.0.103 6379
[1728] 28 Feb 19:57:32.920 * +slave slave 192.168.0.101:6379 192.168.0.101 6379 @ mymaster 192.168.0.103 6379
[1728] 28 Feb 20:00:33.627 * +fix-slave-config slave 192.168.0.101:6379 192.168.0.101 6379 @ mymaster 192.168.0.103 6379

=== Balthasar's redis-server log ===
[1698] 28 Feb 19:56:33.720 # Connection with slave 192.168.0.103:6379 lost.
[1698] 28 Feb 19:56:33.721 * SLAVE OF casper:6379 enabled (user request)
[1698] 28 Feb 19:56:34.304 * Connecting to MASTER casper:6379
[1698] 28 Feb 19:56:34.305 * MASTER <-> SLAVE sync started
[1698] 28 Feb 19:56:34.306 * Non blocking connect for SYNC fired the event.
[1698] 28 Feb 19:56:34.306 * Master replied to PING, replication can continue...
[1698] 28 Feb 19:56:34.307 * Partial resynchronization not possible (no cached master)
[1698] 28 Feb 19:56:34.307 * Master does not support PSYNC or is in error state (reply: -ERR Can't SYNC while not connected with my master)
[1698] 28 Feb 19:56:34.307 * Retrying with SYNC...
[1698] 28 Feb 19:56:34.308 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
...
[1698] 28 Feb 19:57:32.055 * Connecting to MASTER casper:6379
[1698] 28 Feb 19:57:32.056 * MASTER <-> SLAVE sync started
[1698] 28 Feb 19:57:32.056 * Non blocking connect for SYNC fired the event.
[1698] 28 Feb 19:57:32.057 * Master replied to PING, replication can continue...
[1698] 28 Feb 19:57:32.057 * Partial resynchronization not possible (no cached master)
[1698] 28 Feb 19:57:32.058 * Full resync from master: 28295c6983d92ca18a3ea8d1a18c5dc718d2b293:1
[1698] 28 Feb 19:57:32.093 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[1698] 28 Feb 19:57:32.093 * MASTER <-> SLAVE sync: Flushing old data
[1698] 28 Feb 19:57:32.093 * MASTER <-> SLAVE sync: Loading DB in memory
[1698] 28 Feb 19:57:32.093 * MASTER <-> SLAVE sync: Finished with success
[1698] 28 Feb 20:00:33.469 # Connection with master lost.
[1698] 28 Feb 20:00:33.469 * Caching the disconnected master state.
[1698] 28 Feb 20:00:33.470 * Discarding previously cached master state.
[1698] 28 Feb 20:00:33.470 * SLAVE OF 192.168.0.103:6379 enabled (user request)
[1698] 28 Feb 20:00:34.221 * Connecting to MASTER 192.168.0.103:6379
[1698] 28 Feb 20:00:34.221 * MASTER <-> SLAVE sync started
[1698] 28 Feb 20:00:34.222 * Non blocking connect for SYNC fired the event.
[1698] 28 Feb 20:00:34.223 * Master replied to PING, replication can continue...
[1698] 28 Feb 20:00:34.223 * Partial resynchronization not possible (no cached master)
[1698] 28 Feb 20:00:34.224 * Full resync from master: 28295c6983d92ca18a3ea8d1a18c5dc718d2b293:37500
[1698] 28 Feb 20:00:34.296 * MASTER <-> SLAVE sync: receiving 18 bytes from master
[1698] 28 Feb 20:00:34.297 * MASTER <-> SLAVE sync: Flushing old data
[1698] 28 Feb 20:00:34.297 * MASTER <-> SLAVE sync: Loading DB in memory
[1698] 28 Feb 20:00:34.297 * MASTER <-> SLAVE sync: Finished with success

=== Casper's redis-server log ===
4124] 28 Feb 19:56:33.639 # Connection with master lost.
[4124] 28 Feb 19:56:33.639 * Caching the disconnected master state.
[4124] 28 Feb 19:56:34.602 * Connecting to MASTER 192.168.0.101:6379
[4124] 28 Feb 19:56:34.602 * MASTER <-> SLAVE sync started
[4124] 28 Feb 19:56:34.602 * Non blocking connect for SYNC fired the event.
[4124] 28 Feb 19:56:34.603 * Master replied to PING, replication can continue...
[4124] 28 Feb 19:56:34.604 * Trying a partial resynchronization (request 47750624a99f11331d0b06043cea65957552e76c:64637).
[4124] 28 Feb 19:56:34.604 * Master does not support PSYNC or is in error state (reply: -ERR Can't SYNC while not connected with my master)
[4124] 28 Feb 19:56:34.604 * Discarding previously cached master state.
[4124] 28 Feb 19:56:34.604 * Retrying with SYNC...
[4124] 28 Feb 19:56:34.605 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
...
[4124] 28 Feb 19:57:31.304 * Connecting to MASTER 192.168.0.101:6379
[4124] 28 Feb 19:57:31.304 * MASTER <-> SLAVE sync started
[4124] 28 Feb 19:57:31.305 * Non blocking connect for SYNC fired the event.
[4124] 28 Feb 19:57:31.306 * Master replied to PING, replication can continue...
[4124] 28 Feb 19:57:31.307 * Partial resynchronization not possible (no cached master)
[4124] 28 Feb 19:57:31.307 * Master does not support PSYNC or is in error state (reply: -ERR Can't SYNC while not connected with my master)
[4124] 28 Feb 19:57:31.307 * Retrying with SYNC...
[4124] 28 Feb 19:57:31.308 # MASTER aborted replication with an error: ERR Can't SYNC while not connected with my master
[4124] 28 Feb 19:57:31.723 * MASTER MODE enabled (user request)
[4124] 28 Feb 19:57:31.976 * Slave asks for synchronization
[4124] 28 Feb 19:57:31.976 * Full resync requested by slave.
[4124] 28 Feb 19:57:31.976 * Starting BGSAVE for SYNC
[4124] 28 Feb 19:57:31.978 * Background saving started by pid 25857
[25857] 28 Feb 19:57:31.989 * DB saved on disk
[25857] 28 Feb 19:57:31.990 * RDB: 4 MB of memory used by copy-on-write
[4124] 28 Feb 19:57:32.010 * Background saving terminated with success
[4124] 28 Feb 19:57:32.010 * Synchronization with slave succeeded
[4124] 28 Feb 20:00:33.388 # Connection with slave 192.168.0.101:6379 lost.
[4124] 28 Feb 20:00:34.142 * Slave asks for synchronization
[4124] 28 Feb 20:00:34.142 * Full resync requested by slave.
[4124] 28 Feb 20:00:34.143 * Starting BGSAVE for SYNC
[4124] 28 Feb 20:00:34.144 * Background saving started by pid 25873
[25873] 28 Feb 20:00:34.180 * DB saved on disk
[25873] 28 Feb 20:00:34.180 * RDB: 4 MB of memory used by copy-on-write
[4124] 28 Feb 20:00:34.214 * Background saving terminated with success
[4124] 28 Feb 20:00:34.214 * Synchronization with slave succeeded
8. Sentinel 起動中に設定ファイルを間違った内容で上書き

いかなるイベントも発生しません。
設定ファイルは、レプリケーションの構成、または Sentinel の構成に
変化があった時に正しい内容で修正されます。

但し、書き換えられた状態のまま何も起こらずに Sentinel が停止した場合
設定ファイルは元の状態には戻りません。

  • 内容を書き換える
    [Melchior] # echo > redis-sentinel.conf
    Master, Slave 及び どの Sentinel のどのインスタンスにも影響はなく、
    メッセージも出力されません。

  • Master を停止する
    [Melchior] # service redis stop
    フェイルオーバーが実行された際に設定ファイルが修正されます。

  • Slave を停止する
    [Casper ] # service redis stop
    Slave の停止が認識された際に設定ファイルが修正されます。

  • 書き換えた設定ファイルを使用している Sentinel を停止する
    [Melchior] # service redis-sentinel stop
    正常停止の際に設定ファイルを書き出すという動作はなく、
    設定ファイルは修正されない為、元に戻りません。

  • 他の Sentinel を停止する
    [Casper ] # service redis-sentinel stop
    Sentinel の停止が認識された際に、設定ファイルが修正されます。

まとめ

変な事さえしなければ良きに計らって(?)動いてくれたので、
Sentinel を使った構成に移行しても良いのかなという印象です。
(少なくとも keepalived を利用するよりはスマート?)