MIT 6.824 · Lecture 4: Primary-Backup Replication #
Primary-Backup Replication for Fault Tolerance
Goal: reach fault-tolerance
- Provide availability
- Despite server and network failures
- using replication
Failures
- fail-stop faults: can be solved by replication.
- no s/w bugs
Replication approaches:
-
State Transfer: transfer memory.
- primary replica executes the service
- primary send the entire state to backups.
- State maybe too large, slow to transfer over network
-
Replicated State Machine: just send the external events, transfer operations.
- If same start state, same operations, same order, deterministic, then the same end state.
- Generate less network traffic
Replication level:
- Applicaiton state: like GFS.
- Efficient: primary only sends high-level operation to backup
- application must support fault-tolerance, for example, forward operation stream.
- Machine level: registers and RAM content.
- forwarding machine events: interrupts, DMA etc.
- modifications to send/receive event stream for machines.
What state (to replicate)?
- Primary-Backup sync
- cut-over: primary fails, the client should have a machenism to change target(primary -> backup).
- anomalies
- new replicas
Non-determinstic events:
- Inputs - packet - data + interrupt
- werid instructions (multicore)
Each log entry:
- instruction number (#)
- type
- data
Output rule: the primary only can response to client until the log entry send request to backup’s VMM and backup acknowledge it.