Lecture 4: Primary-Backup Replication

MIT 6.824 · Lecture 4: Primary-Backup Replication #

Official lecture notes

Primary-Backup Replication for Fault Tolerance

Goal: reach fault-tolerance

  • Provide availability
  • Despite server and network failures
  • using replication

Failures

  • fail-stop faults: can be solved by replication.
  • no s/w bugs

Replication approaches:

  • State Transfer: transfer memory.

    • primary replica executes the service
    • primary send the entire state to backups.
    • State maybe too large, slow to transfer over network
  • Replicated State Machine: just send the external events, transfer operations.

    • If same start state, same operations, same order, deterministic, then the same end state.
    • Generate less network traffic

Replication level:

  • Applicaiton state: like GFS.
    • Efficient: primary only sends high-level operation to backup
    • application must support fault-tolerance, for example, forward operation stream.
  • Machine level: registers and RAM content.
    • forwarding machine events: interrupts, DMA etc.
    • modifications to send/receive event stream for machines.

What state (to replicate)?

  • Primary-Backup sync
  • cut-over: primary fails, the client should have a machenism to change target(primary -> backup).
  • anomalies
  • new replicas

Non-determinstic events:

  • Inputs - packet - data + interrupt
  • werid instructions (multicore)

Each log entry:

  • instruction number (#)
  • type
  • data

Output rule: the primary only can response to client until the log entry send request to backup’s VMM and backup acknowledge it.