Lecture 3: GFS

MIT 6.824 · Lecture 3: GFS #

由于 2021 的课程是网课形式,听了两节课下来,总体感觉老师讲授的内容没有线下这么多了(没有黑板可能不太方便),于是又看回了 Spring 2020 的课程。因此该次学习计划就调整为:2020 Lectures + 2021 Labs。

Big Storage #

Why it’s hard? #

  • Performance -> sharding
  • Faults -> tolerance
  • Tolerance -> replication
  • Replication -> inconsistency
  • Consistency -> Low performance

This is a dead loop.

Strong Consistency #

C1  Wx1
C2  Wx2
C3        Rx2
C4             Rx2

Client 3, 4 should read Wx2 results but not Wx1 results.


  • Big, fast
  • Global (allow for sharing)
  • Sharding
  • Automatic recovery

Singal data center #

  • Internal use
  • For big sequential access (not for random access)

Master Data #

  • Filename -> array of chunk handles (non-volatile)
  • Handle:
    • list of chunk servers (volatile)
    • Version # (non-volatile)
    • primary (volatile)
    • Lease expiration (volatile)
  • Log, checkpoint – DISK

Read #

  1. Client sends filename, offset to Master
  2. Master sends chunk handle, return list of Servers
    The client cached these results.
  3. Client sends request to chunk server, and chunk server returns the file data.

Writes #

  • No primary – on master:

    • Find up to date replicas
    • Picks primary and secondaries chunks
    • Increment the version number (v#)
    • Tells primary and secondaries this new version, and gives primary a lease
    • master writes its own version number to disk
    • Primary picks off all replicas(including itself) told to write at offset
    • If all secondaries say “yes”, then the primary reply “success” to client; else reply “no” to client
  • How to deal with scenarios like “split brain” or network partition that having multiple primaries?

    The answer is using “Lease”.

How to turn GFS support strong consistency? #

  • Duplicate detection (discard duplicate requests)
  • Secondary confirmation (Secondaries really have done something, with an confirmation mechanism)
  • two-phase commit (Can you do it? Do it after replying with a promise response)
  • primary crashed, new primary has to start by explicitly make synchronizing tail operations
  • if secondaries times differ or client has a slightly stale indication, the system should send all client reads to the primary

High-level summary #

  • tremendously successful: a number of google infrastructure built as a layer on top of the GFS
  • most serious limitation: single master (record all chunk records, maybe run out of RAM)
  • some applications found it its hard to deal with some odd semantics
  • master cannot automatically fail-over that requires human intervention (cost about 10 minutes+)