MIT 6.824 · Lecture 3: GFS #
由于 2021 的课程是网课形式,听了两节课下来,总体感觉老师讲授的内容没有线下这么多了(没有黑板可能不太方便),于是又看回了 Spring 2020 的课程。因此该次学习计划就调整为:2020 Lectures + 2021 Labs。
Big Storage #
Why it’s hard? #
- Performance -> sharding
- Faults -> tolerance
- Tolerance -> replication
- Replication -> inconsistency
- Consistency -> Low performance
This is a dead loop.
Strong Consistency #
C1 Wx1
C2 Wx2
C3 Rx2
C4 Rx2
Client 3, 4 should read Wx2 results but not Wx1 results.
GFS #
- Big, fast
- Global (allow for sharing)
- Sharding
- Automatic recovery
Singal data center #
- Internal use
- For big sequential access (not for random access)
Master Data #
- Filename -> array of chunk handles (non-volatile)
- Handle:
- list of chunk servers (volatile)
- Version # (non-volatile)
- primary (volatile)
- Lease expiration (volatile)
- Log, checkpoint – DISK
Read #
- Client sends filename, offset to Master
- Master sends chunk handle, return list of Servers The client cached these results.
- Client sends request to chunk server, and chunk server returns the file data.
Writes #
-
No primary – on master:
- Find up to date replicas
- Picks primary and secondaries chunks
- Increment the version number (v#)
- Tells primary and secondaries this new version, and gives primary a lease
- master writes its own version number to disk
- Primary picks off all replicas(including itself) told to write at offset
- If all secondaries say “yes”, then the primary reply “success” to client; else reply “no” to client
-
How to deal with scenarios like “split brain” or network partition that having multiple primaries?
The answer is using “Lease”.
How to turn GFS support strong consistency? #
- Duplicate detection (discard duplicate requests)
- Secondary confirmation (Secondaries really have done something, with an confirmation mechanism)
- two-phase commit (Can you do it? Do it after replying with a promise response)
- primary crashed, new primary has to start by explicitly make synchronizing tail operations
- if secondaries times differ or client has a slightly stale indication, the system should send all client reads to the primary
High-level summary #
- tremendously successful: a number of google infrastructure built as a layer on top of the GFS
- most serious limitation: single master (record all chunk records, maybe run out of RAM)
- some applications found it its hard to deal with some odd semantics
- master cannot automatically fail-over that requires human intervention (cost about 10 minutes+)