MIT 6.824 · Lecture 1: Introduction #
MIT 6.824(Distributed Systems Engineering) Lecture 1.
Lecture Notes: https://pdos.csail.mit.edu/6.824/notes/l01.txt
Lab 1: MapReduce (https://pdos.csail.mit.edu/6.824/labs/lab-mr.html)
About course #
Lectures: big ideas
Papers are for case study.
- replication using raft
- replicated key-value service
- sharded key-value service
Why distributed systems are popular? #
- connect physically seperated machines -> sharing
- increase capacity -> parallelism
- tolerate faults -> replication
- achieve security -> isolation
Challenges in distributed systems #
- many concurrent parts (complexity)
- must deal with partial failure (complexity)
- tricky to realize the prefermance benefits
Focus in the class: Infrastracture #
- Storage (e.g. GFS)
- Computation (e.g. mapreduce)
- Communitation (e.g. RPC, more on MIT 6.829 Network Systems)
Main topics #
- Fault tolerance
- avaliablity (replication)
- recoverability (logging / transactions, durable storage)
- Performance (throuhtput, latency)
Achieving these three at the same time is extreme difficult. There always a trade-off.
MapReduce context #
- Multi-hours computations on terabytes data (web indexing)
- Goal: easy for non-expertise to write distributed applicaitons
- map + reduce -> sequential code
- MR deals with all the distributioness
This library is not in general purpose, your computational model has to fit in that paricular computational model which is provided by MapReduce.
Abstract view #
MapReduce execution overview #
Fault tolerance #
Coordinator rerun map/reduce.
Can map function run twice? Yes.
functional aspect is important, map function should be functional / determinsitic(any node run this funciton can get the exactly same result).
Can reduce run twice? Yes. Rely on the Atomic Rename on GFS.
Can coordinator (master) fail? No. If coordinator fail, the whole job must rerun.
How to deal with slow workers (stragglers)? Do backup task, run the map/reduce task on the other worker. Therefore the performance is not limited by the slowest worker, but basically the fastest of the ones that got replicated.
This is the common idea to deal with the stragglers or to deal with tail latency, is trying to replicate task and go for the first that finishes.