|
| 1 | +# Podcast with Heidi Howard (Research fellow in Comp Sci at Cambridge) |
| 2 | + |
| 3 | +Sources |
| 4 | + |
| 5 | +* Raft |
| 6 | + * https://raft.github.io/raft.pdf |
| 7 | + * good first paper to introduce this area |
| 8 | +* Paxos made moderately complex |
| 9 | + * Good 2nd paper after the Raft paper |
| 10 | +* _The part-time parliment_ by Lamport is the original paper which pioneered distributed systems |
| 11 | + * https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf |
| 12 | + * fairly vague so it's hard to be sure whether things (e.g. Raft) are or are not Paxos |
| 13 | +* _Paxos made simple_ by Lamport |
| 14 | + * https://lamport.azurewebsites.net/pubs/paxos-simple.pdf |
| 15 | + * Lamports attempt to make paxos more obvious |
| 16 | + |
| 17 | + |
| 18 | +Distributed consensus algorithms |
| 19 | + |
| 20 | +Distributed system = |
| 21 | + 1. concurrent: multiple things happening at the same time |
| 22 | + 1. asynchrony: we don't know how long messages will take to get to their destinatin |
| 23 | + 1. failures: we don't know _if_ they will get to their destination |
| 24 | + |
| 25 | +Consensus = making a decision in a distributed system |
| 26 | + examples: |
| 27 | + deciding a vaule in a register |
| 28 | + deciding an ordering of operations to be applied to a database |
| 29 | + |
| 30 | +Consensus has 3 components: |
| 31 | + |
| 32 | +1. the value you agree is the value somebody proposed (non triviality component) |
| 33 | +1. when you decide a value that is final then everybody finds ot the same vaule (the safety component) |
| 34 | +1. if everything is working ok in the ystem the system will reach a decision (the progress component) |
| 35 | + |
| 36 | + |
| 37 | +Luckily we don't always need consensus in a distributed systemd |
| 38 | +because it is quite expensive in terms of number of messages exchanged |
| 39 | + |
| 40 | +YOu need consensus when you need _strong consistency guarantees_ |
| 41 | +e.g. in a key-value store where you absolutely have to have linearizability, you need consensus when you can't rely on the clocks being synchronized |
| 42 | + |
| 43 | +e.g. Google spanner avoids the need for consensus by putting atomic clocks in their data centers |
| 44 | +if you can rely on timestamps to be consistent within a very small margin then you don't need consensus |
| 45 | + |
| 46 | + |
| 47 | +* Consesus is the method of achieveing very strong consistency |
| 48 | +* in terms of CAP you are getting strong C but compromising availablity A i.e. it is a system focusing on C_P |
| 49 | + * many consensus systems rely on majorities - if a majority of nodes goes down then it can't reach consensus |
| 50 | + * e.g. 3 node system can tolerate 1 failure |
| 51 | + * 5 node system can tolerate 2 failures |
| 52 | + * 7 node system can tolerate 3 failures |
| 53 | + * people don't tend to scale the system beyond 7 nodes because the latency and number of messages you need to exchange to get consesus goes up a lot as you add notes - a conesnsus system is the opposite of a "scalable system" - as you add more nodes it gets slower! |
| 54 | + * examples |
| 55 | + * etcd recommends no more than 7 nodes https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab+vs.+Paxos |
| 56 | + * Google Chubby recommends no more than 5 nodes |
| 57 | + * Most systems recommend an _odd_ number of nodes because an odd number of N nodes tolerates as much failure as a system with N+1 nodes (the next even number) because of the majority thing |
| 58 | +* Leadership election = deciding which node gets to make decisions |
| 59 | + * typically the leader backs up all its decisions to the majority of systems |
| 60 | + * nodes vote for the leader (views, ballots, term numbers) |
| 61 | + * nodes can vote multiple times |
| 62 | + |
| 63 | + |
| 64 | + |
| 65 | +* Paxos algorithm |
| 66 | + * multi paxos = an elected leader makes multiple decisions ("paxos" and "multi paxos" are used synonmysly |
| 67 | + * single degree paxos/vanilla paxos/classis paxos = a new leader is elected for each decision |
| 68 | + * paxos has 2 phases |
| 69 | + 1. majority of nodes get together to elect a leader |
| 70 | + 2. leader chooses a value and backs it up to a majority of nodes |
| 71 | + * systems which use paxos |
| 72 | + * chubby (google) |
| 73 | + * https://ai.google/research/pubs/pub27897 |
| 74 | + * cassandra |
| 75 | + * how it works |
| 76 | + 1. you send your write request to the leader |
| 77 | + 1. leader decides on what order to write things |
| 78 | + 1. leader tells other nodes what order to write in |
| 79 | + 1. when majority have anssered that they have accepted taht order the leader will commit that write and acknowledge to you that the write was successful |
| 80 | + * theoritically you should get consensus for reads too but we often skip this in practice |
| 81 | +* Raft algorithm |
| 82 | + * a good well articulated simplified version of paxos specifically designed for state machine replication (which is the most common use-case for paxos) |
| 83 | + * https://raft.github.io/raft.pdf |
| 84 | + * It is a good starting point for undertanding the basics of Paxos |
| 85 | + * used by |
| 86 | + * etcd |
| 87 | + * kubernetes |
| 88 | +* ZAB (Zookeeper atomic broadcast) algorithm |
| 89 | + * very similar to multi-paxos |
| 90 | + * https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab+vs.+Paxos |
| 91 | + * Used in |
| 92 | + * Zookeeper |
| 93 | + |
0 commit comments