How machines get committed to each other(Part 1)?

You can think distributed systems as a new planet where machines are like humans which are working together to achieve some targeted output. It could be storing data forever, or it could be some messaging systems which passes messages across systems etc. Like in the real world where people don’t agree to each other because each one of us have a different perspective and opinion towards a subject. For example if my friend comes and say let’s go to vegas for vacation probably i might reject saying that i would like to go to Alaska. Similarly computer come with different CPUs, different memory, different hard-disks etc and most importantly they are connected with network connection which can go wrong at any point of time.

The act of getting all nodes in the distributed systems to agree onto something — which could be agreeing upon a master in master-slave architecture, agreeing upon a synchronized time among different systems, agreeing to move to next state in a replicated state machine, agreeing with each whether they can commit/abort a transaction. This is similar to what we plan for vacation among a set group of friends, where each one wants to go to different place based on their interests and finally we fixate to a place. Achieving consensus allows a distributed system to act as a single entity, with every individual node aware of and in agreement with the actions of the whole of the network.

Firstly achieving consensus in distributed systems is very hard it is as tough as getting all friends agreeing to go to GOA. Secondly implementing the consensus is pretty tough you need to handle several scenario like power getting off, fire in data centre, network cable issues, human errors while deploying etc .. trust me i have seen all these cases :D

Formal Definition

Typically in consensus settings nodes proposes values to other nodes to agree upon a value (value here can refer to a leader, generating a LSN etc) and agree upon a value based on different protocols. More formal definitions say that each node has an output register that, once the protocol has terminated, contains the value to agree upon. Writing to that register is the act of deciding.

Given a set N number of nodes in a distributed system, we’ll consider a consensus protocol correct if and only if:

  1. Agreement — all N nodes decide on the same value
  2. Validity — the value that is decided upon must have been proposed by some node in N
  3. Termination — all nodes eventually decide

Agreement is an easy one to understand: we can’t really call something consensus if no consensus has been achieved.

Validity is a bit less intuitive — and seems kind of obvious at the same time. The problem with the agreement property is that it trivially allows protocols that just instruct every node to decide a default value, no matter what the actual state of the network. For example a database commit protocol that always voted “don’t commit”, no matter what the transaction, satisfies the agreement condition. Validity ensures that these protocols don’t pass muster.

Termination is also a reasonable requirement, but it bears thinking about why it’s needed — agreement only constrains how nodes decide, not if they decide. We need termination to ensure that the protocol actually does something useful, otherwise again we’d be open to trivial solutions that just don’t do anything — satisfying agreement and validity.

Distributed transactions

  • Partition databases across multiple machines for scalability (A and B might not share a server)
  • A transaction might touch more than one systems like two different banks

So what are different algorithms that help to co-ordinate ?

Goal:

General purpose, distributed agreement on some action, with failures — Different entities play different roles in the action

Running example: Transfer money from Bank A to Bank B

  • Debit at Bank A, credit at Bank B, tell the client “okay”
  • Require both banks to do it, or neither ( else one bank might debit other bank might not credit, one bank might credit but other bank might not debit) i.e we is an all-or-nothing atomic commit protocol
  • Require that one bank never act alone

One-Phase Commit

Let us take an example where you are planning to transfer money from one back to another through some payment gateway. When you ask payment gateway to start transfer of 20$. Well one solution for the payment gateway to ask Bank A to debit and Bank B to credit simultaneously.

Cool ! Yeah isn’t approach work ? Well no, it fails under scenarios like

  • What if user doesn’t have money in Bank A (20$ free money in Bank B)
  • What if Bank B has run out of money ( 20$ is lost in Bank A)
  • What if Bank A never got the request to debit ? Bank B would have already credited money to user. User now has 40 $ (20$ in Bank A and 20$ in Bank B)
  • What if Bank B never received the request for credit ? (20$ is lost in Bank A)

We want two properties:

Safety

  • If one commits, no one aborts
  • If one aborts, no one commits

Liveness

  • If no failures and A and B can commit, action commits
  • If failures, reach a conclusion ASAP

Two-Phase Commit

So two phase commit what really happens is payment gateway will first make sure that the transaction is valid. If it receives success response then it will ask to debit / credit in second phase.

Phase 1 (Prepare Phase)

In Phase 1

  • Payment gateway will ask whether Bank A can debit 20$
  • Payment gateway will ask whether Bank B can credit 20$
  • If Bank A and B responds ‘yes’
  • Then payment gateway enters phase 2
Phase 2 (Commit Phase)

In Phase 2

  • After receiving success response in Phase 1 from both banks it asks Bank A to debit and Bank B to credit

What happens if either of Banks respond ‘No’

Phase 1

In this Phase 1 scenario

  • Payment gateway will ask whether Bank A can debit 20$
  • Payment gateway will ask whether Bank B can credit 20$
  • Bank A responds No
  • Bank B responds Yes
Phase 2 (Abort)

In this Phase 2 scenario

  • Since it received No from one bank it will initiate Abort transaction in both the banks

Why is this correct?

  • Neither can commit unless both agreed to commit

What about performance?

Performance is not so great when there is either timeouts because of network partitions or because of node reboots

How to handle crash and reboot?

If all nodes knew their state before crash, we could use the termination protocol — Use write-ahead log to record “commit!” and “yes” to disk

Let’s Check All Failure Scenarios

In this case

  • Payment gateway checks whether Bank A can debit
  • Bank A returns Yes
  • Payment gateway checks whether Bank B can credit
  • But Bank B is busy upgrading to new Java version
  • Payment gateway will timeout and as a preventive measure aborts the transaction in Bank A
  • After update and restart since Bank B didn’t receive any response it will ask the decision to payment gateway and it responds abort

Cons:

  • Because of intermittent failures all the work done is lost.

What if Bank B fails after sending yes

In this case

  • Payment gateway checks whether Bank A can debit
  • Bank A returns Yes
  • Payment gateway checks whether Bank A can credit
  • Bank A returns Yes
  • Payment gateway will respond debit to bank A
  • Bank B was updating java and didn’t receive credit response
  • After restart Bank B will ask payment gateway what to do and it will respond credit it

This is good !

What if Bank B lost a vote?

In this case

  • Payment gateway checks whether Bank A can debit
  • Bank A returns Yes
  • Payment gateway checks whether Bank B can credit
  • Bank B returns Yes but will not reach payment gateway
  • Payment gateway will timeout and as a preventive measure aborts the transaction in Bank A
  • It will abort transaction in Bank B

Cons:

  • Because of network failures all the work done is lost.

What if payment gateway fails before asking banks?

In this scenario payment gateway received the request from user but before it started processing it failed and restarted.

No, problem after restart it can continue with original protocol

What if payment gateway fails after sending for checks ?

In this scenario

  • Payment gateway asks Bank A whether it can debit
  • Payment gateway asks Bank B whether it can credit
  • Bank A responds Yes
  • Bank B responds Yes
  • But because of java update payment gateway didn’t process the response
  • As a result after the restart it will restart the entire protocol

What if payment gateway fails after sending decision to one bank A?

In this case

  • Payment gateway checks whether Bank A can debit
  • Bank A returns Yes
  • Payment gateway checks whether Bank B can credit
  • Bank B returns Yes
  • Payment gateway asks Bank A to debit and fails to submit credit to Bank B
  • Bank B will timeout and asks back payment gateway what it should do
  • Payment gateway will respond to credit

Cons:

  • Blocking on payment gateway. ( What if payment gateway is down for a day ??)

Can we prevent getting blocked because of payment gateway ?

In this case

  • Payment gateway checks whether Bank A can debit
  • Bank A returns Yes
  • Payment gateway checks whether Bank B can credit
  • Bank B returns Yes
  • Payment gateway asks Bank A to debit and fails to submit credit to Bank B
  • Bank B will timeout and asks Bank A what to do
  • Bank A will response i got debit response from payment gateway before it died so you can credit it
  • Bank B will credit it

Cons:

  • Bank B needs to initiate separate network call. What if Bank B can talk to 1000 banks ? it needs to keep track of everything

What happens if we don’t have a decision?

In this case

  • Payment gateway checks whether Bank A can debit
  • Bank A returns Yes
  • Payment gateway checks whether Bank B can credit
  • Bank B returns Yes
  • Bank B will timeout and asks Bank A what to do
  • Bank A will response i also don’t know what to do
  • Both Bank A and Bank B to come online

In summary

Two phase commit with with non-volatile logging ( and timeout ) is

Safe:

All hosts that decide reach the same decision

— No commit unless everyone says “yes” •

Liveness:

If no failures and all say “yes” then commit

— But if failures then 2PC might block

— Payment gateway ( Transaction Coordinator) must be up to decide

Doesn’t tolerate faults well: must wait for repair

Bye bye !! I will cover three phase protocol in another post :D Happy Coding !!!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Bits And Bytes Of Avikodak

I spend most of the time on helping out zeros and ones to reach there destination in a orderly and reliable manner :)