Everyone wants a log in distributed systems !!
Can you guess what is behind every successful implementation of a distributed systems or relational databases like mysql ? Well it is a very small concept called as log. In relational database world it is sometimes called as transactional log, journal or write ahead log (there might be many more).
what is a log is it something what i write in programs like
log.info("hello world");
Well no these are human readable texts which are used by programmers to debug an issue/ bug. These can be called as application logs. The idea i want to cover in this post is something called as data log or journal which are built for programmatic access.
I’m not tech savvy can you example me logs in simple terms ?
Sure no problem. We all learnt basic math operations while we are in school like a+b, a*b, a/b, a-b etc. Let’s do a basic math exercise
- Can you integrate sin(x) + cos(x) from 0 to 1? (Just kidding .. you can start from step 2 :D )
- Can you add 10 to zero?
- Can you multiply the obtained result above by 10 ?
- Can you subtract the obtained result by 20?
- Can you divide the obtained result by 10 ?
- Can you add 40 to the obtained result ?
What result did you get ? Well you must have got 48 ? Yay ! I know how you think.
How did we both reach to same result ?
It is because we both started with same initial state 0 and applied the math operations sequentially. If either of them was different then we couldn’t have reached to a different result
Observations:
- Let’s suppose i started with one and you started with zero then the final value in my case will be 49 but in your case it will be 48.
- Similar if i had interchange the sequence of math operations then we would have different results.
- You could have computed each math operation pretty quickly and i might have gotten distracted and might have computed it much later but it doesn’t matter we both will reach to same output
So what is a learning ?
Given an initial state and same sequence of logs records (operations) we would reach to same result no matter what is the speed at which we process the records.
Can you now give me the actual definition of log ?
A log is perhaps the simplest possible storage abstraction. It is an append-only, totally-ordered sequence of records ordered by time.
Records will only be append only i.e they will be added only to the end. Once appended they become immutable. Each log entry will be assigned a unique sequential number. One idea is to number the log records based on the time it got generated i.e t(1) < t(2) < t(3). Using time in distributed systems is not a very great idea (pretty big topic will cover in another post… It is very similar to real life the time on your watch might not match with time on my watch)
How traditional databases like mysql get constructed ?
In very overly simplified view, every table in mysql will starts empty, apply the logs which are generated by sql constructs and as a consequences rows get generated in the database.
Remember #(logs) >>>> #(records) in table. Consider this scenario where i keep updating the same records again and again then number of logs will be very high but rows might be just one.
How is it helpful in distributed systems ?
One of the core features on why we use distributed systems is that we want our data to be as durable as possible i.e even if one machine goes down because of some issues it shouldn’t lose my data. So one easy solution is to replicate the data across multiple machines.
So if we can successfully replicate the logs among machines , with both machines starting with empty state( assuming we creating from scratch) we can reach to same state after successful processing of the replicated logs.
This is very similar to the math exercise we both started with same state, we both applied operations sequentially and we reached to same output. So now even if i get hit by a bus you will have the result and be my backup. ( Stressing this point because this is the most critical concept you need to understand )
So how can we replicate logs between machines ?
The distributed systems literature commonly distinguishes two broad approaches to processing and replication. The “state machine model” usually refers to an active-active model where we keep a log of the incoming requests and each replica processes each request. A slight modification of this, called the “primary-backup model”, is to elect one replica as the leader and allow this leader to process requests in the order they arrive and log out the changes to its state from processing the requests. The other replicas apply in order the state changes the leader makes so that they will be in sync and ready to take over as leader should the leader fail.
What are different replication control strategies ?
In order to maintain mutually consistent data in all sites, replication control techniques need to be adopted. There are two approaches for replication control, namely −
- Synchronous Replication Control
- Asynchronous Replication Control
Synchronous Replication Control
We programmatically update both databases present in different machines synchronously( either parallely or serially)
{Update table in machine (1)Update table in machine (2)}
Some of the obvious disadvantages with this approach are that even if one machine is not able to reply then we will timeout the request and customers receives 500 which is not a very good customer experience. Some machines might be inconsistent state and needs to roll back properly.
Asynchronous Replication Control
In asynchronous replication approach, the replicas do not always maintain the same value. One or more replicas may store an outdated value, and a transaction can see the different values. The process of bringing all the replicas to the current value is called synchronization.
Some of the replication control algorithms are −
- Master-slave replication control algorithm.
- Distributed voting algorithm.
- Majority consensus algorithm.
- Circulating token algorithm
Master-Slave Replication Control Algorithm
There is one master site and ’N’ slave sites. A master algorithm runs at the master site to detect conflicts. A copy of slave algorithm runs at each slave site. The overall algorithm executes in the following two phases −
- Transaction acceptance/rejection phase − When a transaction enters the transaction monitor of a slave site, the slave site sends a request to the master site. The master site checks for conflicts. If there aren’t any conflicts, the master sends an “ACK+” message to the slave site which then starts the transaction application phase. Otherwise, the master sends an “ACK-” message to the slave which then rejects the transaction.
- Transaction application phase − Upon entering this phase, the slave site where transaction has entered broadcasts a request to all slaves for executing the transaction. On receiving the requests, the peer slaves execute the transaction and send an “ACK” to the requesting slave on completion. After the requesting slave has received “ACK” messages from all its peers, it sends a “DONE” message to the master site. The master understands that the transaction has been completed and removes it from the pending queue.
Distributed voting algorithm
This comprises of ’N’ peer sites, all of whom must “OK” a transaction before it starts executing. Following are the two phases of this algorithm −
- Distributed transaction acceptance phase − When a transaction enters the transaction manager of a site, it sends a transaction request to all other sites. On receiving a request, a peer site resolves conflicts using priority based voting rules. If all the peer sites are “OK” with the transaction, the requesting site starts application phase. If any of the peer sites does not “OK” a transaction, the requesting site rejects the transaction.
- Distributed transaction application phase − Upon entering this phase, the site where the transaction has entered, broadcasts a request to all slaves for executing the transaction. On receiving the requests, the peer slaves execute the transaction and send an “ACK” message to the requesting slave on completion. After the requesting slave has received “ACK” messages from all its peers, it lets the transaction manager know that the transaction has been completed.
Majority Consensus Algorithm
This is a variation from the distributed voting algorithm, where a transaction is allowed to execute when a majority of the peers “OK” a transaction. This is divided into three phases −
- Voting phase − When a transaction enters the transaction manager of a site, it sends a transaction request to all other sites. On receiving a request, a peer site tests for conflicts using voting rules and keeps the conflicting transactions, if any, in pending queue. Then, it sends either an “OK” or a “NOT OK” message.
- Transaction acceptance/rejection phase − If the requesting site receives a majority “OK” on the transaction, it accepts the transaction and broadcasts “ACCEPT” to all the sites. Otherwise, it broadcasts “REJECT” to all the sites and rejects the transaction.
- Transaction application phase − When a peer site receives a “REJECT” message, it removes this transaction from its pending list and reconsiders all deferred transactions. When a peer site receives an “ACCEPT” message, it applies the transaction and rejects all the deferred transactions in the pending queue which are in conflict with this transaction. It sends an “ACK” to the requesting slave on completion.
Circulating Token Algorithm
In this approach the transactions in the system are serialized using a circulating token and executed accordingly against every replica of the database. Thus, all the transactions are accepted, i.e. none is rejected. This has two phases −
- Transaction serialization phase − In this phase, all transactions are scheduled to run in a serialization order. Each transaction in each site is assigned a unique ticket from a sequential series, indicating the order of transaction. Once a transaction has been assigned a ticket, it is broadcasted to all the sites.
- Transaction application phase − When a site receives a transaction along with its ticket, it places the transaction for execution according to its ticket. After the transaction has finished execution, this site broadcasts an appropriate message. A transaction ends when it has completed execution in all the sites.
Other advantage of the logs
Point in time view :- I can get into a any point of view of my database using the logs. For example you can get bank balance on March 25, 2015 by computing all the withdrawals and deposits from account creation time to March 25, 2015 by which can be obtained from the statements.
Disaster recovery:- If node is suffering from intermittent power failures it can reconstruct the state by applying the logs.
Support ACID:- Provide ACID properties even if there was an disaster.
In summary the two problems a log solves — ordering changes and distributing data — are even more important in distributed data systems. Agreeing upon an ordering for updates (or agreeing to disagree and coping with the side-effects) are among the core design problems for these systems
We will discuss more about consensus in the next article ….