Think Serverless !! It is really awesome !

We had an interesting problem and want to share our thought process which went into solving the problem. Our customers used to provide us a file containing millions to billions of rows and each row contains 1000’s of columns to process. We need to process these files and perform some business logic for each row. Being customer obsessive company we want the processing to be completed within 15 minutes / file

Problem Statement

Design a system which can process a file with a billion rows where each row in the CSV file undergoes some business logic which can take upto ~2 seconds to complete

Requirements

  1. Process a very large CSV file which has billion rows with 1000 columns
  2. Prevent processing of duplicate entries present across the file in a day
  3. Scale of uploads is not uniform
  4. Minimize the infrastructure maintenance cost
  5. Ability to minimize the processing of each file
  6. Be frugal!
  7. Should be easily maintainable
  8. Should be easily scalable
  9. Don’t create one way door i.e. design should be extensible in the future where in processing of file might take more time
  10. Approximate time for processing each row is ~2 Seconds
  11. There might be more network calls that might be added while processing the file in the future
  12. Clients should receive a ACK on the finished work items
  13. Design needs to handle case where we might get items from external triggers like SMS, Emails etc.. ( NOT MANDATORY though)

High Level Diagram

Walkthrough

  • Client uploads a file to S3 could be programmatically or he could manually upload a file
  • A PutObject / CopyObject event is emitted by S3
  • A lambda is configured to listen to all put objects ( including copy object). This lambda fetches the object using file stream ( not downloading the entire object ), read the file row by row using byte array, creates a smaller file based on the configured number of rows
  • File splitting lambda will later upload the smaller file to different location in S3
  • At this different location we have configured another lambda which process these smaller files and publishes the relevant rows into kinesis stream ( Performs some validations, minor business logic etc)
  • Another set of lambda function is configured to listen to the streams which dedups the items using the Amazon dynamodb and cache
  • Dynamodb can also used to checkpoint the processed files so that next worker can skip already worked items
  • Stream listeners will perform the required business logic and publishes the processed work items to SNS for the clients to listen

Summary of the solution

Parallelizing processing of large files by breaking them into smaller files. In a way we modelled EMR map reduce using lambdas and making it serverless

Why we went with this approach

  • There is no dedicated fleet which is maintained and handles the processing of the files
  • We only pay for what we use.
  • We have minimum maintenance even though we have irregular scale because lambda does the heavy lifting for us :D
  • If there are any other external sources we can listen to configure the event listeners and publish the work items to stream i.e we can re-use the entire functionality from the kinesis stream
  • When more checks are needed to be performed for each item before publishing to kinesis ? No worries, change the configuration for creating smaller files to have less number of rows
  • Want to support different file formats ? No problem, write a small transformer which transforms the data to kinesis stream data format
  • If we have multiple sources, all we could do is write a source listeners and publish the items to the kinesis stream

Cons of this approach

  • Lambda which splits the initial file becomes a bottleneck but we did some scale testing and figured out that it would work for our use case. We have split 25 millions rows in ~50 seconds and since lambda timeout is 15 mins it should easily handle billions of rows easily

Other solutions which we explored were

  1. Use AWS EMR jobs to parse the file and process
  2. Use AWS data pipeline
  3. Export to mysql using dump and process the process the file
  4. Use AWS step functions
  5. Dedicated machines which scan the S3 buckets and process each file
  6. Use AWS batch and perform batch processing

Some of the reasons for which we didn’t use the above solutions

  • We don’t use any big data solutions because we are not actually performing any complex computations. Though it can used it is like killing a mosquito with a cannon
  • Some of the above approaches only allows strict file formats.
  • Decoupling the processing of items is tightly coupled with certain architecture like AWS step functions
  • We have write some logic to descale and upscale based on the input traffic

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Bits And Bytes Of Avikodak

I spend most of the time on helping out zeros and ones to reach there destination in a orderly and reliable manner :)