For the uninitiated, airflow is a data processing framework, a glorified cron scheduler, if you will, but as most python developers might put it...
With batteries included
We have two instances of airflow running at Lok Suvidha. This article covers, the more interesting use case. I'll be following up with another post for the second instance.
Why? The prologue...
To understand the problem, we first have to be familiar with a bit of financial jargon. Every month, we receive loan EMIs from our customers. The way this works is, our customers have signed a mandate, with us, giving us the authority to deduct EMI from their bank account. Thankfully we have NPCI and NACH which allows for centralized processing with every bank.
Every month, we have this activity where we present emi collection request for our customers. This is done by generating a list of customers from whose bank account the emi is to be collected, sending this list to our sponsor bank and waiting for the bank to credit the amount from the customer's bank account to our bank account. This happens at a bulk level, ie, for each customer, we don't receive a single amount credited, we receive a bulk amount for all the customers. For example, for 10 customers for whom we are awaiting credit, each having an emi of 10Rs, the total amount is 100Rs, we will receive 100Rs and not 10Rs 10 times in our account. Additionally, some customers may even bounce ie. have insufficient funds in their bank accounts for us to debit. So eventually out of the 100Rs that were to be received we might end up getting 90Rs. This is an important point, which will come up later. Now, when we receive the credit, there are several actions that need to be taken - a. Mark received/bounced against a customer. b. Apply bouncing charges if any. c. Reconcile, the bulk payment received from the bank. ie, check if the payment received in our account, is equal to the sum total of credits, the bank has given us against each customer.
Now, we come to the technical part. The entire activity is a huge, SQL transaction. This all or nothing approach keeps the system in a consistent state. Now, with transactions, all sorts of things can go wrong. Especially, when you have several transactions, all happening at once, some will conflict, others will timeout. We have faced this in the past where sometimes the entire DB got locked, and then the team went on a scavenger hunt to find the transaction that's blocking.
How? The date with Airflow...
So we want to solve the classic deadlock problem. Classic problems, also have classic solutions, and the immediate one is with a Queue. So Kafka? Right?
Turns out, the answer lies in the context, and our experiences thus far. Deadlocks come in all shapes and sizes. So do bugs. No matter how resilient your systems are, one must prepare for eventual failures. Additionally, if one is failing fast, one should also recover fast.
The Kafka approach would be to keep two partitions, one is the main queue and the other is the failure queue. If something goes awry in the main queue, it is moved to the failure queue and requeued after the bug is fixed, or deadlock is resolved after a delay. It is a relatively complex architecture which might require additional plumbing(logging, runtime tracking etc) to get right correctly.
In comes airflow. The two most important features that were of utmost importance were, the airflow-rest-api and BashOperator. Now akin to our previous example, the flow is:
1. Upload the bank clearing file to storage.
2. Start the processing job, via a GET call to the airflow server and by passing a json configuration object.
3. A bash job is started with the configuration object being passed as a parameter to the script.
Yes, we are using airflow as simply a task queue with none of its scheduling capabilities. Why? because just the tooling around it is so good! We have clear visibility into logging, alerting mechanisms are baked right into it, with a task parallelism of 1, we ensure no other transaction gets in between.
Soon, we extended this into a framework, whereby all things accounting are pushed through this "queue". There's a router script at the very front which, depending on the configuration object, routes the code flow. Essential, and long running tasks like day end etc have been moved here.
Keep it simple sweet
I had a Problem, so I decided to use java, now I have a ProblemFactory
The key takeaway, and our north star deciding our choices is the desire to keep the code simple, linear and predictable. We have dealt with complex MVCs and ORMs that jump over 10 different files to perform a simple SELECT query. It is okay to have complex external tooling that keeps watch over the pipeline. However, introducing code that has nothing to do with business logic and is only there for squashing bugs that came up because of a really cool asynchronous callback driven architecture is not really, cool.