Hitting a 20-Minute EOD SLA: Designing Reconciliation on a Java Monolith
Every loan management system I've worked on has the same quiet, non-negotiable contract with the business: at the end of the day, every transaction posted against a loan must be reconciled against the accounting ledger before the next business day starts. The numbers have to match. If they don't, you don't open in the morning.
In our case, the operations team gave us a hard window: the EOD process must finish in twenty minutes. Not ninety. Not "usually fast." Twenty.
This post is about how we designed our way into that window. It is deliberately not a deep-dive into thread internals — I was an associate engineer at the time, and the most useful thing I learned was how much you can do with boring tools if you partition the problem correctly.
The system, at a glance
The loan management system was a single Java monolith. There was nothing exotic about it.
Reconciliation was triggered by a Quartz cron job inside the same JVM that served traffic. The job pulled the day's transactions, validated each one against the corresponding ledger entries, posted any adjustments, and wrote a run summary. That's it.
The first version did all of this on a single thread.
It took roughly three hours on a busy day.
Why "just add threads" was not the answer
The temptation was obvious: wrap the txn loop in an ExecutorService, hand each transaction to a worker, done. We tried it. It was worse.
Three things made naive parallelism a bad fit:
- Hot rows on the ledger. Many transactions on the same day touched the same handful of GL accounts. Parallel writers piled up on row locks and serialized themselves at the database, which is the worst of both worlds — all the coordination cost, none of the speedup.
- Connection pool starvation. Each worker held a JDBC connection. Spin up forty workers and the rest of the application — including the very UI that operations was watching — choked.
- Partial failure recovery. If a thread blew up halfway through, we had no clean way to resume. EOD that fails at minute 19 of 20 is worse than EOD that fails at minute 1.
The lesson, in retrospect, is that concurrency is a packaging problem, not a primitive problem. The question wasn't "how do I run threads?" It was "what's the unit of work that's actually independent?"
Partitioning before parallelizing
The reframing that unlocked everything was simple: transactions on different loans don't touch each other's ledger postings. The hot rows were summary GL accounts that we wrote to once at the end, not the per-loan postings that made up 95% of the work.
So we split EOD into two phases:
Phase 1 is embarrassingly parallel because we made the unit of work a loan partition, not a transaction. A loan's transactions are processed in order, on a single thread, by a single worker — no intra-loan races, no cross-loan contention.
Phase 2 is small, fast, and intentionally serial. Doing one consolidated write to the hot GL rows beats forty workers fighting over them.
To make phase 1 resumable, every partition recorded its state in an eod_partition table:
| run_id | loan_id | status | started_at | finished_at | error |
|---|---|---|---|---|---|
| 482 | 10001 | DONE | 22:00:01 | 22:00:04 | |
| 482 | 10002 | IN_PROGRESS | 22:00:01 | ||
| 482 | 10003 | FAILED | 22:00:01 | 22:00:02 | … |
A re-run picked up only PENDING and FAILED rows. EOD became safe to retry, which mattered more than I appreciated at the time.
Sizing the executor
With the partitioning sorted, the actual concurrency primitive was unglamorous: a bounded ThreadPoolExecutor with a fixed-size work queue, sitting inside the same Tomcat process as the web tier.
// rough shape, not the real code ExecutorService eodPool = new ThreadPoolExecutor( workerCount, // core workerCount, // max — we wanted predictable, not elastic 0L, TimeUnit.SECONDS, new ArrayBlockingQueue<>(queueDepth), new ThreadFactoryBuilder().setNameFormat("eod-worker-%d").build(), new ThreadPoolExecutor.CallerRunsPolicy() );
Two decisions mattered more than the code:
workerCountwas tied to the JDBC pool, not CPU count. We had a connection pool of 50, the web tier needed roughly 20 of them at peak, and so the EOD pool was capped at the count we could give it without starving the application — typically 16. The CPU was rarely the bottleneck; the database was.CallerRunsPolicyfor backpressure. If the queue filled, the dispatcher thread itself ran the next task. This naturally throttled enqueue speed to match drain speed, instead of unbounded memory growth.
We didn't use ForkJoinPool. We didn't use parallel streams. Both were tempting and both hide the things you actually need to control on EOD night — pool size, queue depth, and what happens when one task fails.
Batch sizing: the goldilocks problem
Inside a partition, transactions still hit the database. We wrote them in batches via JDBC addBatch / executeBatch. The batch size was the single most-tuned number in the system.
| Batch size | Wall time | Notes |
|---|---|---|
| 1 | 38 min | One round-trip per row. DB is bored. |
| 50 | 14 min | Round-trips down 50×. |
| 500 | 9 min | Best throughput on our hardware. |
| 5,000 | 11 min | Long transactions, lock pressure, undo growth. |
500 was the sweet spot for our schema and hardware. The lesson wasn't the number — it was that "batch everything" without measuring is just as wrong as "no batching." We checked it once a quarter because data volumes moved.
What we deliberately did not build
In 2018, the temptation list looked like this:
- A separate JVM dedicated to batch work, deployed on its own box.
- A JMS queue (ActiveMQ or IBM MQ) to dispatch partitions across multiple worker JVMs.
- Spring Batch for chunk-oriented processing with restart semantics built in.
- A move toward containerization (Docker was real, Kubernetes was just becoming "a thing you might do").
We considered all of them. We shipped none of them. The reasoning was:
- A second JVM doubled the operational surface for a job that already finished in well under twenty minutes.
- A message broker would have given us nothing the partition table didn't already give us, and would have added a piece of infrastructure that the bank's ops team had no runbook for.
- Spring Batch was a real option, and in hindsight I'd likely use it on a green-field rebuild. At the time, retrofitting it onto a five-year-old codebase wasn't worth the risk.
- Containers in production for a regulated lender, in 2018, in our environment, was a battle for a different year.
The honest framing: we matched the architecture to the team and the operating environment, not to the conference talk circuit. That tradeoff aged well. The system ran EOD inside its window for years.
Where it landed
After the rewrite, EOD finished in eight to twelve minutes on normal days and never crossed sixteen on the heaviest end-of-month runs we saw. The actual headline number wasn't the speedup, though — it was that EOD became something operations stopped paying attention to. They got their morning. We got our sleep.
A few lessons I keep returning to:
- Find the unit of work that's truly independent before you reach for threads. Most "concurrency problems" are actually partitioning problems wearing a costume.
- Concurrency that shares a process with serving traffic needs a budget. Pool size, queue depth, backpressure policy — pick them deliberately.
- A boring schema (
eod_run,eod_partition) gives you idempotency and observability for free. It is worth more than any clever framework. - Match the architecture to the operating environment. A monolith with a well-designed scheduler beat a distributed system we couldn't operate.
If I were building this today I'd probably reach for a managed work queue and run the workers on something orchestrated. But the underlying design — partition the work, bound the parallelism, batch deliberately, make every step resumable — is the part I'd carry over unchanged.
That part isn't really about Java at all.