1. Underwriting and documents
A merchant applies for a cash advance and submits three months of bank statements, an ISO application, maybe a credit report through a broker portal. Or a borrower applies for an unsecured term loan and submits tax returns and a tri-merge credit pull. The broker receiving this package needs to underwrite it quickly, figure out which lenders will take it, price it, submit it. The lender receiving the same package needs to determine whether the business generates enough revenue to service the advance, how much existing debt is already pulling from the account, whether the borrower is stacking multiple positions, and whether the documents are authentic. This entire industry runs on documents because the data that matters lives at the transaction level. An underwriter is not looking at a revenue number. They are reading the ledger: which deposits are real revenue versus internal transfers, which debits are MCA payments versus operating expenses, how many lenders are already in the account, whether the daily balance can absorb another position. Plaid and DecisionLogic and bank portal exports give you structured access to some of this, but they compress it through their own schemas. They normalize descriptions, drop fields, aggregate where you need line items. The bank statement is what the bank actually said happened. It is the highest-resolution record of a business’s financial activity, and for underwriting at this level of detail, anything coarser loses signal. The input is not clean. Bank statements arrive as PDFs, photographed pages, DecisionLogic exports, CSV and Excel downloads. Some cover a full calendar month, some are month-to-date pulls. Credit reports come as multi-bureau exports from LexisNexis or MyScoreIQ, single-bureau PDFs, raw API output. Tax returns are scanned 1040s mixed with K-1s from different entities across different years. A single deal submission can contain all of these in the same email. Even if every bank adopted a standard export format tomorrow, the hard problem would remain. Transactions still need to be tagged. Counterparties still need to be resolved across accounts and time periods. Debt positions still need to be clustered and attributed to specific lenders. Revenue still needs to be separated from noise. The rest of this paper focuses on the bank statement pipeline, which is where identity resolution, reconciliation, transaction compression, and position detection all live. Credit reports, tax forms, and loan applications run through their own parallel parsers.2. Account resolution
The bank statement pipeline receives a set of documents. Different banks, different months, sometimes different accounts for the same business. Before extracting a single transaction, the system has to establish what it is looking at. All documents get read in a single pass. The output is one canonical object: the business entity, every distinct bank account visible across all statements, and the principals.account_id: 1. The full set of IDs becomes the coordinate system for every downstream step. Extraction schemas constrain output to this set. If the universe contains accounts 1 and 2, the model cannot produce transactions for account 3. Duplicate IDs are a hard failure. The system assumes unique coordinates everywhere, so it enforces them at the source.
TRANSFER TO CHK ****4521 checked against the account universe resolves to internal transfer. A Zelle payment matching a name in humans resolves to owner draw. These signals only exist because identity resolution ran first.
A dedup pass compares documents on period, account, and balances. Exact duplicates get excluded before extraction.
3. Parallel extraction
N documents spawn N extraction agents concurrently. Each receives the account universe and a single document. Output is transactions grouped by account ID:abs() validator. Sign is carried in transaction_type. VLMs frequently confuse sign when debits and credits share a column, when negatives are parenthetical, or when the minus sign is a PDF rendering artifact. Forcing the schema to separate magnitude from direction eliminates this error class structurally. The model cannot produce a negative amount.
After extraction, each ledger is sorted by date, walked forward from the starting balance to compute running daily balances, and assigned local transaction IDs. This is the first point where the data has traceable shape.
When extraction exceeds the model’s output token limit, the system falls back to sequential chunks of ~100 transactions with deliberate overlap. The model cannot reliably resume from a position in a visual document, so the overlap region gets deduped on (date, amount, normalized_description, type).
4. Reconciliation
Every bank statement prints a starting balance and an ending balance. Walk the starting balance forward through the extracted transactions and the result has to match. Most statements also print total credits, total debits, deposit count, withdrawal count. All checkable invariants. This runs per ledger, not per document. A single PDF with two accounts produces two independent reconciliation problems. Each ledger has its own balance equation, its own credit and debit totals, its own transaction counts to verify against. This matters because extraction errors are not random. The most common failure mode is the model assigning a transaction to the wrong account in a multi-account document. A transaction that lands in account 1 instead of account 2 breaks both ledgers simultaneously: one is too high, the other too low, by the same amount. Per-ledger reconciliation catches this because the error shows up as a symmetric discrepancy across two ledgers in the same document. The primary check: and the ledger is verified.The balance equation produces a precise error correction signal: the magnitude and direction of any discrepancy tell the correction agent exactly what to look for. Each correction attempt is a siloed agent that receives only the current ledger state, the discrepancy, and the source document. It cannot see or trust prior attempts, so no single extraction error propagates unchecked.
5. Transaction compression
A typical deal produces ~800 transactions. The classification models need to reason over all of them. The system compresses descriptions through normalization (strip confirmation codes, ACH metadata, mask tokens, numerics, single-letter fragments) and groups transactions with identical cleaned descriptions. 800 transactions collapse to ~120 groups. Two group indexes are built from this compressed space because the two classification tasks need different views of the same data. The loan index strips sponsor bank names (OptimumBank, Cross River Bank, Pathward, the originating banks MCA funders route ACH through, not the funders themselves), ACH rail metadata (ORIG CO NAME, CO ENTRY DESCR, PPD, CCD), routing stopwords. Card purchases excluded. The classification model sees “Greenline Funding” where the raw description reads “ACH DEBIT ORIG CO NAME GREENLINE FUNDING CO ENTRY DESCR PAYMENT PPD”.
The core index preserves what the loan index strips. Account numbers stay because internal transfer detection depends on matching TRANSFER TO CHK ****4521 against the account universe from §2. Semantic words (“transfer”, “fee”, “wire”) stay.
Both serialize to markdown tables. Models return group IDs rather than transaction IDs. [1, 3] tags two groups covering potentially dozens of transactions. Inverted indexes expand group tags back to individual transactions.
6. Classification
Transaction groups get classified through three parallel engines: deterministic regex patterns, a core LLM, and a loan LLM. The regex layer handles the unambiguous stuff. Checks, wires, P2P, stop payments. NSF and overdraft fees are amount-gated between 200 because a $3,000 debit with “NSF” in the description is a returned payment, not a fee. French-Canadian banking terminology is covered (Desjardins statements usechèque, dépôt chèque, virement interbancaire, fonds manquants).
The core LLM reads the core group index with the full business context from §2. This is where the identity resolution pays off. The model can resolve internal transfers by matching masked account numbers against the account universe, and owner draws by matching names against the principals list. It also picks up payment processors, bank fees, interest, reversals, cash.
The loan LLM reads the loan group index with the organization’s funder registry injected into the prompt. The registry contains every lender the organization has encountered, with their known transaction description aliases. This is how the system catches MCA positions, bank loans, factoring lines, and the rest.
Tags merge in priority: loan tags are non-stackable (one per transaction), core and deterministic tags stack. A transaction can carry ["merchant_cash_advance", "wire"] but never two loan types. These tags drive every downstream metric: true revenue, DTI, loan summaries, NSF counts.
7. Position detection
Tags tell you what kind of activity a transaction is. Positions tell you who is on the other end. For MCAs, the LLM maps tagged groups to funder registry entities, and each position gets enriched with metadata from the registry. Positions surface in the debt board organized by loan type, with per-position financials computed from their transaction ID lists. The registry is stateful per organization. Matches and aliases persist across parses, and underwriter corrections feed back into subsequent runs. The system gets better the more the organization uses it. For other loan types, character n-gram TF-IDF clusters counterparty names at τ=0.45. “GREENLINE CAPITAL”, “GREENLINE CAP”, “ACH DEBIT GREENLINE” collapse into one position.8. Tampering detection
Runs in parallel with classification. PyMuPDF extracts structural signals from each PDF:%%EOF count, creator/producer mismatch, creation vs modification timestamps, font inventory. A scoring model weighs these alongside reconciliation outcomes. Reconciled with suspicious metadata is a different risk profile than failed reconciliation with signs of editing.
9. Structured generation and model routing
Every model call across the pipeline passes through a single function accepting a Pydanticresponse_schema. The LLM returns JSON, the system validates, nonconformance is rejected. Field validators enforce constraints: abs() on amounts, date normalization, enum coercion. The schema bounds what the model can express.
We route primarily through Gemini Flash with thinking budgets tuned per task. Reconciliation gets a high thinking budget because the model needs to reason about discrepancies against the source document. Bulk extraction gets zero thinking budget because throughput matters more. VLM performance on financial documents has improved substantially over the past year, particularly on dense tabular layouts. The remaining failure modes are edge cases in visual parsing: merged cells, absent running balance columns, degraded OCR on fine print. Reconciliation catches most of these downstream.
10. Canonical output
We are looking forward to the future, and I am excited to continue building in the open.
— Armaan