A Look Inside the Parser

By Armaan Kapoor Abstract. The LendPathway parser compiles arbitrary financial documents (PDFs, scans, images, CSVs, structured exports) into a verified, account-centric data model per deal. Given a set of bank statements from different institutions covering different time periods, it resolves a canonical business identity and account structure, extracts every transaction into unified per-account ledgers, and reconciles each ledger against the bank’s printed balances as proof of extraction correctness. A compression layer then reduces the transaction space into semantic groups optimized for parallel classification, where deterministic taggers, LLM classifiers, and a stateful funder registry that learns from each organization’s corrections jointly tag every transaction and cluster debt positions by lender. This produces the synopsis, spreadsheets, and embed views that underwriting teams work from. This paper covers the bank statement pipeline. The system runs in production across revenue-based financing, term lending, and structured credit, where teams run hundreds of thousands of deals through it.

1. Underwriting and documents

A merchant applies for a cash advance and submits three months of bank statements, an ISO application, maybe a credit report through a broker portal. Or a borrower applies for an unsecured term loan and submits tax returns and a tri-merge credit pull. The broker receiving this package needs to underwrite it quickly, figure out which lenders will take it, price it, submit it. The lender receiving the same package needs to determine whether the business generates enough revenue to service the advance, how much existing debt is already pulling from the account, whether the borrower is stacking multiple positions, and whether the documents are authentic. This entire industry runs on documents because the data that matters lives at the transaction level. An underwriter is not looking at a revenue number. They are reading the ledger: which deposits are real revenue versus internal transfers, which debits are MCA payments versus operating expenses, how many lenders are already in the account, whether the daily balance can absorb another position. Plaid and DecisionLogic and bank portal exports give you structured access to some of this, but they compress it through their own schemas. They normalize descriptions, drop fields, aggregate where you need line items. The bank statement is what the bank actually said happened. It is the highest-resolution record of a business’s financial activity, and for underwriting at this level of detail, anything coarser loses signal. The input is not clean. Bank statements arrive as PDFs, photographed pages, DecisionLogic exports, CSV and Excel downloads. Some cover a full calendar month, some are month-to-date pulls. Credit reports come as multi-bureau exports from LexisNexis or MyScoreIQ, single-bureau PDFs, raw API output. Tax returns are scanned 1040s mixed with K-1s from different entities across different years. A single deal submission can contain all of these in the same email. Even if every bank adopted a standard export format tomorrow, the hard problem would remain. Transactions still need to be tagged. Counterparties still need to be resolved across accounts and time periods. Debt positions still need to be clustered and attributed to specific lenders. Revenue still needs to be separated from noise. The rest of this paper focuses on the bank statement pipeline, which is where identity resolution, reconciliation, transaction compression, and position detection all live. Credit reports, tax forms, and loan applications run through their own parallel parsers.

2. Account resolution

The bank statement pipeline receives a set of documents. Different banks, different months, sometimes different accounts for the same business. Before extracting a single transaction, the system has to establish what it is looking at. All documents get read in a single pass. The output is one canonical object: the business entity, every distinct bank account visible across all statements, and the principals.

class _AccountMetadata(BaseModel):
    business: Business
    account_ledgers: List[AccountLedger]
    humans: Optional[List[Human]]

Each account gets an integer ID. Primary checking is account_id: 1. The full set of IDs becomes the coordinate system for every downstream step. Extraction schemas constrain output to this set. If the universe contains accounts 1 and 2, the model cannot produce transactions for account 3. Duplicate IDs are a hard failure. The system assumes unique coordinates everywhere, so it enforces them at the source.

{
  "account_ledgers": [
    { "account_id": 1, "account_name": "Business Checking", "account_number": "****4521", "account_type": "checking", "bank_name": "Chase" },
    { "account_id": 2, "account_name": "Business Savings", "account_number": "****8903", "account_type": "savings", "bank_name": "Chase" }
  ]
}

The principal names and masked account numbers seed downstream classification. TRANSFER TO CHK ****4521 checked against the account universe resolves to internal transfer. A Zelle payment matching a name in humans resolves to owner draw. These signals only exist because identity resolution ran first. A dedup pass compares documents on period, account, and balances. Exact duplicates get excluded before extraction.

3. Parallel extraction

N documents spawn N extraction agents concurrently. Each receives the account universe and a single document. Output is transactions grouped by account ID:

{
  "transactions_by_account_number": [
    {
      "account_id": 1,
      "transactions": [
        { "transaction_date": "2024-11-01", "description": "ACH Credit - Stripe Transfer", "amount": 8412.33, "transaction_type": "credit" },
        { "transaction_date": "2024-11-01", "description": "ACH Debit - Greenline Funding", "amount": 2847.00, "transaction_type": "debit" }
      ]
    }
  ]
}

Amount is constrained positive at the schema boundary via abs() validator. Sign is carried in transaction_type. VLMs frequently confuse sign when debits and credits share a column, when negatives are parenthetical, or when the minus sign is a PDF rendering artifact. Forcing the schema to separate magnitude from direction eliminates this error class structurally. The model cannot produce a negative amount. After extraction, each ledger is sorted by date, walked forward from the starting balance to compute running daily balances, and assigned local transaction IDs. This is the first point where the data has traceable shape. When extraction exceeds the model’s output token limit, the system falls back to sequential chunks of ~100 transactions with deliberate overlap. The model cannot reliably resume from a position in a visual document, so the overlap region gets deduped on (date, amount, normalized_description, type).

4. Reconciliation

Every bank statement prints a starting balance and an ending balance. Walk the starting balance forward through the extracted transactions and the result has to match. Most statements also print total credits, total debits, deposit count, withdrawal count. All checkable invariants. This runs per ledger, not per document. A single PDF with two accounts produces two independent reconciliation problems. Each ledger has its own balance equation, its own credit and debit totals, its own transaction counts to verify against. This matters because extraction errors are not random. The most common failure mode is the model assigning a transaction to the wrong account in a multi-account document. A transaction that lands in account 1 instead of account 2 breaks both ledgers simultaneously: one is too high, the other too low, by the same amount. Per-ledger reconciliation catches this because the error shows up as a symmetric discrepancy across two ledgers in the same document. The primary check:

B_{\text{computed}} = B_{\text{start}} + \sum_{i} \text{signed}(t_i)

|B_{\text{computed}} - B_{\text{end}}| \leq 0.05

and the ledger is verified.

The balance equation produces a precise error correction signal: the magnitude and direction of any discrepancy tell the correction agent exactly what to look for. Each correction attempt is a siloed agent that receives only the current ledger state, the discrepancy, and the source document. It cannot see or trust prior attempts, so no single extraction error propagates unchecked.

The secondary signals (credit sum, debit sum, transaction counts) provide additional constraints when available. If the balance reconciles but the deposit count is off, something was double-extracted or missed. When verification fails, a correction agent receives a prompt that is a pure function of the current ledger state: every transaction as a CSV row with running balances, the discrepancy magnitude and direction (“computed ending is $2,400 HIGHER than expected, net change is too positive”), and the source document. For multi-account documents it also receives the other accounts’ names and masked numbers, because that cross-account swap is exactly what it needs to look for. The agent operates through a constrained DSL:

class BalanceFix(BaseModel):
    flip_indices: list[int]       # flip debit↔credit
    remove_indices: list[int]     # remove row
    add_transactions: list[Transaction]  # missing rows from PDF
    give_up: bool
    explanation: str

It cannot rewrite descriptions or fabricate amounts. Flip a transaction’s direction, remove a row, or add one it finds in the source that extraction missed. After each correction the system recomputes the balance and checks again. If the source document is internally inconsistent (mid-cycle statements, month-to-date summaries where the bank’s own numbers don’t add up), the agent reports why and the data proceeds flagged but usable. After reconciliation, every transaction across all ledgers receives a book-global integer ID starting from 1. This overwrites local per-ledger IDs and becomes the permanent coordinate for tags, positions, and analytics. The data shape then inverts: parsing produces document-centric output (document → accounts → transactions), but analytics needs account-centric output (account → transactions across all documents, sorted by date). Both views are maintained in the canonical output.

5. Transaction compression

A typical deal produces ~800 transactions. The classification models need to reason over all of them. The system compresses descriptions through normalization (strip confirmation codes, ACH metadata, mask tokens, numerics, single-letter fragments) and groups transactions with identical cleaned descriptions. 800 transactions collapse to ~120 groups. Two group indexes are built from this compressed space because the two classification tasks need different views of the same data. The loan index strips sponsor bank names (OptimumBank, Cross River Bank, Pathward, the originating banks MCA funders route ACH through, not the funders themselves), ACH rail metadata (ORIG CO NAME, CO ENTRY DESCR, PPD, CCD), routing stopwords. Card purchases excluded. The classification model sees “Greenline Funding” where the raw description reads “ACH DEBIT ORIG CO NAME GREENLINE FUNDING CO ENTRY DESCR PAYMENT PPD”. The core index preserves what the loan index strips. Account numbers stay because internal transfer detection depends on matching TRANSFER TO CHK ****4521 against the account universe from §2. Semantic words (“transfer”, “fee”, “wire”) stay. Both serialize to markdown tables. Models return group IDs rather than transaction IDs. [1, 3] tags two groups covering potentially dozens of transactions. Inverted indexes expand group tags back to individual transactions.

6. Classification

Transaction groups get classified through three parallel engines: deterministic regex patterns, a core LLM, and a loan LLM. The regex layer handles the unambiguous stuff. Checks, wires, P2P, stop payments. NSF and overdraft fees are amount-gated between

0.01 and

200 because a $3,000 debit with “NSF” in the description is a returned payment, not a fee. French-Canadian banking terminology is covered (Desjardins statements use chèque, dépôt chèque, virement interbancaire, fonds manquants). The core LLM reads the core group index with the full business context from §2. This is where the identity resolution pays off. The model can resolve internal transfers by matching masked account numbers against the account universe, and owner draws by matching names against the principals list. It also picks up payment processors, bank fees, interest, reversals, cash. The loan LLM reads the loan group index with the organization’s funder registry injected into the prompt. The registry contains every lender the organization has encountered, with their known transaction description aliases. This is how the system catches MCA positions, bank loans, factoring lines, and the rest. Tags merge in priority: loan tags are non-stackable (one per transaction), core and deterministic tags stack. A transaction can carry ["merchant_cash_advance", "wire"] but never two loan types. These tags drive every downstream metric: true revenue, DTI, loan summaries, NSF counts.

7. Position detection

Tags tell you what kind of activity a transaction is. Positions tell you who is on the other end. For MCAs, the LLM maps tagged groups to funder registry entities, and each position gets enriched with metadata from the registry. Positions surface in the debt board organized by loan type, with per-position financials computed from their transaction ID lists. The registry is stateful per organization. Matches and aliases persist across parses, and underwriter corrections feed back into subsequent runs. The system gets better the more the organization uses it. For other loan types, character n-gram TF-IDF clusters counterparty names at τ=0.45. “GREENLINE CAPITAL”, “GREENLINE CAP”, “ACH DEBIT GREENLINE” collapse into one position.

8. Tampering detection

Runs in parallel with classification. PyMuPDF extracts structural signals from each PDF: %%EOF count, creator/producer mismatch, creation vs modification timestamps, font inventory. A scoring model weighs these alongside reconciliation outcomes. Reconciled with suspicious metadata is a different risk profile than failed reconciliation with signs of editing.

9. Structured generation and model routing

Every model call across the pipeline passes through a single function accepting a Pydantic response_schema. The LLM returns JSON, the system validates, nonconformance is rejected. Field validators enforce constraints: abs() on amounts, date normalization, enum coercion. The schema bounds what the model can express. We route primarily through Gemini Flash with thinking budgets tuned per task. Reconciliation gets a high thinking budget because the model needs to reason about discrepancies against the source document. Bulk extraction gets zero thinking budget because throughput matters more. VLM performance on financial documents has improved substantially over the past year, particularly on dense tabular layouts. The remaining failure modes are edge cases in visual parsing: merged cells, absent running balance columns, degraded OCR on fine print. Reconciliation catches most of these downstream.

10. Canonical output

class HolyMCAResult(BaseModel):
    business: Optional[Business]
    humans: Optional[List[Human]]
    account_ledgers: Optional[List[AccountLedger]]
    bank_statements: Optional[List[HolyBankStatement]]
    merged_accounts: Dict[str, MergedMCAAccount]
    positions: Optional[List[StoredPosition]]
    transaction_count: int
    web_research: Optional[str]
    tampering_analysis: Optional[Any]

Every downstream surface is a derived view of this object. Analytics are computed at read time, never stored separately. The synopsis renders metrics, the spreadsheet gallery generates formatted workbooks, Salesforce sync pushes computed fields, the chat agent loads the full object into a sandboxed Python environment, and the embed API serves it to external clients building their own interfaces.

Tags are the sole user-editable input. Change a tag, exclude a document, reassign a position, and every one of these surfaces recomputes. The hard work happens once. Everything else is a view.

The broker who submitted the deal and the lender who underwrites it are both looking at derived views of the same canonical object.

We are looking forward to the future, and I am excited to continue building in the open.
— Armaan

​1. Underwriting and documents

​2. Account resolution

​3. Parallel extraction

​4. Reconciliation

​5. Transaction compression

​6. Classification

​7. Position detection

​8. Tampering detection

​9. Structured generation and model routing

​10. Canonical output

1. Underwriting and documents

2. Account resolution

3. Parallel extraction

4. Reconciliation

5. Transaction compression

6. Classification

7. Position detection

8. Tampering detection

9. Structured generation and model routing

10. Canonical output