Account resolution
A deal might have 12 bank statement PDFs across two accounts from different months, plus a loan application and a credit report. Before extracting any transactions, the system needs to answer three questions: whose bank statements are these, how many distinct accounts exist, and who are the people tied to the business. The pipeline starts with a global pass across all bank statement documents at once. Every PDF is read together in a single call, and the output is one_AccountMetadata object that captures the business identity, the account universe, and the principals:
VALID ACCOUNT IDS: [1, 2]) and the model cannot invent new accounts.
Each document is then processed individually. The model reads one PDF and returns the statement period and the reconciliation targets for each account visible in that document:
account_id here must reference an account from the universe established in the first pass. The beginning and ending balances become the reconciliation targets. The deposit and withdrawal counts, when the bank prints them, become additional cross-checks. Redundant documents (same period, same account, matching balances) are detected and excluded before extraction begins.
Transaction extraction
Now the system has the frame: the business, the accounts, and for each document, the period and balances. Extraction fills in the transactions. Each document is processed in parallel: the system spawns concurrent subagents, one per document, that each read a full PDF and return transactions grouped by account. A deal with 12 statements means 12 parallel extraction calls:abs() validator). The sign is carried by transaction_type. This eliminates sign-related extraction errors at the schema boundary.
Chunked fallback
When output exceeds the model’s token limit, the system falls back to chunked extraction: roughly 100 transactions per chunk, sequential, with the last 10 transactions from the previous chunk passed as continuation context. Boundary deduplication compares the last 5 of the previous chunk against the first 5 of the new one using a composite key of(date, amount, normalized_description, type).
Reconciliation
Extraction produces a candidate ledger per account per document. Reconciliation runs at the ledger level: if a document contains two accounts, each is reconciled independently, again in parallel. The beginning and ending balances captured during statement metadata become the test. For each ledger, the system computes: If , the ledger is reconciled. Otherwise, the system enters a correction loop of up to 3 attempts. The correction agent receives the current ledger as a CSV, the original PDF, and aBalanceFix response schema:
Transaction compression
At this point, every account in every document has a reconciled ledger of transactions. The system now merges all of them into a single flat transaction space, globally reindexed with unique integer IDs starting from 1. A deal with 12 statements across 2 accounts might produce 800 transactions in this flat space. Tagging needs to classify all of them. Sending 800 individual transactions to a model would be expensive and noisy. Instead, the system compresses them. Every transaction description is normalized through an aggressive cleaning function:-
Loan groups: card purchases filtered out, descriptions cleaned with
clean_for_loan_display(strips ACH metadata, sponsor bank names, routing codes), sorted by cleaned counterparty name. -
Core groups: all transactions included, descriptions cleaned with
clean_for_core_display(lighter cleaning, preserves semantic words like “transfer”, “fee”, “wire”).
[1, 3] to tag two groups instead of listing 45 individual transaction IDs. The system maintains inverted indexes (t2g_map, t2g_core_map) that map every transaction ID to its group, so group-level tags can be exploded back to individual transactions after the model responds. This is what makes the tagging step tractable: the model reasons over 120 compressed groups instead of 800 raw transactions, and the symbolic group IDs give it a compact vocabulary to express operations over large sets of transactions at once.
Tagging
Three tagging passes run in parallel.Deterministic patterns
Compiled regex patterns match transaction descriptions at the individual level. These cover checks (including French-Canadianchèque), wires (FEDWIRE, IMAD, virement interbancaire), peer-to-peer (ZELLE, VENMO, CASH APP), NSF, overdraft, and stop payments. NSF and overdraft are gated on amount range (200.00) and restricted to debits:
Core LLM tags
The compressed core group table is sent to a model with business context (merchant identity, owner names, account numbers). The model returns aCoreTagBreakdown: lists of group IDs for each category (internal transfer, owner transaction, payment processor, bank fee, bank interest, reversal, cash).
Internal transfer detection uses the merchant’s account numbers as context (masked numbers in descriptions like TRANSFER TO CHK ****4521 signal same-bank movement). Owner transaction detection uses owner names from account metadata to identify draws and contributions.
Loan LLM tags
The compressed loan group table is sent to a model with the org’s funder registry injected. The model returns aLoanBreakdown: group IDs for each loan type (MCA, bank loan, factoring, auto, lease, mortgage, BNPL, debt collection). The system validates that no group appears in more than one loan type.
Merge
Tags merge in order: loan tags (one per transaction), core LLM tags (stackable, group-level), deterministic tags (stackable, transaction-level). A single transaction can carry["merchant_cash_advance", "wire"] but not two loan types.
Position detection
After tagging, the system knows which transactions are loan activity and what type. The next step is grouping them into positions: named clusters of transactions that belong to the same lender. For MCA-tagged groups, this is LLM-based. The compressed MCA groups are sent to a model along with the org’s funder registry (names and known transaction description aliases). The model creates named positions, each containing a list of group IDs, and matches each to a funder when it recognizes one. The system explodes group IDs back to transaction IDs and enriches each position with funder metadata (name, link, contact email). For non-MCA loan types (bank loans, factoring, auto, lease), position detection is fully deterministic. The system extracts counterparty names from the tagged groups and clusters them using TF-IDF cosine similarity:Tampering detection
While tagging and position detection run, a separate background task analyzes the uploaded PDFs for signs of fabrication. This task runs in parallel and cannot fail the parse. If it errors, the result isnull and the book completes normally.
The analysis has two layers. The first is deterministic: the system opens each PDF with PyMuPDF and extracts structural metadata.
%%EOF markers mean the file was re-saved. A creator / producer mismatch suggests re-export through an editing tool. A time_diff_seconds of zero on a statement dated months earlier is suspicious. None of these are conclusive alone, but they combine into a signal.
The second layer sends the metadata alongside reconciliation results to a model that produces a tampering score from 0 (fresh from the bank) to 7 (definite tampering), a written summary, and a list of flagged documents.
Structured generation
Everything described so far, classification, account resolution, extraction, reconciliation, tagging, position detection, tampering analysis, runs through a single function:response_schema. The LLM returns structured JSON. The system validates with isinstance(result.parsed, response_schema) and rejects anything that does not conform. Field validators (abs() on amounts, date coercion, enum normalization) run automatically on the parsed response.
This is why a Transaction always has a positive amount, a BalanceFix can only contain flips, removes, and adds, and a CoreTagBreakdown can only return lists of group IDs. These are typed contracts, not prompt suggestions.
The same function handles retry with exponential backoff, MaxTokensError detection, input-too-large rejection, usage tracking, and tracing. Every model interaction in the pipeline is this one path.
Canonical output
The full pipeline produces a single object:HolyMCAResult.
Downstream
OnceHolyMCAResult exists, every other surface in the platform reads from it. Analytics (true revenue, average daily balance, DTI, negative days, counterparty clusters) are derived on the fly, not stored separately, so any edit to tags, exclusions, or positions is reflected immediately. Screening evaluates the analytics against org-configured rules and passes or flags the deal. Spreadsheets (MCA stack views, monthly columns, qualifying income) are generated as Excel workbooks from the same object. Salesforce sync maps a configurable subset of metrics to custom fields on an Opportunity. The chat engine hydrates the full analytics into a sandboxed Python environment with tool access (run_command, write_file, save_artifact), so a model can compute over the deal, generate charts, build spreadsheets, and search the user’s Gmail. The embed API serves the same data to external clients building their own CRM or white-labeled views.
The point of all of this is that the hard work happens once, during the parse. Classification, account resolution, extraction, reconciliation, compression, tagging, position detection, tampering analysis. That pipeline produces one canonical object. Everything after that, every metric, every spreadsheet, every API response, every chat interaction, is a projection over it. One source of truth, many views, always consistent.