Skip to main content
By Armaan Kapoor Financial data arrives as bank statements, credit reports, tax returns, loan applications, Plaid JSON, DecisionLogic exports, scanned PDFs, phone photos. The formats are different, the schemas are different, the level of structure varies wildly. Pathway takes all of it and maps it into one structured object per deal. That object is the thing that matters. Once it exists, everything else is a projection over it: analytics, screening, spreadsheets, CRM field sync, chat-based analysis, embed API responses. One canonical representation, regardless of which banks, which bureaus, which formats the documents came from. This paper covers the bank statement pipeline, because that is where the system does its most involved work. Bank statements have built-in reconciliation criteria. Every statement prints a starting balance and an ending balance. Most print deposit and withdrawal counts. Some include a running daily balance column. These are on the document itself, regardless of format or bank: Bend=Bstart+cidjB_{\text{end}} = B_{\text{start}} + \sum c_i - \sum d_j Extract every transaction, check the math. If it holds, the extraction is complete. If not, the discrepancy becomes an error signal: magnitude, direction, and when a running balance column is present, the exact row where divergence begins. That signal drives a bounded correction loop over the candidate ledger.

Account resolution

A deal might have 12 bank statement PDFs across two accounts from different months, plus a loan application and a credit report. Before extracting any transactions, the system needs to answer three questions: whose bank statements are these, how many distinct accounts exist, and who are the people tied to the business. The pipeline starts with a global pass across all bank statement documents at once. Every PDF is read together in a single call, and the output is one _AccountMetadata object that captures the business identity, the account universe, and the principals:
class _AccountMetadata(BaseModel):
    business: Business           # name, legal name, address, phone, tax ID
    account_ledgers: List[AccountLedger]  # every distinct bank account
    humans: Optional[List[Human]]         # owners / account holders
Each distinct bank account gets an integer ID. This becomes the coordinate system for everything downstream:
{
  "account_ledgers": [
    { "account_id": 1, "account_name": "Business Checking", "account_number": "****4521", "account_type": "checking", "bank_name": "Chase" },
    { "account_id": 2, "account_name": "Business Savings", "account_number": "****8903", "account_type": "savings", "bank_name": "Chase" }
  ]
}
This account universe is established once. Every subsequent extraction is constrained to it. The valid IDs are injected into the prompt (VALID ACCOUNT IDS: [1, 2]) and the model cannot invent new accounts. Each document is then processed individually. The model reads one PDF and returns the statement period and the reconciliation targets for each account visible in that document:
{
  "statement_start_date": "2024-11-01",
  "statement_end_date": "2024-11-30",
  "accounts_visible_in_statement": [
    { "account_id": 1, "beginning_balance": 24531.88, "ending_balance": 31204.55, "number_of_deposits": 47, "number_of_withdrawals": 112 }
  ]
}
The account_id here must reference an account from the universe established in the first pass. The beginning and ending balances become the reconciliation targets. The deposit and withdrawal counts, when the bank prints them, become additional cross-checks. Redundant documents (same period, same account, matching balances) are detected and excluded before extraction begins.

Transaction extraction

Now the system has the frame: the business, the accounts, and for each document, the period and balances. Extraction fills in the transactions. Each document is processed in parallel: the system spawns concurrent subagents, one per document, that each read a full PDF and return transactions grouped by account. A deal with 12 statements means 12 parallel extraction calls:
{
  "transactions_by_account_number": [
    {
      "account_id": 1,
      "transactions": [
        { "transaction_date": "2024-11-01", "description": "ACH Credit - Stripe Transfer", "amount": 8412.33, "transaction_type": "credit" },
        { "transaction_date": "2024-11-01", "description": "ACH Debit - Greenline Funding", "amount": 2847.00, "transaction_type": "debit" }
      ]
    }
  ]
}
Amounts are always positive (enforced by a Pydantic abs() validator). The sign is carried by transaction_type. This eliminates sign-related extraction errors at the schema boundary.

Chunked fallback

When output exceeds the model’s token limit, the system falls back to chunked extraction: roughly 100 transactions per chunk, sequential, with the last 10 transactions from the previous chunk passed as continuation context. Boundary deduplication compares the last 5 of the previous chunk against the first 5 of the new one using a composite key of (date, amount, normalized_description, type).

Reconciliation

Extraction produces a candidate ledger per account per document. Reconciliation runs at the ledger level: if a document contains two accounts, each is reconciled independently, again in parallel. The beginning and ending balances captured during statement metadata become the test. For each ledger, the system computes: Bcomputed=Bstart+isigned(ti)B_{\text{computed}} = B_{\text{start}} + \sum_{i} \text{signed}(t_i) If BcomputedBend0.05|B_{\text{computed}} - B_{\text{end}}| \leq 0.05, the ledger is reconciled. Otherwise, the system enters a correction loop of up to 3 attempts. The correction agent receives the current ledger as a CSV, the original PDF, and a BalanceFix response schema:
class BalanceFix(BaseModel):
    flip_indices: list[int]            # 1-based, flip debit↔credit
    remove_indices: list[int]          # 1-based, remove row
    add_transactions: list[Transaction]  # missing rows from the PDF
    give_up: bool
    explanation: str
These are the only legal moves. The agent cannot change amounts, rewrite descriptions, or restructure the ledger. The CSV includes running balances so the agent can locate exactly where divergence starts. The direction of the error and, for multi-account statements, the other accounts in the PDF are provided as context. After each correction, the system recomputes and re-checks. Up to 3 attempts. Never fabricates transactions.

Transaction compression

At this point, every account in every document has a reconciled ledger of transactions. The system now merges all of them into a single flat transaction space, globally reindexed with unique integer IDs starting from 1. A deal with 12 statements across 2 accounts might produce 800 transactions in this flat space. Tagging needs to classify all of them. Sending 800 individual transactions to a model would be expensive and noisy. Instead, the system compresses them. Every transaction description is normalized through an aggressive cleaning function:
def clean_for_clustering(desc: str) -> str:
    s = desc.lower()
    s = re.sub(r"[^a-z0-9 ]+", " ", s)  # keep alphanumeric + spaces
    s = re.sub(r"\d+", "", s)             # strip all numbers
    s = re.sub(r"\s+", " ", s).strip()    # collapse whitespace
    return s
Transactions with identical normalized descriptions merge into groups. A book with 800 transactions might compress to 120 groups. Each group carries an exemplar description, credit count and sum, debit count and sum. Two separate group indexes are built from this compressed space:
  1. Loan groups: card purchases filtered out, descriptions cleaned with clean_for_loan_display (strips ACH metadata, sponsor bank names, routing codes), sorted by cleaned counterparty name.
  2. Core groups: all transactions included, descriptions cleaned with clean_for_core_display (lighter cleaning, preserves semantic words like “transfer”, “fee”, “wire”).
Both are serialized into markdown tables for model consumption:
| Group ID | Description | Credit Count | Credit Sum | Debit Count | Debit Sum |
|---:|---|---:|---:|---:|---:|
| 1 | Greenline Funding | 1 | 15000.00 | 33 | 94251.00 |
| 2 | Stripe Transfer | 87 | 142033.41 | 0 | 0.00 |
| 3 | Zelle Transfer John Smith | 0 | 0.00 | 12 | 62400.00 |
The model operates on compressed group IDs. It returns [1, 3] to tag two groups instead of listing 45 individual transaction IDs. The system maintains inverted indexes (t2g_map, t2g_core_map) that map every transaction ID to its group, so group-level tags can be exploded back to individual transactions after the model responds. This is what makes the tagging step tractable: the model reasons over 120 compressed groups instead of 800 raw transactions, and the symbolic group IDs give it a compact vocabulary to express operations over large sets of transactions at once.

Tagging

Three tagging passes run in parallel.

Deterministic patterns

Compiled regex patterns match transaction descriptions at the individual level. These cover checks (including French-Canadian chèque), wires (FEDWIRE, IMAD, virement interbancaire), peer-to-peer (ZELLE, VENMO, CASH APP), NSF, overdraft, and stop payments. NSF and overdraft are gated on amount range (0.01to0.01 to 200.00) and restricted to debits:
def tag_nsf(description: str, amount: float = 0.0) -> Optional[str]:
    if not (0.01 <= amount <= 200.00):
        return None
    if _NSF_REGEX.search(re.sub(r"\s+", " ", description.strip())):
        return "nsf"
    return None

Core LLM tags

The compressed core group table is sent to a model with business context (merchant identity, owner names, account numbers). The model returns a CoreTagBreakdown: lists of group IDs for each category (internal transfer, owner transaction, payment processor, bank fee, bank interest, reversal, cash). Internal transfer detection uses the merchant’s account numbers as context (masked numbers in descriptions like TRANSFER TO CHK ****4521 signal same-bank movement). Owner transaction detection uses owner names from account metadata to identify draws and contributions.

Loan LLM tags

The compressed loan group table is sent to a model with the org’s funder registry injected. The model returns a LoanBreakdown: group IDs for each loan type (MCA, bank loan, factoring, auto, lease, mortgage, BNPL, debt collection). The system validates that no group appears in more than one loan type.

Merge

Tags merge in order: loan tags (one per transaction), core LLM tags (stackable, group-level), deterministic tags (stackable, transaction-level). A single transaction can carry ["merchant_cash_advance", "wire"] but not two loan types.

Position detection

After tagging, the system knows which transactions are loan activity and what type. The next step is grouping them into positions: named clusters of transactions that belong to the same lender. For MCA-tagged groups, this is LLM-based. The compressed MCA groups are sent to a model along with the org’s funder registry (names and known transaction description aliases). The model creates named positions, each containing a list of group IDs, and matches each to a funder when it recognizes one. The system explodes group IDs back to transaction IDs and enriches each position with funder metadata (name, link, contact email). For non-MCA loan types (bank loans, factoring, auto, lease), position detection is fully deterministic. The system extracts counterparty names from the tagged groups and clusters them using TF-IDF cosine similarity:
def cluster_by_similarity(texts: List[str], tau: float = 0.45):
    cleaned = [clean_for_clustering(t) for t in texts]
    vectorizer = TfidfVectorizer(
        analyzer="char_wb", ngram_range=(3, 6), min_df=2
    )
    X = vectorizer.fit_transform(cleaned)
    k = min(8, n - 1, max(2, n // 3))
    nn = NearestNeighbors(n_neighbors=k, metric="cosine").fit(X)
    distances, indices = nn.kneighbors(X)
    sims = 1 - distances

    G = nx.Graph()
    G.add_nodes_from(range(n))
    for i in range(n):
        for j, sim in zip(indices[i], sims[i]):
            if i != j and sim >= tau:
                G.add_edge(i, j, weight=sim)

    return [list(cc) for cc in nx.connected_components(G)]
Character n-gram TF-IDF (3 to 6 characters) handles the high variance of bank transaction descriptions. The k-NN graph with adaptive k and a cosine similarity threshold of 0.45 produces connected components that each become a position. The exemplar name for each cluster is selected by frequency of the extracted counterparty name across its members.

Tampering detection

While tagging and position detection run, a separate background task analyzes the uploaded PDFs for signs of fabrication. This task runs in parallel and cannot fail the parse. If it errors, the result is null and the book completes normally. The analysis has two layers. The first is deterministic: the system opens each PDF with PyMuPDF and extracts structural metadata.
class PDFMetadataSignals(BaseModel):
    eof_count: int             # multiple %%EOF markers = re-saved
    created_at_raw: str | None
    modified_at_raw: str | None
    time_diff_seconds: float | None  # creation vs modification gap
    creator: str | None        # authoring tool
    producer: str | None       # PDF generator
    is_encrypted: bool
    xref_object_count: int     # object table size
    page_fonts: list           # embedded font inventory
Multiple %%EOF markers mean the file was re-saved. A creator / producer mismatch suggests re-export through an editing tool. A time_diff_seconds of zero on a statement dated months earlier is suspicious. None of these are conclusive alone, but they combine into a signal. The second layer sends the metadata alongside reconciliation results to a model that produces a tampering score from 0 (fresh from the bank) to 7 (definite tampering), a written summary, and a list of flagged documents.

Structured generation

Everything described so far, classification, account resolution, extraction, reconciliation, tagging, position detection, tampering analysis, runs through a single function:
async def complete_with_retry(
    self,
    book_id: str,
    org_id: str,
    model: Model,
    contents: List[types.Content],
    response_schema: Optional[Type[T]] = None,  # T: BaseModel
    thinking_budget: int = 0,
    ...
) -> Optional[T]:
The caller passes a Pydantic model as response_schema. The LLM returns structured JSON. The system validates with isinstance(result.parsed, response_schema) and rejects anything that does not conform. Field validators (abs() on amounts, date coercion, enum normalization) run automatically on the parsed response. This is why a Transaction always has a positive amount, a BalanceFix can only contain flips, removes, and adds, and a CoreTagBreakdown can only return lists of group IDs. These are typed contracts, not prompt suggestions. The same function handles retry with exponential backoff, MaxTokensError detection, input-too-large rejection, usage tracking, and tracing. Every model interaction in the pipeline is this one path.

Canonical output

The full pipeline produces a single object: HolyMCAResult.
class HolyMCAResult(BaseModel):
    business: Optional[Business]
    humans: Optional[List[Human]]
    account_ledgers: Optional[List[AccountLedger]]
    bank_statements: Optional[List[HolyBankStatement]]
    merged_accounts: Dict[str, MergedMCAAccount]
    positions: Optional[List[StoredPosition]]
    transaction_count: int
    web_research: Optional[str]
    tampering_analysis: Optional[Any]
This object is stored as JSONB in the book’s metadata. It is the single source of truth for everything the system knows about a deal’s financial documents.

Downstream

Once HolyMCAResult exists, every other surface in the platform reads from it. Analytics (true revenue, average daily balance, DTI, negative days, counterparty clusters) are derived on the fly, not stored separately, so any edit to tags, exclusions, or positions is reflected immediately. Screening evaluates the analytics against org-configured rules and passes or flags the deal. Spreadsheets (MCA stack views, monthly columns, qualifying income) are generated as Excel workbooks from the same object. Salesforce sync maps a configurable subset of metrics to custom fields on an Opportunity. The chat engine hydrates the full analytics into a sandboxed Python environment with tool access (run_command, write_file, save_artifact), so a model can compute over the deal, generate charts, build spreadsheets, and search the user’s Gmail. The embed API serves the same data to external clients building their own CRM or white-labeled views. The point of all of this is that the hard work happens once, during the parse. Classification, account resolution, extraction, reconciliation, compression, tagging, position detection, tampering analysis. That pipeline produces one canonical object. Everything after that, every metric, every spreadsheet, every API response, every chat interaction, is a projection over it. One source of truth, many views, always consistent.