How Parsing Works

When you a parse job is initiated, LendPathway runs a multi-step pipeline that turns raw PDFs into structured, analyzed financial data. This page explains exactly what happens at each step.

The Pipeline

Every parse follows this sequence:

1. Load Documents

All uploaded files are pulled from storage. Only PDF files are accepted — any other file type (images, spreadsheets, Word docs, etc.) is immediately marked as failed.

2. Classify Each Document

Each PDF is sent to an AI model that reads the document and identifies what kind of financial document it is. Classification runs in parallel — all PDFs are classified at the same time. The classifier can identify these document types:

Type	What it looks for
Bank Statement	Transactions, deposits, withdrawals, account balances
Credit Report	Experian / Equifax / TransUnion, credit scores, tradelines
Tax Form	IRS forms (1040, 1065, 1120, 1120-S, Schedule C, Schedule E, K-1, etc.)
Loan Application	Application form for financing
Receipt	Purchase receipt, business expense
DecisionLogic	Report from decisionlogic.com
AR Report	Accounts receivable report
Photo ID	Government-issued ID (driver’s license, passport)
Voided Check	Bank check showing account details

Documents that don’t match any type are marked Unsupported.

Of these, four have full parsing pipelines: Bank Statement, Credit Report, Tax Form, and Loan Application. The others (Receipt, Photo ID, Voided Check, etc.) are classified and stored but not parsed further.

3. Run Type-Specific Pipelines

Based on classification, LendPathway runs the appropriate parser for each document type — in parallel. If you upload a mix of bank statements, a credit report, and a loan application, all three pipelines run simultaneously.

Bank Statement Pipeline

The bank statement pipeline is the most complex. Here’s what happens inside it, step by step.

Step 1 — Account Metadata Extraction

The AI reads all bank statement PDFs together in a single call and extracts:

Business identity — business name, address, phone, tax ID
Principals — owner names, roles, addresses, phone numbers
Account ledgers — every distinct bank account across all documents (account name, account number, bank name, routing number)

Each account is assigned a unique ID. This step establishes the map of accounts that the rest of the pipeline uses. If no bank accounts are found at all, the parse fails here. After this step, two things happen immediately:

The book name is updated to the extracted business name
AI Deep Research is kicked off in the background (more on this below)

Step 2 — Statement Metadata Extraction

For each individual PDF (in parallel), the AI extracts statement-level metadata:

Statement start and end dates
Starting and ending balances, per account
Which accounts appear in this specific document

This is also where LendPathway figures out the date range for the book (e.g. “3 accounts, Jan 2024 to Dec 2024”).

Step 3 — Duplicate Detection

Only runs when there are 2 or more bank statement PDFs. The AI compares all statements and identifies redundant documents — complete duplicates or documents that are subsets of another (e.g. someone uploaded both a full 3-page statement and a 1-page summary of the same month). Redundant documents are marked as failed and removed before transaction extraction, preventing double-counted data.

Step 4 — Transaction Extraction

For each remaining statement (in parallel), the AI extracts every individual transaction:

Date
Description
Amount
Type (credit or debit)

This is the most computationally intensive step. If a document is too large for a single extraction call (hits token limits), LendPathway automatically falls back to chunked extraction — pulling transactions in batches of ~100 at a time, up to 20 chunks, and merging the results. Each chunk receives context about where the previous chunk left off to avoid gaps.

Step 5 — Assembly

The extracted metadata and transactions are merged together into ledgers. A ledger is one account’s data within one statement document — its starting balance, ending balance, and list of transactions. A single PDF can produce multiple ledgers if it contains multiple accounts.

Step 6 — Reconciliation

After assembly, LendPathway reconciles each ledger independently (all ledgers in parallel). This is the mathematical verification step. The formula: Starting Balance + Sum of All Credits − Sum of All Debits = Computed Ending Balance The computed ending balance is compared against the ending balance printed on the statement. If they match, the ledger is reconciled — meaning the extracted transactions are a mathematically faithful representation of the bank’s own records. If it doesn’t reconcile on the first check, LendPathway enters a retry loop (up to 3 attempts). On each attempt, the AI receives:

The original PDF (ground truth)
The current list of extracted transactions as a CSV
The current discrepancy amount and direction (too high or too low)
If the statement has multiple accounts, a note about which account is being reconciled

The AI compares the extracted transactions against the PDF and can make three types of corrections:

Flip a transaction’s type — if a credit was mistakenly extracted as a debit (or vice versa), flip it
Remove a transaction — if a duplicate or nonexistent transaction was extracted
Add a missing transaction — if a transaction visible in the PDF wasn’t extracted

The AI is instructed to only make corrections it can clearly verify in the PDF. It will never fabricate transactions to force the math to work. If the extraction looks correct but the math still doesn’t add up (e.g. the bank’s own statement has an internal discrepancy), the AI gives up and explains why. After corrections are applied, the balance is rechecked. If it’s within $0.05, the ledger is reconciled. If not, the next attempt runs. After 3 failed attempts (or if the AI gives up), the ledger is marked not reconciled with an explanation of what went wrong. Reconciliation is skipped entirely if the starting or ending balance couldn’t be extracted from the statement.

Step 7 — Tagging

After reconciliation, all transactions across all ledgers are assigned a global sequential ID (1, 2, 3, …) and then tagged. Tagging runs three parallel processes simultaneously: AI Loan Tagging — The AI classifies transaction groups into debt/loan types:

Tag	Display Name
merchant_cash_advance	Merchant Cash Advance
bank_loan	Bank Loan
factoring	Factoring
credit	Credit Card
lease	Lease
auto	Auto Loan
mortgage	Mortgage
buy_now_pay_later	Buy Now Pay Later
debt_collection	Debt Collection

Each transaction can have at most one loan tag. AI Core Tagging — The AI classifies transaction groups into activity categories. The AI receives business identity and account context to make accurate calls (e.g. knowing the business name helps identify internal transfers vs external payments):

Tag	Display Name
internal_transfer	Internal Transfer
owner_transaction	Owner Transaction
payment_processor	Payment Processor
bank_fee	Bank Fee
bank_interest	Bank Interest
reversal	Reversal
cash	Cash

A transaction can have multiple core tags. Deterministic Pattern Tagging — Rule-based regex matching (no AI involved) that identifies:

Tag	Display Name
check	Check
wire	Wire
peer_to_peer	P2P
stop_payment	Stop Payment
nsf	NSF
overdraft	Overdraft

NSF and overdraft tags are only applied to debits. A transaction can have multiple deterministic tags. All three tag types are then merged onto each transaction: loan tag first (if any), then core tags, then deterministic tags.

Step 8 — Position Detection

Positions are detected from the loan-tagged transactions. There are two methods depending on loan type: MCA Positions (AI-based) — For Merchant Cash Advance transactions, an AI model matches transaction groups to known funders from your org’s funder registry. Each position gets a funder name, loan type, and the set of transaction IDs that belong to it. Funders from your registry include metadata like favicon, contact info, and website. Other Loan Positions (algorithmic) — For all other loan types (Bank Loan, Factoring, Auto, Lease, Mortgage, Debt Collection, Buy Now Pay Later), positions are detected using text similarity clustering. Transaction descriptions are compared using TF-IDF (a text similarity algorithm) and grouped into clusters. Each cluster becomes a position.

Step 9 — Background Analysis

Two background tasks run during the pipeline and are collected at the end: AI Deep Research — Started immediately after account metadata extraction (Step 1). Uses the extracted business name, address, phone, and principal names to search the web and verify the business’s legitimacy. Runs in the background during the entire rest of the pipeline. The result is the “AI Deep Research” card on the Synopsis page. Tampering Analysis — Started after reconciliation (Step 6). Examines the PDF metadata of every uploaded document — producer, creator application, creation dates, modification dates — and looks for signs of fabrication or programmatic generation (e.g. all PDFs having identical metadata, timestamps that are impossibly close together, or creation tools not typically used by banks). Runs in the background during tagging and position detection. The result is the “Tampering Analysis” card on the Synopsis page. Both tasks are best-effort. If either one fails, the parse still completes normally.

Key Concepts

Book — A container for one deal or submission. A book holds one or more uploaded documents and the parsed results. When you upload files and click Parse, you’re parsing a book. Document — A single uploaded PDF file. Gets classified into a document type (bank statement, credit report, etc.) during parsing. Ledger — One bank account within one statement period. A single PDF can produce multiple ledgers if it contains data for multiple accounts. Each ledger has a starting balance, ending balance, and a list of transactions. Reconciliation happens at the ledger level — each ledger is independently verified. Account — A bank account that spans across statement periods. After parsing, LendPathway merges all ledgers for the same account into a single unified transaction history. If you upload 12 monthly statements for the same checking account, you get 12 ledgers but 1 account. Position — A detected debt relationship with a specific lender. For example, if the parser identifies regular payments to “Prime Funding LLC,” it creates a position grouping those transactions together with a funder name, loan type, total disbursed, and total paid. Tag — A label applied to a transaction that identifies what type of activity it represents. Tags are applied automatically during parsing and can be manually edited afterward. A transaction can have multiple tags (e.g. a wire payment to a lender could be tagged both “Wire” and “Merchant Cash Advance”). Reconciliation — The process of mathematically verifying that extracted transactions match the bank’s own records. Starting balance plus the sum of all transaction amounts should equal the ending balance. A reconciled ledger means the data is accurate to within $0.05 of what the bank reported.

​The Pipeline

​1. Load Documents

​2. Classify Each Document

​3. Run Type-Specific Pipelines

​Bank Statement Pipeline

​Step 1 — Account Metadata Extraction

​Step 2 — Statement Metadata Extraction

​Step 3 — Duplicate Detection

​Step 4 — Transaction Extraction

​Step 5 — Assembly

​Step 6 — Reconciliation

​Step 7 — Tagging

​Step 8 — Position Detection

​Step 9 — Background Analysis

​Key Concepts

The Pipeline

1. Load Documents

2. Classify Each Document

3. Run Type-Specific Pipelines

Bank Statement Pipeline

Step 1 — Account Metadata Extraction

Step 2 — Statement Metadata Extraction

Step 3 — Duplicate Detection

Step 4 — Transaction Extraction

Step 5 — Assembly

Step 6 — Reconciliation

Step 7 — Tagging

Step 8 — Position Detection

Step 9 — Background Analysis

Key Concepts