System Architecture
BankVision processes unstructured financial documents by applying Optical Character Recognition (OCR) to extract text, generating embeddings for text and images, and using Convolutional Neural Networks (CNNs) for template recognition.
Large Language Models (LLMs)—advanced neural networks trained on extensive text corpora—and Vision Transformers interpret context and layout across pages.
By coordinating these components, BankVision maps the extracted content into consistent JSON structures, capturing everything from metadata and transaction details to risk metrics.
This multi-model pipeline adapts to varied statement formats, providing validated outputs for underwriting, compliance, and analytics.
The diagram gives a high-level view of how BankVision processes financial documents—extracting text, mapping structure, verifying entities, and analyzing risk. It simplifies some interactions but captures the core pipeline.
1. Extracting Metadata
Before analyzing transactions, BankVision identifies key metadata like business name, account number, and statement period.
How it works:
- Page embeddings identify document headers and structured fields.
- LLMs classify extracted text, distinguishing between similar terms (e.g., “Balance” vs. “Amount Due”).
- Cross-document validation ensures consistency across multiple statements.
Extracted Metadata
Key Processing Steps
- OCR extracts raw text from statement headers.
- Embeddings locate metadata fields across different formats.
- LLMs classify and validate metadata, ensuring accuracy.
2. Parsing and Sorting Transactions
Once metadata is extracted, BankVision processes transaction tables by analyzing row structures, merchant names, and numerical patterns.
3. Risk Analysis and Financial Patterns
Beyond basic transaction parsing, BankVision analyzes financial behavior to detect risks and provide deeper insights.
Risk Factors
- Cash Flow Ratio Calculation
- Volatility Scoring
- Revenue Consistency Tracking
- Sector Performance Benchmarking
Extracted Risk Insights
Under the Hood: Advanced Agent Workflow & Permutation-Based Chunk Extraction
Financial documents are inherently complex. They exhibit a wide range of layouts, formats, and multi-modal content—combining text, tables, and graphics. Traditional parsers fail when faced with such diversity. Our solution leverages a multi-agent framework that processes documents in parallel, extracting structured insights from both textual and visual data.
This approach is motivated by several key challenges:
- Robust Metadata Extraction: Capturing core information (e.g., business names, account numbers, statement periods) reliably from diverse document formats.
- Multi-Modal Fusion: Integrating OCR-derived text with object detection outputs to accurately interpret document layouts.
- Permutation-Based Chunk Analysis: Generating multiple interpretations of document segments to cover every structural nuance.
- Iterative Consensus: Using LLM prompts and mathematically driven updates to reconcile agent outputs into a unified data model.
By combining these methods, our system creates a highly accurate and production-ready data model suitable for applications such as underwriting, compliance, and risk analysis.
1. Stage 1: Initial Metadata Extraction on Page 1
The first stage focuses on establishing a reliable baseline from Page 1. Two dedicated agents work in tandem:
- Agent A (Text Extraction): Uses OCR to extract raw text from Page 1.
- Agent B (Visual Verification): Applies object detection to capture visual elements (tables, headers, logos) from Page 1.
These outputs are fused by an embedding function to form the metadata vector:
This vector becomes the initial state vector for Page 1:
Example Pseudocode:
This baseline is crucial as it sets the context for the entire document.
2. Stage 2: Structure Agents and Permutation-Based Chunk Extraction
Once Page 1 establishes the metadata baseline, specialized Structure Agents process the full document. Their goal is to capture the detailed structure by breaking the document into manageable “chunks” and exploring multiple candidate interpretations.
Chunk Extraction
The document is segmented into chunks based on:
- Sequential Segmentation: Dividing the document into contiguous blocks.
- Thematic Segmentation: Grouping text blocks with similar content.
- Hybrid Methods: Combining both sequential and thematic strategies.
Permutation Generation
For each segmentation strategy, multiple permutations are generated. Each permutation is a candidate interpretation of the document’s structure.
For each permutation , a candidate state vector is computed:
where embeds the chunk information into the same vector space.
These candidate vectors are later reconciled via a consensus mechanism to capture the best representation of the document’s structure.
The consensus of these candidates is computed using a function that takes all candidate vectors through as input to produce the final state .
Example Pseudocode:
The consensus of these candidates is computed as:
3. Global Context Sharing & State Update
To achieve a unified understanding, agents share their state vectors. The global state vector at iteration is defined as:
where is the total number of agents.
Each agent updates its state vector by integrating its local state with the global state using a weighted update rule:
with determining the influence of local versus global context.
Python Example:
4. Iterative Consensus Loop with Agent Prompts and Internal Structure Parsing
A single pass rarely captures the full complexity of a document. Therefore, agents enter an iterative consensus loop where they refine their state vectors through multiple iterations.
Role of Prompts and Internal Parsing
Each agent constructs a prompt that combines:
- Local State : The agent’s current vector.
- Global Context : The aggregated state from all agents.
- Expert Corpus : A collection of expert-defined schemas for transaction categorization and document structure.
The prompt is expressed as:
A typical prompt might be:
“Using the current metadata vector and global context , refine the metadata extraction for page . Refer to the corpus for standardized rules (e.g., mapping ‘withdrawal’ to the correct credit/debit category).”
The LLM processes the prompt and returns a refined proposal. This is parsed into a delta state vector , which captures adjustments based on expert rules.
The agent then updates its state vector:
Detailed Iterative Loop (Python Pseudocode)
This iterative process ensures that each agent not only leverages its own extraction but also learns from the collective insight of all agents.
5. Unified Data Model Generation and the Intermediary JSON
Once the consensus loop converges, the final unified state vector represents the best approximation of the document’s overall structure and metadata. This state is then transformed into an intermediary JSON structure using a deterministic function :
Mapping to the Intermediary JSON
The transformation function decodes the high-dimensional state vector into a structured JSON format. The intermediary JSON schema is rigorously defined to include:
- Metadata:
business_name
statement_month
account_number
starting_balance
ending_balance
- Transactions:
An array of objects, each containing:date
description
amount
method
category
is_recurring
merchant_id
- Risk Analysis:
Fields such as:cash_flow_ratio
volatility_score
- Additional risk factors (e.g.,
revenue_consistency
,sector_performance
,growth_trend
)
This JSON serves as a robust, standardized data model ready for downstream applications.
Final Transformation Example:
Deep Dive: Mathematical Underpinnings and Information Flow
Mathematical Motivation
The use of state vectors and iterative weighted updates allows us to treat the document as a dynamic system where each page contributes to a shared global representation. The equation:
ensures that while each agent retains its unique local insights, it also converges toward a consensus that reflects the overall document context.
Information Flow via Prompts
Prompts serve as a vehicle for transferring nuanced information from the embedded state vectors to the LLM. Mathematically, each prompt is a function:
where:
- provides the agent’s local perspective.
- conveys the aggregated global context.
- injects expert knowledge, guiding the LLM to produce refined outputs.
The LLM’s response, once parsed and embedded into , directly influences the next state vector, ensuring that each iteration is informed by both raw data and expert validation.
Conclusion: The Power of Our Multi-Agent Framework
Our detailed multi-agent approach is designed to tackle the complexity of financial documents through:
-
Robust Baseline Extraction:
Two dedicated agents process Page 1 to establish a reliable metadata baseline. -
Permutation-Based Chunk Analysis:
Structure agents generate multiple candidate interpretations by exploring various segmentation strategies, ensuring no nuance is overlooked. -
Global Consistency via Iterative Consensus:
Sharing state vectors and refining them iteratively through LLM prompts leads to a consistent, unified representation. -
Mathematical Rigor:
The weighted update mechanisms and consensus equations provide a quantitative foundation for merging diverse data sources. -
Intermediary JSON Transformation:
The final unified state is mapped into a standardized JSON format, bridging the gap between raw extraction and actionable insights.
This comprehensive, mathematically grounded workflow ensures that BankVision delivers precise, consistent, and production-ready data models for even the most challenging financial documents.