System Architecture

BankVision processes unstructured financial documents by applying Optical Character Recognition (OCR) to extract text, generating embeddings for text and images, and using Convolutional Neural Networks (CNNs) for template recognition.

Large Language Models (LLMs)—advanced neural networks trained on extensive text corpora—and Vision Transformers interpret context and layout across pages.

By coordinating these components, BankVision maps the extracted content into consistent JSON structures, capturing everything from metadata and transaction details to risk metrics.

This multi-model pipeline adapts to varied statement formats, providing validated outputs for underwriting, compliance, and analytics.

The diagram gives a high-level view of how BankVision processes financial documents—extracting text, mapping structure, verifying entities, and analyzing risk. It simplifies some interactions but captures the core pipeline.

1. Extracting Metadata

Before analyzing transactions, BankVision identifies key metadata like business name, account number, and statement period.

How it works:

Page embeddings identify document headers and structured fields.
LLMs classify extracted text, distinguishing between similar terms (e.g., “Balance” vs. “Amount Due”).
Cross-document validation ensures consistency across multiple statements.

Extracted Metadata

{
  "business_name": "Acme Corporation",
  "statement_month": "February",
  "account_number": "XXXX4567",
  "starting_balance": 125000.00,
  "ending_balance": 167250.75
}

Key Processing Steps

OCR extracts raw text from statement headers.
Embeddings locate metadata fields across different formats.
LLMs classify and validate metadata, ensuring accuracy.

2. Parsing and Sorting Transactions

Once metadata is extracted, BankVision processes transaction tables by analyzing row structures, merchant names, and numerical patterns.

How Transactions are Mapped

Example Structured Transactions

{
  "transactions": [
    {
      "date": "2024-02-01",
      "description": "MERCHANT PAYMENT PROCESSING",
      "amount": 45000.00,
      "method": "ACH",
      "category": "revenue",
      "is_recurring": true,
      "merchant_id": "M12345"
    },
    {
      "date": "2024-02-03",
      "description": "EQUIPMENT PURCHASE - INDUSTRIAL SUPPLIES",
      "amount": -12500.00,
      "method": "wire",
      "category": "equipment",
      "is_recurring": false,
      "merchant_id": "M67890"
    }
  ]
}

3. Risk Analysis and Financial Patterns

Beyond basic transaction parsing, BankVision analyzes financial behavior to detect risks and provide deeper insights.

Risk Factors

Cash Flow Ratio Calculation
Volatility Scoring
Revenue Consistency Tracking
Sector Performance Benchmarking

Extracted Risk Insights

{
  "risk_analysis": {
    "cash_flow_ratio": 1.51,
    "volatility_score": 0.15,
    "risk_factors": {
      "revenue_consistency": "high",
      "sector_performance": "above_average",
      "growth_trend": "positive"
    }
  }
}

Under the Hood: Advanced Agent Workflow & Permutation-Based Chunk Extraction

Financial documents are inherently complex. They exhibit a wide range of layouts, formats, and multi-modal content—combining text, tables, and graphics. Traditional parsers fail when faced with such diversity. Our solution leverages a multi-agent framework that processes documents in parallel, extracting structured insights from both textual and visual data.

This approach is motivated by several key challenges:

Robust Metadata Extraction: Capturing core information (e.g., business names, account numbers, statement periods) reliably from diverse document formats.
Multi-Modal Fusion: Integrating OCR-derived text with object detection outputs to accurately interpret document layouts.
Permutation-Based Chunk Analysis: Generating multiple interpretations of document segments to cover every structural nuance.
Iterative Consensus: Using LLM prompts and mathematically driven updates to reconcile agent outputs into a unified data model.

By combining these methods, our system creates a highly accurate and production-ready data model suitable for applications such as underwriting, compliance, and risk analysis.

1. Stage 1: Initial Metadata Extraction on Page 1

The first stage focuses on establishing a reliable baseline from Page 1. Two dedicated agents work in tandem:

Agent A (Text Extraction): Uses OCR to extract raw text $t_1$ from Page 1.
Agent B (Visual Verification): Applies object detection to capture visual elements $o_1$ (tables, headers, logos) from Page 1.

These outputs are fused by an embedding function $E$ to form the metadata vector:

m_1 = E(t_1, o_1)

This vector becomes the initial state vector for Page 1:

s_1^{(0)} = m_1

Example Pseudocode:

for each page in pages:
    if page.id == 1:
        text = OCR(page)                        // Extract raw text from Page 1
        objects = ObjectDetect(page)            // Detect visual elements on Page 1
        metadata = extract_metadata(text, objects)  // Combine text and visual cues
        state_vector = embed(metadata)          // Map metadata to a vector

        agent = {
            "page_number": page.id,
            "raw_text": text,
            "detected_objects": objects,
            "metadata": metadata,
            "state_vector": state_vector,        // s_1^(0)
            "history": [state_vector]            // Record initial state
        }
        agents[page.id] = agent

This baseline is crucial as it sets the context for the entire document.

2. Stage 2: Structure Agents and Permutation-Based Chunk Extraction

Once Page 1 establishes the metadata baseline, specialized Structure Agents process the full document. Their goal is to capture the detailed structure by breaking the document into manageable “chunks” and exploring multiple candidate interpretations.

Chunk Extraction

The document is segmented into chunks based on:

Sequential Segmentation: Dividing the document into contiguous blocks.
Thematic Segmentation: Grouping text blocks with similar content.
Hybrid Methods: Combining both sequential and thematic strategies.

Permutation Generation

For each segmentation strategy, multiple permutations $P_j$ are generated. Each permutation is a candidate interpretation of the document’s structure.

For each permutation $P_j$ , a candidate state vector is computed:

c_j = E(P_j)

where $E$ embeds the chunk information into the same vector space.

These candidate vectors are later reconciled via a consensus mechanism to capture the best representation of the document’s structure.

The consensus of these candidates is computed using a function $C$ that takes all candidate vectors $c_1$ through $c_M$ as input to produce the final state $S$ .

Example Pseudocode:

chunks = extract_chunks(document)
permutations = generate_permutations(chunks)  // Generate multiple candidate segmentations

candidate_states = []
for each permutation P_j in permutations:
    candidate_state = embed(P_j)             // Embed each permutation into a vector
    candidate_states.append(candidate_state)

The consensus of these candidates is computed as:

S = C(c_1,c_2,...,c_M)

To achieve a unified understanding, agents share their state vectors. The global state vector at iteration $t$ is defined as:

S^{(t)} = \frac{1}{N} \sum_{i=1}^{N} s_i^{(t)}

where $N$ is the total number of agents.

Each agent updates its state vector by integrating its local state with the global state using a weighted update rule:

s_i^{(t+1)} = \alpha\, s_i^{(t)} + (1 - \alpha)\, S^{(t)}

with $0 < \alpha < 1$ determining the influence of local versus global context.

Python Example:

import numpy as np

def update_state(agent_state, global_state, alpha=0.7):
    return alpha * agent_state + (1 - alpha) * global_state

def compute_global_state(agents):
    state_vectors = [agent["state_vector"] for agent in agents.values()]
    return np.mean(state_vectors, axis=0)

global_state = compute_global_state(agents)
for agent in agents.values():
    new_state = update_state(agent["state_vector"], global_state)
    agent["state_vector"] = new_state
    agent["history"].append(new_state)

4. Iterative Consensus Loop with Agent Prompts and Internal Structure Parsing

A single pass rarely captures the full complexity of a document. Therefore, agents enter an iterative consensus loop where they refine their state vectors through multiple iterations.

Role of Prompts and Internal Parsing

Each agent constructs a prompt $P_i^{(t)}$ that combines:

Local State $s_i^{(t)}$ : The agent’s current vector.
Global Context $S^{(t)}$ : The aggregated state from all agents.
Expert Corpus $\mathcal{C}$ : A collection of expert-defined schemas for transaction categorization and document structure.

The prompt is expressed as:

P_i^{(t)} = f\big(s_i^{(t)}, S^{(t)}, \mathcal{C}\big)

A typical prompt might be:

“Using the current metadata vector $s_i^{(t)}$ and global context $S^{(t)}$ , refine the metadata extraction for page $i$ . Refer to the corpus $\mathcal{C}$ for standardized rules (e.g., mapping ‘withdrawal’ to the correct credit/debit category).”

The LLM processes the prompt and returns a refined proposal. This is parsed into a delta state vector $\Delta s_i^{(t)}$ , which captures adjustments based on expert rules.

The agent then updates its state vector:

s_i^{(t+1)} = \alpha\, s_i^{(t)} + (1 - \alpha)\, \Delta s_i^{(t)}

Detailed Iterative Loop (Python Pseudocode)

while not converged and iteration < MAX_ITER:
    global_state = compute_global_state(agents)
    
    for agent in agents:
        # Construct a prompt that includes local state, global context, and the expert corpus
        prompt = construct_prompt(
            agent.state_vector, 
            global_state, 
            agent.page_number, 
            corpus=corpus
        )
        
        # LLM processes the prompt and returns a refined metadata proposal
        llm_response = LLM_generate(prompt)
        
        # Parse the LLM response using internal structure parsing against the corpus
        delta_state = embed(parse_with_corpus(llm_response, corpus=corpus))
        
        # Update the agent's state using a weighted combination of its current state and the delta
        new_state = alpha * agent.state_vector + (1 - alpha) * delta_state
        
        if validate(new_state):  # Perform local sanity checks
            agent.history.append(new_state)
            agent.state_vector = new_state
    
    # Merge all updated state vectors to compute a consensus score
    final_model, consensus_score = consensus_merge(
        [agent.state_vector for agent in agents]
    )
    
    if consensus_score >= threshold:
        converged = True
        
    iteration += 1

This iterative process ensures that each agent not only leverages its own extraction but also learns from the collective insight of all agents.

5. Unified Data Model Generation and the Intermediary JSON

Once the consensus loop converges, the final unified state vector $S^*$ represents the best approximation of the document’s overall structure and metadata. This state is then transformed into an intermediary JSON structure using a deterministic function $F$ :

F(S^*) \rightarrow JSON

Mapping to the Intermediary JSON

The transformation function $F$ decodes the high-dimensional state vector into a structured JSON format. The intermediary JSON schema is rigorously defined to include:

Metadata:
- business_name
- statement_month
- account_number
- starting_balance
- ending_balance
Transactions:
An array of objects, each containing:
- date
- description
- amount
- method
- category
- is_recurring
- merchant_id
Risk Analysis:
Fields such as:
- cash_flow_ratio
- volatility_score
- Additional risk factors (e.g., revenue_consistency, sector_performance, growth_trend)

This JSON serves as a robust, standardized data model ready for downstream applications.

Final Transformation Example:

final_model = F(S_star)  # S_star is the converged unified state vector
return final_model

Deep Dive: Mathematical Underpinnings and Information Flow

Mathematical Motivation

The use of state vectors and iterative weighted updates allows us to treat the document as a dynamic system where each page contributes to a shared global representation. The equation:

s_i^{(t+1)} = \alpha\, s_i^{(t)} + (1 - \alpha)\, S^{(t)}

ensures that while each agent retains its unique local insights, it also converges toward a consensus that reflects the overall document context.

Information Flow via Prompts

Prompts serve as a vehicle for transferring nuanced information from the embedded state vectors to the LLM. Mathematically, each prompt $P_i^{(t)}$ is a function:

P_i^{(t)} = f\big(s_i^{(t)}, S^{(t)}, \mathcal{C}\big)

where:

$s_i^{(t)}$ provides the agent’s local perspective.
$S^{(t)}$ conveys the aggregated global context.
$\mathcal{C}$ injects expert knowledge, guiding the LLM to produce refined outputs.

The LLM’s response, once parsed and embedded into $\Delta s_i^{(t)}$ , directly influences the next state vector, ensuring that each iteration is informed by both raw data and expert validation.

Conclusion: The Power of Our Multi-Agent Framework

Our detailed multi-agent approach is designed to tackle the complexity of financial documents through:

Robust Baseline Extraction:
Two dedicated agents process Page 1 to establish a reliable metadata baseline.
Permutation-Based Chunk Analysis:
Structure agents generate multiple candidate interpretations by exploring various segmentation strategies, ensuring no nuance is overlooked.
Global Consistency via Iterative Consensus:
Sharing state vectors and refining them iteratively through LLM prompts leads to a consistent, unified representation.
Mathematical Rigor:
The weighted update mechanisms and consensus equations provide a quantitative foundation for merging diverse data sources.
Intermediary JSON Transformation:
The final unified state is mapped into a standardized JSON format, bridging the gap between raw extraction and actionable insights.

This comprehensive, mathematically grounded workflow ensures that BankVision delivers precise, consistent, and production-ready data models for even the most challenging financial documents.

Overview

Merchant Cash Accounting

Technical Overview

1. Extracting Metadata

Extracted Metadata

Key Processing Steps

2. Parsing and Sorting Transactions

3. Risk Analysis and Financial Patterns

Risk Factors

Extracted Risk Insights

Under the Hood: Advanced Agent Workflow & Permutation-Based Chunk Extraction

1. Stage 1: Initial Metadata Extraction on Page 1

2. Stage 2: Structure Agents and Permutation-Based Chunk Extraction

Chunk Extraction

Permutation Generation

4. Iterative Consensus Loop with Agent Prompts and Internal Structure Parsing

Role of Prompts and Internal Parsing

Detailed Iterative Loop (Python Pseudocode)

5. Unified Data Model Generation and the Intermediary JSON

Mapping to the Intermediary JSON

Deep Dive: Mathematical Underpinnings and Information Flow

Mathematical Motivation

Information Flow via Prompts

Conclusion: The Power of Our Multi-Agent Framework

Overview

Merchant Cash Accounting

Technical Overview

​1. Extracting Metadata

Extracted Metadata

Key Processing Steps

​2. Parsing and Sorting Transactions

​3. Risk Analysis and Financial Patterns

Risk Factors

Extracted Risk Insights

​Under the Hood: Advanced Agent Workflow & Permutation-Based Chunk Extraction

​1. Stage 1: Initial Metadata Extraction on Page 1

​2. Stage 2: Structure Agents and Permutation-Based Chunk Extraction

​Chunk Extraction

​Permutation Generation

​3. Global Context Sharing & State Update

​4. Iterative Consensus Loop with Agent Prompts and Internal Structure Parsing

​Role of Prompts and Internal Parsing

​Detailed Iterative Loop (Python Pseudocode)

​5. Unified Data Model Generation and the Intermediary JSON

​Mapping to the Intermediary JSON

​Deep Dive: Mathematical Underpinnings and Information Flow

​Mathematical Motivation

​Information Flow via Prompts

​Conclusion: The Power of Our Multi-Agent Framework

1. Extracting Metadata

2. Parsing and Sorting Transactions

3. Risk Analysis and Financial Patterns

Under the Hood: Advanced Agent Workflow & Permutation-Based Chunk Extraction

1. Stage 1: Initial Metadata Extraction on Page 1

2. Stage 2: Structure Agents and Permutation-Based Chunk Extraction

Chunk Extraction

Permutation Generation

3. Global Context Sharing & State Update

4. Iterative Consensus Loop with Agent Prompts and Internal Structure Parsing

Role of Prompts and Internal Parsing

Detailed Iterative Loop (Python Pseudocode)

5. Unified Data Model Generation and the Intermediary JSON

Mapping to the Intermediary JSON

Deep Dive: Mathematical Underpinnings and Information Flow

Mathematical Motivation

Information Flow via Prompts

Conclusion: The Power of Our Multi-Agent Framework