BankVision processes unstructured financial documents by applying Optical Character Recognition (OCR) to extract text, generating embeddings for text and images, and using Convolutional Neural Networks (CNNs) for template recognition.

Large Language Models (LLMs)—advanced neural networks trained on extensive text corpora—and Vision Transformers interpret context and layout across pages.

By coordinating these components, BankVision maps the extracted content into consistent JSON structures, capturing everything from metadata and transaction details to risk metrics.

This multi-model pipeline adapts to varied statement formats, providing validated outputs for underwriting, compliance, and analytics.

The diagram gives a high-level view of how BankVision processes financial documents—extracting text, mapping structure, verifying entities, and analyzing risk. It simplifies some interactions but captures the core pipeline.


1. Extracting Metadata

Before analyzing transactions, BankVision identifies key metadata like business name, account number, and statement period.

How it works:

  • Page embeddings identify document headers and structured fields.
  • LLMs classify extracted text, distinguishing between similar terms (e.g., “Balance” vs. “Amount Due”).
  • Cross-document validation ensures consistency across multiple statements.

Extracted Metadata

{
  "business_name": "Acme Corporation",
  "statement_month": "February",
  "account_number": "XXXX4567",
  "starting_balance": 125000.00,
  "ending_balance": 167250.75
}

Key Processing Steps

  • OCR extracts raw text from statement headers.
  • Embeddings locate metadata fields across different formats.
  • LLMs classify and validate metadata, ensuring accuracy.

2. Parsing and Sorting Transactions

Once metadata is extracted, BankVision processes transaction tables by analyzing row structures, merchant names, and numerical patterns.


3. Risk Analysis and Financial Patterns

Beyond basic transaction parsing, BankVision analyzes financial behavior to detect risks and provide deeper insights.

Risk Factors

  • Cash Flow Ratio Calculation
  • Volatility Scoring
  • Revenue Consistency Tracking
  • Sector Performance Benchmarking

Extracted Risk Insights

{
  "risk_analysis": {
    "cash_flow_ratio": 1.51,
    "volatility_score": 0.15,
    "risk_factors": {
      "revenue_consistency": "high",
      "sector_performance": "above_average",
      "growth_trend": "positive"
    }
  }
}

Under the Hood: Advanced Agent Workflow & Permutation-Based Chunk Extraction

Financial documents are inherently complex. They exhibit a wide range of layouts, formats, and multi-modal content—combining text, tables, and graphics. Traditional parsers fail when faced with such diversity. Our solution leverages a multi-agent framework that processes documents in parallel, extracting structured insights from both textual and visual data.

This approach is motivated by several key challenges:

  • Robust Metadata Extraction: Capturing core information (e.g., business names, account numbers, statement periods) reliably from diverse document formats.
  • Multi-Modal Fusion: Integrating OCR-derived text with object detection outputs to accurately interpret document layouts.
  • Permutation-Based Chunk Analysis: Generating multiple interpretations of document segments to cover every structural nuance.
  • Iterative Consensus: Using LLM prompts and mathematically driven updates to reconcile agent outputs into a unified data model.

By combining these methods, our system creates a highly accurate and production-ready data model suitable for applications such as underwriting, compliance, and risk analysis.


1. Stage 1: Initial Metadata Extraction on Page 1

The first stage focuses on establishing a reliable baseline from Page 1. Two dedicated agents work in tandem:

  • Agent A (Text Extraction): Uses OCR to extract raw text t1t_1 from Page 1.
  • Agent B (Visual Verification): Applies object detection to capture visual elements o1o_1 (tables, headers, logos) from Page 1.

These outputs are fused by an embedding function EE to form the metadata vector:

m1=E(t1,o1)m_1 = E(t_1, o_1)

This vector becomes the initial state vector for Page 1:

s1(0)=m1s_1^{(0)} = m_1

Example Pseudocode:

for each page in pages:
    if page.id == 1:
        text = OCR(page)                        // Extract raw text from Page 1
        objects = ObjectDetect(page)            // Detect visual elements on Page 1
        metadata = extract_metadata(text, objects)  // Combine text and visual cues
        state_vector = embed(metadata)          // Map metadata to a vector

        agent = {
            "page_number": page.id,
            "raw_text": text,
            "detected_objects": objects,
            "metadata": metadata,
            "state_vector": state_vector,        // s_1^(0)
            "history": [state_vector]            // Record initial state
        }
        agents[page.id] = agent

This baseline is crucial as it sets the context for the entire document.


2. Stage 2: Structure Agents and Permutation-Based Chunk Extraction

Once Page 1 establishes the metadata baseline, specialized Structure Agents process the full document. Their goal is to capture the detailed structure by breaking the document into manageable “chunks” and exploring multiple candidate interpretations.

Chunk Extraction

The document is segmented into chunks based on:

  • Sequential Segmentation: Dividing the document into contiguous blocks.
  • Thematic Segmentation: Grouping text blocks with similar content.
  • Hybrid Methods: Combining both sequential and thematic strategies.

Permutation Generation

For each segmentation strategy, multiple permutations PjP_j are generated. Each permutation is a candidate interpretation of the document’s structure.

For each permutation PjP_j, a candidate state vector is computed:

cj=E(Pj)c_j = E(P_j)

where EE embeds the chunk information into the same vector space.

These candidate vectors are later reconciled via a consensus mechanism to capture the best representation of the document’s structure.

The consensus of these candidates is computed using a function CC that takes all candidate vectors c1c_1 through cMc_M as input to produce the final state SS.

Example Pseudocode:

chunks = extract_chunks(document)
permutations = generate_permutations(chunks)  // Generate multiple candidate segmentations

candidate_states = []
for each permutation P_j in permutations:
    candidate_state = embed(P_j)             // Embed each permutation into a vector
    candidate_states.append(candidate_state)

The consensus of these candidates is computed as:

S=C(c1,c2,...,cM)S = C(c_1,c_2,...,c_M)

3. Global Context Sharing & State Update

To achieve a unified understanding, agents share their state vectors. The global state vector at iteration tt is defined as:

S(t)=1Ni=1Nsi(t)S^{(t)} = \frac{1}{N} \sum_{i=1}^{N} s_i^{(t)}

where NN is the total number of agents.

Each agent updates its state vector by integrating its local state with the global state using a weighted update rule:

si(t+1)=αsi(t)+(1α)S(t)s_i^{(t+1)} = \alpha\, s_i^{(t)} + (1 - \alpha)\, S^{(t)}

with 0<α<10 < \alpha < 1 determining the influence of local versus global context.

Python Example:

import numpy as np

def update_state(agent_state, global_state, alpha=0.7):
    return alpha * agent_state + (1 - alpha) * global_state

def compute_global_state(agents):
    state_vectors = [agent["state_vector"] for agent in agents.values()]
    return np.mean(state_vectors, axis=0)

global_state = compute_global_state(agents)
for agent in agents.values():
    new_state = update_state(agent["state_vector"], global_state)
    agent["state_vector"] = new_state
    agent["history"].append(new_state)

4. Iterative Consensus Loop with Agent Prompts and Internal Structure Parsing

A single pass rarely captures the full complexity of a document. Therefore, agents enter an iterative consensus loop where they refine their state vectors through multiple iterations.

Role of Prompts and Internal Parsing

Each agent constructs a prompt Pi(t)P_i^{(t)} that combines:

  • Local State si(t)s_i^{(t)}: The agent’s current vector.
  • Global Context S(t)S^{(t)}: The aggregated state from all agents.
  • Expert Corpus C\mathcal{C}: A collection of expert-defined schemas for transaction categorization and document structure.

The prompt is expressed as:

Pi(t)=f(si(t),S(t),C)P_i^{(t)} = f\big(s_i^{(t)}, S^{(t)}, \mathcal{C}\big)

A typical prompt might be:

“Using the current metadata vector si(t)s_i^{(t)} and global context S(t)S^{(t)}, refine the metadata extraction for page ii. Refer to the corpus C\mathcal{C} for standardized rules (e.g., mapping ‘withdrawal’ to the correct credit/debit category).”

The LLM processes the prompt and returns a refined proposal. This is parsed into a delta state vector Δsi(t)\Delta s_i^{(t)}, which captures adjustments based on expert rules.

The agent then updates its state vector:

si(t+1)=αsi(t)+(1α)Δsi(t)s_i^{(t+1)} = \alpha\, s_i^{(t)} + (1 - \alpha)\, \Delta s_i^{(t)}

Detailed Iterative Loop (Python Pseudocode)

while not converged and iteration < MAX_ITER:
    global_state = compute_global_state(agents)
    
    for agent in agents:
        # Construct a prompt that includes local state, global context, and the expert corpus
        prompt = construct_prompt(
            agent.state_vector, 
            global_state, 
            agent.page_number, 
            corpus=corpus
        )
        
        # LLM processes the prompt and returns a refined metadata proposal
        llm_response = LLM_generate(prompt)
        
        # Parse the LLM response using internal structure parsing against the corpus
        delta_state = embed(parse_with_corpus(llm_response, corpus=corpus))
        
        # Update the agent's state using a weighted combination of its current state and the delta
        new_state = alpha * agent.state_vector + (1 - alpha) * delta_state
        
        if validate(new_state):  # Perform local sanity checks
            agent.history.append(new_state)
            agent.state_vector = new_state
    
    # Merge all updated state vectors to compute a consensus score
    final_model, consensus_score = consensus_merge(
        [agent.state_vector for agent in agents]
    )
    
    if consensus_score >= threshold:
        converged = True
        
    iteration += 1

This iterative process ensures that each agent not only leverages its own extraction but also learns from the collective insight of all agents.


5. Unified Data Model Generation and the Intermediary JSON

Once the consensus loop converges, the final unified state vector SS^* represents the best approximation of the document’s overall structure and metadata. This state is then transformed into an intermediary JSON structure using a deterministic function FF:

F(S)JSONF(S^*) \rightarrow JSON

Mapping to the Intermediary JSON

The transformation function FF decodes the high-dimensional state vector into a structured JSON format. The intermediary JSON schema is rigorously defined to include:

  • Metadata:
    • business_name
    • statement_month
    • account_number
    • starting_balance
    • ending_balance
  • Transactions:
    An array of objects, each containing:
    • date
    • description
    • amount
    • method
    • category
    • is_recurring
    • merchant_id
  • Risk Analysis:
    Fields such as:
    • cash_flow_ratio
    • volatility_score
    • Additional risk factors (e.g., revenue_consistency, sector_performance, growth_trend)

This JSON serves as a robust, standardized data model ready for downstream applications.

Final Transformation Example:

final_model = F(S_star)  # S_star is the converged unified state vector
return final_model

Deep Dive: Mathematical Underpinnings and Information Flow

Mathematical Motivation

The use of state vectors and iterative weighted updates allows us to treat the document as a dynamic system where each page contributes to a shared global representation. The equation:

si(t+1)=αsi(t)+(1α)S(t)s_i^{(t+1)} = \alpha\, s_i^{(t)} + (1 - \alpha)\, S^{(t)}

ensures that while each agent retains its unique local insights, it also converges toward a consensus that reflects the overall document context.

Information Flow via Prompts

Prompts serve as a vehicle for transferring nuanced information from the embedded state vectors to the LLM. Mathematically, each prompt Pi(t)P_i^{(t)} is a function:

Pi(t)=f(si(t),S(t),C)P_i^{(t)} = f\big(s_i^{(t)}, S^{(t)}, \mathcal{C}\big)

where:

  • si(t)s_i^{(t)} provides the agent’s local perspective.
  • S(t)S^{(t)} conveys the aggregated global context.
  • C\mathcal{C} injects expert knowledge, guiding the LLM to produce refined outputs.

The LLM’s response, once parsed and embedded into Δsi(t)\Delta s_i^{(t)}, directly influences the next state vector, ensuring that each iteration is informed by both raw data and expert validation.


Conclusion: The Power of Our Multi-Agent Framework

Our detailed multi-agent approach is designed to tackle the complexity of financial documents through:

  1. Robust Baseline Extraction:
    Two dedicated agents process Page 1 to establish a reliable metadata baseline.

  2. Permutation-Based Chunk Analysis:
    Structure agents generate multiple candidate interpretations by exploring various segmentation strategies, ensuring no nuance is overlooked.

  3. Global Consistency via Iterative Consensus:
    Sharing state vectors and refining them iteratively through LLM prompts leads to a consistent, unified representation.

  4. Mathematical Rigor:
    The weighted update mechanisms and consensus equations provide a quantitative foundation for merging diverse data sources.

  5. Intermediary JSON Transformation:
    The final unified state is mapped into a standardized JSON format, bridging the gap between raw extraction and actionable insights.

This comprehensive, mathematically grounded workflow ensures that BankVision delivers precise, consistent, and production-ready data models for even the most challenging financial documents.