BankVision processes unstructured financial documents by applying Optical Character Recognition (OCR) to extract text, generating embeddings for text and images, and using Convolutional Neural Networks (CNNs) for template recognition.
Large Language Models (LLMs)—advanced neural networks trained on extensive text corpora—and Vision Transformers interpret context and layout across pages.
By coordinating these components, BankVision maps the extracted content into consistent JSON structures, capturing everything from metadata and transaction details to risk metrics.
This multi-model pipeline adapts to varied statement formats, providing validated outputs for underwriting, compliance, and analytics.
The diagram gives a high-level view of how BankVision processes financial documents—extracting text, mapping structure, verifying entities, and analyzing risk. It simplifies some interactions but captures the core pipeline.
Under the Hood: Advanced Agent Workflow & Permutation-Based Chunk Extraction
Financial documents are inherently complex. They exhibit a wide range of layouts, formats, and multi-modal content—combining text, tables, and graphics. Traditional parsers fail when faced with such diversity. Our solution leverages a multi-agent framework that processes documents in parallel, extracting structured insights from both textual and visual data.
This approach is motivated by several key challenges:
Robust Metadata Extraction: Capturing core information (e.g., business names, account numbers, statement periods) reliably from diverse document formats.
Multi-Modal Fusion: Integrating OCR-derived text with object detection outputs to accurately interpret document layouts.
Permutation-Based Chunk Analysis: Generating multiple interpretations of document segments to cover every structural nuance.
Iterative Consensus: Using LLM prompts and mathematically driven updates to reconcile agent outputs into a unified data model.
By combining these methods, our system creates a highly accurate and production-ready data model suitable for applications such as underwriting, compliance, and risk analysis.
The first stage focuses on establishing a reliable baseline from Page 1. Two dedicated agents work in tandem:
Agent A (Text Extraction): Uses OCR to extract raw text t1 from Page 1.
Agent B (Visual Verification): Applies object detection to capture visual elements o1 (tables, headers, logos) from Page 1.
These outputs are fused by an embedding function E to form the metadata vector:
m1=E(t1,o1)
This vector becomes the initial state vector for Page 1:
s1(0)=m1
Example Pseudocode:
Copy
for each page in pages: if page.id == 1: text = OCR(page) // Extract raw text from Page 1 objects = ObjectDetect(page) // Detect visual elements on Page 1 metadata = extract_metadata(text, objects) // Combine text and visual cues state_vector = embed(metadata) // Map metadata to a vector agent = { "page_number": page.id, "raw_text": text, "detected_objects": objects, "metadata": metadata, "state_vector": state_vector, // s_1^(0) "history": [state_vector] // Record initial state } agents[page.id] = agent
This baseline is crucial as it sets the context for the entire document.
2. Stage 2: Structure Agents and Permutation-Based Chunk Extraction
Once Page 1 establishes the metadata baseline, specialized Structure Agents process the full document. Their goal is to capture the detailed structure by breaking the document into manageable “chunks” and exploring multiple candidate interpretations.
For each segmentation strategy, multiple permutations Pj are generated. Each permutation is a candidate interpretation of the document’s structure.
For each permutation Pj, a candidate state vector is computed:
cj=E(Pj)
where E embeds the chunk information into the same vector space.
These candidate vectors are later reconciled via a consensus mechanism to capture the best representation of the document’s structure.
The consensus of these candidates is computed using a function C that takes all candidate vectors c1 through cM as input to produce the final state S.
Example Pseudocode:
Copy
chunks = extract_chunks(document)permutations = generate_permutations(chunks) // Generate multiple candidate segmentationscandidate_states = []for each permutation P_j in permutations: candidate_state = embed(P_j) // Embed each permutation into a vector candidate_states.append(candidate_state)
4. Iterative Consensus Loop with Agent Prompts and Internal Structure Parsing
A single pass rarely captures the full complexity of a document. Therefore, agents enter an iterative consensus loop where they refine their state vectors through multiple iterations.
Each agent constructs a prompt Pi(t) that combines:
Local State si(t): The agent’s current vector.
Global Context S(t): The aggregated state from all agents.
Expert Corpus C: A collection of expert-defined schemas for transaction categorization and document structure.
The prompt is expressed as:
Pi(t)=f(si(t),S(t),C)
A typical prompt might be:
“Using the current metadata vector si(t) and global context S(t), refine the metadata extraction for page i. Refer to the corpus C for standardized rules (e.g., mapping ‘withdrawal’ to the correct credit/debit category).”
The LLM processes the prompt and returns a refined proposal. This is parsed into a delta state vectorΔsi(t), which captures adjustments based on expert rules.
while not converged and iteration < MAX_ITER: global_state = compute_global_state(agents) for agent in agents: # Construct a prompt that includes local state, global context, and the expert corpus prompt = construct_prompt( agent.state_vector, global_state, agent.page_number, corpus=corpus ) # LLM processes the prompt and returns a refined metadata proposal llm_response = LLM_generate(prompt) # Parse the LLM response using internal structure parsing against the corpus delta_state = embed(parse_with_corpus(llm_response, corpus=corpus)) # Update the agent's state using a weighted combination of its current state and the delta new_state = alpha * agent.state_vector + (1 - alpha) * delta_state if validate(new_state): # Perform local sanity checks agent.history.append(new_state) agent.state_vector = new_state # Merge all updated state vectors to compute a consensus score final_model, consensus_score = consensus_merge( [agent.state_vector for agent in agents] ) if consensus_score >= threshold: converged = True iteration += 1
This iterative process ensures that each agent not only leverages its own extraction but also learns from the collective insight of all agents.
5. Unified Data Model Generation and the Intermediary JSON
Once the consensus loop converges, the final unified state vector S∗ represents the best approximation of the document’s overall structure and metadata. This state is then transformed into an intermediary JSON structure using a deterministic function F:
The transformation function F decodes the high-dimensional state vector into a structured JSON format. The intermediary JSON schema is rigorously defined to include:
Metadata:
business_name
statement_month
account_number
starting_balance
ending_balance
Transactions:
An array of objects, each containing:
The use of state vectors and iterative weighted updates allows us to treat the document as a dynamic system where each page contributes to a shared global representation. The equation:
si(t+1)=αsi(t)+(1−α)S(t)
ensures that while each agent retains its unique local insights, it also converges toward a consensus that reflects the overall document context.
Prompts serve as a vehicle for transferring nuanced information from the embedded state vectors to the LLM. Mathematically, each prompt Pi(t) is a function:
Pi(t)=f(si(t),S(t),C)
where:
si(t) provides the agent’s local perspective.
S(t) conveys the aggregated global context.
C injects expert knowledge, guiding the LLM to produce refined outputs.
The LLM’s response, once parsed and embedded into Δsi(t), directly influences the next state vector, ensuring that each iteration is informed by both raw data and expert validation.
Conclusion: The Power of Our Multi-Agent Framework
Our detailed multi-agent approach is designed to tackle the complexity of financial documents through:
Robust Baseline Extraction:
Two dedicated agents process Page 1 to establish a reliable metadata baseline.
Permutation-Based Chunk Analysis:
Structure agents generate multiple candidate interpretations by exploring various segmentation strategies, ensuring no nuance is overlooked.
Global Consistency via Iterative Consensus:
Sharing state vectors and refining them iteratively through LLM prompts leads to a consistent, unified representation.
Mathematical Rigor:
The weighted update mechanisms and consensus equations provide a quantitative foundation for merging diverse data sources.
Intermediary JSON Transformation:
The final unified state is mapped into a standardized JSON format, bridging the gap between raw extraction and actionable insights.
This comprehensive, mathematically grounded workflow ensures that BankVision delivers precise, consistent, and production-ready data models for even the most challenging financial documents.
Assistant
Responses are generated using AI and may contain mistakes.