How Apoidea Group enhances visual information extraction from banking documents with multimodal models using LLaMA-Factory on Amazon SageMaker HyperPod

How Apoidea Group enhances visual information extraction from banking documents with multimodal models using LLaMA-Factory on Amazon SageMaker HyperPod

This post is co-written with Ken Tsui, Edward Tsoi and Mickey Yip from Apoidea Group.

The banking industry has long struggled with the inefficiencies associated with repetitive processes such as information extraction, document review, and auditing. These tasks, which require significant human resources, slow down critical operations such as Know Your Customer (KYC) procedures, loan applications, and credit analysis. As a result, banks face operational challenges, including limited scalability, slow processing speeds, and high costs associated with staff training and turnover.

To address these inefficiencies, the implementation of advanced information extraction systems is crucial. These systems enable the rapid extraction of data from various financial documents—including bank statements, KYC forms, and loan applications—reducing both manual errors and processing time. As such, information extraction technology is instrumental in accelerating customer onboarding, maintaining regulatory compliance, and driving the digital transformation of the banking sector, particularly in high-volume document processing tasks.

The challenges in document processing are compounded by the need for specialized solutions that maintain high accuracy while handling sensitive financial data such as banking statements, financial statements, and company annual reports. This is where Apoidea Group, a leading AI-focused FinTech independent software vendor (ISV) based in Hong Kong, has made a significant impact. By using cutting-edge generative AI and deep learning technologies, Apoidea has developed innovative AI-powered solutions that address the unique needs of multinational banks. Their flagship product, SuperAcc, is a sophisticated document processing service featuring a set of proprietary document understanding models capable of processing diverse document types such as bank statements, financial statements, and KYC documents.

SuperAcc has demonstrated significant improvements in the banking sector. For instance, the financial spreading process, which previously required 4–6 hours, can now be completed in just 10 minutes, with staff needing less than 30 minutes to review the results. Similarly, in small and medium-sized enterprise (SME) banking, the review process for multiple bank statements spanning 6 months—used to extract critical data such as sales turnover and interbank transactions—has been reduced to just 10 minutes. This substantial reduction in processing time not only accelerates workflows but also minimizes the risk of manual errors. By automating repetitive tasks, SuperAcc enhances both operational efficiency and accuracy, using Apoidea’s self-trained machine learning (ML) models to deliver consistent, high-accuracy results in live production environments. These advancements have led to an impressive return on investment (ROI) of over 80%, showcasing the tangible benefits of implementing SuperAcc in banking operations.

AI transformation in banking faces several challenges, primarily due to stringent security and regulatory requirements. Financial institutions demand banking-grade security, necessitating compliance with standards such as ISO 9001 and ISO 27001. Additionally, AI solutions must align with responsible AI principles to facilitate transparency and fairness. Integration with legacy banking systems further complicates adoption, because these infrastructures are often outdated compared to rapidly evolving tech landscapes. Despite these challenges, SuperAcc has been successfully deployed and trusted by over 10 financial services industry (FSI) clients, demonstrating its reliability, security, and compliance in real-world banking environments.

To further enhance the capabilities of specialized information extraction solutions, advanced ML infrastructure is essential. Amazon SageMaker HyperPod offers an effective solution for provisioning resilient clusters to run ML workloads and develop state-of-the-art models. SageMaker HyperPod accelerates the development of foundation models (FMs) by removing the undifferentiated heavy lifting involved in building and maintaining large-scale compute clusters powered by thousands of accelerators such as AWS Trainium and NVIDIA A100 and H100 GPUs. Its resiliency features automatically monitor cluster instances, detecting and replacing faulty hardware automatically, allowing developers to focus on running ML workloads without worrying about infrastructure management.

Building on this foundation of specialized information extraction solutions and using the capabilities of SageMaker HyperPod, we collaborate with APOIDEA Group to explore the use of large vision language models (LVLMs) to further improve table structure recognition performance on banking and financial documents. In this post, we present our work and step-by-step code on fine-tuning the Qwen2-VL-7B-Instruct model using LLaMA-Factory on SageMaker HyperPod. Our results demonstrate significant improvements in table structure recognition accuracy and efficiency compared to the original base model and traditional methods, with particular success in handling complex financial tables and multi-page documents. Following the steps described in this post, you can also fine-tune your own model with domain-specific data to solve your information extraction challenges using the open source implementation.

Challenges in banking information extraction systems with multimodal models

Developing information extraction systems for banks presents several challenges, primarily due to the sensitive nature of documents, their complexity, and variety. For example, bank statement formats vary significantly across financial institutions, with each bank using unique layouts, different columns, transaction descriptions, and ways of presenting financial information. In some cases, documents are scanned with low quality and are poorly aligned, blurry, or faded, creating challenges for Optical Character Recognition (OCR) systems attempting to convert them into machine-readable text. Creating robust ML models is challenging due to the scarcity of clean training data. Current solutions rely on orchestrating models for tasks such as layout analysis, entity extraction, and table structure recognition. Although this modular approach addresses the issue of limited resources for training end-to-end ML models, it significantly increases system complexity and fails to fully use available information.

Models developed based on specific document features are inherently limited in their scope, restricting access to diverse and rich training data. This limitation results in upstream models, particularly those responsible for visual representation, lacking robustness. Furthermore, single-modality models fail to use the multi-faceted nature of information, potentially leading to less precise and accurate predictions. For instance, in table structure recognition tasks, models often lack the capability to reason about textual content while inferring row and column structures. Consequently, a common error is the incorrect subdivision of single rows or columns into multiple instances. Additionally, downstream models that heavily depend on upstream model outputs are susceptible to error propagation, potentially compounding inaccuracies introduced in earlier stages of processing.

Moreover, the substantial computational requirements of these multimodal systems present scalability and efficiency challenges. The necessity to maintain and update multiple models increases the operational burden, rendering large-scale document processing both resource-intensive and difficult to manage effectively. This complexity impedes the seamless integration and deployment of such systems in banking environments, where efficiency and accuracy are paramount.

The recent advances in multimodal models have demonstrated remarkable capabilities in processing complex visual and textual information. LVLMs represent a paradigm shift in document understanding, combining the robust textual processing capabilities of traditional language models with advanced visual comprehension. These models excel at tasks requiring simultaneous interpretation of text, visual elements, and their spatial relationships, making them particularly effective for financial document processing. By integrating visual and textual understanding into a unified framework, multimodal models offer a transformative approach to document analysis. Unlike traditional information extraction systems that rely on fragmented processing pipelines, these models can simultaneously analyze document layouts, extract text content, and interpret visual elements. This integrated approach significantly improves accuracy by reducing error propagation between processing stages while maintaining computational efficiency.

Advanced vision language models are typically pre-trained on large-scale multimodal datasets that include both image and text data. The pre-training process typically involves training the model on diverse datasets containing millions of images and associated text descriptions, sourced from publicly available datasets such as image-text pairs LAION-5B, Visual Question Answering (VQAv2.0), DocVQA, and others. These datasets provide a rich variety of visual content paired with textual descriptions, enabling the model to learn meaningful representations of both modalities. During pre-training, these models are trained using auto-regressive loss, where the model predicts the next token in a sequence given the previous tokens and the visual input. This approach allows the model to effectively align visual and textual features and generate coherent text responses based on the visual context. For image data specifically, modern vision-language models use pre-trained vision encoders, such as vision transformers (ViTs), as their backbone to extract visual features. These features are then fused with textual embeddings in a multimodal transformer architecture, allowing the model to understand the relationships between images and text. By pre-training on such diverse and large-scale datasets, these models develop a strong foundational understanding of visual content, which can be fine-tuned for downstream tasks like OCR, image captioning, or visual question answering. This pre-training phase is critical for enabling the model to generalize well across a wide range of vision-language tasks. The model architecture is illustrated in the following diagram.

visual language model

Fine-tuning vision-language models for visual document understanding tasks offers significant advantages due to their advanced architecture and pre-trained capabilities. The model’s ability to understand and process both visual and textual data makes it inherently well-suited for extracting and interpreting text from images. Through fine-tuning on domain-specific datasets, the model can achieve superior performance in recognizing text across diverse fonts, styles, and backgrounds. This is particularly valuable in banking applications, where documents often contain specialized terminology, complex layouts, and varying quality scans.

Moreover, fine-tuning these models for visual document understanding tasks allows for domain-specific adaptation, which is crucial for achieving high precision in specialized applications. The model’s pre-trained knowledge provides a strong foundation, reducing the need for extensive training data and computational resources. Fine-tuning also enables the model to learn domain-specific nuances, such as unique terminologies or formatting conventions, further enhancing its performance. By combining a model’s general-purpose vision-language understanding with task-specific fine-tuning, you can create a highly efficient and accurate information extraction system that outperforms traditional methods, especially in challenging or niche use cases. This makes vision-language models powerful tools for advancing visual document understanding technology in both research and practical applications.

Solution overview

LLaMA-Factory is an open source framework designed for training and fine-tuning large language models (LLMs) efficiently. It supports over 100 popular models, including LLaMA, Mistral, Qwen, Baichuan, and ChatGLM, and integrates advanced techniques such as LoRA (Low-Rank Adaptation), QLoRA (Quantized LoRA), and full-parameter fine-tuning. The framework provides a user-friendly interface, including a web-based tool called LlamaBoard, which allows users to fine-tune models without writing code. LLaMA-Factory also supports various training methods like supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO), making it versatile for different tasks and applications.

The advantage of LLaMA-Factory lies in its efficiency and flexibility. It significantly reduces the computational and memory requirements for fine-tuning large models by using techniques like LoRA and quantization, enabling users to fine-tune models even on hardware with limited resources. Additionally, its modular design and integration of cutting-edge algorithms, such as FlashAttention-2 and GaLore, facilitate high performance and scalability. The framework also simplifies the fine-tuning process, making it accessible to both beginners and experienced developers. This democratization of LLM fine-tuning allows users to adapt models to specific tasks quickly, fostering innovation and application across various domains. The solution architecture is presented in the following diagram.

sagemaker hyperpod architecture

For the training infrastructure, we use SageMaker HyperPod for distributed training. SageMaker HyperPod provides a scalable and flexible environment for training and fine-tuning large-scale models. SageMaker HyperPod offers a comprehensive set of features that significantly enhance the efficiency and effectiveness of ML workflows. Its purpose-built infrastructure simplifies distributed training setup and management, allowing flexible scaling from single-GPU experiments to multi-GPU data parallelism and large model parallelism. The service’s shared file system integration with Amazon FSx for Lustre enables seamless data synchronization across worker nodes and Amazon Simple Storage Service (Amazon S3) buckets, while customizable environments allow tailored installations of frameworks and tools.

SageMaker HyperPod integrates with Slurm, a popular open source cluster management and job scheduling system, to provide efficient job scheduling and resource management, enabling parallel experiments and distributed training. The service also enhances productivity through Visual Studio Code connectivity, offering a familiar development environment for code editing, script execution, and Jupyter notebook experimentation. These features collectively enable ML practitioners to focus on model development while using the power of distributed computing for faster training and innovation.

Refer to our GitHub repo for a step-by-step guide on fine-tuning Qwen2-VL-7B-Instruct on SageMaker HyperPod.

We start the data preprocessing using the image input and HTML output. We choose the HTML structure as the output format because it is the most common format for representing tabular data in web applications. It is straightforward to parse and visualize, and it is compatible with most web browsers for rendering on the website for manual review or modification if needed. The data preprocessing is critical for the model to learn the patterns of the expected output format and adapt the visual layout of the table. The following is one example of input image and output HTML as the ground truth.

sample financial statement table

<table>
  <tr>
    <td></td>
    <td colspan="5">Payments due by period</td>
  </tr>
  <tr>
    <td></td><td>Total</td><td>Less than 1 year</td><td>1-3 years</td><td>3-5 years</td><td>More than 5 years</td>
  </tr>
  <tr>
    <td>Operating Activities:</td><td></td><td></td><td></td><td></td><td></td>
  </tr>
… … …
… … …
 <tr>
    <td>Capital lease obligations<sup> (6)</sup></td><td>48,771</td><td>8,320</td><td>10,521</td><td>7,371</td><td>22,559</td>
  </tr>
  <tr>
    <td>Other<sup> (7) </sup></td><td>72,734</td><td>20,918</td><td>33,236</td><td>16,466</td><td>2,114</td>
  </tr>
  <tr>
    <td>Total</td><td>$16,516,866</td><td>$3,037,162</td><td>$5,706,285</td><td>$4,727,135</td><td>$3,046,284</td>
  </tr>
</table>

We then use LLaMA-Factory to fine-tune the Qwen2-VL-7B-Instruct model on the preprocessed data. We use Slurm sbatch to orchestrate the distributed training script. An example of the script would be submit_train_multinode.sh. The training script uses QLoRA and data parallel distributed training on SageMaker HyperPod. Following the guidance provided, you will see output similar to the following training log.

During inference, we use vLLM for hosting the quantized model, which provides efficient memory management and optimized attention mechanisms for high-throughput inference. vLLM natively supports the Qwen2-VL series model and continues to add support for newer models, making it particularly suitable for large-scale document processing tasks. The deployment process involves applying 4-bit quantization to reduce model size while maintaining accuracy, configuring the vLLM server with optimal parameters for batch processing and memory allocation, and exposing the model through RESTful APIs for quick integration with existing document processing pipelines. For details on model deployment configuration, refer to the hosting script.

Results

Our evaluation focused on the FinTabNet dataset, which contains complex tables from S&P 500 annual reports. This dataset presents unique challenges due to its diverse table structures, including merged cells, hierarchical headers, and varying layouts. The following example demonstrates a financial table and its corresponding model-generated HTML output, rendered in a browser for visual comparison.

result

For quantitative evaluation, we employed the Tree Edit Distance-based Similarity (TEDS) metric, which assesses both structural and content similarity between generated HTML tables and ground truth. TEDS measures the minimum number of edit operations required to transform one tree structure into another, and TEDS-S focuses specifically on structural similarity. The following table summarizes the output on different models.

Model TEDS TEDS-S
Anthropic’s Claude 3 Haiku 69.9 76.2
Anthropic’s Claude 3.5 Sonnet 86.4 87.1
Qwen2-VL-7B-Instruct (Base) 23.4 25.3
Qwen2-VL-7B-Instruct (Fine-tuned) 81.1 89.7

The evaluation results reveal significant advancements in our fine-tuned model’s performance. Most notably, the Qwen2-VL-7B-Instruct model demonstrated substantial improvements in both content recognition and structural understanding after fine-tuning. When compared to its base version, the model showed enhanced capabilities in accurately interpreting complex table structures and maintaining content fidelity. The fine-tuned version not only surpassed the performance of Anthropic’s Claude 3 Haiku, but also approached the accuracy levels of Anthropic’s Claude 3.5 Sonnet, while maintaining more efficient computational requirements. Particularly impressive was the model’s improved ability to handle intricate table layouts, suggesting a deeper understanding of document structure and organization. These improvements highlight the effectiveness of our fine-tuning approach in adapting the model to specialized financial document processing tasks.

Best practices

Based on our experiments, we identified several key insights and best practices for fine-tuning multimodal table structure recognition models:

  • Model performance is highly dependent on the quality of fine-tuning data. The closer the fine-tuning data resembles real-world datasets, the better the model performs. Using domain-specific data, we achieved a 5-point improvement in TEDS score with only 10% of the data compared to using general datasets. Notably, fine-tuning doesn’t require massive datasets; we achieved relatively good performance with just a few thousand samples. However, we observed that imbalanced datasets, particularly those lacking sufficient examples of complex elements like long tables and forms with merged cells, can lead to biased performance. Maintaining a balanced distribution of document types during fine-tuning facilitates consistent performance across various formats.
  • The choice of base model significantly impacts performance. More powerful base models yield better results. In our case, Qwen2-VL’s pre-trained visual and linguistic features provided a strong foundation. By freezing most parameters through QLoRA during the initial fine-tuning stages, we achieved faster convergence and better usage of pre-trained knowledge, especially with limited data. Interestingly, the model’s multilingual capabilities were preserved; fine-tuning on English datasets alone still yielded good performance on Chinese evaluation datasets. This highlights the importance of selecting a compatible base model for optimal performance.
  • When real-world annotated data is limited, synthetic data generation (using specific document data synthesizers) can be an effective solution. Combining real and synthetic data during fine-tuning helps mitigate out-of-domain issues, particularly for rare or domain-specific text types. This approach proved especially valuable for handling specialized financial terminology and complex document layouts.

Security

Another important aspect of our project involves addressing the security considerations essential when working with sensitive financial documents. As expected in the financial services industry, robust security measures must be incorporated throughout the ML lifecycle. These typically include data security through encryption at rest using AWS Key Management Service (AWS KMS) and in transit using TLS, implementing strict S3 bucket policies with virtual private cloud (VPC) endpoints, and following least-privilege access controls through AWS Identity and Access Management IAM roles. For training environments like SageMaker HyperPod, security considerations involve operating within private subnets in dedicated VPCs using the built-in encryption capabilities of SageMaker. Secure model hosting with vLLM requires deployment in private VPC subnets with proper Amazon API Gateway protections and token-based authentication. These security best practices for financial services make sure that sensitive financial information remains protected throughout the entire ML pipeline while enabling innovative document processing solutions in highly regulated environments.

Conclusion

Our exploration of multi-modality models for table structure recognition in banking documents has demonstrated significant improvements in both accuracy and efficiency. The fine-tuned Qwen2-VL-7B-Instruct model, trained using LLaMA-Factory on SageMaker HyperPod, has shown remarkable capabilities in handling complex financial tables and diverse document formats. These results highlight how multimodal approaches represent a transformative leap forward from traditional multistage and single modality methods, offering an end-to-end solution for modern document processing challenges.

Furthermore, using LLaMA-Factory on SageMaker HyperPod significantly streamlines the fine-tuning process, making it both more efficient and accessible. The scalable infrastructure of SageMaker HyperPod enables rapid experimentation by allowing seamless scaling of training resources. This capability facilitates faster iteration cycles, enabling researchers and developers to test multiple configurations and optimize model performance more effectively.

Explore our GitHub repository to access the implementation and step-by-step guidance, and begin customizing models for your specific requirements. Whether you’re processing financial statements, KYC documents, or complex reports, we encourage you to evaluate its potential for optimizing your document workflows.


About the Authors

Tony Wong is a Solutions Architect at AWS based in Hong Kong, specializing in financial services. He works with FSI customers, particularly in banking, on digital transformation journeys that address security and regulatory compliance. With entrepreneurial background and experience as a Solutions Architect Manager at a local System Integrator, Tony applies problem management skills in enterprise environments. He holds an M.Sc. from The Chinese University of Hong Kong and is passionate to leverage new technologies like Generative AI to help organizations enhance business capabilities.

Yanwei Cui, PhD, is a Senior Machine Learning Specialist Solutions Architect at AWS. He started machine learning research at IRISA (Research Institute of Computer Science and Random Systems), and has several years of experience building AI-powered industrial applications in computer vision, natural language processing, and online user behavior prediction. At AWS, he shares his domain expertise and helps customers unlock business potentials and drive actionable outcomes with machine learning at scale. Outside of work, he enjoys reading and traveling.

Zhihao Lin is a Deep Learning Architect at the AWS Generative AI Innovation Center. With a Master’s degree from Peking University and publications in top conferences like CVPR and IJCAI, he brings extensive AI/ML research experience to his role. At AWS, He focuses on developing generative AI solutions, leveraging cutting-edge technology for innovative applications. He specializes in solving complex computer vision and natural language processing challenges and advancing the practical use of generative AI in business.

Ken Tsui, VP of Machine Learning at Apoidea Group, is a seasoned machine learning engineer with over a decade of experience in applied research and B2B and B2C AI product development. Specializing in language models, computer vision, data curation, synthetic data generation, and distributed training, he also excels in credit scoring and stress-testing. As an active open-source researcher, he contributes to large language model and vision-language model pretraining and post-training datasets.

Edward Tsoi Po Wa is a Senior Data Scientist at Apoidea Group. Passionate about Artificial Intelligence, he specializes in Machine Learning, working on projects like Document Intelligence System, Large Language Models R&D and Retrieval-Augmented Generation Application. Edward drives impactful AI solutions, optimizing systems for industries like banking. Beside working, he holds a B.S. in Physics from Hong Kong University of Science and Technology. In his spare time, he loves to explore science, mathematics, and philosophy.

Mickey Yip is the Vice President of Product at Apoidea Group, where he utilizes his expertises to spearhead groundbreaking AI and digital transformation initiatives. With extensive experience, Mickey has successfully led complex projects for multinational banks, property management firms, and global corporations, delivering impactful and measurable outcomes. His expertise lies in designing and launching innovative AI SaaS products tailored for the banking sector, significantly improving operational efficiency and enhancing client success.

​ 

Leave a Comment

Your email address will not be published. Required fields are marked *

Sign In

Register

Reset Password

Please enter your username or email address, you will receive a link to create a new password via email.

Scroll to Top