HeadlinesBriefing favicon HeadlinesBriefing.com

Grab's Vision LLM for Image Scanning: Architecture and Training

ByteByteGo Newsletter •
×

Grab's engineering team detailed how they built a Vision LLM to improve information extraction from user-submitted documents. Faced with limitations of traditional OCR systems and the poor performance of existing proprietary and open-source models on Southeast Asian languages, Grab opted to build their own. This move was essential for accurate electronic Know-Your-Customer (eKYC) verification across diverse document formats.

To build their Vision LLM, Grab selected Qwen2-VL 2B as a base multimodal LLM. They chose this model because of its appropriate size, good support for Southeast Asian languages, and ability to process images at their native resolution. They then fine-tuned the model, addressing the need for better native language understanding and the desire to reduce latency while maintaining accuracy in their document processing.

Grab employed two methods for training data generation. They created a synthetic OCR dataset by extracting text from Common Crawl and rendering it in various formats. They also developed Documint, an auto-labeling and preprocessing framework to extract training labels from real documents. This dual approach allowed Grab to overcome the challenges of language diversity and document variability.

This development highlights a trend: companies are increasingly building specialized AI models to meet their specific needs. While off-the-shelf solutions exist, fine-tuning and custom model development is essential for accuracy in specialized contexts. This approach is becoming more common as the cost of training and deploying smaller, more efficient models decreases.