Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology
Primo\v{z} Kocbek, Azra Frkatovi\'c-Hod\v{z}i\'c, Dora Lali\'c, Vivian Hui, Gordan Lauc, Gregor \v{S}tiglic

TL;DR
This study evaluates different augmentation strategies for multi-modal retrieval-augmented generation in glycobiology, finding that text conversion and visual retrieval methods vary in effectiveness depending on model capacity and domain complexity.
Contribution
It provides a comprehensive benchmark and analysis of augmentation strategies in biomedical multi-modal QA, highlighting the trade-offs and performance of various retrieval methods across model sizes.
Findings
Text and multi-modal augmentation outperform OCR-free retrieval for mid-size models.
Visual retrieval methods like ColPali and ColFlor are competitive with larger models.
Pipeline choice depends on model capacity; text conversion suits mid-size models, visual retrieval benefits larger models.
Abstract
Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Multimodal Machine Learning Applications · Topic Modeling
