Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology

Primo\v{z} Kocbek; Azra Frkatovi\'c-Hod\v{z}i\'c; Dora Lali\'c; Vivian Hui; Gordan Lauc; Gregor \v{S}tiglic

arXiv:2512.16802·cs.CL·December 19, 2025

Exploration of Augmentation Strategies in Multi-modal Retrieval-Augmented Generation for the Biomedical Domain: A Case Study Evaluating Question Answering in Glycobiology

Primo\v{z} Kocbek, Azra Frkatovi\'c-Hod\v{z}i\'c, Dora Lali\'c, Vivian Hui, Gordan Lauc, Gregor \v{S}tiglic

PDF

Open Access

TL;DR

This study evaluates different augmentation strategies for multi-modal retrieval-augmented generation in glycobiology, finding that text conversion and visual retrieval methods vary in effectiveness depending on model capacity and domain complexity.

Contribution

It provides a comprehensive benchmark and analysis of augmentation strategies in biomedical multi-modal QA, highlighting the trade-offs and performance of various retrieval methods across model sizes.

Findings

01

Text and multi-modal augmentation outperform OCR-free retrieval for mid-size models.

02

Visual retrieval methods like ColPali and ColFlor are competitive with larger models.

03

Pipeline choice depends on model capacity; text conversion suits mid-size models, visual retrieval benefits larger models.

Abstract

Multi-modal retrieval-augmented generation (MM-RAG) promises grounded biomedical QA, but it is unclear when to (i) convert figures/tables into text versus (ii) use optical character recognition (OCR)-free visual retrieval that returns page images and leaves interpretation to the generator. We study this trade-off in glycobiology, a visually dense domain. We built a benchmark of 120 multiple-choice questions (MCQs) from 25 papers, stratified by retrieval difficulty (easy text, medium figures/tables, hard cross-evidence). We implemented four augmentations-None, Text RAG, Multi-modal conversion, and late-interaction visual retrieval (ColPali)-using Docling parsing and Qdrant indexing. We evaluated mid-size open-source and frontier proprietary models (e.g., Gemma-3-27B-IT, GPT-4o family). Additional testing used the GPT-5 family and multiple visual retrievers (ColPali/ColQwen/ColFlor).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Multimodal Machine Learning Applications · Topic Modeling