MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval
Kiarash Naghavi Khanghah, Hoang Anh Nguyen, Anna C. Doris, Amir Mohammad Vahedi, Daniele Grandi, Faez Ahmed, Hongyi Xu

TL;DR
This paper introduces MCERF, a multimodal retrieval and reasoning framework that significantly improves question answering accuracy on engineering documents by integrating visual and textual information with advanced retrieval strategies.
Contribution
The work presents a novel multimodal retrieval system with modular design, multiple reasoning strategies, and dynamic query routing, advancing the state-of-the-art in engineering document comprehension.
Findings
Achieved +41.1% accuracy gain over baseline RAG on DesignQA benchmark.
Demonstrated effective multimodal retrieval combining text, tables, and images.
Validated the system's scalability and adaptability across diverse engineering tasks.
Abstract
Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
