OmniFusion Technical Report
Elizaveta Goncharova, Anton Razzhigaev, Matvey Mikhalchuk, Maxim, Kurkin, Irina Abdullaeva, Matvey Skripkin, Ivan Oseledets, Denis Dimitrov and, Andrey Kuznetsov

TL;DR
OmniFusion is a multimodal AI model that combines large language models with visual adapters, achieving top performance on visual question answering benchmarks and supporting detailed domain-specific responses.
Contribution
The paper introduces OmniFusion, a novel multimodal architecture integrating pretrained LLMs with visual adapters, and demonstrates its superior performance across multiple visual-language benchmarks.
Findings
Achieved top scores on 8 visual-language benchmarks.
Demonstrated detailed responses in various domains like medicine and culture.
Open-source release of the OmniFusion model and training scripts.
Abstract
Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Interactive and Immersive Displays · IoT-based Smart Home Systems
MethodsContrastive Language-Image Pre-training
