OmniFusion Technical Report

Elizaveta Goncharova; Anton Razzhigaev; Matvey Mikhalchuk; Maxim; Kurkin; Irina Abdullaeva; Matvey Skripkin; Ivan Oseledets; Denis Dimitrov and; Andrey Kuznetsov

arXiv:2404.06212·cs.CV·April 10, 2024·2 cites

OmniFusion Technical Report

Elizaveta Goncharova, Anton Razzhigaev, Matvey Mikhalchuk, Maxim, Kurkin, Irina Abdullaeva, Matvey Skripkin, Ivan Oseledets, Denis Dimitrov and, Andrey Kuznetsov

PDF

Open Access 1 Models

TL;DR

OmniFusion is a multimodal AI model that combines large language models with visual adapters, achieving top performance on visual question answering benchmarks and supporting detailed domain-specific responses.

Contribution

The paper introduces OmniFusion, a novel multimodal architecture integrating pretrained LLMs with visual adapters, and demonstrates its superior performance across multiple visual-language benchmarks.

Findings

01

Achieved top scores on 8 visual-language benchmarks.

02

Demonstrated detailed responses in various domains like medicine and culture.

03

Open-source release of the OmniFusion model and training scripts.

Abstract

Last year, multimodal architectures served up a revolution in AI-based approaches and solutions, extending the capabilities of large language models (LLM). We propose an \textit{OmniFusion} model based on a pretrained LLM and adapters for visual modality. We evaluated and compared several architecture design principles for better text and visual data coupling: MLP and transformer adapters, various CLIP ViT-based encoders (SigLIP, InternVIT, etc.), and their fusing approach, image encoding method (whole image or tiles encoding) and two 7B LLMs (the proprietary one and open-source Mistral). Experiments on 8 visual-language benchmarks show the top score for the best OmniFusion setup in terms of different VQA tasks in comparison with open-source LLaVA-like solutions: VizWiz, Pope, MM-Vet, ScienceQA, MMBench, TextVQA, VQAv2, MMMU. We also propose a variety of situations, where OmniFusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
AIRI-Institute/OmniFusion
model· ♡ 59
♡ 59

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Interactive and Immersive Displays · IoT-based Smart Home Systems

MethodsContrastive Language-Image Pre-training