Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
Fawaz Sammani, Nikos Deligiannis

TL;DR
Uni-NLX is a unified multi-task model that leverages large language models to generate natural language explanations across vision and vision-language tasks, reducing parameters while maintaining or improving performance.
Contribution
It introduces a unified framework for all NLE tasks, along with two new datasets, enabling multi-task learning with fewer parameters and comparable or better results.
Findings
Capable of performing 7 NLE tasks simultaneously
Uses 7X fewer parameters than task-specific models
Achieves comparable or superior performance on several tasks
Abstract
Natural Language Explanations (NLE) aim at supplementing the prediction of a model with human-friendly natural text. Existing NLE approaches involve training separate models for each downstream task. In this work, we propose Uni-NLX, a unified framework that consolidates all NLE tasks into a single and compact multi-task model using a unified training objective of text generation. Additionally, we introduce two new NLE datasets: 1) ImageNetX, a dataset of 144K samples for explaining ImageNet categories, and 2) VQA-ParaX, a dataset of 123K samples for explaining the task of Visual Question Answering (VQA). Both datasets are derived leveraging large language models (LLMs). By training on the 1M combined NLE samples, our single unified framework is capable of simultaneously performing seven NLE tasks including VQA, visual recognition and visual reasoning tasks with 7X fewer parameters,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
