# MM-SeR: Multimodal Self-Refinement for Lightweight Image Captioning

**Authors:** Junha Song, Yongsik Jo, So Yeon Min, Quanting Xie, Taehwan Kim, Yonatan Bisk, Jaegul Choo

arXiv: 2508.21451 · 2025-12-15

## TL;DR

This paper introduces a lightweight image captioning model that replaces large language models with a compact one, combined with a self-refinement framework inspired by human visual perception, achieving competitive performance with reduced computational cost.

## Contribution

The paper presents a novel multimodal self-refinement framework for lightweight image captioning, demonstrating that a small 125M-parameter model can match larger models' performance.

## Key findings

- Compact model achieves comparable performance to large MLLMs.
- Self-refinement improves captioning reliability and detail.
- Model extends effectively to long-range video question answering.

## Abstract

Systems such as video chatbots and navigation robots often depend on streaming image captioning to interpret visual inputs. Existing approaches typically employ large multimodal language models (MLLMs) for this purpose, but their substantial computational cost hinders practical application. This limitation motivates our development of a lightweight captioning model. Our investigation begins by replacing the large-scale language component in MLLMs with a compact 125M-parameter model. Surprisingly, this compact model, despite a 93x reduction in size, achieves comparable performance to MLLMs, suggesting that factual image captioning does not significantly require the complex reasoning abilities of LLMs. Despite this promising result, our lightweight model still lacks reliability. To address this, we draw inspiration from the human visual process: perceiving a global and coarse understanding of the scene before attending to finer details. Accordingly, we propose a multimodal self-refinement framework that guides the model to utilize features from salient regions, identified by referencing the previous coarse caption, and to produce a refined description. Experimental results demonstrate the superiority of our model in both single-sentence and detailed captioning, extending even to long-range video QA tasks.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21451/full.md

## Figures

25 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21451/full.md

---
Source: https://tomesphere.com/paper/2508.21451