Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention
Abid Ali, Diego Molla-Aliod, Usman Naseem

TL;DR
The paper introduces SPeCTrA-Sum, a novel multimodal summarization framework that enhances visual-textual integration and image selection through hierarchical fusion and DPP-based relevance prediction.
Contribution
It proposes a unified model with depth-aware visual-text fusion and a DPP-based image selection method, improving semantic coherence and image relevance in multimodal summaries.
Findings
Produces more accurate, visually grounded summaries.
Selects more representative and diverse images.
Demonstrates the effectiveness of depth-aware fusion and DPP-based selection.
Abstract
Multimodal summarization requires models to jointly understand textual and visual inputs to generate concise, semantically coherent summaries. Existing methods often inject shallow visual features into deep language models, leading to representational mismatches and weak cross-modal grounding. We propose a unified framework that jointly performs text summarization and representative image selection. Our system, SPeCTrA-Sum (Sampler Perceiver with Cross-modal Transformer and gated Attention for Summarization), introduces two key innovations. First, a Deep Visual Processor (DVP) aligns the visual encoder with the language model at corresponding depths, enabling hierarchical, layer-wise fusion that preserves semantic consistency. Second, a lightweight Visual Relevance Predictor (VRP) selects salient and diverse images by distilling soft labels from a Determinantal Point Processes (DPP)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
