Multi-Image Summarization: Textual Summary from a Set of Cohesive Images
Nicholas Trieu, Sebastian Goodman, Pradyumna Narayana, Kazoo Sone,, Radu Soricut

TL;DR
This paper introduces the task of multi-image summarization, developing a Transformer-based model that generates concise textual summaries from a set of cohesive images, improving coherence and reducing hallucinations.
Contribution
It extends image captioning models to handle multiple images, proposing a dense feature aggregation method and demonstrating the benefits of pretraining on single-image captioning.
Findings
Aggregated image features outperform individual embeddings.
Pretraining reduces hallucinations in generated summaries.
Model achieves improved coherence in multi-image summaries.
Abstract
Multi-sentence summarization is a well studied problem in NLP, while generating image descriptions for a single image is a well studied problem in Computer Vision. However, for applications such as image cluster labeling or web page summarization, summarizing a set of images is also a useful and challenging task. This paper proposes the new task of multi-image summarization, which aims to generate a concise and descriptive textual summary given a coherent set of input images. We propose a model that extends the image-captioning Transformer-based architecture for single image to multi-image. A dense average image feature aggregation network allows the model to focus on a coherent subset of attributes across the input images. We explore various input representations to the Transformer network and empirically show that aggregated image features are superior to individual image embeddings.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Multi-Head Attention · Adam · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Byte Pair Encoding
