Shifted Window Fourier Transform And Retention For Image Captioning

Jia Cheng Hu; Roberto Cavicchioli; Alessandro Capotondi

arXiv:2408.13963·cs.CV·August 27, 2024

Shifted Window Fourier Transform And Retention For Image Captioning

Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi

PDF

Open Access

TL;DR

This paper introduces SwiFTeR, a Fourier Transform-based architecture for image captioning that is lightweight, efficient, and scalable, aiming to improve performance on resource-constrained devices.

Contribution

SwiFTeR is a novel Fourier Transform and Retention-based architecture that significantly reduces computational complexity and memory usage in image captioning models.

Findings

01

SwiFTeR has only 20M parameters and requires 3.1 GFLOPs per forward pass.

02

It can generate 400 captions per second, demonstrating high efficiency.

03

Current caption quality is lower, but this is due to incomplete training, not architecture limitations.

Abstract

Image Captioning is an important Language and Vision task that finds application in a variety of contexts, ranging from healthcare to autonomous vehicles. As many real-world applications rely on devices with limited resources, much effort in the field was put into the development of lighter and faster models. However, much of the current optimizations focus on the Transformer architecture in contrast to the existence of more efficient methods. In this work, we introduce SwiFTeR, an architecture almost entirely based on Fourier Transform and Retention, to tackle the main efficiency bottlenecks of current light image captioning models, being the visual backbone's onerosity, and the decoder's quadratic cost. SwiFTeR is made of only 20M parameters, and requires 3.1 GFLOPs for a single forward pass. Additionally, it showcases superior scalability to the caption length and its small memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Adam · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings