Shifted Window Fourier Transform And Retention For Image Captioning
Jia Cheng Hu, Roberto Cavicchioli, Alessandro Capotondi

TL;DR
This paper introduces SwiFTeR, a Fourier Transform-based architecture for image captioning that is lightweight, efficient, and scalable, aiming to improve performance on resource-constrained devices.
Contribution
SwiFTeR is a novel Fourier Transform and Retention-based architecture that significantly reduces computational complexity and memory usage in image captioning models.
Findings
SwiFTeR has only 20M parameters and requires 3.1 GFLOPs per forward pass.
It can generate 400 captions per second, demonstrating high efficiency.
Current caption quality is lower, but this is due to incomplete training, not architecture limitations.
Abstract
Image Captioning is an important Language and Vision task that finds application in a variety of contexts, ranging from healthcare to autonomous vehicles. As many real-world applications rely on devices with limited resources, much effort in the field was put into the development of lighter and faster models. However, much of the current optimizations focus on the Transformer architecture in contrast to the existence of more efficient methods. In this work, we introduce SwiFTeR, an architecture almost entirely based on Fourier Transform and Retention, to tackle the main efficiency bottlenecks of current light image captioning models, being the visual backbone's onerosity, and the decoder's quadratic cost. SwiFTeR is made of only 20M parameters, and requires 3.1 GFLOPs for a single forward pass. Additionally, it showcases superior scalability to the caption length and its small memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Adam · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings
