Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI

Wing Man Casca Kwok; Yip Chiu Tung; Kunal Bhagchandani

arXiv:2506.03607·cs.CV·June 5, 2025

Analyzing Transformer Models and Knowledge Distillation Approaches for Image Captioning on Edge AI

Wing Man Casca Kwok, Yip Chiu Tung, Kunal Bhagchandani

PDF

Open Access

TL;DR

This paper evaluates transformer-based image captioning models for edge AI, demonstrating that knowledge distillation can enable efficient inference on resource-constrained devices without significant performance loss.

Contribution

It introduces resource-efficient transformer models and applies knowledge distillation techniques to improve inference speed on edge devices for image captioning.

Findings

01

Knowledge distillation accelerates inference on edge devices.

02

Transformer models can be optimized for resource constraints.

03

Maintained performance with reduced computational requirements.

Abstract

Edge computing decentralizes processing power to network edge, enabling real-time AI-driven decision-making in IoT applications. In industrial automation such as robotics and rugged edge AI, real-time perception and intelligence are critical for autonomous operations. Deploying transformer-based image captioning models at the edge can enhance machine perception, improve scene understanding for autonomous robots, and aid in industrial inspection. However, these edge or IoT devices are often constrained in computational resources for physical agility, yet they have strict response time requirements. Traditional deep learning models can be too large and computationally demanding for these devices. In this research, we present findings of transformer-based models for image captioning that operate effectively on edge devices. By evaluating resource-effective transformer models and applying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Visual Attention and Saliency Detection