Automated Image Captioning for Rapid Prototyping and Resource Constrained Environments
Karan Sharma, Arun CS Kumar, Suchendra Bhandarkar

TL;DR
This paper proposes a scalable, resource-efficient approach to automated image captioning that leverages top object detection and word embeddings to infer actions, suitable for resource-constrained environments.
Contribution
It introduces a novel insight that detecting key objects enables relevant captioning and action inference, emphasizing simplicity and scalability in image captioning systems.
Findings
Effective action prediction using object detection and word embeddings
Achieved reasonable captioning performance with low complexity
Reduced system development time and resource requirements
Abstract
Significant performance gains in deep learning coupled with the exponential growth of image and video data on the Internet have resulted in the recent emergence of automated image captioning systems. Ensuring scalability of automated image captioning systems with respect to the ever increasing volume of image and video data is a significant challenge. This paper provides a valuable insight in that the detection of a few significant (top) objects in an image allows one to extract other relevant information such as actions (verbs) in the image. We expect this insight to be useful in the design of scalable image captioning systems. We address two parameters by which the scalability of image captioning systems could be quantified, i.e., the traditional algorithmic time complexity which is important given the resource limitations of the user device and the system development time since the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
