From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
Nicolas Schuler, Lea Dewald, Nick Baldig, J\"urgen Graf

TL;DR
This paper evaluates the effectiveness of small Visual Language Models for scene interpretation and action recognition on edge devices in mobile robotics, highlighting their potential and limitations in real-world scenarios.
Contribution
It introduces a pipeline for deploying small VLMs on edge devices for scene understanding in mobile robotics, addressing computational constraints and real-world applicability.
Findings
Small VLMs can perform scene interpretation on edge devices with acceptable accuracy.
Challenges include model biases and inference time constraints.
Potential for real-world mobile robotics applications is demonstrated.
Abstract
Video Understanding, Scene Interpretation and Commonsense Reasoning are highly challenging tasks enabling the interpretation of visual information, allowing agents to perceive, interact with and make rational decisions in its environment. Large Language Models (LLMs) and Visual Language Models (VLMs) have shown remarkable advancements in these areas in recent years, enabling domain-specific applications as well as zero-shot open vocabulary tasks, combining multiple domains. However, the required computational complexity poses challenges for their application on edge devices and in the context of Mobile Robotics, especially considering the trade-off between accuracy and inference time. In this paper, we investigate the capabilities of state-of-the-art VLMs for the task of Scene Interpretation and Action Recognition, with special regard to small VLMs capable of being deployed to edge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Domain Adaptation and Few-Shot Learning
