An LLM-Empowered Low-Resolution Vision System for On-Device Human   Behavior Understanding

Siyang Jiang; Bufang Yang; Lilin Xu; Mu Yuan; Yeerzhati Abudunuer,; Kaiwei Liu; Liekang Zeng; Hongkai Chen; Zhenyu Yan; Xiaofan Jiang; Guoliang; Xing

arXiv:2505.01743·cs.CV·May 6, 2025

An LLM-Empowered Low-Resolution Vision System for On-Device Human Behavior Understanding

Siyang Jiang, Bufang Yang, Lilin Xu, Mu Yuan, Yeerzhati Abudunuer,, Kaiwei Liu, Liekang Zeng, Hongkai Chen, Zhenyu Yan, Xiaofan Jiang, Guoliang, Xing

PDF

Open Access

TL;DR

This paper introduces Llambda, a low-resource, on-device system that enhances human behavior understanding from low-resolution videos by leveraging limited labeled data, contrastive learning, and efficient fine-tuning of large vision language models.

Contribution

It proposes a novel system combining contrastive-oriented pseudo labeling, physical-knowledge guided captioning, and LoRA-based fine-tuning for low-resolution human behavior understanding.

Findings

01

Llambda outperforms state-of-the-art LVLMs by up to 40.03% Bert-Score.

02

Effective pseudo labeling improves low-resolution video captioning.

03

Efficient on-device fine-tuning enables practical deployment.

Abstract

The rapid advancements in Large Vision Language Models (LVLMs) offer the potential to surpass conventional labeling by generating richer, more detailed descriptions of on-device human behavior understanding (HBU) in low-resolution vision systems, such as depth, thermal, and infrared. However, existing large vision language model (LVLM) approaches are unable to understand low-resolution data well as they are primarily designed for high-resolution data, such as RGB images. A quick fixing approach is to caption a large amount of low-resolution data, but it requires a significant amount of labor-intensive annotation efforts. In this paper, we propose a novel, labor-saving system, Llambda, designed to support low-resolution HBU. The core idea is to leverage limited labeled data and a large amount of unlabeled data to guide LLMs in generating informative captions, which can be combined with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Anomaly Detection Techniques and Applications · Currency Recognition and Detection