Temporal Visual Semantics-Induced Human Motion Understanding with Large Language Models

Zheng Xing; Weibing Zhao

arXiv:2512.22249·cs.LG·December 30, 2025

Temporal Visual Semantics-Induced Human Motion Understanding with Large Language Models

Zheng Xing, Weibing Zhao

PDF

Open Access

TL;DR

This paper introduces a novel approach that combines temporal vision semantics derived from large language models with subspace clustering to improve unsupervised human motion segmentation, achieving superior results on benchmark datasets.

Contribution

It proposes a new method to incorporate temporal semantic information from LLMs into subspace clustering for human motion understanding, which was not explored before.

Findings

01

Outperforms state-of-the-art methods on four datasets

02

Effective integration of LLM-derived semantics improves segmentation accuracy

03

Feedback mechanism enhances subspace embedding optimization

Abstract

Unsupervised human motion segmentation (HMS) can be effectively achieved using subspace clustering techniques. However, traditional methods overlook the role of temporal semantic exploration in HMS. This paper explores the use of temporal vision semantics (TVS) derived from human motion sequences, leveraging the image-to-text capabilities of a large language model (LLM) to enhance subspace clustering performance. The core idea is to extract textual motion information from consecutive frames via LLM and incorporate this learned information into the subspace clustering framework. The primary challenge lies in learning TVS from human motion sequences using LLM and integrating this information into subspace clustering. To address this, we determine whether consecutive frames depict the same motion by querying the LLM and subsequently learn temporal neighboring information based on its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Multimodal Machine Learning Applications