SKI Models: Skeleton Induced Vision-Language Embeddings for   Understanding Activities of Daily Living

Arkaprava Sinha; Dominick Reilly; Francois Bremond; Pu Wang; Srijan; Das

arXiv:2502.03459·cs.CV·February 6, 2025

SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living

Arkaprava Sinha, Dominick Reilly, Francois Bremond, Pu Wang, Srijan, Das

PDF

Open Access

TL;DR

This paper introduces SKI models that integrate 3D skeleton data into vision-language embeddings, improving zero-shot activity recognition and captioning for daily living videos without needing skeleton data during inference.

Contribution

The paper presents SKI models that incorporate skeleton information into vision-language models via collaborative training, enabling better generalization to unseen activities.

Findings

01

Enhanced zero-shot activity recognition accuracy

02

Improved video captioning performance on ADL datasets

03

Skeleton integration boosts model robustness without skeleton data at inference

Abstract

The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition