LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

Dominick Reilly; Rajatsubhra Chakraborty; Arkaprava Sinha; Manish; Kumar Govind; Pu Wang; Francois Bremond; Le Xue; Srijan Das

arXiv:2406.09390·cs.CV·March 27, 2025

LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

Dominick Reilly, Rajatsubhra Chakraborty, Arkaprava Sinha, Manish, Kumar Govind, Pu Wang, Francois Bremond, Le Xue, Srijan Das

PDF

Open Access

TL;DR

This paper introduces LLAVIDAL, a multimodal large language vision model specialized for Activities of Daily Living, trained on a new dataset ADL-X, and demonstrates its superior performance in fine-grained ADL understanding tasks.

Contribution

It presents a semi-automated framework for creating ADL datasets, a new multimodal dataset ADL-X, and a novel training strategy MMPro for improved ADL activity modeling.

Findings

01

LLAVIDAL achieves state-of-the-art results on ADL benchmarks.

02

The multimodal curriculum training improves model performance.

03

The ADL-X dataset enhances fine-grained activity understanding.

Abstract

Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Context-Aware Activity Recognition Systems