# Talking with Actionbits—A Part-Enhanced VLM for Action and Interaction Recognition in Animals

**Authors:** Yang Yang, Ren Nakagawa, Risa Shinoda, Hiroaki Santo, Kenji Oyama, Takenao Ohkawa, Fumio Okura

PMC · DOI: 10.3390/s26061969 · Sensors (Basel, Switzerland) · 2026-03-21

## TL;DR

This paper introduces AIRA, a new framework for recognizing animal actions and interactions using body parts and motion cues, improving accuracy and cross-species generalization.

## Contribution

The novel contribution is the introduction of Actionbit tokens and Part-Enhanced Prompt Fine-tuning to enhance vision–language models for animal behavior analysis.

## Key findings

- AIRA improves robustness to background noise and enables cross-species generalization through a unified part ontology.
- Experiments show consistent improvements in action and interaction recognition across multiple benchmarks.
- Action-centered adaptation and relational reasoning are highlighted as crucial for understanding animal behavior.

## Abstract

Understanding animal actions and interactions is essential for behavior analysis and ecological monitoring. Although large-scale in-the-wild datasets have advanced animal action recognition, existing methods still struggle with fine-grained motion, spatial relations, and multi-individual interactions. To address these challenges, we introduce AIRA, a unified framework for Action and Interaction Recognition in Animals. Built upon a vision–language model (VLM), AIRA learns in an action-centered representation space defined by body parts and their corresponding motions, thereby improving robustness to background noise and enabling cross-species generalization via a unified mammal-centric part ontology. To model actions, we treat body parts and motion as primary cues and introduce Actionbit tokens—compact representations for parts and motions generated by a large language model (LLM) that encode which parts move and how. We further propose Part-Enhanced Prompt Fine-tuning (PEPF) to make the VLM explicitly sensitive to part and pose cues. Within PEPF, the Action–actionbit Alignment (AbA) module enriches action representations with fine-grained part–motion semantics, and Part-Vision Prompting (PVP) extracts keyframes through action-aware prompting. Experiments across multiple benchmarks show consistent improvements in both action and interaction recognition, highlighting the importance of action-centered adaptation and relational reasoning for understanding animal behavior in the wild.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13030406/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13030406/full.md

## References

48 references — full list in the complete paper: https://tomesphere.com/paper/PMC13030406/full.md

---
Source: https://tomesphere.com/paper/PMC13030406