CLHOP: Combined Audio-Video Learning for Horse 3D Pose and Shape Estimation
Ci Li, Elin Hernlund, Hedvig Kjellstr\"om, Silvia Zuffi

TL;DR
This paper introduces CLHOP, a novel method that combines audio and visual data to improve 3D horse pose and shape estimation from monocular videos, demonstrating enhanced accuracy and robustness.
Contribution
It is the first study to explore the use of audio in 3D animal motion recovery, introducing a new dataset and showing improved results over visual-only methods.
Findings
Audio-visual integration improves 3D pose accuracy.
New outdoor horse movement dataset introduced.
Enhanced robustness in motion estimation.
Abstract
In the monocular setting, predicting 3D pose and shape of animals typically relies solely on visual information, which is highly under-constrained. In this work, we explore using audio to enhance 3D shape and motion recovery of horses from monocular video. We test our approach on two datasets: an indoor treadmill dataset for 3D evaluation and an outdoor dataset capturing diverse horse movements, the latter being a contribution to this study. Our results show that incorporating sound with visual data leads to more accurate and robust motion regression. This study is the first to investigate audio's role in 3D animal motion recovery.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Music and Audio Processing
