# MV-SSM: Multi-View State Space Modeling for 3D Human Pose Estimation

**Authors:** Aviral Chharia, Wenbo Gou, Haoye Dong

arXiv: 2509.00649 · 2025-09-03

## TL;DR

MV-SSM introduces a multi-view state space framework that improves 3D human pose estimation's robustness and generalization across different camera setups, especially in occluded and unseen scenarios.

## Contribution

The paper proposes a novel Multi-View State Space Modeling framework with a Projective State Space block and Grid Token-guided Bidirectional Scanning to enhance generalization in multi-view 3D human pose estimation.

## Key findings

- Outperforms state-of-the-art methods on multiple benchmarks
- Achieves +10.8 AP25 in CMU Panoptic three-camera setting
- Improves cross-dataset generalization with +15.3 PCP on Campus A1

## Abstract

While significant progress has been made in single-view 3D human pose estimation, multi-view 3D human pose estimation remains challenging, particularly in terms of generalizing to new camera configurations. Existing attention-based transformers often struggle to accurately model the spatial arrangement of keypoints, especially in occluded scenarios. Additionally, they tend to overfit specific camera arrangements and visual scenes from training data, resulting in substantial performance drops in new settings. In this study, we introduce a novel Multi-View State Space Modeling framework, named MV-SSM, for robustly estimating 3D human keypoints. We explicitly model the joint spatial sequence at two distinct levels: the feature level from multi-view images and the person keypoint level. We propose a Projective State Space (PSS) block to learn a generalized representation of joint spatial arrangements using state space modeling. Moreover, we modify Mamba's traditional scanning into an effective Grid Token-guided Bidirectional Scanning (GTBS), which is integral to the PSS block. Multiple experiments demonstrate that MV-SSM achieves strong generalization, outperforming state-of-the-art methods: +10.8 on AP25 (+24%) on the challenging three-camera setting in CMU Panoptic, +7.0 on AP25 (+13%) on varying camera arrangements, and +15.3 PCP (+38%) on Campus A1 in cross-dataset evaluations. Project Website: https://aviralchharia.github.io/MV-SSM

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00649/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00649/full.md

## References

47 references — full list in the complete paper: https://tomesphere.com/paper/2509.00649/full.md

---
Source: https://tomesphere.com/paper/2509.00649