Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

Hao Xing; Kai Zhe Boey; Yuankai Wu; Darius Burschka; Gordon Cheng

arXiv:2507.00752·cs.CV·December 12, 2025

Multi-Modal Graph Convolutional Network with Sinusoidal Encoding for Robust Human Action Segmentation

Hao Xing, Kai Zhe Boey, Yuankai Wu, Darius Burschka, Gordon Cheng

PDF

Open Access

TL;DR

This paper introduces a multi-modal graph convolutional network with sinusoidal encoding and a novel data augmentation technique to improve the accuracy and temporal coherence of human action segmentation, especially under noisy conditions.

Contribution

The paper presents a novel multi-modal GCN framework with sinusoidal encoding, hierarchical feature fusion, and SmoothLabelMix augmentation for robust human action segmentation.

Findings

01

Achieves state-of-the-art segmentation accuracy on Bimanual Actions Dataset.

02

Effectively reduces over-segmentation errors in noisy data.

03

Demonstrates robustness to low-frame-rate visual data.

Abstract

Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Robot Manipulation and Learning