Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos
Junbin Zhang, Pei-Hsuan Tsai, Meng-Hsun Tsai

TL;DR
Semantic2Graph introduces a graph-based multi-modal feature fusion method for video action segmentation that models long-term dependencies efficiently, reducing computational costs while improving accuracy over previous models.
Contribution
The paper presents Semantic2Graph, a novel graph-structured approach that effectively captures long-term semantic relationships in videos using multi-modal features and GNNs, outperforming prior methods.
Findings
Semantic2Graph outperforms state-of-the-art methods on GTEA and 50Salads datasets.
Semantic edges improve long-term dependency modeling and accuracy.
The approach reduces computational costs compared to LSTM and Transformer-based models.
Abstract
Video action segmentation have been widely applied in many fields. Most previous studies employed video-based vision models for this purpose. However, they often rely on a large receptive field, LSTM or Transformer methods to capture long-term dependencies within videos, leading to significant computational resource requirements. To address this challenge, graph-based model was proposed. However, previous graph-based models are less accurate. Hence, this study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos, thereby reducing computational costs and raise the accuracy. We construct a graph structure of video at the frame-level. Temporal edges are utilized to model the temporal relations and action order within videos. Additionally, we have designed positive and negative semantic edges, accompanied by corresponding edge weights, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Surveillance and Tracking Methods
