Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models
Siyuan Yang, Jun Liu, Hao Cheng, Chong Wang, Shijian Lu, Hedvig Kjellstrom, Weisi Lin, Alex C. Kot

TL;DR
This paper introduces Skeleton-to-Image Encoding (S2I), transforming skeleton sequences into image-like data to leverage vision-pretrained models for improved skeleton representation learning across diverse datasets.
Contribution
The paper proposes a novel S2I representation that converts skeleton data into images, enabling the use of vision-pretrained models and addressing heterogeneity in skeleton formats.
Findings
Effective self-supervised skeleton representation learning demonstrated on multiple datasets.
S2I enables cross-format generalization for skeleton data.
Outperforms existing methods in various evaluation settings.
Abstract
Recent advances in large-scale pretrained vision models have demonstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without introducing additional model branches present significant research opportunities. To address these challenges, we introduce Skeleton-to-Image Encoding (S2I), a novel representation that transforms skeleton sequences into image-like data by partitioning and arranging joints based on body-part semantics and resizing to standardized image dimensions. This encoding enables, for the first time, the use of powerful vision-pretrained models for…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Conceptual simplicity and reusability: The S2I design elegantly reuses pretrained vision transformers for skeleton data without requiring task-specific architectural redesigns. 2. Strong generalization: Demonstrates robustness in cross-format and transfer-learning scenarios, supporting heterogeneous skeleton datasets with a unified encoding. 3. Comprehensive empirical validation: Includes detailed ablations on masking strategies, modalities, and pretrained initialization across five dat
1. Overlap with prior art: **Many earlier works (e.g., Skepxels, Translation-Scale Invariant Image Mapping) have already transformed skeletons into image-like forms for CNN-based feature extraction using pretrained image models**. The paper should clearly articulate what S2I contributes beyond these established skeleton-to-image CNN paradigms. **Repeated or incorrect phrases like “for the first time”**, in this regard. 2. Incremental performance gains: The improvements over strong recent bas
+ The core idea of converting sparse skeleton data into a dense image format is novel. The S2I representation demonstrates good adaptability to different skeleton definitions. + The experiments are thorough, covering five diverse datasets, including the challenging real-world Toyota dataset. The method is rigorously evaluated under multiple settings (self-supervised learning, linear evaluation, fine-tuning, semi-supervised learning, transfer learning, cross-format transfer), demonstrating robus
- The S2I transformation is heuristic. The paper provides little theoretical analysis on how this specific encoding preserves the spatial structure information of the skeleton or its impact on action semantics. The principles behind the body-part ordering and the interpolation method are not sufficiently justified. - The direct mapping of 3D coordinates to the RGB domain is a key design choice. The authors should investigate and discuss whether this is the optimal mapping or if alternative enco
S1: The proposed skeleton-to-image encoding exhibits strong generality. It reformats sparse 3D skeletal data into an image-like representation, providing a novel and useful form of data representation. S2: The paper is clearly written, and the descriptions of the experimental settings and implementation details are well-articulated. S3: The experimental section is relatively thorough and validates the effectiveness of the proposed method. It is worth noting that the method does not require any
W1: While the proposed skeleton-to-image encoding provides a useful and generalizable way to decouple skeletal representation from dataset-specific joint configurations, the contribution may be regarded as more of a technical refinement than a conceptual innovation. Nonetheless, it does offer practical value and may inspire further exploration in this direction. W2: Both MAE and DiffMAE used in this paper are based on the ViT-B architecture, which is relatively modest in scale. Considering the
1. The experiments show that fine-tuning pre-trained vision models on image-like skeleton data can achieve competitive results, suggesting that visual semantics may share correlations with skeleton motion semantics. 2. Converting skeleton sequences into image-like representations effectively addresses the structural discrepancy among different skeleton types.
1. The motivation for using pre-trained vision models in self-supervised skeleton action recognition is unclear and insufficiently explained. Moreover, the necessity of applying this paradigm to skeletal data is not well justified. From the performance improvement perspective, the transferred knowledge from vision models appears limited. For example, the performance under linear evaluation is worse than in prior works. This indicates that the learned skeleton representations from pre-trained vis
1. Converting the set of keypoints into an image representation is an interesting attempt; 2. The overall writing is clear and easy to follow; 3. The method outperforms state-of-the-art approaches on multiple benchmarks (e.g., NTU-60, NTU-120, PKU-MMD) and shows strong generalization in cross-format and few-shot scenarios;
1. Mapping the temporal dimension directly to the image height is a simple and straightforward stacking approach. However, this design disrupts temporal continuity. In an image, adjacent rows (i.e., consecutive time frames) are spatially continuous, whereas in skeleton sequences, the same joint across consecutive frames appears in different rows, and different joints within the same row may be far apart in physical space (e.g., left hand vs. right foot). Such a representation may violate the loc
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
