PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification

Hao Yang; Qianyu Zhou; Haijia Sun; Xiangtai Li; Xuequan Lu; Lizhuang Ma; Shuicheng Yan

arXiv:2508.20835·cs.CV·September 1, 2025

PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains for Point Cloud Classification

Hao Yang, Qianyu Zhou, Haijia Sun, Xiangtai Li, Xuequan Lu, Lizhuang Ma, Shuicheng Yan

PDF

Open Access

TL;DR

PointDGRWKV introduces a novel RWKV-based framework for point cloud classification that enhances domain generalization by addressing spatial distortions and attention shifts, achieving state-of-the-art results.

Contribution

This work is the first to adapt RWKV architecture for domain generalization in point cloud classification, introducing modules for spatial modeling and distribution alignment.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Effectively models local geometric structures.

03

Mitigates cross-domain attention shifts.

Abstract

Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV's fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in…

Tables6

Table 1. Table 1 : Performance comparison between the proposed method and the state-of-the-art point cloud classification methods on the PointDA-10 and PointDG-3to1 benchmarks. The metric used is overall classification accuracy (%), and Avg. indicates the mean accuracy across all target domain scenarios. The highest result in each benchmark is marked in bold .

Method	Setting	Venue	Backbone	PointDA-10 Benchmark				PointDG-3to1 Benchmark
Method	Setting	Venue	Backbone	M,S*→S	M,S→S*	S,S*→M	Avg.	ABC→D	ABD→C	ACD→B	BCD→A	Avg.
PointDAN [44]	DA	NeurIPS’2019	PointNet	77.38	40.32	78.69	65.46	58.85	81.66	48.86	79.95	67.33
DefRec [1]	DA	WACV’2021	DGCNN	77.23	44.28	84.77	68.76	72.76	79.97	43.29	87.94	70.99
GAST [85]	DA	ICCV’2021	DGCNN	79.43	47.69	81.72	69.61	71.78	86.43	52.31	86.21	74.18
MetaSets [21]	DG	CVPR’2021	PointNet	81.39	50.86	83.48	71.91	73.24	92.41	60.97	87.28	78.48
PDG [60]	DG	NeurIPS’2022	PointNet	79.82	51.73	83.51	71.69	73.38	92.98	60.57	89.90	79.21
PointNeXt [43]	DG	NeurIPS’2022	PointNet	77.31	43.32	78.16	66.26	71.47	91.70	46.39	88.95	74.63
X-3D [51]	DG	CVPR’2024	PointNet	78.06	46.91	79.69	68.22	71.58	91.89	48.34	88.45	75.07
PCT [16]	DG	CVM’2021	PointTrans	80.23	48.29	81.91	70.14	71.43	87.43	58.43	88.34	76.41
GBNet [46]	DG	TMM’2021	PointTrans	79.94	48.92	81.34	70.07	72.78	87.83	57.76	88.82	76.80
SUG [23]	DG	MM’2023	PointTrans	78.34	49.59	82.03	69.99	71.58	89.62	54.66	86.35	75.55
PCM [16]	DG	AAAI’2025	PCM	81.02	46.83	83.92	70.59	72.27	91.24	57.28	87.54	77.08
PointDGMamba [68]	DG	AAAI’2025	PCM	84.33	52.83	87.38	74.85	74.20	95.51	61.71	90.68	80.53
V-RWKV’ [9]	DG	ICLR’2025	V-RWKV	81.90	49.52	85.49	72.24	73.42	92.12	57.88	88.18	77.90
PointDGRWKV	DG	-	V-RWKV	84.39	54.10	88.49	75.66	76.37	95.99	63.92	91.38	81.92

Table 2. Table 2 : Ablation study on the AGT-Shift (AGTS) and CD-KDA (KDA) modules on the PointDA-10 benchmark.

AGTS	KDA	M,S*→S	M,S→S*	S,S*→M	Avg.	Gain
		81.70	49.52	85.49	72.24	-
✓		83.39	51.10	86.21	73.57	1.33
	✓	82.50	53.48	86.86	74.28	2.04
✓	✓	84.39	54.10	88.49	75.66	3.42

Table 3. Table 3 : Ablations on different shifting strategies on the PointDA-10 benchmark.

Shift	M,S*→S	M,S→S*	S,S*→M	Avg.
KNN-RandOne	83.39	52.85	87.38	74.54
KNN-Avg	83.87	53.36	85.51	74.25
KNN-WAvg	83.35	53.31	86.80	74.82
AGT-Shift (Ours)	84.39	54.10	88.49	75.66

Table 4. Table 4 : Ablation study comparing different alignment settings for key ( 𝐤 \mathbf{k} ) and value ( 𝐯 \mathbf{v} ) in the CD-KDA module.

Setting	M,S*→S	M,S→S*	S,S*→M	Avg.
None	83.39	51.10	86.21	73.57
Only $𝐯$	83.67	51.89	86.68	74.08
$𝐤$ and $𝐯$	84.63	54.07	88.34	75.68
Only $𝐤$ (Ours)	84.39	54.10	88.49	75.66

Table 5. Table 5 : Generalization results of PointDGRWKV across varying network scales.

Scale	M,S*→S	M,S→S*	S,S*→M	Avg.
Ours-Base	83.99	53.83	87.62	75.15
Ours-Standard	84.39	54.10	88.49	75.66
Ours-Large	84.63	54.61	89.14	76.13

Table 6. Table 6 : Analysis of Computational efficiency on the single NVIDIA 4090 GPU.

	Params	GFlops	Time
Method	(M)	(G)	(ms)	Acc(%)
GAST [85]	75.36	2.17	23.13	69.61
PCT [16]	2.88	2.19	27.65	70.14
GBNet [46]	8.77	9.87	80.97	70.07
SUG [23]	19.17	18.4	5.42	69.99
PCM [16]	35.85	20.18	6.26	70.59
PointDGMamba [68]	13.09	6.08	3.35	74.85
Ours-Base	2.13	3.22	1.68	75.15
Ours-Standard	3.72	4.57	2.39	75.66
Ours-Large	10.40	7.60	2.92	76.13

Equations14

Q-Shift_{S} (X)

Q-Shift_{S} (X)

X^{⋆}

wkv_{t}

wkv_{t}

= \frac{\sum _{i = 0, i \neq = t}^{T - 1} e ^{- \frac{∣ t - i ∣ - 1}{T} \cdot w + k_{i}} \cdot v _{i} + e ^{u + k_{t}} \cdot v _{t}}{\sum _{i = 0, i \neq = t}^{T - 1} e ^{- \frac{∣ t - i ∣ - 1}{T} \cdot w + k_{i}} + e ^{u + k_{t}}},

\hat{f_{i}}

\hat{f_{i}}

f_{i}^{out} = [λ f_{i}^{(1)} + (1 - λ) \hat{f}_{i}^{(1)} ∥ f_{i}^{(2)}],

f_{i}^{out} = [λ f_{i}^{(1)} + (1 - λ) \hat{f}_{i}^{(1)} ∥ f_{i}^{(2)}],

L_{CD-KDA} = \frac{1}{∣ P ∣} (i, j) \in P \sum μ^{(i)} - μ^{(j)}_{2}^{2} + Σ^{(i)} - Σ^{(j)}_{F}^{2},

L_{CD-KDA} = \frac{1}{∣ P ∣} (i, j) \in P \sum μ^{(i)} - μ^{(j)}_{2}^{2} + Σ^{(i)} - Σ^{(j)}_{F}^{2},

L = λ_{1} L_{cls} + λ_{2} L_{CD-KDA},

L = λ_{1} L_{cls} + λ_{2} L_{CD-KDA},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Gaussian Processes and Bayesian Inference · Face recognition and analysis

Full text

PointDGRWKV: Generalizing RWKV-like Architecture to Unseen Domains

for Point Cloud Classification

Hao Yang111footnotemark: 1, Qianyu Zhou2, Haijia Sun3, Xiangtai Li4, Xuequan Lu5,

Lizhuang Ma1, Shuicheng Yan6

1Shanghai Jiao Tong University; 2 The University of Tokyo; 3 Nanjing University;

4 Nanyang Technological University; 5 The University of Western Australia; 6 National University of Singapore

Code: https://github.com/yxltya/PointDGRWKV *The first two authors contributed equally to this work.*Corresponding author.

Abstract

Domain Generalization (DG) has been recently explored to enhance the generalizability of Point Cloud Classification (PCC) models toward unseen domains. Prior works are based on convolutional networks, Transformer or Mamba architectures, either suffering from limited receptive fields or high computational cost, or insufficient long-range dependency modeling. RWKV, as an emerging architecture, possesses superior linear complexity, global receptive fields, and long-range dependency. In this paper, we present the first work that studies the generalizability of RWKV models in DG PCC. We find that directly applying RWKV to DG PCC encounters two significant challenges: RWKV’s fixed direction token shift methods, like Q-Shift, introduce spatial distortions when applied to unstructured point clouds, weakening local geometric modeling and reducing robustness. In addition, the Bi-WKV attention in RWKV amplifies slight cross-domain differences in key distributions through exponential weighting, leading to attention shifts and degraded generalization. To this end, we propose PointDGRWKV, the first RWKV-based framework tailored for DG PCC. It introduces two key modules to enhance spatial modeling and cross-domain robustness, while maintaining RWKV’s linear efficiency. In particular, we present Adaptive Geometric Token Shift to model local neighborhood structures to improve geometric context awareness. In addition, Cross-Domain key feature Distribution Alignment is designed to mitigate attention drift by aligning key feature distributions across domains. Extensive experiments on multiple benchmarks demonstrate that PointDGRWKV achieves state-of-the-art performance on DG PCC.

1 Introduction

3D point clouds analysis [18, 53, 47, 2, 29, 74, 45, 46, 72, 71, 67] play a crucial role in various applications, such as autonomous driving, augmented reality, and robotics [4, 3, 52]. Recently, point cloud classification (PCC) tasks [41, 42, 40, 30, 59, 2, 74] have made significant progress in understanding the local geometry and global shapes. However, most of them usually assume that the training and testing data share the same distribution. When the model is applied to unknown domains, the performance often drops significantly due to domain shifts induced by different sensors, environments, scanning angles, etc.

To address this issue, domain generalization (DG) has been introduced into point cloud analysis, aiming to train models solely on source domains and generalize well in unknown domains. The mainstream DG PCC methods tend to learn domain-invariant features via data augmentation [63], adversarial training [28], and consistency learning [27]. Nonetheless, most of them are based on CNNs, and suffer from a limited receptive field, making it challenging to capture global structural information and harm the generalizability. Subsequently, point Transformer [19] was introduced to enhance global modeling capabilities in DG PCC. However, its inherent attention involves high computational complexity, which limits its efficiency in practical applications. Point Cloud Mamba has recently shown the potential of sequence modeling in DG PCC [68]. Nevertheless, due to its fixed state space size, it is difficult to fully capture long-range dependencies, especially under long sequence lengths, as shown in Figure 1.

Recently, Reception Weighted Key Value (RWKV) [39, 5, 9, 7] has demonstrated excellent capabilities in long-range dependency modeling and capturing global information in NLP and vision tasks. Moreover, the core WKV attention mechanism exhibits a linear computational complexity, which significantly reduces the computational overhead of traditional self-attention. They demonstrate strong scalability in various vision tasks and even in point cloud analysis. Despite its gratifying progress, enhancing the generalizability of RWKV-like models in unseen domains for point cloud analysis remains an open problem, as directly applying RWKV to DG PCC tasks is non-trivial.

In this paper, we aim to improve the generalizability of RWKV-like architectures toward unseen domains in point cloud classification. Our motivations mainly lie in two aspects. Firstly, RWKV’s fixed direction token shift, e.g., Q-Shift, would inevitably introduce spatial distortions to unstructured point clouds due to the inconsistent order of token arrangement and spatial proximity, weakening the model’s ability to model local geometry and thus affecting robustness in unseen domains. Secondly, the Bi-WKV attention mechanism in RWKV is highly sensitive to slight discrepancies in key distribution between the source domain and the unseen domain. The nonlinear amplification characteristics, i.e., the exponential function, can easily amplify the shift in the focus of attention, undermining the generalization performance of the model in unknown domains.

Motivated by the aforementioned analysis, we propose PointDGRWKV, a novel RWKV-based framework for domain generalized point cloud classification. PointDGRWKV excels in strong generalizability, linear complexity, and capabilities in modeling long-range dependency and global structure information. Our proposed method has two key modules. Firstly, we design a lightweight, parameter-free Adaptive Geometric Token Shift mechanism (AGT-Shift) based on the inherent spatial characteristics of point clouds. It constructs local neighborhoods through spatial partitioning and dynamically integrates structural features to enhance the model’s ability to model geometric contexts. This mechanism is specifically designed based on the characteristics of point clouds. Secondly, we propose a Cross-Domain Key feature Distribution Alignment module (CD-KDA) to address the nonlinear amplification effect of key vectors on weight calculation in the Bi-WKV attention mechanism. By aligning the key distributions between source domains at the mean and covariance levels, we explicitly alleviate the cross-domain shift of attention and improve the generalization performance of the model in unseen domains. As shown in Fig. 1, PointDGRWKV achieves superior performance with less computational overhead compared to existing Transformer-based and Mamba-based methods on multiple DG benchmarks. Our contributions are three-fold:

•

We propose PointDGRWKV, a novel RWKV-based framework for domain generalizable point cloud classification that excels in strong generalizability toward unseen domains, global receptive fields, linear complexity, and long-range dependency.

•

We design Adaptive Geometric Token Shift (AGT-Shift) and Cross-Domain key feature Distribution Alignment (CD-KDA) to enhance RWKV’s geometry perception ability and the generalizability toward unseen domains.

•

Extensive experiments on multiple DG benchmarks verify the superiority and effectiveness of PointDGRWKV compared to state-of-the-art approaches.

2 Related Work

Point Cloud Classification (PCC) aims to accurately categorize 3D point cloud data. Early works such as PointNet [41] and PointNet++[42] pioneered the use of MLP-based architectures to directly learn features from raw point clouds. Subsequent research expanded on this by incorporating Convolutional Neural Networks (CNNs)[30, 59] to better capture local geometric patterns. Nevertheless, CNN-based approaches often struggle with limited receptive fields, especially in deeper networks. To address this, Vision Transformers (ViTs)[75, 11, 8] have recently been adopted in PCC, offering enhanced global context modeling capabilities. Methods like PCT[16] and Point Transformer [75] leverage self-attention mechanisms to capture long-range dependencies across points. Recently, Point Mamba and Point Cloud Mamba [31, 73] have introduced Mamba-like models into point cloud analysis, and achieved a global receptive field with linear complexity. While these models achieve impressive results on standard benchmarks, their generalization to novel or unseen domains remains a significant challenge.

Domain Generalized Point Cloud Classification (DG PCC) Although domain adaptation techniques [14, 56, 79, 81, 82, 15, 77, 65, 12, 80, 17, 77] have been explored in point cloud areas [44, 85, 54, 48, 57, 32, 10, 22, 61, 26, 26, 33, 25] to narrow the domain shifts, the target data is not always accessible in real scenarios, which might fail these methods. Domain generalization [55, 76, 34, 24, 58, 35, 84, 83, 78, 50, 37, 49, 36] has recently been introduced into PCC [66, 27, 28, 63, 64, 23, 21, 60] to improve the generalizability toward unseen domains. Existing DG PCC methods primarily focus on learning domain-invariant representations through meta learning [21], adversarial learning [66], contrastive learning [60] consistency regularization [27], and data augmentation [28, 63, 64]. While these approaches have shown promising results, many are built on CNN-based backbones, whose inherently limited receptive fields constrain their ability to capture global structural information critical for robust generalization. Subsequently, Huang et al. [23] proposed Transformers-based subdomain alignment and domain-aware attention mechanisms, while suffer from the quadratic computational costs. Recently, PointDGMamba [68] introduced Mamba-based architectures in DG PCC to improve generalization to unseen domains. Although Mamba offers linear inference efficiency and long-sequence modeling capabilities, its fixed-size state space constrains its ability to capture long-range spatial dependencies in point clouds. This highlights the need for novel architectures in DG PCC that can simultaneously support global context modeling and maintain better long-range dependencies on long sequence length.

Reception Weighted Key Value (RWKV). RWKV [70, 5, 69, 7] has garnered increasing attention due to its significant advantages in global receptive fields, computational complexity, and advantages in long sequence modeling. The core innovation is its linear WKV attention mechanism and spatial mixing and channel mixing, balancing local features and global dependencies through gating and recursion mechanisms, and supporting parallel training and efficient inference. Recently, Point-RWKV [20] introduced RWKV in point cloud analysis, but did not really open-source their implementations. Regarding the unstructured and sparse nature of point clouds, as well as cross-domain differences such as sensor or scene changes, pose new challenges to the original RWKV. To our knowledge, this is the first work that studies the generalizability of RWKV-based models toward unseen domains in point cloud tasks. This paper uses the popular vision-RWKV [70] as the baseline.

3 Method

3.1 Revisiting the RWKV

RWKV [9] incorporates a token shift function, e.g., Q-Shift, which introduces interactions among nearby positions along the channel dimension, enriching local context without increasing the computational cost:

[TABLE]

where $X^{\star}$ denotes a sliced vector of $X$ , capturing tokens from positions adjacent to the current location in the channel dimension. However, when directly applied to unstructured 3D point clouds, this operation may distort the underlying spatial structure (Fig.2).

Moreover, in the attention mechanism adopted by Bi-WKV [9], the attention weight of each token is formulated as follows:

[TABLE]

where $\mathbf{k}_{i}$ and $\mathbf{v}_{i}$ represent the key and value of the $\mathbf{i}\text{-}th$ token, respectively, and $\mathbf{w}$ is the learnable distance decay parameter, and $\mathbf{u}$ is the learnable bias term. Since the key $\mathbf{k}$ appears directly in the exponential function, its distribution has an exponential amplification effect on attention results (Fig.3), which can lead to attention drift and degrade the model’s generalization performance.

To address these limitations in the context of point cloud domain generalization, we propose two modules, as illustrated in Fig. 4: Adaptive Geometric Token Shift (AGT-Shift), which enhances local structure modeling via spatial partitioning, and Cross-Domain Key Distribution Alignment (CD-KDA), which improves the robustness of attention by aligning key feature distributions across domains.

3.2 Adaptive Geometric Token-Shift

When adapting the token shift mechanism of RWKV to the point cloud domain, two key challenges arise. Firstly, point clouds inherently lack regular topological structures, making it difficult to establish consistent spatial directions such as “up,” “down,” “left,” and “right” as in image data. Secondly, point cloud datasets are typically large-scale, and conventional operations like KNN search or graph construction introduce substantial computational and memory overhead, thereby limiting scalability. Consequently, a central challenge lies in achieving an effective balance between computational efficiency and the ability to model spatial structures.

To address this issue, we propose Adaptive Geometric Token Shift (AGT-Shift). AGT-Shift efficiently constructs the nearest neighborhood through spatial partitioning and introduces a weighted feature aggregation scheme among neighboring points to enable token shifting and enhance structural awareness. By avoiding the explicit computation of pairwise distance matrices, the method circumvents the quadratic complexity typically found in KNN-based approaches.

Concretely, AGT-Shift partitions the 3D space into a set of spatial sub-regions using a spatial hashing technique with fixed step sizes. Points within the same sub-region are treated as forming a local context block, capturing localized geometric structures. For each point token, partial feature fusion is then conducted by computing a weighted average of the feature tokens within its corresponding region, facilitating efficient context-aware feature enhancement.

Let the point cloud features be denoted as $F\in\mathbb{R}^{B\times N\times C}$ and the corresponding point coordinates as $X\in\mathbb{R}^{B\times N\times 3}$ , where $B$ , $N$ , $C$ , represent the batch size, number of points, and feature dimensions, respectively. The 3D space is discretized into a set of spatial grids $\mathcal{G}_{i}$ and each point is assigned to a grid cell based on its coordinates. For a given point $i\in\mathcal{G}_{i}$ , its token shift feature is defined as:

[TABLE]

where $\mu_{i}=\frac{1}{|\mathcal{G}_{i}|}\sum_{j\in\mathcal{G}_{i}}x_{j}$ denotes the geometric center of subregion $\mathcal{G}_{i}$ , and $w_{ij}$ is the contribution of point $j$ to the representation of point $i$ , with higher weights assigned to closer points. To preserve the discriminability of the original features and avoid excessive perturbation, we selectively perturb only a subset of channels and introduce a residual fusion mechanism to ensure stable feature refinement:

[TABLE]

where $f_{i}^{(1)}$ represents the first $C^{{}^{\prime}}$ channels (used for disturbance), $f_{i}^{(2)}$ is the remaining channel, and $\lambda\in(0,1)$ controls the degree of the disturbance in token shift.

Remark. Note that our presented AGT-Shift module does not rely on KNN or explicit adjacency graph construction, nor does it depend on additional parameter learning. All aggregation processes can be completed through tensor operations, and the overall computational complexity is $\mathcal{O}(N)$ .

3.3 Cross-Domain Key Distribution Alignment

We observe that there are significant differences in the distribution of $\mathbf{k}$ across different domains, such as high or low mean values of $\mathbf{k}$ features and significant differences in variance in some domains. These distribution shifts will cause significant bias at $e^{k}$ level, leading to shifts of attention focus position within the domain, severely undermining the model’s generalizability to unseen domains.

To address this issue, we propose Cross-Domain Key Feature Distribution Alignment(CD-KDA) to enhance the modeling stability and structural generalization ability of attention mechanisms in cross-domain scenes. Regarding the nature of point cloud data, the key vector $\mathbf{k}$ encodes the relative importance of each point within its local neighborhood or in relation to the global context. Specifically, $\mathbf{k}$ captures the spatial selection tendencies of the points: its mean $\mu$ reflects the global attention focus, while its variance $\Sigma$ characterizes the semantic or geometric diversity among points. Consequently, if different source domains exhibit distinct distributions of $\mathbf{k}$ due to geometric discrepancies, it will lead to domain shifts in the aggregation of point cloud structures governed by attention mechanisms. In contrast, although the value vector $\mathbf{v}$ also contributes to attention computation, it only serves as the weighted content to be aggregated. It does not influence the generation of attention weights directly, nor does it appear within the exponential function of the attention formulation. As such, its effect on generalization is less immediate and critical than that of $\mathbf{k}$ .

Additionally, the spatial decay parameter $\mathbf{w}$ and bias $\mathbf{u}$ in Bi-WKV [9] are shared model parameters that encode the model’s inherent sensitivity to spatial distance and positional priors. These parameters should be learned jointly across multiple source domains to capture a unified inductive bias, and therefore should not be forcibly aligned across domains.

Based on these observations, we argue that aligning the dynamic input key representations $\mathbf{k}$ across source domains is the most critical and effective strategy to enhance generalization. Let the set of source domains be denoted as $\{\mathcal{D}_{1},\mathcal{D}_{2},\dots,\mathcal{D}_{n}\}$ , with the corresponding key features extracted from each domain represented by $\mathbf{k}^{(i)}\in\mathbb{R}^{T\times C}$ . We define the following objective function for alignment:

[TABLE]

where $\mu^{(i)}=\frac{1}{T}\sum_{t}\mathbf{k}^{(i)}_{t}$ is the key mean of the i-th domain, $\Sigma^{(i)}$ is the corresponding covariance matrix, $\mathcal{P}$ is the set of unordered domain pairs between all source domains, and $\|\cdot\|_{F}$ is the Frobenius norm. As such, by minimizing the distribution of $\mathbf{k}$ representations between source domains, the cross-domain stability of the attention mechanism is enhanced. The introduced CD-KDA explicitly aligns the distribution of $\mathbf{k}$ representations in different source domains on a source domain basis, thereby improving the consistency of the model’s attention distribution to the unseen domain.

3.4 Training and Inference

During the training phase, the total loss of the model includes the classification loss and the cross-domain feature alignment loss, as follows:

[TABLE]

where $\mathcal{L}\_{\text{cls}}$ is the cross-entropy loss used to supervise the model’s ability to recognize point cloud classes. Hyperparameters $\lambda_{1}$ and $\lambda_{2}$ , respectively, adjust the weights of these two losses during training. During the inference stage, only the trained feature extractor and classifier are used. The model no longer requires access to any source domain data and directly predicts the class labels of the target domain samples. This allows for efficient and scalable deployment in unseen domains without additional adaptation.

4 Experiments

4.1 Experiments Setup

Implementation Details. Our model was trained on an NVIDIA RTX 4090 GPU. The optimizer AdamW [38] is used, with an initial learning rate of $1\times 10^{-4}$ , a cosine annealing scheduling strategy, and a weight decay of $1\times 10^{-4}$ . During the training process, preprocessing and enhancement operations such as scaling, normalization, and random jitter were applied to the input point cloud data. The model adopts a four-stage hierarchical structure for gradually extracting and aggregating multi-scale point cloud features. The number of RWKV blocks included in each stage is 1, 1, 2 and 2, respectively. In all experiments, $\lambda_{1}\text{ = }1$ and $\lambda_{2}\text{ = }0.3$ by default.

Benchmarks. To evaluate the generalizability of our method in DG PCC, we conduct experiments on two benchmarks. The first is PointDA-10 [44, 6, 62], which includes ModelNet-10 (M), ShapeNet-10 (S), and ScanNet-10 (S*) with 10 shared categories. ModelNet and ShapeNet contain clean point clouds generated from synthetic 3D models, while ScanNet captures real-world scenes with frequent missing regions due to occlusion. It defines three cross-domain settings: M, S*→S; M, S→S*; and S, S*→M. The second benchmark is PointDG-3to1 [68], including ModelNet-5 (A), ScanNet-5 (B), ShapeNet-5 (C), and 3D-FUTURE-Completion (D) [33, 13], sharing five classes. It adopts a “leave-one-out” setting to form four settings: ABC→D, ABD→C, ACD→B, and BCD→A. Following common DG practices, the training uses only source samples, and the evaluation is performed on the target domain’s testing set.

4.2 Comparison Results

Comparison Methods. To comprehensively evaluate the proposed method, we compare it with representative point cloud classification models, including CNN-based methods such as PointDAN [44], DefRec [1], GAST [85], PDG [60], MetaSets [21], PointNeXt [43], and X-3D [51], as well as Transformer-based approaches like SUG [23], PCT [16], and GBNet [46]. We also include Mamba-based methods PCM [73] and PointDGMamba [68]. Additionally, we evaluate V-RWKV, a modified Vision-RWKV [9] variant with a different number of blocks. Due to the unavailability of training code, PointRWKV [20] is excluded.

Benchmark Results. We conduct a comprehensive evaluation of the proposed PointDGRWKV method on two widely used multi-domain point cloud generalization benchmarks: PointDA-10 and PointDG-3to1. The results are presented in Table 1. PointDGRWKV consistently outperforms existing methods in terms of average overall accuracy across both benchmarks. Specifically, on the three domain generalization tasks of PointDA-10, PointDGRWKV achieves better performance than the state-of-the-art PointDGMamba across all DG tasks, with an average accuracy of 75.66%. Notably, on the PointDG-3to1 benchmark, PointDGRWKV achieves an average accuracy of 81.92% across four domain shifts, outperforming PointDGMamba by a significant margin of 1.39%. The improvement is particularly pronounced in the most challenging ACD→B setting.

Analysis of Improvements. Overall, PointDGRWKV demonstrates high performance across different domains, indicating strong cross-domain stability. We attribute this consistent improvement to the AGT-Shift mechanism, which better captures the local geometric structures inherent to unstructured point cloud data, effectively mitigating the information mismatch caused by the “pseudo-local receptive field” issue in the original RWKV. Additionally, the CD-KDA module alleviates attention misalignment in Bi-WKV caused by domain-specific variations in key distributions, enabling the model to learn more consistent structural perception across source domains and thereby enhancing generalization to unseen domains. It is worth noting that while the proposed method shows clear improvements on average, its advantage is relatively modest in certain simpler settings, suggesting that the primary gains come from improved robustness and stability under more complex domain shifts.

4.3 Ablation Study

Effectiveness of AGT-Shift and CD-KDA. To further validate the specific roles of the proposed components in the model, we conducted ablation experiments on the PointDA-10 benchmark, and the results are presented in Table 2. Firstly, we constructed a basic version of V-RWKV’ without AGT-Shift and CD-KDA modules. Compared with the basic model, the introduction of AGT-Shift improved the overall performance in all three tasks, indicating that this module has a positive effect on modeling the local geometric structure of point clouds. Furthermore, we separately introduced the CD-KDA module for evaluation. We observed stable performance improvements, indicating that the cross-domain key feature alignment mechanism has a certain effect in alleviating attention bias and improving generalization ability. Finally, when both modules are introduced simultaneously, the model achieves better performance on all transfer tasks, indicating that AGT-Shift and CD-KDA are complementary, jointly promoting the overall performance.

Effects of Different Shifting Strategies. To evaluate the effectiveness of our proposed AGT-Shift module, we compareit with three different token shifting strategies: (1) KNN-Random Replacement (KNN-RandOne): For each point, its K nearest neighbors are first identified using KNN search. Then, one neighbor is randomly selected to replace the original point’s feature. (2) KNN-Mean Aggregation (KNN-Avg): After obtaining the K nearest neighbors, the output feature is computed as the average of all neighbor features, replacing the original. (3) KNN-Weighted Aggregation (KNN-WAvg): Different from KNN-Avg, a soft weighting scheme is applied based on spatial distance, where closer neighbors contribute more. Table 3 shows that the performances of these strategies are 74.54%, 74.25%, 74.82% for KNN-RandOne, KNN-Avg, KNN-WAvg, respectively. Notably, KNN-RandOne performs competitively despite its simplicity, suggesting that introducing randomness can help alleviate overfitting to local patterns. However, all three variants suffer from quadratic computational complexity due to KNN. In contrast, our AGT-Shift achieves better performance while maintaining linear complexity and avoiding pairwise distance computation, highlighting its efficiency and robustness in large-scale domain generalization tasks.

Impact of Key and Value Alignment in CD-KDA. To further investigate the roles of different components in the attention mechanism, we conduct an ablation study isolating the effects of the key ( $\mathbf{k}$ ) and value ( $\mathbf{v}$ ) features in our proposed CD-KDA module. Specifically, we design the following variants: (1) None: no alignment is performed, serving as a baseline; (2) Only $\mathbf{v}$ : alignment is applied solely on the $\mathbf{v}$ features; (3) $\mathbf{k}$ and $\mathbf{v}$ : both $\mathbf{k}$ and $\mathbf{v}$ are aligned simultaneously; (4) Only $\mathbf{k}$ (Ours): alignment is applied solely on the key representations $\mathbf{k}$ , as proposed in our method. The results in Table 4 show that aligning only the value vector $\mathbf{v}$ brings limited performance improvement, suggesting that despite contributing to feature aggregation, $\mathbf{v}$ has a relatively minor influence on cross-domain generalization. In contrast, aligning both the key $\mathbf{k}$ and value $\mathbf{v}$ vectors leads to a more noticeable performance gain, indicating that promoting feature consistency does benefit generalization. Interestingly, the best performance is achieved when alignment is applied solely to $\mathbf{k}$ , confirming our hypothesis that the key vector, which directly influences attention weights via the exponential function, plays a more critical role in guiding spatial focus and structural understanding. Thus, aligning $\mathbf{k}$ across domains significantly stabilizes the attention mechanism and enhances generalization performance.

4.4 Visualization and Analysis

T-SNE Feature Visualization. To investigate the effectiveness of each proposed module, we visualize the features of target distributions under four different configurations using t-SNE, as illustrated in Fig. 5. Specifically, (I) shows the baseline without AGT-Shift and CD-KDA, (II) removes only the AGT-Shift module, and (III) removes only the CD-KDA module, while (IV) represents our complete model. The visualization is conducted on the test set of the ShapeNet-5 (C) dataset under the PointDG-3to1 benchmark, with different colors denoting different classes. Among the first three ones, the baseline (I) exhibits the lowest intra-class compactness, indicating limited discriminability. Removing either AGT-Shift (II) or CD-KDA (III) leads to moderate improvements, but both still show less compact clusters and more ambiguous boundaries compared to the full model. Notably, these differences are especially clear in categories such as cabinet (blue) and table (purple). The complete model (IV), in contrast, achieves the most compact intra-class distributions and the clearest inter-class boundaries, demonstrating superior feature separability and confirming the essential roles of AGT-Shift and CD-KDA in DG.

Effect of Model Scale. To examine the impact of model capacity on generalization, we design three variants of our PointDGMamba: Ours-Base, Ours-Standard, and Ours-Large. As summarized in Table 5, the Standard version follows the default configuration used in our main experiments. Ours-Base reduces the number of network blocks by half, resulting in a shallower architecture. In contrast, Ours-Large increases the feature dimension and performs denser point sampling during processing. Among these, Ours-Large achieves the best average accuracy, indicating that increased representational power and finer geometric granularity benefit generalization. Meanwhile, the Base model still performs competitively to state-of-the-art methods, indicating the proposed method is effective even at lower computational cost.

Analysis of Computational Efficiency. To assess the efficiency of our approach, we report comparisons in terms of model parameters, GFlops, inference time, and accuracy, as summarized in Table 6. Our method delivers strong generalizability while maintaining lower computational cost than most existing methods. Notably, Ours-Base variant achieves 75.15% accuracy with only 2.13M parameters, 3.22 GFlops, and 1.68 ms inference time, highlighting the effectiveness of our lightweight design. These results confirm that even the compact version of our model can achieve competitive performance with minimal cost, demonstrating strong potential for deployment in resource-constrained scenarios.

5 Conclusion

We propose PointDGRWKV, the first RWKV-based framework for domain generalization in point cloud classification. It enhances spatial perception and generalization while retaining RWKV’s efficient sequence modeling and linear complexity. To address RWKV’s limitations on 3D data, we introduce AGT-Shift for improved local geometric modeling and CD-KDA to reduce attention drift by aligning key distributions across domains. Extensive experiments on PointDA-10 and PointDG-3to1 benchmarks confirm that our method achieves state-of-the-art performance with a strong balance between efficiency and robustness.

Bibliography85

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Achituve et al. [2021] Idan Achituve, Haggai Maron, and Gal Chechik. Self-supervised learning for domain adaptation on point clouds. In Proceedings of Winter Conference on Applications of Computer Vision , pages 123–133, 2021.
2Ben-Shabat et al. [2018] Yizhak Ben-Shabat, Michael Lindenbaum, and Anath Fischer. 3dmfv: Three-dimensional point cloud classification in real-time using convolutional neural networks. IEEE Robotics and Automation Letters , 3(4):3145–3152, 2018.
3Billinghurst et al. [2015] Mark Billinghurst, Adrian Clark, Gun Lee, et al. A survey of augmented reality. Foundations and Trends® in Human–Computer Interaction , 8(2-3):73–272, 2015.
4Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 11621–11631, 2020.
5Chen et al. [2025] Tianxiang Chen, Xudong Zhou, Zhentao Tan, Yue Wu, Ziyang Wang, Zi Ye, Tao Gong, Qi Chu, Nenghai Yu, and Le Lu. Zig-rir: Zigzag rwkv-in-rwkv for efficient medical image segmentation. IEEE Transactions on Medical Imaging , 2025.
6Dai et al. [2017] Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 5828–5839, 2017.
7Dai et al. [2025] Miaomiao Dai, Qianyu Zhou, and Lizhuang Ma. Stylerwkv: High-quality and high-efficiency style transfer with rwkv-like architecture. In IEEE International Conference on Multimedia and Expo , pages 01–06, 2025.
8Deng et al. [2024] Zhichao Deng, Xiangtai Li, Xia Li, Yunhai Tong, Shen Zhao, and Mengyuan Liu. Vg 4d: Vision-language model goes 4d video recognition. ar Xiv preprint ar Xiv:2404.11605 , 2024.