Detecting Malicious Intents in Smart Contracts with Pre-trained Programming Language Models

Youwei Huang; Jianwen Li; Bin Hu; Sen Fang; Yao Li; and Peng Yang

arXiv:2508.20086·cs.SE·April 7, 2026

Detecting Malicious Intents in Smart Contracts with Pre-trained Programming Language Models

Youwei Huang, Jianwen Li, Bin Hu, Sen Fang, Yao Li, and Peng Yang

PDF

TL;DR

This paper introduces SmartIntentV2, a state-of-the-art model for detecting malicious developer intents in smart contracts, leveraging a domain-adapted BERT model and outperforming previous methods significantly.

Contribution

The paper presents SmartIntentV2, an improved smart contract intent detection model that integrates a domain-adapted BERT-based pre-trained programming language model.

Findings

01

SmartIntentV2 achieves an F1 score of 0.9279, outperforming previous models.

02

It delivers a 65.5% relative improvement in F1 score over GPT-4.

03

The model attains high accuracy, precision, and recall on real-world smart contract data.

Abstract

Malicious developer intents in smart contracts constitute significant security threats to decentralized applications, leading to substantial economic losses. Prior work introduced SmartIntentNN, a deep learning model for detecting unsafe developer intents. By combining the Universal Sentence Encoder, a K-means clustering-based intent highlighting mechanism, and a Bidirectional Long Short-Term Memory (BiLSTM) network, the model achieved an F1 score of 0.8633 on an evaluation set of 10,000 real-world smart contracts across ten distinct intent categories. This paper presents SmartIntentV2 (Smart Contract Intent Neural Network Version 2). The primary enhancement is the integration of a BERT-based pre-trained programming language model, which we domain-adaptively pre-train on a dataset of 16,000 real-world smart contracts using a Masked Language Modeling objective. SmartIntentV2 retains…

Tables3

Table 1. Table 1. Performance of SmartIntentV2 on ten intent detection categories.

Category	Accuracy	Precision	Recall	F1
Fee	0.9452	0.9117	0.9639	0.9371
DisableTrading	0.9753	0.8954	0.8425	0.8681
Blacklist	0.9813	0.8580	0.9190	0.8874
Reflect	0.9930	0.9806	0.9967	0.9886
MaxTX	0.9750	0.9610	0.9579	0.9595
Mint	0.9393	0.7232	0.8630	0.7869
Honeypot	0.9910	0.5833	0.6875	0.6311
Reward	0.9947	0.9445	0.9752	0.9596
Rebase	0.9958	0.7000	0.8953	0.7857
MaxSell	0.9987	0.7143	0.9677	0.8219
Macro-average	0.9789	0.8272	0.9069	0.8626
Micro-average	0.9789	0.9090	0.9476	0.9279

Table 2. Table 2. SmartIntentV2 vs. baseline models.

Model	Accuracy	Precision	Recall	F1
SmartIntentNN
V2 (this work)	0.9789	0.9090	0.9476	0.9279
V1	0.9647	0.8873	0.8406	0.8633
Ablation with Original Pre-trained Models
RoBERTa	0.9693	0.8670	0.9274	0.8962
CodeBERT	0.9672	0.8516	0.9332	0.8906
Other Baselines
LSTM	0.9172	0.7725	0.5973	0.6737
BiLSTM	0.9320	0.7871	0.7200	0.7521
CNN	0.9093	0.6922	0.6596	0.6755
GPT-3.5-turbo	0.8375	0.4135	0.5447	0.4701
GPT-4o-mini	0.7821	0.3703	0.9240	0.5288
GPT-4.1	0.8651	0.4927	0.6501	0.5606

Table 3. Table 3. Ablation study of different backbone models and training strategies for smart contract intent detection. Results are reported under both macro- and micro-averaged evaluation metrics.

Backbone Model	Training Strategy	Avg.	Acc.	Prec.	Rec.	F1
SmartBERT	Original Distribution	Macro	0.9751	0.8718	0.6744	0.7125
	Original Distribution	Micro	0.9751	0.9270	0.8962	0.9113
	Class-Balanced Training	Macro	0.9789	0.8272	0.9069	0.8626
	Class-Balanced Training	Micro	0.9789	0.9090	0.9476	0.9279
CodeBERT	Original Distribution	Macro	0.9697	0.7594	0.6545	0.6852
	Original Distribution	Micro	0.9697	0.9017	0.8847	0.8931
	Class-Balanced Training	Macro	0.9672	0.7135	0.9022	0.7735
	Class-Balanced Training	Micro	0.9672	0.8516	0.9332	0.8906
RoBERTa	Original Distribution	Macro	0.9679	0.7267	0.6547	0.6699
	Original Distribution	Micro	0.9679	0.8813	0.8960	0.8886
	Class-Balanced Training	Macro	0.9693	0.7492	0.8771	0.8026
	Class-Balanced Training	Micro	0.9693	0.8670	0.9274	0.8962

Equations22

L_{MLM} = - \frac{1}{∣ M ∣} i \in M \sum lo g P_{θ} (x_{i} ∣ x_{masked})

L_{MLM} = - \frac{1}{∣ M ∣} i \in M \sum lo g P_{θ} (x_{i} ∣ x_{masked})

f = \frac{1}{T} t = 1 \sum T h_{t}

f = \frac{1}{T} t = 1 \sum T h_{t}

X = f_{1} f_{2} ⋮ f_{N} \in R^{N \times d}

X = f_{1} f_{2} ⋮ f_{N} \in R^{N \times d}

X = pad ({f_{i}}_{i = 1}^{N}) \in R^{L \times d}

X = pad ({f_{i}}_{i = 1}^{N}) \in R^{L \times d}

h = BiLSTM (X) \in R^{2 U}

h = BiLSTM (X) \in R^{2 U}

z = W \cdot Dropout (h) + b, \hat{y} = σ (z) \in (0, 1)^{C}

z = W \cdot Dropout (h) + b, \hat{y} = σ (z) \in (0, 1)^{C}

FL (p, y) = - α y (1 - p)^{γ} lo g (p) - (1 - α) (1 - y) p^{γ} lo g (1 - p)

FL (p, y) = - α y (1 - p)^{γ} lo g (p) - (1 - α) (1 - y) p^{γ} lo g (1 - p)

L = \frac{1}{M} i = 1 \sum M c = 1 \sum C FL (p_{i, c}, y_{i, c})

L = \frac{1}{M} i = 1 \sum M c = 1 \sum C FL (p_{i, c}, y_{i, c})

Precision_{c}

Precision_{c}

F1_{c}

Macro - F1

Macro - F1

Micro - F1

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\setcctype

[4.0]by

Detecting Malicious Intents in Smart Contracts with Pre-trained Programming Language Models

Youwei Huang

Independent ResearcherSuzhouChina

[email protected]

,

Jianwen Li

Carnegie Mellon UniversityMoffett FieldCAUSA

[email protected]

,

Bin Hu

Institute of Computing Technology, Chinese Academy of SciencesBeijingChina

[email protected]

,

Sen Fang

North Carolina State UniversityRaleighNCUSA

,

Yao Li

Macau University of Science and TechnologyMacaoChina

and

Peng Yang

Institute of Intelligent Computing Technology, Suzhou, CASSuzhouChina

(2026)

Abstract.

Malicious developer intents in smart contracts constitute significant security threats to decentralized applications, leading to substantial economic losses. Prior work introduced SmartIntentNN, a deep learning model for detecting unsafe developer intents. By combining the Universal Sentence Encoder, a K-means clustering-based intent highlighting mechanism, and a Bidirectional Long Short-Term Memory (BiLSTM) network, the model achieved an F1 score of 0.8633 on an evaluation set of 10,000 real-world smart contracts across ten distinct intent categories.

This paper presents SmartIntentV2 (Smart Contract Intent Neural Network Version 2). The primary enhancement is the integration of a BERT-based pre-trained programming language model, which we domain-adaptively pre-train on a dataset of 16,000 real-world smart contracts using a Masked Language Modeling objective. SmartIntentV2 retains the BiLSTM-based multi-label classification network for intent detection. On the same evaluation set of 10,000 smart contracts, it achieves superior performance with an accuracy of 0.9789, precision of 0.9090, recall of 0.9476, and an F1 score of 0.9279, substantially outperforming its predecessor and other baseline models. Notably, SmartIntentV2 also delivers a 65.5% relative improvement in F1 score over GPT-4.1 on this specialized task. These results establish SmartIntentV2 as a new state-of-the-art model for smart contract intent detection.

Smart Contract, Blockchain Security, Malicious Intent Detection, Pre-trained Language Model, Domain Adaptation

††journalyear: 2026††copyright: cc††conference: 22nd International Conference on Predictive Models and Data Analytics in Software Engineering; July 5, 2026; Montreal, QC, Canada††booktitle: 22nd International Conference on Predictive Models and Data Analytics in Software Engineering (PROMISE ’26), July 5, 2026, Montreal, QC, Canada††doi: 10.1145/3803846.3807464††isbn: 979-8-4007-2584-5/26/07††ccs: Security and privacy Software security engineering††ccs: Computing methodologies Neural networks††ccs: Software and its engineering Software verification and validation

1. Introduction

Smart contracts serve as the foundational infrastructure for decentralized application (DApp) development (Szabo, 1996; Antonopoulos and Wood, 2018; Ethereum Foundation, 2025), operating on various blockchain platforms, e.g., Ethereum (Buterin and others, 2014; Wood and others, 2014) and Binance Smart Chain (BSC) (Cernera et al., 2023). These contracts enable decentralized financial services and automate on-chain transactions, fostering a trustless execution environment. However, the transparency and immutability of smart contracts also introduce significant risks. Both vulnerabilities and deliberately embedded malicious intents can lead to severe economic losses for DApp users. While extensive research has been conducted on vulnerability detection in smart contracts (He et al., 2020; Chu et al., 2023; Chen et al., 2025), detecting malicious developer intents remains understudied despite their significant security implications.

To bridge this gap, SmartIntentNN was introduced as a deep learning-based approach for identifying unsafe development intents in smart contracts (Huang et al., 2022, 2025). It comprises three core components: (1) the Universal Sentence Encoder (USE) (Cer et al., 2018) for generating contextual embeddings of source code, (2) a K-means clustering-based intent highlighting module to emphasize intent-related features, and (3) a Bidirectional Long Short-Term Memory (BiLSTM) network for multi-label classification across ten distinct categories of unsafe intents. Evaluations on $10,000$ smart contracts showed that SmartIntentNN achieved an F1 score of 0.8633, with an accuracy of 0.9647, precision of 0.8873, and recall of 0.8406.

In this paper, we introduce SmartIntentV2 (Smart Contract Intent Neural Network Version 2), an enhanced version of this model. The primary improvement in V2 is the integration of a BERT-based pre-trained language model (Devlin, 2018; Liu, 2019), specifically CodeBERT (Feng et al., 2020), which replaces USE for embedding generation. The pre-trained encoder undergoes domain-adaptive pre-training on a corpus of $16,000$ smart contracts using a masked language modeling (MLM) objective to better capture the contextual semantics of smart contract code. It is subsequently fine-tuned for intent detection via transfer learning, incorporating a BiLSTM-based (Hochreiter and Schmidhuber, 1997; Graves and Schmidhuber, 2005) multi-label classification network. SmartIntentV2 was trained on $16,000$ smart contracts and evaluated on an independent test set of $10,000$ contracts. The results demonstrate that our model outperforms both traditional baselines and its predecessor, achieving an accuracy of 0.9789, precision of 0.9090, recall of 0.9476, and an F1 score of 0.9279.

Furthermore, we compare our model with large language models (LLMs) on the same evaluation dataset, where GPT-4.1 achieves an F1 score of only 0.5606. In a small-scale test involving 100 smart contracts, GPT-4.1 incurred costs of $2.88 and a latency of 11,074ms, demonstrating significantly higher economic and temporal overhead compared to SmartIntentV2. These results conclusively establish our model as the state-of-the-art for smart contract intent detection.

Our contributions are as follows:

•

We present SmartIntentV2, achieving an F1 score of 0.9279 for the task of smart contract intent detection.

•

We introduce SmartBERT, a domain-adapted pre-trained language model specifically designed for smart contract code, publicly available at https://huggingface.co/web3se/SmartBERT-v2.

•

We have open-sourced the dataset, code, documentation, and models at https://github.com/web3se-lab/web3-sekit.

2. Background

This section provides the technical background for our work. We detail the categories of unsafe development intents in smart contracts with representative examples, discuss existing methods for detecting these intents, and review the use of pre-trained language models for code representation.

2.1. Unsafe Intents in Smart Contracts

Smart contracts, implemented as Turing-complete programs on blockchain systems, enable the development of DApps. Solidity is the predominant language for smart contract programming, particularly on platforms like Ethereum and BSC. While much attention has been given to vulnerabilities inadvertently introduced during development, we argue that intentionally embedded malicious code by developers also constitutes a significant class of contract flaws. Prior research has identified ten common categories of unsafe development intents in smart contracts: Fee, DisableTrading, Blacklist, Reflect, MaxTX, Mint, Honeypot, Reward, Rebase, and MaxSell (Huang et al., 2025).

Figure 1 presents several code snippets from a smart contract that exemplify unsafe intents. The setTxLimit function allows for unrestricted modification of transaction limits, embodying the MaxTX intent. This capability can be exploited to manipulate transaction volumes unfairly. The setFees function facilitates the adjustment of various fees, including liquidity, reflection, and marketing fees, corresponding to the Fee intent. This may lead to unjust transaction costs for users. Most critically, the tradingStatus function empowers the contract owner to enable or disable trading at will, representing the DisableTrading intent. This functionality poses a significant threat, as it permits the owner to arbitrarily halt trading operations. Collectively, these functions reflect unsafe developer intents that have the potential to cause economic harm to users.

2.2. Smart Contract Intent Detection

Detecting malicious intents within smart contracts is crucial for safeguarding against potential threats. SmartIntentNN (Huang et al., 2022) was the pioneering model developed to address this challenge. Built using the TensorFlow.js framework (Abadi et al., 2016; Smilkov et al., 2019), it employs a combination of advanced components to enhance detection accuracy: the Universal Sentence Encoder for capturing contextual code representations, a K-means clustering model for highlighting intent indicators, and a BiLSTM network for classifying intents. The model was trained on a dataset of 10,000 smart contracts and evaluated on a separate, unseen test set of 10,000 contracts. While it achieved a commendable overall F1 score of 0.8633, significantly surpassing traditional deep learning baselines, it exhibited clear performance bottlenecks. Specifically, its detection capability was limited for minority-class intents due to data imbalance in the training set. For instance, the F1 score for the MaxSell intent was only 0.5714, highlighting the need for a more robust model.

2.3. Pre-trained Language Models

Pre-trained language models (PLMs) have revolutionized natural language processing with models like BERT (Devlin, 2018) and RoBERTa (Liu, 2019). These models leverage large-scale self-supervised learning to generate robust contextual representations that are adaptable to various downstream tasks via transfer learning, markedly reducing the demand for labeled data. In the realm of code analysis, models like CodeBERT (Feng et al., 2020) extend this paradigm by training on both code and natural language prose. Our work builds upon this by pre-training CodeBERT specifically on a large dataset of smart contracts, enhancing its ability to discern the subtle semantic nuances pertinent to intent detection. The resulting model, SmartBERT, is not limited to intent detection. As a domain-specific encoder trained on real-world Solidity code, it produces high-quality function-level embeddings that can benefit a broad spectrum of smart contract analysis tasks, including vulnerability detection, code clone detection, and contract similarity search. The domain-adaptive pre-training of the encoder on large-scale smart contract corpora is instrumental in equipping SmartIntentV2 with the capacity to model domain-specific syntactic and semantic patterns.

3. Model

SmartIntentV2 comprises SmartBERT, which is a pre-trained model specifically designed for smart contracts, paired with a BiLSTM-based network for multi-label classification. Figure 2 illustrates the complete model architecture of SmartIntentV2. The following subsections detail the training process of this pre-trained model and its utilization through transfer learning in downstream tasks for smart contract intent classification.

3.1. Data Preparation

Our dataset originates from an industrial smart contract auditing pipeline, in which contracts deployed on Ethereum and BSC were collected through both automated scanning and manual auditing processes. Each contract was analyzed at the source code level and labeled for the presence or absence of the ten unsafe intent categories described in Section 2.1. A single contract may carry zero, one, or multiple labels.

Dataset snapshot and scope. The auditing pipeline operates continuously. As of December 2025, the corpus has grown to over 40,000 annotated contracts. All experiments in this paper were conducted on a fixed snapshot taken when the full corpus stood at approximately 40,000 entries. After rigorous quality control, including cross-verified annotations and removal of ambiguous or duplicated contracts, we retained 30,000 contracts and partitioned them as follows:

•

Training set (16,000 contracts): used for both SmartBERT domain-adaptive pre-training and SmartIntentV2 classification training. Using the same set for both stages ensures there is no overlap with the evaluation set and that the encoder’s pre-training domain is fully aligned with the downstream task.

•

Evaluation set (10,000 contracts): an independent held-out set used exclusively for all intent-detection experiments, consistent with the configuration in SmartIntentNN.

•

Validation set (4,000 contracts): disjoint from both the training and evaluation sets, used during SmartBERT pre-training to monitor MLM loss every 10,000 steps.

The remaining approximately 10,000 contracts in the corpus at the time of the snapshot were reserved for future studies and not included in the current experiments. Because the auditing pipeline is continuously growing, incorporating the full corpus into a single experimental cycle is impractical: newly collected contracts require time for annotation, cross-verification, and quality assurance before they can be reliably used for training or evaluation. We therefore froze the experimental dataset at 30,000 quality-controlled contracts to ensure reproducibility. The complete dataset, including all contracts collected up to December 2025, has been open-sourced as part of the project repository listed in Section 1.

Due to limitations imposed by maximum sequence length, it is impractical to input entire smart contract contexts into our model during both training and evaluation phases. Consistent with the approach in SmartIntentNN, we extract code at the function level, with subsequent data processing also conducted at this granularity. From the perspective of processing an entire smart contract, this approach effectively treats function-level code as sequential data for analysis.

3.2. Pre-training for Smart Contracts

We trained SmartBERT, a pre-trained programming language model specifically tailored for smart contract analysis. The model is initialized from CodeBERT, leveraging its proven capability in code understanding, and adapted to capture the unique syntactic and semantic patterns of smart contract programming. The pre-training process is visually depicted in the upper part of Figure 2.

Training Data and Preprocessing: We curated 16,000 real-world smart contracts, covering a broad spectrum of functionalities and structures in decentralized applications. The dataset was tokenized at the function level, rather than at the contract level, to prevent exceeding the maximum sequence length of 512 tokens and to enable finer-grained semantic representation of code.

Pre-training Objective: We adopt the MLM objective to adapt the pre-trained CodeBERT-base-mlm111https://huggingface.co/microsoft/codebert-base-mlm model to smart contract code. In each function-level sequence, 15% of tokens are randomly selected and replaced with the [MASK] token. Given an input sequence $\mathbf{x}=(x_{1},x_{2},\ldots,x_{T})$ of length $T$ , let $M\subseteq\{1,2,\ldots,T\}$ denote the masked positions where $|M|\approx 0.15T$ . The model predicts the original tokens $\{x_{i}\mid i\in M\}$ from the masked context $\mathbf{x}_{\text{masked}}$ . The MLM loss is defined as:

[TABLE]

where $P_{\theta}(x_{i}\mid\mathbf{x}_{\text{masked}})$ is the predicted token probability at position $i$ , parameterized by model weights $\theta$ .

Model Architecture and Training Setup: Our backbone model is CodeBERT-base-mlm, which is initialized from RoBERTa-base and trained with an MLM objective on source code. We retain its 12-layer encoder architecture with hidden dimension $d=768$ and 12 self-attention heads. Domain-adaptive pre-training is performed on smart contract code using the MLM objective. Training runs for 20 epochs with a per-device batch size of 64 on two NVIDIA A100 80GB GPUs (effective batch size 128), taking approximately 10 hours. We utilize the AdamW optimizer with a learning rate of $5\times 10^{-5}$ . Evaluation on an independent held-out set of 4,000 smart contracts, disjoint from both the training and evaluation splits described in Section 3.1, is performed every 10,000 steps during pre-training.

Function-level Representation: For each function with input sequence $\mathbf{x}=(x_{1},x_{2},\ldots,x_{T})$ , SmartBERT produces contextualized token embeddings $\mathbf{H}=(\mathbf{h}_{1},\mathbf{h}_{2},\ldots,\mathbf{h}_{T})\in\mathbb{R}^{T\times d}$ , where $T$ is the sequence length and $d=768$ is the hidden dimension. We obtain a function embedding $\mathbf{f}\in\mathbb{R}^{d}$ via mean pooling:

[TABLE]

Empirically, mean pooling outperforms using the [CLS] token embedding for function representation.

Contract-level Representation: Given a smart contract containing $N$ functions, we obtain function embeddings $\{\mathbf{f}_{i}\}_{i=1}^{N}$ using Eq. 2 for each function. These embeddings are stacked to form the contract-level representation matrix:

[TABLE]

where each row $\mathbf{f}_{i}\in\mathbb{R}^{d}$ represents the $i$ -th function embedding. This matrix $\mathbf{X}$ serves as input to the downstream multi-label classifier for intent detection.

Through this hierarchical representation learning, SmartBERT produces the contract-level matrix $\mathbf{X}$ for downstream multi-label classification.

3.3. Multi-label Intent Classification

We formulate smart contract intent detection as a multi-label binary classification task with $C$ target intents ( $C=10$ in our dataset). Given a contract with $N$ functions, the function embeddings $\{\mathbf{f}_{i}\}_{i=1}^{N}$ obtained from SmartBERT are organized into a fixed-length matrix through padding or truncation:

[TABLE]

where $L=256$ is the maximum sequence length. Contracts with fewer than $L$ functions are padded with zero vectors, while longer contracts are truncated.

The BiLSTM encoder processes the padded matrix to capture inter-function dependencies:

[TABLE]

where $U=128$ is the hidden size per direction, and $\mathbf{h}$ concatenates the final forward and backward hidden states. A masking mechanism ensures padded positions do not contribute to the computation.

The final classification layer applies dropout regularization and produces per-class probabilities:

[TABLE]

where $\mathbf{W}\in\mathbb{R}^{C\times 2U}$ , $\mathbf{b}\in\mathbb{R}^{C}$ , dropout rate is 0.5, and $\sigma(\cdot)$ is the element-wise sigmoid function.

Padding and Masking: Each contract’s function sequence is padded to length $L$ by appending zero vectors in $\mathbb{R}^{d}$ . Denote this operation as $\mathrm{pad}\left(\{\mathbf{f}_{i}\}_{i=1}^{N}\right)$ . The Masking layer with $\texttt{mask\_value}=0.0$ ensures padded timesteps do not contribute to recurrent computations or loss.

Loss Function: To address class imbalance and emphasize hard examples, we use the binary focal loss (Lin et al., 2017). For a single binary label $y\in\{0,1\}$ with prediction $p=\hat{y}$ , the focal loss is:

[TABLE]

where $\gamma\geq 0$ is the focusing parameter and $\alpha\in[0,1]$ balances the importance of positive vs. negative samples. We adopt the default values $\gamma=2$ and $\alpha=0.25$ , which are widely used in practice (Lin et al., 2017). The overall training loss over $M$ contracts is the mean per-sample, per-class loss:

[TABLE]

where $p_{i,c}$ and $y_{i,c}$ denote the predicted probability and ground-truth label for class $c$ on sample $i$ . We implement this loss via TensorFlow’s BinaryFocalCrossentropy (Abadi et al., 2016; Smilkov et al., 2019).

Optimization and Training Regimen: The classifier is optimized using Adam with the following hyperparameters during the primary training phase (complete data training):

•

learning rate $1\times 10^{-3}$

•

batch size $S=200$

•

number of chunks $B=80$ , processing $B\times S=16,000$ training samples per epoch

•

epochs per chunk $E=100$

•

dropout rate $p=0.5$

A secondary class-balanced training phase uses a reduced learning rate of $1\times 10^{-4}$ , smaller batch sizes, and per-class balanced sampling strategy to further alleviate class imbalance. Specifically, we randomly sample 10 instances from each intent class to ensure balanced representation across all $C=10$ intent categories during training.

3.4. Overall Model Summary

In summary, SmartIntentV2 integrates pre-training, hierarchical representation learning, and sequence modeling for robust intent detection in smart contracts. The pipeline operates in three stages: (i) function-level embeddings are derived from SmartBERT, a domain-adapted language model pre-trained with the MLM objective on 16,000 real-world contracts; (ii) these embeddings are aggregated into a contract-level matrix with fixed-length padding and masking, ensuring consistent batch processing while preserving semantic granularity; (iii) a BiLSTM-based encoder captures inter-function dependencies, followed by a sigmoid-based output layer that produces multi-label probabilities across all $C=10$ intent categories.

Key design choices include the use of mean pooling for function embeddings, masking-aware sequence encoding, and binary focal loss for handling class imbalance. Together, these components enhance robustness against sequence length variability and uneven label distribution. Building on this architecture, we will evaluate SmartIntentV2 on a held-out set of 10,000 smart contracts to demonstrate its effectiveness in intent detection.

4. Evaluation

We first formulate three research questions that guide our evaluation, then describe the metrics and baselines used, and finally present the experimental results.

4.1. Research Questions

We investigate the following research questions to evaluate the effectiveness of domain-specific pre-training and training strategies in smart contract intent detection.

RQ1: Effectiveness of Domain-Specific Pre-training

Does domain-specific pre-training on smart contracts improve intent detection performance compared to general-purpose pre-trained models?

RQ2: Impact of Class-Balanced Training

How does class-balanced training affect performance under severe intent class imbalance?

RQ3: Contributions of Architectural Enhancements

Which architectural changes in SmartIntentV2 contribute most to the observed performance improvements over its predecessor?

4.2. Evaluation Metrics

We evaluate SmartIntentV2 using four standard metrics: accuracy, precision, recall, and F1 score. For each intent category $c\in\{1,\ldots,C\}$ , let $\mathrm{TP}_{c}$ , $\mathrm{TN}_{c}$ , $\mathrm{FP}_{c}$ , $\mathrm{FN}_{c}$ denote the true positives, true negatives, false positives, and false negatives over all $N$ evaluation samples. The per-class metrics are defined as:

[TABLE]

Accuracy is the proportion of correct predictions for class $c$ . F1 provides a balanced measure of precision and recall, and is particularly informative under class imbalance.

We report both macro-averaged and micro-averaged results. Macro-averaging computes each metric independently per class and then takes the unweighted mean across all $C$ classes, treating every class equally regardless of its frequency. Micro-averaging pools the per-class $\mathrm{TP}$ , $\mathrm{FP}$ , and $\mathrm{FN}$ counts globally before computing the metric, thereby giving more weight to frequent classes. For instance:

[TABLE]

The same aggregation schemes apply to accuracy, precision, and recall.

4.3. Result Analysis

We evaluate SmartIntentV2 on detecting ten representative intent categories introduced in Section 2.1. The evaluation is conducted on a held-out dataset of $N=10{,}000$ real-world smart contracts, disjoint from the training corpus of $16{,}000$ contracts.

As illustrated in Figure 3 and detailed in Table 1, our model consistently achieves high accuracy, precision, recall, and F1 across all categories. Notably, the Fee, Reflect, Reward, and MaxTX categories achieve F1 exceeding 0.9, while more challenging and imbalanced intents such as the Honeypot, Mint, and Rebase categories still maintain competitive performance (F1 between 0.63 and 0.79). This demonstrates that SmartIntentV2 not only excels on majority-class intents but also preserves robustness on minority-class and semantically subtle categories. The improved balance across intent categories, compared to the predecessor V1, can be attributed to two key enhancements: the incorporation of a secondary class-balanced training strategy that mitigates class imbalance by resampling the training dataset to ensure more equitable representation across all intent categories, and the adoption of binary focal loss during model optimization, which effectively addresses class imbalance by emphasizing hard examples and down-weighting easy negatives, thereby improving performance on minority classes without sacrificing majority class accuracy.

We further compare SmartIntentV2 against its first version and a set of baselines, including traditional deep learning models (e.g., LSTM, BiLSTM, CNN) and LLMs (e.g., GPT-3.5-turbo, GPT-4o-mini, GPT-4.1). As shown in Table 2, SmartIntentV2 achieves the highest overall accuracy (0.9789), precision (0.9090), recall (0.9476), and F1 (0.9279). Compared to V1, V2 achieves a 7.48% relative improvement in F1 score, validating the effectiveness of the architectural and training enhancements introduced in this work. Against conventional neural baselines, SmartIntentV2 delivers substantial relative gains. For instance, it achieves a 23.37% relative improvement over BiLSTM in F1 score. The performance gap is even more pronounced when compared to LLM-based baselines, where SmartIntentV2 demonstrates 75.5% and 65.5% relative improvements over GPT-4o-mini and GPT-4.1 respectively in F1, highlighting that task-specific architectures remain highly competitive for domain-specialized classification problems.

This performance gap is largely attributable to the fact that LLMs are pre-trained on general-domain corpora and have not been exposed to smart contract intent detection tasks. All LLM baselines in this study are evaluated under a zero-shot setting: the prompt contains the natural-language definitions of the ten intent categories and the output format specification, but no labeled input–output examples are provided. This setting reflects the out-of-the-box capability of general-purpose models on a highly specialized task and establishes a lower bound on LLM performance for smart contract intent detection. Exploring few-shot prompting with labeled demonstrations and parameter-efficient fine-tuning of LLMs is an interesting direction that we leave for future work (Section 7).

Beyond performance metrics, we conduct an additional analysis to quantify the economic and temporal costs of smart contract intent detection. We evaluate both SmartIntentV2 and GPT-4.1 on a dedicated test set of 100 smart contracts under concurrent processing conditions. SmartIntentV2 completes the entire task in 2,628ms (2.63s) on a standard PC with negligible computational cost. In contrast, GPT-4.1 requires 11,074ms (11.07s) with a total consumption of 960,622 tokens (959,801 input tokens and 821 output tokens), averaging 9,606 tokens per request, resulting in approximately $2.88 in API costs. This substantial disparity in both time and economic efficiency further reinforces the practical advantages of our specialized approach over general-purpose LLMs for smart contract analysis.

Furthermore, we conduct an ablation study in which SmartBERT is replaced by general-purpose pre-trained models, RoBERTa and CodeBERT, as the encoder for generating smart contract representations, while keeping all other architectural components and training configurations unchanged. As reported in Table 2, both ablation variants exhibit a noticeable decline in performance: the model with RoBERTa achieves a micro F1 of 0.8962, and with CodeBERT, 0.8906, both substantially lower than the 0.9279 attained by SmartIntentV2 with SmartBERT. This result underscores the critical role of domain-adaptive pre-training, as SmartBERT’s exposure to large-scale smart contract corpora enables it to capture domain-specific syntactic and semantic patterns that are not well represented in general-purpose models. SmartIntentV2 with SmartBERT consistently outperforms all ablation baselines across all metrics.

4.4. Answers to Research Questions

We now revisit the three research questions posed in Section 4.1 in light of the experimental results.

Answer to RQ1. Table 3 shows that SmartBERT consistently outperforms general-purpose pre-trained models across all evaluation settings. Under class-balanced training, SmartIntentV2 with SmartBERT achieves a micro-averaged F1 of 0.9279, exceeding CodeBERT (0.8906) and RoBERTa (0.8962). The advantage is more pronounced under macro-averaged metrics, where SmartBERT reaches an F1 of 0.8626, compared to 0.7735 and 0.8026 for CodeBERT and RoBERTa, respectively. These results indicate that domain-specific pre-training enables SmartBERT to better capture smart contract–specific syntax and semantics that are underrepresented in general programming language corpora.

Answer to RQ2. As shown in Table 3, class-balanced training substantially improves performance across all backbone models, particularly under macro-averaged metrics. For SmartIntentV2 with SmartBERT, the macro F1 score increases from 0.7125 to 0.8626, corresponding to a 21.07% relative improvement. Similar trends are observed for RoBERTa (0.6699 to 0.8026, +19.8%) and CodeBERT (0.6852 to 0.7735, +12.89%). These gains primarily stem from improved recall on minority intent categories, which are emphasized by macro-averaged evaluation. This confirms that class-balanced training is critical for mitigating real-world intent imbalance in smart contracts.

Answer to RQ3. Compared to V1, SmartIntentV2 achieves a 7.48% relative improvement in F1 score (0.9279 vs. 0.8633). This gain can be attributed to three key enhancements: (1) Domain-Adaptive Encoder: Replacing USE with SmartBERT provides richer contextualized representations tailored to smart contract code; (2) Class-Balanced Training: The secondary training phase significantly improves minority intent detection without degrading majority-class performance; (3) Binary Focal Loss: This loss function further mitigates class imbalance by emphasizing hard-to-classify samples. Together, these improvements lead to more robust and balanced intent detection, establishing SmartIntentV2 as a new state-of-the-art approach.

5. Threats to Validity

This section discusses various threats to the validity of our study, covering both internal and external factors that could affect our results. These include data labeling inaccuracies, model and hyperparameter optimization, data imbalance, dataset coverage, and adaptability to evolving attack patterns.

5.1. Internal Validity

Ground-Truth Accuracy: Although we employed both automated pattern matching and manual audits by domain auditors to label the dataset, inaccuracies in multi-label annotations may persist. We conducted iterative cross-verification to reconcile inconsistent labels and performed additional reviews on randomly sampled subsets to validate annotation quality. We have also open-sourced our dataset to facilitate further industrial and academic review. Contracts deemed ambiguous or highly disputed were omitted from inclusion in both the training and evaluation sets. Despite these measures, minor imperfections in labeling may still exist, affecting the internal validity of our findings.

Model Selection and Hyperparameters: In our study, we selected CodeBERT-base-mlm as the initialized model for training SmartBERT, acknowledging that the model’s architecture and parameter settings may impact subsequent outcomes. To ensure robust performance, we conducted extensive ablation testing, including experiments with BERT and RoBERTa variants to fine-tune different versions of SmartBERT. For the downstream classification task, we tested various models, such as LSTM, CNN, and basic feedforward neural networks, along with different hyperparameters, including LSTM units, dropout rates, and normalization layers. From these numerous experiments, the current model configuration was selected. However, it is important to note that this setup may not represent the absolute optimum, posing an internal threat. Thus, it remains open for long-term optimization and refinement.

5.2. External Validity

Data Imbalance: As depicted in Figure 4, certain intent categories, such as Rebase and Honeypot, account for less than 1% of our dataset, reflecting their infrequent occurrence in real-world mainnet code. Even though a class-balanced second training phase was employed using randomized sampling, this threat persists due to the inherent scarcity of these data samples. This scarcity suggests that these intents are indeed less prevalent in actual development, thereby naturally reducing their risk profile. We are committed to continually collecting smart contract data from public blockchain mainnets to incrementally include more of these underrepresented samples. This ongoing effort will facilitate further training and optimization of our model to better account for these low-frequency intents.

Dataset Coverage: Our dataset primarily comprises open-source smart contracts from public blockchains like Ethereum and BSC, mainly using Solidity code, although it includes a small number of Vyper contracts (Buterin, 2018; Vyper Team, 2025). There is a lack of data from private and consortium blockchains, which poses a threat to the effectiveness of our method. We are working on enhancing the dataset with more Vyper contract data. For non-public and non-open-source contracts, access authorization from the respective platforms is required before they can be incorporated into the training pipeline.

Emerging Attacks: As the development of smart contracts expands, new categories of malicious intents are likely to emerge, posing potential threats to our model’s detection capabilities. To address this challenge, our neural network model is designed to be both extensible and retrainable. We can swiftly update our dataset with new samples and conduct extensive training to adapt the model to these evolving threats. SmartIntentV2 also benefits from its modular architecture; SmartBERT, as a pre-trained model, remains unchanged and does not require retraining. By focusing on modifying the downstream neural network layers and utilizing a small amount of new data for training, we can effectively identify these unsafe intents, maintaining robustness against emerging attack vectors.

6. Related Work

We review prior studies related to our work along three directions: (i) pre-trained models for natural and programming languages, which provide foundations for smart contract analysis; (ii) smart contract vulnerability detection, focusing on external risks; and (iii) malicious behaviors and developer intents, reflecting internal risks.

6.1. Pre-trained Models for Natural Language and Code

Pre-trained language models have fundamentally advanced representation learning for both natural language and source code. In NLP, models such as BERT (Devlin, 2018) and RoBERTa (Liu, 2019) provide contextualized embeddings that have become standard across downstream tasks. In the code domain, models including CodeBERT (Feng et al., 2020), CodeT5 (Wang et al., 2021), CodeT5+ (Wang et al., 2023), and CuBERT (Kanade et al., 2020) adapt pre-training objectives to programming languages, enabling tasks such as code search, summarization, clone detection, and defect detection.

More recently, large language models (LLMs) have been explored for smart contract analysis. Smart-LLaMA-DPO (Yu et al., 2025) enhances explainability in vulnerability detection through reinforcement learning, while SCALM (Li et al., 2025) applies prompting and retrieval strategies to identify insecure coding practices.

Our work is orthogonal to these approaches. Unlike general-purpose language models and broadly trained code models, we adopt domain-specific pre-training tailored to smart contracts, enabling more precise modeling of contract-specific semantics. Moreover, in contrast to LLM-based methods that emphasize generative reasoning, our encoder-only architecture is lightweight, controllable, and readily integrable with task-specific neural components such as BiLSTMs.

6.2. Smart Contract Vulnerability Detection

Most existing research on smart contract security focuses on vulnerability detection, targeting unintentional defects introduced during development. Early approaches rely on formal verification, symbolic execution, and static analysis, exemplified by tools such as Oyente (Luu et al., 2016), Mythril (Mueller, 2017), ZEUS (Kalra et al., 2018), Securify (Tsankov et al., 2018), and SmartCheck (Tikhomirov et al., 2018). ÆGIS (Ferreira Torres et al., 2020) further extends this line by providing runtime shielding against malicious control and data flows.

With the adoption of machine learning, vulnerability detection has evolved toward learned representations. SaferSC (Tann et al., 2018) employs sequential neural models, while ContractWard (Wang et al., 2020), DR-GCN, and TMP (Zhuang et al., 2020) leverage graph-based learning. Recent work explores transfer and contrastive learning (Sendner et al., 2023; Chen et al., 2024), as well as LLM-based vulnerability analysis (Chen et al., 2025; Yu et al., 2025; Li et al., 2025).

Despite these advances, the dominant focus remains on bugs, vulnerabilities, and insecure coding practices. In contrast, our work addresses unsafe developer intents, capturing intentional and high-risk behaviors that cannot be fully characterized by vulnerability-level analysis alone.

6.3. Malicious Developer Intents and Behaviors

Internal security risks arise from malicious behaviors deliberately embedded by developers. Prior studies typically focus on specific threat categories. HoneyBadger (Torres et al., 2019) detects honeypots through symbolic execution, while “Trade or Trick” (Xia et al., 2021) and TrapdoorAnalyser (Huynh et al., 2025) identify scam tokens in decentralized exchanges (Lehar and Parlour, 2025). Other work addresses Ponzi schemes (Hu et al., 2022), backdoors (Ma et al., 2023), and rug-pull risks (Zhou et al., 2024; Kalacheva et al., ; Huynh et al., ), with some employing topological data analysis (Fan et al., 2022).

While effective for specific threats, these approaches are largely threat-centric and lack a unified abstraction of developer intent. Recent work (Huang et al., 2025) formalized unsafe developer intent categories and demonstrated intent detection using deep learning. Building on this foundation, the present work advances intent-level representation learning through domain-specific pre-training, enabling a more systematic and proactive form of smart contract security analysis.

7. Future Work

Based on our current study, we identify three directions for advancing smart contract malicious intent detection capabilities.

Few-shot and Fine-tuned LLM Comparison. Our current LLM baselines are evaluated in a zero-shot setting. A natural next step is to conduct a comprehensive few-shot evaluation (e.g., 5-shot and 10-shot with demonstrations sampled from the training set) and to explore parameter-efficient fine-tuning of open-weight LLMs on smart contract intent detection data. Such experiments would establish upper-bound LLM performance on this task and provide a more complete picture of the cost–accuracy trade-off between general-purpose and task-specific models.

Bytecode-based Intent Detection for Closed-source Contracts. A significant limitation of our current approach is its reliance on source code availability, restricting applicability to open-source smart contracts. Many deployed contracts on mainnet lack verified source code, creating blind spots in intent detection. Future work will explore training SmartBERT directly on smart contract bytecode to enable detection of malicious developer intents in closed-source contracts. This extension would substantially broaden our method’s practical deployment scope across the entire blockchain ecosystem.

Intent Code Localization Capabilities. While our model can successfully identify the presence of 10 malicious intent categories, it lacks precise code localization functionality. Detecting intent types alone is insufficient for practical security auditing—developers and auditors require exact code locations where malicious intents are implemented. Future research will investigate extending our model with code localization capabilities, potentially through fine-tuning specialized language models to pinpoint specific functions, statements, or code segments responsible for detected malicious intents.

8. Conclusion

We present SmartIntentV2, an enhanced deep learning model for smart contract intent detection. The primary enhancement is the incorporation of SmartBERT, a BERT-based programming language model pre-trained on 16,000 real smart contracts using MLM at the function level. This model generates contextual embeddings that are processed by a BiLSTM-based multi-label classifier to detect ten categories of unsafe developer intents. Comprehensive evaluation on 10,000 smart contracts demonstrates that SmartIntentV2 achieves an accuracy of 0.9789, a precision of 0.9090, a recall of 0.9476, and an F1 score of 0.9279, representing a 7.48% relative improvement over SmartIntentNN (V1) and substantially outperforming all baseline methods including LLMs. These results establish SmartIntentV2 as the new state-of-the-art model for smart contract intent detection.

Acknowledgements.

This work was supported by the National Key Research Program of China (Grant No. 2021YFF0703800).

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensor Flow: a system for large-scale machine learning . In 12th USENIX symposium on operating systems design and implementation (OSDI 16) , pp. 265–283 . Cited by: §2.2 , §3.3 .
2A. M. Antonopoulos and G. Wood (2018) Mastering ethereum: building smart contracts and dapps . O’reilly Media . Cited by: §1 .
3V. Buterin et al. (2014) A next-generation smart contract and decentralized application platform . white paper 3 ( 37 ), pp. 2–1 . Cited by: §1 .
4V. Buterin (2018) Vyper documentation . Vyper by Example , pp. 13 . Cited by: §5.2 .
5D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al. (2018) Universal sentence encoder . ar Xiv preprint ar Xiv:1803.11175 . Cited by: §1 .
6F. Cernera, M. La Morgia, A. Mei, and F. Sassi (2023) Token spammers, rug pulls, and sniper bots: an analysis of the ecosystem of tokens in ethereum and in the binance smart chain ( { \{ { \{ { \{ { \{ { \{ bnb } \} } \} } \} } \} } \} ) . In 32nd USENIX Security Symposium (USENIX Security 23) , pp. 3349–3366 . Cited by: §1 .
7C. Chen, J. Su, J. Chen, Y. Wang, T. Bi, J. Yu, Y. Wang, X. Lin, T. Chen, and Z. Zheng (2025) When chatgpt meets smart contract vulnerability detection: how far are we? . ACM Transactions on Software Engineering and Methodology 34 ( 4 ), pp. 1–30 . Cited by: §1 , §6.2 .
8Y. Chen, Z. Sun, Z. Gong, and D. Hao (2024) Improving smart contract security with contrastive learning-based vulnerability detection . In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pp. 1–11 . Cited by: §6.2 .