Exploiting Code Symmetries for Learning Program Semantics
Kexin Pei, Weichen Li, Qirui Jin, Shuyang Liu, Scott Geng, Lorenzo, Cavallaro, Junfeng Yang, Suman Jana

TL;DR
This paper introduces SymC, a symmetry-aware model that incorporates code symmetries into the architecture, improving program analysis performance by leveraging group-theoretic principles.
Contribution
It presents a novel group-theoretic framework and a symmetry-equivariant self-attention mechanism for better encoding of code semantics in LLMs.
Findings
SymC outperforms state-of-the-art models on five program analysis tasks.
Encoding code symmetries improves generalization and learning efficiency.
The approach requires no pre-training.
Abstract
This paper tackles the challenge of teaching code semantics to Large Language Models (LLMs) for program analysis by incorporating code symmetries into the model architecture. We introduce a group-theoretic framework that defines code symmetries as semantics-preserving transformations, where forming a code symmetry group enables precise and efficient reasoning of code semantics. Our solution, SymC, develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph. SymC obtains superior performance on five program analysis tasks, outperforming state-of-the-art code models without any pre-training. Our results suggest that code LLMs that encode the code structural prior via the code symmetry group generalize better and faster.
Peer Reviews
Decision·ICML 2024 Spotlight
1. Formalization of code symmetries as automorphisms of graphs is nice and seems like the correct formalism. 2. SymC model achieves equivariance to the code symmetries under consideration in a natural way, which is not too different from existing Transformer-based models. 3. Empirical results show that SymC outperforms strong baselines, while being small and robust to code symmetries.
1. Hard to understand exactly what program interpretation graphs and program dependence graphs look like, which is crucial to the paper. 2. Experimental details are lacking. What is the training procedure for SymC, is it just supervised training on the downstream task? How about for the other models? For Function Name prediction, do the LLMs take in just the text as input, and what exactly does SymC take as input there? 3. Computation of graphs associated to code may be costly and restrictive.
The idea of defining code symmetries as semantics-preserving transformations, enabling precise reasoning within LLMs is somewhat interesting. To evaluate the approach, four analysis tasks that require a deep understanding of code behavior such that they are expected to stay invariant to code symmetries were considered. Also a set of real-world semantics-preserving transformations beyond PDG automorphisms to evaluate SYMC’s generalization in the experiments.
The paper needs more evaluations, e.g. an evaluation of the robustness of SYMC using the adversarial attack methods based on code transformations. Some contents are not well presented/stated.
1. The paper presents a unique and innovative approach to harnessing code symmetry, grounded in group theory, which stands out from previous methods that rely on ad-hoc heuristics. Instead of using these transformations for data augmentation, as is common in prior work, SymC ingeniously incorporates them into the attention layers of Transformers, showcasing a novel application. 2. SymC's performance is noteworthy, as it surpasses the baselines across the various tasks presented in the paper, so
1. The paper could benefit from a more comprehensive comparison with related works, such as DOBF (https://arxiv.org/abs/2102.07492), which exploits code symmetry in pretraining through a deobfuscation objective, and CodeT5 (https://arxiv.org/abs/2109.00859), which leverages code symmetry in pretraining with identifier-aware data augmentation. These related works were not discussed or compared to the proposed method in the paper. 2. The evaluation framework relies heavily on four artificial task
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Reliability and Analysis Research
MethodsLinear Layer · Layer Normalization · Byte Pair Encoding · Dropout · Multi-Head Attention · Attention Is All You Need · Softmax · Dense Connections · Label Smoothing · Adam
