Coding Triangle: How Does Large Language Model Understand Code?
Taolin Zhang, Zihan Ma, Maosong Cao, Junnan Liu, Songyang Zhang, Kai Chen

TL;DR
This paper introduces the Code Triangle framework to systematically evaluate large language models' understanding of code across analysis, implementation, and testing, revealing their strengths and limitations in coding tasks.
Contribution
The study proposes the Code Triangle framework for comprehensive evaluation of LLMs in coding, highlighting the importance of diverse data and model mixtures for improvement.
Findings
LLMs can form a self-consistent system across analysis, implementation, and testing.
Model errors tend to cluster due to training data biases.
Incorporating human data and model mixtures improves performance and robustness.
Abstract
Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three fundamental dimensions: editorial analysis, code implementation, and test case generation. Through extensive experiments on competitive programming benchmarks, we reveal that while LLMs can form a self-consistent system across these dimensions, their solutions often lack the diversity and robustness of human programmers. We identify a significant distribution shift between model cognition and human expertise, with model errors tending to cluster due to training data biases and limited reasoning transfer. Our study demonstrates that incorporating human-generated editorials, solutions, and diverse test cases, as well as leveraging model mixtures, can…
Peer Reviews
Decision·Submitted to ICLR 2026
* Framework: The primary strength is the proposal of the "Coding Triangle" (Editorial, Code, Cases). This is a novel, intuitive, and significant contribution. It provides a multi-dimensional, interpretable framework that moves beyond simple functional correctness to probe an LLM's analytical and validation capabilities. * Insight: The paper clearly identifies and provides evidence for "self-consistency" and "distribution shift". The finding that LLM-generated solutions are highly similar (hig
* Methodological Opacity (Critical Weakness): As detailed in the "Soundness" section, the paper is missing the most crucial experimental details. The authors analyze solution diversity and self-consistency without specifying the decoding parameters (temperature, top-p, etc.) or the number of samples (k) used for the diversity analysis in Figure 3. These parameters are not minor details; they are the central variables that control the exploration and diversity the paper claims to measure. This om
No obvious grammar flaw in the paper.
1. Figure 1, the teaser is difficult to follow, I can’t understand the relationship between green, blue, and orange arrows and blocks. And which dimensions are self-consistent or not self consistent cannot easily tell from the figure. 2. The evaluation models: QWQ, Qwen coder and Qwen instruct are basically from the same company, my concern is I think their would be some similarity in pretrain data, a more diverse model to be used would make the observations in the paper seems more reasonable.
The three-dimensional evaluation framework is innovative and addresses limitations of existing benchmarks The analysis of self-consistency and self-inconsistency reveals important characteristics of model cognition The discovery that model mixtures enhance diversity and robustness is practically valuable Comprehensive experiments across multiple model types and problem difficulties
The evaluation is limited to competitive programming problems; generalization to real-world coding scenarios needs verification The "self-consistency" and "self-inconsistency" concepts could be more precisely defined and quantified Limited analysis of why reasoning models still exhibit self-inconsistency despite extended reasoning capabilities No discussion about the computational cost of implementing the full Coding Triangle evaluation
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Artificial Intelligence in Healthcare and Education
