COMEX: A Tool for Generating Customized Source Code Representations
Debeshee Das, Noble Saji Mathews, Alex Mathai, Srikanth Tamilselvam,, Kranthi Sedamaki, Sridhar Chimalakonda, Atul Kumar

TL;DR
COMEX is a versatile tool that enables easy creation and combination of multiple structural code views from source code, facilitating advanced machine learning applications in software engineering.
Contribution
It introduces a flexible framework for generating customizable code-views directly from source code, supporting multiple languages and analysis levels, built on the tree-sitter parser.
Findings
Supports Java and C# source code analysis
Works on both method and program-level snippets
Easily extendable to other languages
Abstract
Learning effective representations of source code is critical for any Machine Learning for Software Engineering (ML4SE) system. Inspired by natural language processing, large language models (LLMs) like Codex and CodeGen treat code as generic sequences of text and are trained on huge corpora of code data, achieving state of the art performance on several software engineering (SE) tasks. However, valid source code, unlike natural language, follows a strict structure and pattern governed by the underlying grammar of the programming language. Current LLMs do not exploit this property of the source code as they treat code like a sequence of tokens and overlook key structural and semantic properties of code that can be extracted from code-views like the Control Flow Graph (CFG), Data Flow Graph (DFG), Abstract Syntax Tree (AST), etc. Unfortunately, the process of generating and integrating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Reliability and Analysis Research · Software System Performance and Reliability
MethodsCodeGen
