Using Small Language Models to Reverse-Engineer Machine Learning Pipelines Structures
Nicolas Lacroix, Mireille Blay-Fornarino, S\'ebastien Mosser, Frederic Precioso

TL;DR
This paper investigates the use of Small Language Models to automatically identify and understand the structure of machine learning pipelines from source code, aiming to improve flexibility and reliability over existing methods.
Contribution
It demonstrates that Small Language Models can effectively classify ML pipeline components, offering a scalable and adaptable approach compared to manual labeling and traditional classifiers.
Findings
SLMs outperform traditional classifiers in pipeline structure classification
Performance varies with taxonomy definitions, affecting classification accuracy
SLMs provide insights aligning with data science practices from prior studies
Abstract
Background: Extracting the stages that structure Machine Learning (ML) pipelines from source code is key for gaining a deeper understanding of data science practices. However, the diversity caused by the constant evolution of the ML ecosystem (e.g., algorithms, libraries, datasets) makes this task challenging. Existing approaches either depend on non-scalable, manual labeling, or on ML classifiers that do not properly support the diversity of the domain. These limitations highlight the need for more flexible and reliable solutions. Objective: We evaluate whether Small Language Models (SLMs) can leverage their code understanding and classification abilities to address these limitations, and subsequently how they can advance our understanding of data science practices. Method: We conduct a confirmatory study based on two reference works selected for their relevance regarding current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Software Engineering Research · Machine Learning in Materials Science
