Do LLMs Provide Links to Code Similar to what they Generate? A Study   with Gemini and Bing CoPilot

Daniele Bifolco; Pietro Cassieri; Giuseppe Scanniello and; Massimiliano Di Penta; Fiorella Zampetti

arXiv:2501.12134·cs.SE·January 22, 2025

Do LLMs Provide Links to Code Similar to what they Generate? A Study with Gemini and Bing CoPilot

Daniele Bifolco, Pietro Cassieri, Giuseppe Scanniello and, Massimiliano Di Penta, Fiorella Zampetti

PDF

Open Access

TL;DR

This study empirically evaluates how often links provided by Bing CoPilot and Google Gemini are relevant to generated code snippets, revealing significant issues with provenance and trustworthiness in LLM-based code assistants.

Contribution

It provides the first systematic analysis of link relevance in LLM-generated code, highlighting the extent of provenance debt and its implications.

Findings

01

66% of Bing CoPilot links are relevant

02

28% of Google Gemini links are relevant

03

LLMs often provide irrelevant or misleading links

Abstract

Large Language Models (LLMs) are currently used for various software development tasks, including generating code snippets to solve specific problems. Unlike reuse from the Web, LLMs are limited in providing provenance information about the generated code, which may have important trustworthiness and legal consequences. While LLM-based assistants may provide external links that are "related" to the generated code, we do not know how relevant such links are. This paper presents the findings of an empirical study assessing the extent to which 243 and 194 code snippets, across six programming languages, generated by Bing CoPilot and Google Gemini, likely originate from the links provided by these two LLM-based assistants. The study leverages automated code similarity assessments with thorough manual analysis. The study's findings indicate that the LLM-based assistants provide a mix of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Library Science and Information Systems