HuggingGraph: Understanding the Supply Chain of LLM Ecosystem
Mohammad Shahedur Rahman, Peng Gao, Yuede Ji

TL;DR
This paper introduces HuggingGraph, a methodology and graph-based model to analyze the supply chain of large language models, revealing relationships and potential risks inherited from datasets and previous models.
Contribution
It presents a novel graph-based approach to systematically analyze the supply chain of LLMs, including a large heterogeneous graph with over 400,000 nodes.
Findings
Identified complex relationships between models and datasets.
Revealed potential vulnerabilities inherited from data sources.
Provided insights for improving model fairness and compliance.
Abstract
Large language models (LLMs) leverage deep learning architectures to process and predict sequences of words, enabling them to perform a wide range of natural language processing tasks, such as translation, summarization, question answering, and content generation. As existing LLMs are often built from base models or other pre-trained models and use external datasets, they can inevitably inherit vulnerabilities, biases, or malicious components that exist in previous models or datasets. Therefore, it is critical to understand these components' origin and development process to detect potential risks, improve model fairness, and ensure compliance with regulatory frameworks. Motivated by that, this project aims to study such relationships between models and datasets, which are the central parts of the LLM supply chain. First, we design a methodology to systematically collect LLMs' supply…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSustainable Industrial Ecology · Scientific Computing and Data Management · Blockchain Technology Applications and Security
