DEP: A Decentralized Large Language Model Evaluation Protocol

Jianxiang Peng; Junhao Li; Hongxiang Wang; Haocheng Lyu; Hui Guo; Siyi Hao; Zhen Wang; Chuang Liu; Shaowei Zhang; Bojian Xiong; Yue Chen; Zhuowen Han; Ling Shi; Tianyu Dong; Juesi Xiao; Lei Yang; Yuqi Ren; Deyi Xiong

arXiv:2603.01167·cs.CL·March 3, 2026

DEP: A Decentralized Large Language Model Evaluation Protocol

Jianxiang Peng, Junhao Li, Hongxiang Wang, Haocheng Lyu, Hui Guo, Siyi Hao, Zhen Wang, Chuang Liu, Shaowei Zhang, Bojian Xiong, Yue Chen, Zhuowen Han, Ling Shi, Tianyu Dong, Juesi Xiao, Lei Yang, Yuqi Ren, Deyi Xiong

PDF

Open Access

TL;DR

DEP introduces a decentralized, standardized evaluation protocol for large language models that enhances reproducibility, data security, and modularity, enabling long-term, cost-effective benchmarking across diverse tasks.

Contribution

This work presents DEP, a novel decentralized evaluation framework and toolkit that unifies LLM benchmarking, ensuring consistency, data privacy, and ease of adaptation for multiple benchmarks.

Findings

01

DEP reduces evaluation deployment costs.

02

Over 60 benchmarks adapted using DEP.

03

Effective in ensuring secure, reproducible evaluations.

Abstract

With the rapid development of Large Language Models (LLMs), a large number of benchmarks have been proposed. However, most benchmarks lack unified evaluation standard and require the manual implementation of custom scripts, making results hard to ensure consistency and reproducibility. Furthermore, mainstream evaluation frameworks are centralized, with datasets and answers, which increases the risk of benchmark leakage. To address these issues, we propose a Decentralized Evaluation Protocol (DEP), a decentralized yet unified and standardized evaluation framework through a matching server without constraining benchmarks. The server can be mounted locally or deployed remotely, and once adapted, it can be reused over the long term. By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation: benchmark files and evaluation logic stay exclusively on the server…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods