TL;DR
This paper investigates the vulnerabilities of decentralized Group Relative Policy Optimization (GRPO) in training large language models, demonstrating effective adversarial attacks and proposing defenses, including logit-based filtering and LLM judging.
Contribution
It introduces the first adversarial attacks on decentralized GRPO and proposes two novel defense mechanisms to improve robustness.
Findings
Adversaries can achieve up to 100% attack success rate in 50 iterations.
Proposed defenses effectively prevent most attacks, except DoS.
Code for attacks and defenses is publicly available at https://github.com/gensyn-ai/HTTT.
Abstract
Group Relative Policy Optimization (GRPO) has demonstrated wide adoption in the post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and preferred behaviour is learnt via reinforcement learning. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and these completions are exchanged in the form of strings. In this work, we explore the robustness of decentralised GRPO by presenting the first adversarial attacks and countermeasures. We present a diverse set of attacks where malicious nodes poison benign models by sharing their poisoned completions. We demonstrate these attacks on math and coding tasks and show that an adversary can achieve attack success rates of up to 100% in as few as 50 iterations. Moreover, to mitigate the attacks, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
