Attack on Unfair ToS Clause Detection: A Case Study using Universal Adversarial Triggers
Shanshan Xu, Irina Broda, Rashid Haddad, Marco Negrini and, Matthias Grabmair

TL;DR
This paper reveals that transformer-based systems for detecting unfair ToS clauses are vulnerable to universal adversarial triggers, which can significantly impair detection performance while remaining natural and hard to detect by humans.
Contribution
It introduces a novel adversarial attack method using universal triggers on ToS clause detection systems and evaluates their effectiveness and human detectability.
Findings
Universal triggers can reduce detection accuracy significantly.
Naturalness of triggers influences their success in fooling humans.
Human evaluation shows triggers are often perceived as natural and unnoticeable.
Abstract
Recent work has demonstrated that natural language processing techniques can support consumer protection by automatically detecting unfair clauses in the Terms of Service (ToS) Agreement. This work demonstrates that transformer-based ToS analysis systems are vulnerable to adversarial attacks. We conduct experiments attacking an unfair-clause detector with universal adversarial triggers. Experiments show that a minor perturbation of the text can considerably reduce the detection performance. Moreover, to measure the detectability of the triggers, we conduct a detailed human evaluation study by collecting both answer accuracy and response time from the participants. The results show that the naturalness of the triggers remains key to tricking readers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Cybercrime and Law Enforcement Studies · Spam and Phishing Detection
Methodstravel james
