Dependable Artificial Intelligence with Reliability and Security (DAIReS): A Unified Syndrome Decoding Approach for Hallucination and Backdoor Trigger Detection
Hema Karnam Surendrababu (1), Nithin Nagaraj (1) ((1) National Institute of Advanced Studies, Indian Institute of Science Campus, Bengaluru, India)

TL;DR
This paper introduces DAIReS, a unified syndrome decoding method that detects both backdoor triggers and hallucinations in large language models, enhancing system security and reliability.
Contribution
The work adapts syndrome decoding to NLP, providing a novel unified approach for detecting security and reliability issues in ML models.
Findings
Effective detection of poisoned samples in training data.
Identification of hallucinated content in LLMs.
Unified approach applicable to security and reliability violations.
Abstract
Machine Learning (ML) models, including Large Language Models (LLMs), are characterized by a range of system-level attributes such as security and reliability. Recent studies have demonstrated that ML models are vulnerable to multiple forms of security violations, among which backdoor data-poisoning attacks represent a particularly insidious threat, enabling unauthorized model behavior and systematic misclassification. In parallel, deficiencies in model reliability can manifest as hallucinations in LLMs, leading to unpredictable outputs and substantial risks for end users. In this work on Dependable Artificial Intelligence with Reliability and Security (DAIReS), we propose a novel unified approach based on Syndrome Decoding for the detection of both security and reliability violations in learning-based systems. Specifically, we adapt the syndrome decoding approach to the NLP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Advanced Malware Detection Techniques
