OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities

Sahil Verma; Keegan Hines; Jeff Bilmes; Charlotte Siska; Luke Zettlemoyer; Hila Gonen; Chandan Singh

arXiv:2505.23856·cs.CL·December 10, 2025

OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities

Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, Chandan Singh

PDF

Open Access 1 Repo

TL;DR

Omniguard is a novel, efficient method for detecting harmful prompts across multiple languages and modalities, significantly improving accuracy and robustness over existing approaches.

Contribution

It introduces a cross-language and cross-modality detection approach that leverages aligned internal representations for improved harmful prompt classification.

Findings

01

11.57% accuracy improvement in multilingual detection

02

20.44% accuracy increase for image prompts

03

State-of-the-art performance for audio prompts

Abstract

The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose Omniguard, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. Omniguard improves harmful prompt classification accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vsahil/omniguard
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning