ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context
Victoria R. Li, Yida Chen, Naomi Saphra

TL;DR
This study investigates how user context influences guardrail responses in GPT-3.5, revealing biases related to demographics and identity that affect the model's refusal behavior on sensitive requests.
Contribution
It uncovers biases in GPT-3.5's guardrail sensitivity based on user demographics and contextual cues, highlighting unintended behavior in response moderation.
Findings
Younger, female, and Asian-American personas trigger more refusals.
Guardrails are influenced by seemingly innocuous identity cues.
ChatGPT infers political ideology from user context and adjusts responses.
Abstract
While the biases of language models in production are extensively documented, the biases of their guardrails have been neglected. This paper studies how contextual information about the user influences the likelihood of an LLM to refuse to execute a request. By generating user biographies that offer ideological and demographic information, we find a number of biases in guardrail sensitivity on GPT-3.5. Younger, female, and Asian-American personas are more likely to trigger a refusal guardrail when requesting censored or illegal information. Guardrails are also sycophantic, refusing to comply with requests for a political position the user is likely to disagree with. We find that certain identity groups and seemingly innocuous information, e.g., sports fandom, can elicit changes in guardrail sensitivity similar to direct statements of political ideology. For each demographic category and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTraffic Prediction and Management Techniques · Traffic and Road Safety
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · 7 Fastest Ways to Call American Airlines Reservations Number (USA Guide) · Linear Layer · Adam · Dropout · Dense Connections · Weight Decay · Multi-Head Attention
