When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for   Large Language Models

Yinghui Li; Qingyu Zhou; Yuanzhen Luo; Shirong Ma; Yangning Li,; Hai-Tao Zheng; Xuming Hu; Philip S. Yu

arXiv:2402.11100·cs.CL·June 11, 2024·2 cites

When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models

Yinghui Li, Qingyu Zhou, Yuanzhen Luo, Shirong Ma, Yangning Li,, Hai-Tao Zheng, Xuming Hu, Philip S. Yu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces the FLUB benchmark to evaluate large language models' ability to understand cunning, tricky, and misleading texts, revealing current limitations and guiding future improvements in fallacy comprehension.

Contribution

The paper presents a novel benchmark with tasks designed to test LLMs on fallacy understanding using real-world cunning texts, which is a new challenge for the community.

Findings

01

FLUB is challenging for current LLMs

02

Advanced models show limited fallacy understanding

03

The benchmark encourages future research in fallacy comprehension

Abstract

Recently, Large Language Models (LLMs) make remarkable evolutions in language understanding and generation. Following this, various benchmarks for measuring all kinds of capabilities of LLMs have sprung up. In this paper, we challenge the reasoning and understanding abilities of LLMs by proposing a FaLlacy Understanding Benchmark (FLUB) containing cunning texts that are easy for humans to understand but difficult for models to grasp. Specifically, the cunning texts that FLUB focuses on mainly consist of the tricky, humorous, and misleading texts collected from the real internet environment. And we design three tasks with increasing difficulty in the FLUB benchmark to evaluate the fallacy understanding ability of LLMs. Based on FLUB, we investigate the performance of multiple representative and advanced LLMs, reflecting our FLUB is challenging and worthy of more future study. Interesting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thukelab/flub
noneOfficial

Videos

When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Artificial Intelligence in Law