Zero-Shot Commonsense Validation and Reasoning with Large Language Models: An Evaluation on SemEval-2020 Task 4 Dataset
Rawand Alfugaha, Mohammad AL-Smadi

TL;DR
This paper evaluates large language models on SemEval-2020 tasks for commonsense validation and reasoning, showing larger models perform well but still struggle with explanation relevance and causal inference.
Contribution
It provides a comprehensive zero-shot evaluation of multiple LLMs on commonsense tasks, highlighting their strengths and limitations compared to fine-tuned models.
Findings
LLaMA3-70B achieves 98.40% accuracy in validation.
Models outperform previous baselines in validation but lag in explanation tasks.
Challenges remain in selecting relevant explanations and causal reasoning.
Abstract
This study evaluates the performance of Large Language Models (LLMs) on SemEval-2020 Task 4 dataset, focusing on commonsense validation and explanation. Our methodology involves evaluating multiple LLMs, including LLaMA3-70B, Gemma2-9B, and Mixtral-8x7B, using zero-shot prompting techniques. The models are tested on two tasks: Task A (Commonsense Validation), where models determine whether a statement aligns with commonsense knowledge, and Task B (Commonsense Explanation), where models identify the reasoning behind implausible statements. Performance is assessed based on accuracy, and results are compared to fine-tuned transformer-based models. The results indicate that larger models outperform previous models and perform closely to human evaluation for Task A, with LLaMA3-70B achieving the highest accuracy of 98.40% in Task A whereas, lagging behind previous models with 93.40% in Task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
