Assessing the Capability of LLMs in Solving POSCOMP Questions
Cayo Viegas, Rohit Gheyi, M\'arcio Ribeiro

TL;DR
This study evaluates the performance of various large language models on the POSCOMP computer science exam, demonstrating that recent models like ChatGPT-4 and Gemini 2.5 Pro outperform human participants in text-based questions.
Contribution
It provides a comprehensive assessment of LLM capabilities on a specialized, challenging computer science exam, highlighting recent models' superiority over humans.
Findings
ChatGPT-4 outperforms all human participants in 2023.
Recent models show continuous improvement across years.
LLMs excel in text-based questions but struggle with image interpretation.
Abstract
Recent advancements in Large Language Models (LLMs) have significantly expanded the capabilities of artificial intelligence in natural language processing tasks. Despite this progress, their performance in specialized domains such as computer science remains relatively unexplored. Understanding the proficiency of LLMs in these domains is critical for evaluating their practical utility and guiding future developments. The POSCOMP, a prestigious Brazilian examination used for graduate admissions in computer science promoted by the Brazlian Computer Society (SBC), provides a challenging benchmark. This study investigates whether LLMs can match or surpass human performance on the POSCOMP exam. Four LLMs - ChatGPT-4, Gemini 1.0 Advanced, Claude 3 Sonnet, and Le Chat Mistral Large - were initially evaluated on the 2022 and 2023 POSCOMP exams. The assessments measured the models' proficiency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law
