Elsevier Arena: Human Evaluation of Chemistry/Biology/Health Foundational Large Language Models
Camilo Thorne, Christian Druckenbrodt, Kinga Szarkowska, Deepika, Goyal, Pranita Marajan, Vijay Somanath, Corey Harper, Mao Yan, Tony Scerri

TL;DR
This paper evaluates the performance of large language models in chemistry, biology, and health domains through human assessments, highlighting their strengths and limitations in specialized scientific fields.
Contribution
It introduces a comprehensive human evaluation framework for large language models in scientific domains, providing insights into their capabilities and gaps.
Findings
Models perform well on general scientific questions.
Significant gaps remain in specialized domain knowledge.
Human evaluation reveals nuanced strengths and weaknesses.
Abstract
arXiv admin comment: This version has been removed by arXiv administrators as the submitter did not have the rights to agree to the license at the time of submission
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetics, Bioinformatics, and Biomedical Research · Health, Environment, Cognitive Aging · Biomedical Text Mining and Ontologies
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Residual Connection · Attention Dropout · Linear Layer · Multi-Head Attention · Dense Connections · Cosine Annealing · Linear Warmup With Cosine Annealing
