A Test for Evaluating Performance in Human-Computer Systems

Andres Campero; Michelle Vaccaro; Jaeyoon Song; Haoran Wen; Abdullah; Almaatouq; Thomas W. Malone

arXiv:2206.12390·cs.HC·June 30, 2022·21 cites

A Test for Evaluating Performance in Human-Computer Systems

Andres Campero, Michelle Vaccaro, Jaeyoon Song, Haoran Wen, Abdullah, Almaatouq, Thomas W. Malone

PDF

Open Access

TL;DR

This paper introduces a new test for evaluating performance improvements in human-computer systems using the ratio of means, demonstrating its application through analysis of existing studies and experiments with GPT-3, revealing insights into performance gains and human-computer synergy.

Contribution

The paper develops a novel statistical test for measuring performance improvements in human-computer systems and applies it to multiple scenarios, including analysis of prior studies and experiments with GPT-3.

Findings

01

Over half of recent studies show no performance improvement.

02

Maximum observed performance ratio is 1.36 (36% improvement).

03

GPT-3 enables a 27% speed improvement when used by human programmers.

Abstract

The Turing test for comparing computer performance to that of humans is well known, but, surprisingly, there is no widely used test for comparing how much better human-computer systems perform relative to humans alone, computers alone, or other baselines. Here, we show how to perform such a test using the ratio of means as a measure of effect size. Then we demonstrate the use of this test in three ways. First, in an analysis of 79 recently published experimental results, we find that, surprisingly, over half of the studies find a decrease in performance, the mean and median ratios of performance improvement are both approximately 1 (corresponding to no improvement at all), and the maximum ratio is 1.36 (a 36% improvement). Second, we experimentally investigate whether a higher performance improvement ratio is obtained when 100 human programmers generate software using GPT-3, a massive,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Context-Aware Activity Recognition Systems

Methods{Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Is All You Need · Test · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Cosine Annealing · Dropout · Weight Decay · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia?