HARP: Hesitation-Aware Reframing in Transformer Inference Pass

Romain Stora\"i; Seung-won Hwang

arXiv:2412.07282·cs.CL·May 27, 2025

HARP: Hesitation-Aware Reframing in Transformer Inference Pass

Romain Stora\"i, Seung-won Hwang

PDF

Open Access 1 Repo 1 Video

TL;DR

HARP is a model-agnostic, training-free method that adaptively adds computation during Transformer inference by mimicking human hesitation, leading to improved performance and faster inference times.

Contribution

HARP introduces a novel hesitation-aware reframing technique that enhances Transformer inference without retraining or complex modifications.

Findings

01

Up to +5.16% performance improvement on downstream tasks

02

Inference times are twice as fast as beam search

03

Applicable across various models and tasks

Abstract

This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to "off-the-shelf" Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

romsto/harp
pytorchOfficial

Videos

HARP: Hesitation-Aware Reframing in Transformer Inference Pass· underline

Taxonomy

TopicsSmart Grid Security and Resilience · Parallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices

MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing