The role of auditory and haptic cues in object enumeration within containers

Ilja Frissen; Olga Sagou; Krista E. Overvliet

PMC · DOI:10.1007/s00221-025-07215-4·January 23, 2026

The role of auditory and haptic cues in object enumeration within containers

Ilja Frissen, Olga Sagou, Krista E. Overvliet

PDF

Open Access

TL;DR

This study explores how touch and sound help people estimate the number of objects inside a container, finding that touch is more effective when people can freely explore.

Contribution

The study reveals that haptic cues outperform auditory cues in unconstrained exploration for object enumeration.

Findings

01

Participants could reliably estimate the number of items under all sensory conditions.

02

Haptic-only estimates were significantly more accurate when time and movement constraints were removed.

03

Multisensory integration provided little benefit for enumeration accuracy.

Abstract

Everyday experience suggests that handheld containers convey auditory and haptic information about their contents. The primary objective of this study was to examine how these two sources of information interact. Across three experiments, participants estimated the number of beads in cardboard boxes under three sensory conditions: haptics-only, auditory-only, and auditory-haptic. We hypothesized that the auditory-haptic condition would yield more accurate and precise estimates. A secondary objective was to assess the effects of time and manipulation constraints. In Experiment 1, participants were limited to 5 s of exploration using a standardized movement; Experiment 2 removed the time constraint, and Experiment 3 additionally removed the movement constraint. Results indicated that participants could reliably estimate the number of items under all conditions but provided little evidence…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals1

water

Diseases1

AH

Figures4

Click any figure to enlarge with its caption.

Mean estimated number of objects for Experiments 1–3. Error bars represent 95% empirical bootstrapped (n = 10,000) confidence intervals using the bias corrected and accelerated percentile method (Haukoos and Lewis [2005](#CR16))

Tests for integration in terms of accuracy for Experiments 1–3. The top row panels show individual results; each line is one participant. The horizontal lines indicated RMS values ranging between the auditory-only (triangles) to haptic-only (squares) conditions. The circular marker shows RMS values when both cues were available. Only those markers to the left of the best condition are consistent with increased accuracy (shown in solid black). The middle row panels show the corresponding group level histograms based on empirical bootstraps of the individual mean RMS. The bottom row panels show

Tests for integration in terms of precision for Experiments 1–3. The top row panels show individual results; each line is one participant. The horizontal lines indicated RMS values for the best (leftward) and worst (rightward) cues. The circular marker shows RMS values when both cues were available. Only those markers to the left of the best cue are consistent with increased precision (shown in solid black). The labels on the right-hand side of each panel indicates whether the auditory (A) or haptic (H) condition produced the Best cue. The middle row panels show the corresponding group level h

Keywords

NumerosityHapticsAuditoryMultisensoryHuman

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultisensory perception and integration · Embodied and Extended Cognition · Visual perception and processing mechanisms

Full text

Introduction

Humans have long relied on containers for storage and transport (Bevan 2014; Klose 2015), but little is known about how we infer their contents from sensory cues. Containers are likely to have been part of the human experience for as long as humans have been around (Bevan 2014; Suddendorf and Langley, 2020). They are thought to have been instrumental in hominin evolution (e.g., Suddendorf et al. 2020) and have long since been cultural universals (Goodenough 2003).

One general question that arises from the long common history of humans and containers is whether we can make reliable perceptual inferences about a container’s content. Daily experiences imply that we can: Shaking a milk carton or a box of chocolate sprinkles allows for clear impressions of how much content there is. As in this example, most common interactions with a container produce both auditory and haptic cues. Here we investigate how effective either of these two cues are, as well as how they interact in making inferences about the number of objects in a container (i.e., enumeration).

Background

Besides the daily experiences, prior research shows that auditory and haptic cues can each support judgments about container contents, yet their combined effect remains unclear. Auditory cues have been shown to afford inferences about the fullness of a vessel from the sound of poured liquids (Cabe and Pittinger, 2000), about whether contained water was hot or cold (Velasco et al. 2013), and how many objects are inside a box (i.e., enumeration; Hummel et al. 2022; Pittenger et al. 1997; Pittenger and Mincy 1999; Plaisier and Smeets 2017). Haptic cues afford accurate judgments about how many deciliters there are in a milk carton (Jansson 1993; Jansson et al. 2006) or milliliters in a bottle (Koshiyama et al. 2015), and tracking moving objects inside tubes (Frissen et al. 2022; Yao and Hayward 2006). There are also several haptic enumeration studies (Frissen and Chen 2024; Frissen et al. 2023, 2025; Hummel et al. 2022; Sekiguchi et al. 2005). For instance, Frissen et al. (2023) reported on a series of three experiments in which participants handled small cardboard jewelry boxes holding between one and five beads of various diameters and weights and made direct estimates of the number of objects. Overall, this, and similar enumeration studies document remarkably accurate performance although with a progressive tendency to underestimate as the number of objects increases.

Early work on combined auditory and haptic cues was conducted by Pittenger and colleagues (1997; Pittenger and Mincy 1999). These studies examined whether participants could reliably rank a series of containers based on the perceived size of balls inside of the containers. The containers were explored in one of two ways. In one condition, the containers (opaque plastic cylinders) were shaken and in a second condition, the content of containers (glass bowls) was stirred with a stick. Access to sensory information was controlled by who was allowed to interact with the container. For the haptic-only condition, participants handled the containers while sounds were masked. For the auditory-only condition, the experimenter handled the container while the participant listened. And for the haptic-auditory condition the participant handled the container without the sounds being masked. Participants performed above chance with either cue, but combining cues offered no advantage over auditory information alone. Similarly, Hummel et al. (2022) found that auditory cues yielded slightly more accurate enumeration than haptic cues, and that combined cues were comparable to auditory-only performance. Together, these findings suggest that haptic information adds little to the multisensory percept, implying a winner-take-all process that favors auditory cues.

One study, however, challenges this interpretation. Plaisier and Smeets (2017) asked participants to estimate the number of wooden spheres (one to five) in small cardboard boxes after five seconds of handling. The task included two conditions: in one, participants had access to both auditory and haptic cues; in the other, they heard audio recordings from the first condition. Results showed that the auditory–haptic condition produced significantly more accurate and less variable estimates than the auditory-only condition. The authors argued that this improvement reflects a multisensory integration process, whereby combining cues reduces sensory noise and enhances perceptual precision (Faisal et al. 2008; Ernst and Bülthoff 2004; Stein, 2012).

However, the previous auditory–haptic studies have notable limitations. They rely exclusively on group-level analyses, overlooking individual differences, and most are based on very small samples (e.g., Plaisier and Smeets 2017) or presented only as brief summary reports (Hummel et al. 2022; Pittenger et al. 1997; Pittenger and Mincy 1999). These limitations have two important implications. First, the absence of individual-level analyses undermines any conclusion that auditory cues are the sole “winner” in a winner-take-all mechanism (Pittenger et al. 1997; Pittenger and Mincy 1999). It is plausible that some participants relied more on haptic cues, but such patterns were obscured in group averages. Second, the small sample size in Plaisier and Smeets (2017)—only seven participants—and reliance on group-level variability estimates limit the reliability of their precision measure and, by extension, their claim of sensory integration.

The present study

The primary objective of the present study is to investigate the role of auditory and haptic cues in the estimation of the number of objects inside handheld boxes. We take an analytical approach that can show individual changes in precision (Scheller and Nardini 2024) as well as suggest individual “winners” for auditory or haptic cues. We adopt integration as the default position and hypothesize that the estimations of the number of objects in a container with both auditory and haptic information are more accurate (i.e., responses being closer to the actual number of objects) and more precise (i.e., having a lower variability in responses) in comparison to having only auditory or haptic information.

A secondary objective pertains to the effects of how participants manipulate the box. To date, enumeration studies with handheld boxes have constrained how long participants are allowed to explore a box (typically between 5 and 10 s) while at the same time they have not controlled how participants are allowed to manipulate the box. Seminal studies on form recognition show that the haptic system performs exceptionally well under ecologically valid conditions (Gibson 1962; Heller 1984). Gibson (1962) reported near-perfect recognition (95%) when participants actively explored objects compared to passive presentation (49%). Similarly, Heller (1984) found that allowing 30 s of exploration yielded better performance (95%) than 5 s (71%). Here we increasingly reduce the amount of constraint on how participants could explore the boxes across a series of three experiments. In Experiment 1 participants were instructed to manipulate the boxes for exactly 5 s using a standardized manipulation method. In Experiment 2 participants still used the standardized manipulation method but were given no time constraint. In Experiment 3 participants were free to manipulate the boxes as they wished without time constraints.

Experiment 1: Time limited, standardized movement

Method

Participants

Eighteen participants within the age range of 22–59 years took part in the study (8 females; 1 left-handed). All participants reported normal hearing and haptic perception. Informed consent was obtained from all participants before their participation. The study protocol received approval from the Utrecht University Faculty of Social Sciences ethics committee. Participants participated voluntarily and were paid 8 euros per hour or received course credits. No power analysis was conducted; sample sizes were based on previous work on container haptics showing robust findings at the individual level with sample sizes between 15 and 24 (Frissen and Chen 2024; Frissen et al. 2023, 2025).

Stimuli and apparatus

One to eight beads of were distributed among two sets of eight cardboard boxes, each weighting 17.5 g plus a 9.2 g lid, and measuring 7.5 (L) × 7.5 (W) × 4.5 cm (H), for a total of 16 boxes (see Fig. 1a and Table 1 for details). There were two different boxes of 1 bead, two different boxes of 2 beads, and so on. All boxes had lids so that the content was hidden from view. While previous studies have suggested that weight is only a minor cue in determining the number of objects in a box (Frissen et al. 2023, 2025), care was taken to minimize the weight differences between the boxes by using beads of different weights, sizes, and materials, resulting in weights ranging between 28.3 and 35.2 g. To aid the researcher, stickers were added at the bottom of the boxes indicating the number of items.

Fig. 1a Beads distribution inside the boxes with open lid. b Participants in Experiment 1 and 2 were instructed to move the box back and forth by rotating at the wrist (see also text)

Table 1. Distribution of objects and weights of items in the experimental containers#Objects (total)#Objects (per type)Weight (g)PinkBlueSilver1Silver2PearlGrayBeadsTotal(3.2 g)(1.6 g)(1.4 g)(0.9 g)(0.8 g)(0.4 g)10 | 11 | 00 | 00 | 00 | 00 | 01.6 | 3.228.3 | 29.921 | 10 | 00 | 11 | 00 | 00 | 04.1 | 4.630.8 | 31.331 | 10 | 00 | 11 | 01 | 00 | 14.9 | 5.031.6 | 31.741 | 00 | 20 | 10 | 12 | 01 | 05.2 | 5.531.9 | 32.251 | 10 | 00 | 01 | 11 | 22 | 15.7 | 6.132.4 | 32.861 | 10 | 01 | 01 | 11 | 32 | 17.1 | 6.933.8 | 33.670 | 12 | 02 | 10 | 01 | 22 | 37.6 | 7.434.3 | 34.180 | 11 | 13 | 01 | 11 | 22 | 38.3 | 8.535.0 | 35.2Information for the two sets of containers (see text) is separated by a pipe symbol (i.e., Set 1 | Set 2)

The randomization of trials, time keeping, and recording of answers, were controlled by a custom routine implemented in PsychoPy (version 2022.2.5), a software package for creating psychological experiments (Peirce et al. 2019). Four versions of the experiment were developed: a practice round and three experimental conditions.

Design and procedure

The experiment followed a 3 (Modality: Auditory-Only [A], Haptic-only [H], Auditory-Haptic [AH]) × 8 (Number of Objects: 1–8) × 4 (Repetitions) within-participants design for a total of 96 trials per participant. The levels of the Modality factor were counterbalanced, and Number of Objects was presented in a random order (in which each copy of a box was repeated twice). The experimental session lasted about 45 min.

The participant was seated at a desk next to the experimenter, so that the experimenter could easily hand over the boxes. A computer screen with trial information separated the experimenter from the participant, so that the participant could not see the labels on the boxes.

In the H and AH conditions, the participants manipulated the boxes themselves. In the A condition the experimenter manipulated the box in the standardized way (as described below) near the participant. In the haptic-only condition, participants wore noise-cancelling headphones (Bose NC700) playing white noise to block out any auditory information. Participants were not blind-folded.

In each trial the experimenter placed the box with the lid facing up in the participant’s hand (or the experimenter’s hand for the A condition). Subsequently, a beep noise was played to indicate to the participant (or the experimenter in the auditory condition) that they could start manipulating the box. After 5 s, another beep indicated the end of the manipulation time and the participant stated their numerosity estimation, which was recorded by the researcher on the computer. The box was returned, and the next trial began. There was neither mention of the maximum number of items to the participants, nor feedback on performance.

How participants manipulated the box was standardized based on a pilot study. Eight participants were instructed to manipulate, as they wished and without time limit, three different boxes containing 3, 5, or 8 items and to estimate the number of items inside the box. A number of manipulation strategies were observed, such as, shaking the box vigorously and moving the box horizontally in circles. However, the most common method involved carefully moving the box back and forth by rotating at the wrist (See Fig. 1b). It was therefore decided to instruct all participants in the experiment to use this particular manipulation technique. Before data collection started, participants were given a practice round to familiarize themselves with the task and the experimental setup. This consisted of three randomly selected trials from the AH condition.

Dependent variables and data analysis

We considered three dependent measures: 1) the estimated number of objects, and 2) how accurate and 3) precise those estimates are. Number estimates were calculated for each level of Number of Objects (1 – 8) by simply taking the mean across the four Repetitions. Both accuracy and precision were calculated for each participant individually using all 32 (8 × 4) number estimations collected for each of the three Modality conditions. Accuracy was operationalized in terms of the root-mean-square of errors (RMSE) in number estimations. Precision was operationalized in terms of the RMSE of the residuals from a second-order polynomial regression.1 Scheller and Nardini (2024) have pointed out that, in the multisensory perception literature, the most common approach to studying integration has been to compare precision when two senses are combined with precision when the two senses are on their own. That is, noise_1+2_ vs. noise_1_ and noise_1+2_ vs. noise_2_. However, they also demonstrate, using computational simulations, that this common approach is sensitive to high rates of false positives (concluding there was integration, when there was none). Here we therefore adopted their “individually-determined best cue analysis”, in which, for each participant it was determined which of the two senses produced the more precise (e.g., least noisy) estimates, which then served as the comparator; that is, noise_1+2_ vs. min(noise_1_, noise_2_).

Data processing and visualization were done with Matlab (The MathWorks Inc., 2022). Inferential statistical analyses, assuming a level of significance of 5%, were performed in JASP (JASP Team 2023). For repeated measures ANOVAs, violations of the assumption of sphericity were addressed using the Greenhouse–Geisser correction (in which case we report non-integer values for corrected degrees of freedom). All post-hoc tests were Holm-corrected for multiple comparison. For effect sizes, we report Generalized Eta-Squared values (η^2^G) for ANOVAs (Bakeman 2005) and Cohen’s d for any pairwise comparisons.

Results and discussion

The results show that with a time limit and a standardized movement, there is little evidence to support the idea that participants determined their estimates in the AH condition by integrating auditory and haptic information.

Number estimates

Figure 2a plots the group average estimates for Experiment 1, which allows two observations. The first observation is that, among the three conditions, A tended to produce the most accurate estimates on average, followed by AH, and then by H. This observation was supported by a 3 (Condition: A, H, AH) × 8 (Number of Objects: 1—8) repeated measures ANOVA (see Table 2 for details) that showed that the main effects of Condition and Number of Objects were significant. However, whereas the effect size for Number of Objects was, unsurprisingly, very large (η^2^G = 0.817), for Condition it could be considered medium (η^2^G = 0.062). The main effect of Condition was further explored with post-hoc comparisons, which showed significant differences for A vs. H and for H vs. AH, but not for A vs. AH. The interaction between Condition and Number of Objects was significant as well although its effect size was small to medium (η^2^G = 0.037). The interaction could be attributed to the fact that the differences between the three conditions tended to increase as the number of objects in the box increased.

Fig. 2. Mean estimated number of objects for Experiments 1–3. Error bars represent 95% empirical bootstrapped (n = 10,000) confidence intervals using the bias corrected and accelerated percentile method (Haukoos and Lewis 2005)

Table 2. Number estimatesANOVAPost-hoc testsExpEffectdf*FMSE p η2_G_Mean Difference [95% CI]Cohen’s d [95% CI] t

d 1Condition2, 349.126.94 < 0.0010.062A-H0.43 [ 0.17; 0.68]0.59 [ 0.15; 1.04]4.14 < 0.001number of objects1.8, 30.5250.11527.64 < 0.0010.817A-AH0.12 [− 0.14; 0.37]0.16 [− 0.21; 0.53]1.150.259interaction5.6, 94.92.971.470.0130.037H-AH− 0.31 [ 0.57; − 0.05]− 0.43 [− 0.83; − 0.02]2.990.0102Condition1.3, 22.29.230.130.0040.049A-H0.48 [ 0.04; 0.91]0.35 [− 0.01; 0.71]2.760.018number of objects1.4, 23.089.68793.03 < 0.0010.583A-AH− 0.25 [− 0.69; 0.18]− 0.18 [− 0.52; 0.15]1.460.153interaction4.1, 70.15.947.65 < 0.0010.040H-AH− 0.73 [− 1.16; 0.29]− 0.53 [− 0.93; − 0.13]4.22 < 0.0013Condition2, 343.584.810.0390.040A-H− 0.15 [− 0.50; 0.19]0.21 [− 0.67; 0.26]1.130.268number of objects7, 119479.94193.36 < 0.0010.856A-AH− 0.21 [− 0.55; 0.13]− 0.28 [− 0.75; 0.19]1.540.267interaction6.2, 103.11.140.730.350.020H-AH− 0.36 [− 0.71; 0.02]− 0.49 [− 0.99; − 0.02]2.670.035Details for the repeated measures ANOVAs and post-hoc comparisons for the main effect of Condition*non-integer values indicate Greenhouse–Geisser corrected degrees of freedom

The second observation is that the estimates for the AH condition were not better (i.e., more accurate) than those of the uni-modal conditions. In fact, the estimates for the AH condition tended to lie between those of A and H, which at face value could be taken as indicative of a form of integration based on a (weighted) average of the two uni-modal cues. However, the results for the Accuracy and Precision, to which we turn next, are not consistent with such an interpretation, since neither measure shows convincing improvement in performance in the AH condition relative to A or H.

Accuracy

The results for the accuracy analysis are shown in the left column panels of Fig. 3. At an individual level we saw that only 6 out of the 18 participants showed evidence of an increase in accuracy when both auditory and haptic cues were available (black markers). The majority of participants produced RMS values that were between the A and H conditions, and some produced RMS values that were even worse than the worst cues (e.g., participants 1 and 6). A 3 (Condition: A, H, AH) repeated measures ANOVA showed a significant main effect (F(2, 34) = 8.06, MSE = 0.636, p = 0.001, η^2^G = 0.177). Post-hoc comparisons (see Table 3) showed significant differences for A vs. H and for H vs. AH, but not for A vs. AH.

Fig. 3. Tests for integration in terms of accuracy for Experiments 1–3. The top row panels show individual results; each line is one participant. The horizontal lines indicated RMS values ranging between the auditory-only (triangles) to haptic-only (squares) conditions. The circular marker shows RMS values when both cues were available. Only those markers to the left of the best condition are consistent with increased accuracy (shown in solid black). The middle row panels show the corresponding group level histograms based on empirical bootstraps of the individual mean RMS. The bottom row panels show the group level mean RMS values. Error bars represent 95% empirical bootstrapped (n = 10,000) confidence intervals using the bias corrected and accelerated percentile method (Haukoos and Lewis 2005)

Table 3. Accuracy and precisionAccuracyPrecisionExpCompMean difference p Cohen’s dCompMean difference p Cohen’s d1A-H− 0.37 [− 0.61; − 0.14] < 0.001*− 1.09 [− 1.92; − 0.26]Best–Worst− 0.20 [− 0.30; − 0.10] < 0.001*− 0.86 [− 1.45; − 0.28]A-AH− 0.14 [− 0.38; 0.09]0.136− 0.42 [− 1.13; 0.29]Worst-Both0.10 [ 0.00; 0.20]0.0290.44 [− 0.04; 0.92]H-AH0.23 [− 0.01; 0.47]0.0390.67 [− 0.07; 1.42]Best-Both− 0.10 [− 0.20; 0.00]0.029*− 0.43 [− 0.90; 0.05]2A-H− 0.25 [− 0.59; 0.10]0.155− 0.42 [− 1.02; 0.19]Best–Worst− 0.25 [− 0.45; − 0.05]0.012*− 0.58 [− 1.12; − 0.04]A-AH0.13 [− 0.21; 0.47]0.3520.22 [− 0.37; 0.80]Worst-Both0.16 [− 0.04; 0.36]0.0940.39 [− 0.12; 0.89]H-AH0.38 [ 0.03; 0.72]0.0270.63 [− 0.01; 1.27]Best-Both− 0.08 [− 0.28; 0.12]0.312− 0.19 [− 0.68; 0.29]3A-H0.00 [− 0.23; 0.24]0.9770.01 [− 0.72; 0.74]Best–Worst− 0.18 [− 0.27; − 0.10] < 0.001− 1.04 [− 1.71; − 0.37]A-AH0.27 [ 0.04; 0.51]0.0170.87 [ 0.05; 1.67]Worst-Both0.22 [ 0.13; 0.30] < 0.0011.24 [ 0.51; 1.98]H-AH0.27 [ 0.04; 0.50]0.017*0.86 [ 0.04; 1.68]Best-Both0.04 [− 0.05; 0.12]0.3080.20 [− 0.30; 0.71]Details post-hoc comparisons. Values in square brackets are 95% CI

Precision

The results for the precision analysis are shown in the left column panels of Fig. 4. At an individual level we saw that only 7 out of the 18 participants showed evidence of an increase in precision when both auditory and haptic cues were available (black markers). And of those 7 there were 2 that were only marginally better than the best cue (participants 15 and 17). The majority of participants produced RMS values that were between the best and worst cues, and some produced RMS values that were even worse than the worst cues (e.g., participants 1 and 5).

Fig. 4. Tests for integration in terms of precision for Experiments 1–3. The top row panels show individual results; each line is one participant. The horizontal lines indicated RMS values for the best (leftward) and worst (rightward) cues. The circular marker shows RMS values when both cues were available. Only those markers to the left of the best cue are consistent with increased precision (shown in solid black). The labels on the right-hand side of each panel indicates whether the auditory (A) or haptic (H) condition produced the Best cue. The middle row panels show the corresponding group level histograms based on empirical bootstraps of the individual mean rms. The bottom row panels show the group level mean RMS values. Error bars represent 95% empirical bootstrapped (n = 10,000) confidence intervals using the bias corrected and accelerated percentile method (Haukoos and Lewis 2005)

A 3 (Cues: Best, Worst, Both) repeated measures ANOVA showed a significant main effect (F(1.45, 24.79) = 12.82, MSE = 0.250, p < 0.001, η^2^G = 0.116). More critically, the difference between the Best and Both cues conditions showed a decrease in precision when having both cues. However, while a post-hoc comparison suggested that the difference was statistically significant (p < 0.05) the corresponding confidence interval for Cohen d’s suggested that it was not so, since it included zero [− 0.90; 0.05])(See Table 3).

Experiment 2: No time limit, standardized movements

Method

Eighteen new participants within the age range of 19–44 years took part in the study (13 females; all right-handed). All participants reported normal hearing and haptic perception.

Experiment 2 had the same specifications as Experiment 1, except that there was no time limit on the (standardized) manipulation of the box. After the auditory starting signal, a timer started which was stopped when participants pressed a foot pedal to indicate that they were ready to give their numerosity estimation. Exploration times were saved to the computer alongside the numerosity estimation.

Results and discussion

The results for Experiment 2 show that despite the removal of the time limit, performance is similar to Experiment 1. There was again little evidence for integration.

Number estimates

Figure 2 (middle panel) plots the group average estimates for Experiment 2, which allows for two main observations. First, the A and AH conditions now produced virtually identical estimates. Second, the H condition was (again) the least accurate of the three.

A 3 (Condition: A, H, AH) × 8 (Number of Objects: 1—8) repeated measures ANOVA showed that the main effects of Condition (F(1.30, 22.18) = 9.20, MSE = 30.13, p = 0.004; η^2^G = 0.049) and Number of Objects (F(1.35, 22.99) = 89.68, MSE = 793.03, p < 0.001, η^2^G = 0.58) were significant. Post-hoc comparisons for Condition showed significant differences for A vs. H (mean difference = 0.48, 95% CI = [0.04; 0.91], Cohen’s d = 0.35 [− 0.012; 0.71], t = 2.76, p = 0.018), and for H vs. AH (− 0.73 [− 1.16; 0.29], d = − 0.53 [− 0.93; − 0.13], t = 4.22, p < 0.001), but not for A vs. AH (− 0.25 [− 0.69; 0.18], d = − 0.18 [− 0.52; 0.15], t = 1.46, p = 0.15). The interaction between Condition and Number of Objects was significant as well (F(4.12, 70.08) = 5.94, MSE = 7.65, p < 0.001), with a small to medium effect size (η^2^G = 0.040). Like in Experiment 1, the interaction could be attributed to the fact that some of differences among the three conditions tended to increase as the number of objects in the box increased as well.

Accuracy

The results for the accuracy analysis are shown in the center column panels of Fig. 3. At an individual level we saw that 8 out of the 18 participants showed evidence of an increase in accuracy when both auditory and haptic cues were available. The majority of participants produced RMS values that were between the A and H conditions, and some even produced RMS values that were even worse than the worst cues (e.g., participant 1). A 3 (Condition: A, H, AH) repeated measures ANOVA showed a significant effect (F(2, 34) = 4.95, MSE = 0.653, p = 0.029, η^2^G = 0.067). Post-hoc comparisons showed significant differences for H vs. AH, but not for A vs. H and A vs. AH (Table 3).

Precision

The results for the precision analysis are shown in the center column panels of Fig. 4. They showed again that only 7 out of the 18 participants produced evidence for an increase in precision when both auditory and haptic cues were available and that the majority of participants produced RMS values that were between the best and worst cues, and several produced RMS values that were even worse than the worst cues. A 3 (Cues: Best, Worst, Both) repeated measures ANOVA showed a significant effect (F(1.47, 24.98) = 4.95, MSE = 0.381, p = 0.023, η^2^G = 0.057). As in Experiment 1, the difference between the Best and Both cues showed a (non-significant) decrease in precision when having both cues and the confidence interval for Cohen d’s included zero [− 0.68; 0.29]) (See Table 3).

Experiment 3: No time limit, no standardized movements

Method

Eighteen new participants within the age range of 20–41 years took part in the study (10 females; 2 left-handed). All participants reported normal hearing and haptic perception.

Experiment 3 had the same specifications as Experiment 2, except that there was no restriction on the manipulation of the boxes; participants were free to move the box as they saw fit.

Results and discussion

The results for Experiment 3 show that removing both the time limit and the movement constraint does show a benefit for the AH condition. However, the benefit is limited to accuracy and does not appear for precision. Moreover, we observed that H was now on par with A, suggesting that having a movement constraint might be part of the reason H had performed worst in the previous two experiments.

Number estimates

Figure 2 (right panel) plots the group average estimates for Experiment 3, which show that there were only minor apparent differences between the three conditions. That is, similar to Experiment 2, there was no apparent difference between the A and AH conditions. It was interesting, however, to see that the H condition now seemingly performed as well as the other two conditions, closing the gap that was observed in the previous two experiments.

A 3 (Condition: A, H, AH) × 8 (Number of Objects: 1—8) repeated measures ANOVA showed that the main effects of Condition (F(2,34) = 3.58, MSE = 4.81, p = 0.039; η^2^G = 0.040) and Number of Objects (F(7, 119) = 479.94, MSE = 193.36, p < 0.001, η^2^G = 0.86) were significant. Post-hoc comparisons for Condition showed a significant difference only for H vs. AH (mean difference = − 0.36, 95% CI = [− 0.71; 0.02], Cohen’s d = − 0.49 [− 0.99; − 0.02], t = 2.67, p = 0.035), But not for A vs. H (− 0.15 [− 0.50; 0.19], d = 0.21 [− 0.67; 0.26], t = 1.13, p = 0.268) or A vs. AH (− 0.21 [− 0.55; 0.13], d = − 0.28 [− 0.75; 0.19], t = 1.54, p = 0.27). Contrary to Experiments 1 and 2, the interaction between Condition and Number of Objects was not significant (F(6.24, 103.1) = 1.14, MSE = 0.73, p = 0.35) with a small effect size (η^2^G = 0.020).

Accuracy

The results for the accuracy analysis are shown in the right column panels of Fig. 3. At an individual level we saw that 12 out of the 18 participants showed evidence of an increase in accuracy when both auditory and haptic cues were available, making up a majority. This time, only a minority of participants produced RMS values that were between the A and H conditions, and only one produced RMS values that were the worst (e.g., participant 1). A 3 (Condition: A, H, AH) repeated measures ANOVA showed a significant effect (F(1.47, 24.96) = 5.71, MSE = 0.602, p = 0.015, η^2^G = 0.149). Post-hoc comparisons showed significant differences for A vs. AH and H vs. AH, but not for A vs. H (Table 3).

Precision

This time, the results for the precision analysis showed that the majority (11 out of the 18 participants) produced evidence of an increase in precision (Fig. 4). A 3 (Cues: Best, Worst, Both) repeated measures ANOVA showed a significant effect (F(2, 34) = 23.42, MSE = 0.243, p < 0.001, η^2^G = 0.239). However, while the difference between the Best and Both cues did show an increase in precision for the latter, it was not statistically significant and the confidence interval for its corresponding Cohen d’s included zero [− 0.30; 0.71])(See Table 3).

Across experiment comparison: effects of box manipulation constraints

A final set of analyses considered all three experiments together to assess how the gradually released box manipulation constraints affected number estimates, accuracy, and precision.

Number estimates

A 3 (Experiment) × 3 (Condition: A, H, AH) × 8 (Number of Objects: 1—8) mixed ANOVA with Experiment as a between-subjects factor, showed that two of the three possible interactions involving the Experiment factor were significant; most important of which, the second-order interaction between all factors (F(15.26, 389.09) = 2.267, MSE = 1.20, p = 0.004; η^2^G = 0.015).

This interaction was followed up by three separate 3 (Experiment) × 8 (Number of Objects) ANOVAs; one for each of the three levels of Condition. For the auditory-only condition there were no significant effects of Experiment (all F’s < 1). For the haptic-only condition there was a significant interaction (F(7.42, 189.30) = 3.52, MSE = 2.67, p = 0.001; η^2^G = 0.055), which was attributed to the observation that estimates in Experiments 1 and 2 were virtually identical, while those in Experiment 3 were higher (i.e., more accurate) for larger numbers of objects (i.e., more than 5). For the auditory-haptic condition the interaction was not significant in the strict sense (F(7.42, 125.41) = 2.29, MSE = 3.29, p = 0.051); although it produced a small to medium effect size (η^2^G = 0.037).

Thus, it could be argued that the second-order interaction in the omnibus analysis is attributable to the effect of the different box manipulation constraints across the three experiments. This effect was apparently absent for the auditory-only condition (where, contrary to the other conditions, the experimenter manipulated the box), marginal for the auditory-haptic condition, but pronounced for the haptic-only condition.

Accuracy

A 3 (Experiment) × 3 (Condition: A, H, AH) mixed ANOVA with Experiment as a between-subjects factor, showed a significant main effect for Experiment (F(2, 51) = 10.90, MSE = 3.861, p < 0.001; η^2^G = 0.210). This was attributed to the fact that RMS values were overall different between the experiments, being the largest for Experiment 2 (M = 1.90), intermediate for Experiment 1 (M = 1.54), and the smallest for Experiment 3 (M = 1.38). Post-hoc comparisons showed significant differences for Experiment 1 vs. Experiment 2 (p = 0.006) and for Experiment 2 vs. Experiment 3 (p < 0.001), but not for Experiment 1 vs. Experiment 3 (p = 0.156).

A significant main effect for Condition (F(1.66, 84.82) = 11.30, MSE = 1.457, p < 0.001; η^2^G = 0.077) reflected the fact that H tended to produce less accurate responses compared to A and AH. Post-hoc comparisons showed significant differences for A vs. H (p = 0.003) and for H vs. AH (p < 0.001), but not for A vs. AH (p = 0.526). The interaction was not significant (F(6.33, 84.82) = 2.42, MSE = 0.312, p = 0.065; η^2^G = 0.035).

Precision

A 3 (Experiment) × 3 (Cues: Best, Worst, Both) mixed ANOVA with Experiment as a between-subjects factor, showed a significant effect for Cues only (F(1.61, 82.19) = 23.92, MSE = 0.647, p < 0.001; η^2^G = 0.087), but not for any of the effects involving the Experiment factor (main effect: p = 0.079, η^2^G = 0.077; interaction: p = 0.370, η^2^G = 0.007). The (non-significant) main effect of Experiment could be attributed to the fact that RMS values for Experiment 1 were overall smaller than those for Experiments 2 and 3, which were comparable. Thus, there was no statistical evidence for an effect of the different box manipulation constraints on precision.

Individual best cues

The auditory cue was Best for 9/18 (50.0%), 10/18 (55.6%), and 11/18 (61.1%) participants, for the three experiments respectively, and 30/54 (55.6%) across all experiments. In other words, there does not appear to be a systematic advantage for either the auditory or the haptic condition. Instead, there seem to be individual differences in which sense allows for the more precise performance.

General discussion

Our results show that participants could estimate the number of items in closed containers reasonably well, although they showed some general underestimation with higher numbers. The performance in terms of accuracy in the auditory-haptic condition and the auditory condition was quite similar, while in the haptic condition participants underestimated the number of items compared to the other two conditions. This difference became smaller when participants were completely free in their manipulation of the boxes (i.e., no time or strategy restrictions). When looking at performance in terms of precision, to investigate “winner-takes-all” or “integration” mechanisms, we found that this task yielded neither of these mechanisms: about half of the participants seemed to “favor” the auditory cue, while the other half “favored” the haptic cue.

The results of the auditory-only and haptic-only conditions were consistent with earlier work in that either of them was sufficient to make reasonably accurate inferences about the number of objects in a box. The results also reconfirmed a general underestimation of the number that increases with the number of items presented (Frissen et al. 2023; 2025; Hummel et al. 2022; Plaisier and Smeets 2017; Pittenger et al. 1997; Pittinger and Mincy, 1999).

Little evidence for integration

However, our results are not consistent with the assertions put forward in earlier studies concerning how auditory and haptic information interact when both are available. First, our hypothesis was that auditory and haptic information would be integrated and that the estimations of the number of objects in a container with both auditory and haptic information would be more accurate and more precise in comparison to having only auditory or haptic information. Overall, we did not find any evidence in favor of this hypothesis, as we observed neither better accuracy nor better precision.

These results are therefore inconsistent with Plaisier and Smeets’ (2017) conclusion that auditory and haptic information are integrated. One contributing factor explaining the difference in results we have already mentioned in the Introduction. The Plaisier and Smeets study featured a small group of participants (n = 7), which likely introduced a large measure of sampling variability. Moreover, estimates of precision were based on the variability between participants as opposed to being based on more instructive within-participant measures. One other difference that may be of interest to point out is the maximum number of objects in the container. Whereas in our study there could be up to eight objects, in the Plaisier and Smeets study there was a maximum of five. This small difference may nevertheless have significant consequences since humans have separate cognitive mechanisms for processing small and large numbers (Feigenson et al. 2004; Revkin et al. 2008). Subitizing processes small numerosities, up to about 4, rapidly and essentially without error. The Approximate Number System (ANS) processes numbers larger than that, using estimation to make approximate and imprecise judgments of numerical magnitudes (Dehaene 1997; Feigenson et al. 2004). In short, it is possible that the enumeration task in our study was in part more difficult because of the presumed involvement of the less precise ANS from about 4 items onwards (Frissen and Chen 2024; see also Fig. 2).

Little evidence for winner-take-all favoring auditory cues

We also did not find convincing evidence for a winner-take-all mechanism that favors auditory over haptic cues as proposed by Pittinger and colleagues 1997; Pittinger and Mincy 1999. For instance, at an individual level, measures of accuracy in the auditory-haptic condition did not tend to match with the auditory-only condition; nor, for that matter, did it tend to match with the haptic-only condition. In fact, there did not seem to be a winning cue as such.

If anything, the results suggest that the winning cue, or rather the more precise cue, is individually determined, with only roughly half of the participants favoring the auditory cue and the other half favoring the haptic cue. We are currently not able to know the reason for an individual’s leaning toward either the auditory or haptic sensory modality. One explanation comes from the modality appropriateness hypothesis (Welch 1999), which holds that the modality that is the “more appropriate” for the task will dominate. For instance, while both the auditory and visual sense can provide spatial information, the latter sense is known to be orders of magnitude more precise than the former and would therefore dominate in a spatial localization task. Contrary to this localization example, there is no apparent reason to assume that either auditory or haptic sense would be more appropriate in the current container enumeration task, which might explain why we found that either sense is favored more or less equally across participants. We note that the modality appropriateness hypothesis implies that the presumed dominant modality is fixed. However, such fixedness in modality dominance might not exists and it is rather the reliability of the perceptual estimate of a stimulus that is instrumental (e.g., Ernst and Bülthoff 2004). Thus, any apparent dominance of modality A over modality B might be reversable by experimentally adding noise to the stimulus in modality A, and vice versa. We return to this notion in the Conclusion.

Effect of box exploration constraints

Our secondary objective was to explore the potential effects of how participants manipulate the box on their ability to enumerate its content. The comparison across the three experiments appears to show some such procedural effects. Letting go of the time constraint (Experiment 2) did not have a noticeable effect on the results. However, letting go of both the time and movement constraints (Experiment 3) produced two effects. First, Experiment 3 was the only case in which there was some suggestion of integration. That is, for the majority of participants, the auditory-haptic condition tended to show evidence for improved accuracy and precision, although only the former was statistically significant.

Second, performance in the haptic-only condition improved considerably. Indeed, the differences in estimates between the auditory-only and haptic-only conditions, so obvious in Experiments 1 and 2, disappeared almost entirely. These findings are consistent with seminal studies investigating the role of active touch and time constraints on form recognition (Gibson 1962; Heller 1984). They are also reminiscent of studies on rod length perception (Burton and Turvey 1990; Chan 1996; Park and Rie 2014). For instance, Park and Rie (2014) investigated participants’ ability to estimate the length of a rod under various movement conditions. In one of these conditions (“self-determined”) participants moved a rod from one end while the experiment held it on the other. In another condition (“other-determined”) the situation was reversed. While in the other-determined condition participants showed sensitivity to rod length they underestimated it by about 80%. In the self-determined condition underestimation was dramatically reduced to 35%. Overall, the results reiterate the importance of letting the haptic system operate on its own terms with no artificial constraints (Klatzky et al. 1985).

Enumeration strategies

Container studies thus far have not explicitly tried to work out which strategies participants might adopt when enumerating content. The exception is Frissen and Chen (2024), who in a study using containers with large numbers of objects (between several dozen and several hundreds), debriefed participants about what they thought was their strategy. This particular case revealed three strategies, “positive space” (participants made reference to the container’s contents when constructing their quantity judgments), “negative space” (participants make reference to the excess space in the container), and “weight” (participant explicitly mentioned using the perceived weight). Incidentally, formal investigations of the effect of weight (in studies using small numbers of objects) showed only marginal importance of weight (Frissen et al. 2023; 2025).

For the present study, however, we can only speculate as to which strategies participants used in enumerating the contents of the containers, which we assume here could be perceptual and/or take place at higher-level cognitive levels. For instance, attention may have played a role. If attention is attracted to the one of the modalities just preceding a trial, that modality will be more important in the final estimation of that trial (e.g., because of keyboard sounds or touching of the hands preceding the trial). Attention mediated cue selection has been shown extensively in visual perception (e.g. Lee et al. 1999). And in multi-sensory perception it has been shown that attention can be drawn to separate modalities (for a review see Spence 2002). A higher-level strategy might be parallel item-by-item estimation, in which case some items may have been perceived auditorily and other items haptically to be added up at a higher cognitive level. However, this would more likely yield overestimation of numbers as opposed to the underestimation that we found here and in other studies.

Conclusion

The present work is one of only handful of studies looking at how inferences about the content of a container are affected by having concurrent access to auditory and haptic information. However, it is the first to present a comprehensive analysis, using multiple dependent variables at the level of individual participants. The results do not appear to be in line with previously proposed interpretations which suggested that auditory and haptic information would be either subject to a winner-take-all type of mechanism that favors auditory information or would be integrated. At best we found some partial and tentative evidence for an integrative mechanism when participants were released from both time and movement constraints on how to explore the container. The results therefore also bring into focus, for the first time, the potential role of the specific way people are allowed to interact with the container. Given the tentative evidence and the potential effect of how containers are explored, future work might want to readdress the integration hypothesis. One approach would be to allow for unconstrained container exploration while independently manipulating the reliability of the auditory and haptic information and assess whether this biases the number estimates in the direction of the more reliable cue.

Bibliography2

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1The Math Works, Inc.: MATLAB (Version 9.13.0, R 2022 b) [Computer software]. The Math Works, Inc. (2022) https://www.mathworks.com
2Suddendorf, T., and Langely, M.: Got your bag? The critical place of mobile containers in human evolution. The Conversation. (2020) https://theconversation.com/got-your-bag-the-critical-place-of-mobile-containers-in-human-evolution-142712