Learning from Silence and Noise for Visual Sound Source Localization

Xavier Juanola; Giovana Morais; Magdalena Fuentes; Gloria Haro

arXiv:2508.21761·cs.CV·September 1, 2025

Learning from Silence and Noise for Visual Sound Source Localization

Xavier Juanola, Giovana Morais, Magdalena Fuentes, Gloria Haro

PDF

Open Access

TL;DR

This paper introduces a self-supervised model, SSL-SaN, for visual sound source localization that effectively handles silence, noise, and negative audio, improving robustness and evaluation in diverse scenarios.

Contribution

It presents a new training strategy incorporating silence and noise, a novel metric for feature alignment, and an extended dataset with negative audio for better evaluation.

Findings

01

SSL-SaN achieves state-of-the-art performance in localization and retrieval.

02

The new metric quantifies alignment and separability trade-offs.

03

Extended dataset IS3+ includes negative audio scenarios.

Abstract

Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other…

Tables11

Table 1. Table 1 : Results of the cross-modal alignment and separability analysis. Best results in italics . Best results of self-supervised models in bold and the second-best underlined .

Test set

Model

Self Supervised

Magnitude

↓

Alignment

↑

Separability_±

↑

VGG-SS

LVS [CVPR, 2021]

✘

0.92

0.19

-0.1124

EZ-VSL [ECCV, 2022]

✘

1.14

0.20

-0.0339

FNAC [CVPR, 2023]

✘

1.25

0.03

-0.0094

SLAVC [NEURIPS, 2022]

✘

1.14

0.22

-0.0427

SSL-Align [ICCV, 2023]

✘

0.94

0.41

0.1034

ACL [WACV, 2025]

✘

1.31

0.14

-0.2551

SSL-TIE [ACMMM, 2022]

✓

0.96

0.37

0.0606

SSL-Align [ICCV, 2023]

✓

0.95

0.39

0.0428

Ours

\to

SSL-SaN

✓

0.93

0.40

0.0971

IS3

LVS [CVPR, 2021]

✘

0.95

0.11

-0.1113

EZ-VSL [ECCV, 2022]

✘

1.14

0.17

-0.0423

FNAC [CVPR, 2023]

✘

1.24

0.02

-0.0241

SLAVC [NEURIPS, 2022]

✘

1.17

0.15

-0.0733

SSL-Align [ICCV, 2023]

✘

0.98

0.32

-0.0608

ACL [WACV, 2025]

✘

1.29

0.17

0.1112

SSL-TIE [ACMMM, 2022]

✓

0.98

0.28

-0.0530

SSL-Align [ICCV, 2023]

✓

0.99

0.26

-0.1287

Ours

\to

SSL-SaN

✓

0.97

0.31

-0.0327

IS3+

LVS [CVPR, 2021]

✘

0.96

0.10

-0.1148

EZ-VSL [ECCV, 2022]

✘

1.14

0.17

-0.0423

FNAC [CVPR, 2023]

✘

1.24

0.01

-0.0231

SLAVC [NEURIPS, 2022]

✘

1.16

0.15

-0.0692

SSL-Align [ICCV, 2023]

✘

0.98

0.33

-0.0480

ACL [WACV, 2025]

✘

1.29

0.17

0.0717

SSL-TIE [ACMMM, 2022]

✓

0.99

0.28

-0.0604

SSL-Align [ICCV, 2023]

✓

0.99

0.26

-0.1279

Ours

\to

SSL-SaN

✓

0.97

0.31

-0.0327

AVS-Bench S4

LVS [CVPR, 2021]

✘

0.88

0.28

-0.1085

EZ-VSL [ECCV, 2022]

✘

1.12

0.18

-0.0180

FNAC [CVPR, 2023]

✘

1.23

0.01

-0.0017

SLAVC [NEURIPS, 2022]

✘

1.15

0.14

-0.0327

SSL-Align [ICCV, 2023]

✘

0.86

0.49

0.2161

ACL [WACV, 2025]

✘

1.28

0.18

0.0575

SSL-TIE [ACMMM, 2022]

✓

0.93

0.41

0.1442

SSL-Align [ICCV, 2023]

✓

0.88

0.48

0.1729

Ours

\to

SSL-SaN

✓

0.86

0.47

0.2479

Table 2. Table 2 : Localization results on the VGG-SS, IS3 + and S4 extended test sets. The best value across all models is shown in italics . Within the self-supervised subset, the best value is in bold and the second-best is underlined .

Positive audio input

Negative audio input

Global metric

cIoU

AUC

Silence

Noise

Offscreen sound

Test set

Model

Self Sup.

Uth

↑

Adap.

↑

Uth

↑

Adap.

↑

pIA

↓

AUC

_{N}

↑

pIA

↓

AUC

_{N}

↑

pIA

↓

AUC

_{N}

↑

F

_{LOC}

↑

F

_{AUC}

↑

VGG-SS

RCGrad [INTERSPEECH, 2022]

✘

11.71

37.04

12.08

37.26

4.42

95.80

4.42

95.80

4.41

95.82

20.86

21.45

LVS [CVPR, 2021]

✘

3.90

39.43

5.91

41.27

1.59

98.30

0.05

99.93

0.70

99.23

7.51

11.15

EZ-VSL [ECCV, 2022]

✘

10.41

43.85

11.97

42.86

1.83

98.09

0.37

99.60

1.80

98.08

18.84

21.34

FNAC [CVPR, 2023]

✘

17.45

47.14

18.69

44.27

4.09

95.88

4.09

95.88

3.58

96.38

29.53

31.28

SLAVC [NEURIPS, 2022]

✘

9.60

49.62

11.40

45.55

5.53

94.43

0.59

99.39

1.54

98.45

17.48

20.40

SSL-Align [ICCV, 2023]

✘

34.46

56.86

34.78

49.25

2.34

97.57

1.11

98.79

1.97

97.95

51.02

51.36

ACL [WACV 2025]

✘

14.31

39.92

15.99

40.30

0.29

99.60

0.40

99.47

0.91

99.01

25.02

27.55

SSL-TIE [ACMMM, 2022]

✓

27.78

51.88

28.23

48.00

0.78

99.16

0.68

99.26

2.50

97.39

43.36

43.90

SSL-Align [ICCV, 2023]

✓

25.40

53.92

25.96

47.84

1.18

98.72

0.28

99.70

0.63

99.33

40.45

41.16

Ours

\to

SSL-SaN

✓

29.61

52.74

29.97

48.66

0.01

99.99

0.00

100.00

1.72

98.17

45.63

46.05

IS3

RCGrad [INTERSPEECH, 2022]

✘

0.10

1.07

2.54

3.57

4.49

95.81

4.49

95.81

4.47

95.79

0.20

4.95

LVS [CVPR, 2021]

✘

2.07

14.10

4.31

26.37

1.08

98.80

0.03

99.95

0.22

99.74

4.05

8.27

EZ-VSL [ECCV, 2022]

✘

3.82

16.82

5.88

28.53

1.10

98.80

0.01

99.98

0.78

99.14

7.35

11.11

FNAC [CVPR, 2023]

✘

8.23

18.72

9.98

29.61

3.32

96.65

3.32

96.65

2.09

97.87

15.17

18.10

SLAVC [NEURIPS, 2022]

✘

5.78

18.92

7.74

28.46

18.57

81.38

0.49

99.49

2.49

97.49

10.89

14.28

SSL-Align [ICCV, 2023]

✘

25.25

34.61

25.96

39.23

4.33

95.49

1.17

98.67

1.95

97.94

40.12

41.00

ACL [WACV 2025]

✘

49.07

67.80

49.95

67.87

0.02

99.86

1.37

98.78

0.50

99.40

65.70

66.48

SSL-TIE [ACMMM, 2022]

✓

17.32

18.43

18.29

29.69

0.49

99.42

0.40

99.53

3.19

96.71

29.47

30.85

SSL-Align [ICCV, 2023]

✓

17.37

29.68

18.54

37.12

3.33

96.45

0.55

99.42

0.67

99.26

29.54

31.20

Ours

\to

SSL-SaN

✓

18.87

18.90

19.73

30.28

0.00

100.00

0.00

100.00

1.97

97.87

31.72

32.92

IS3+

RCGrad [INTERSPEECH, 2022]

✘

0.10

1.07

2.54

3.57

4.50

52.49

4.49

52.49

4.48

52.49

0.20

4.95

LVS [CVPR, 2021]

✘

2.05

14.35

4.31

26.72

1.08

98.80

0.03

99.95

0.29

99.67

4.01

8.26

EZ-VSL [ECCV, 2022]

✘

3.91

17.95

5.95

28.70

1.16

98.75

0.02

99.98

0.95

98.95

7.52

11.23

FNAC [CVPR, 2023]

✘

7.62

16.84

9.37

28.53

3.43

96.54

3.43

96.54

3.22

96.73

14.13

17.08

SLAVC [NEURIPS, 2022]

✘

6.43

18.23

8.31

28.33

18.57

81.38

0.49

99.49

3.21

96.76

12.02

15.25

SSL-Align [ICCV, 2023]

✘

25.63

33.36

26.29

38.11

4.33

95.49

1.17

98.67

2.30

97.58

40.58

41.39

ACL [WACV 2025]

✘

44.75

63.93

45.79

64.16

0.02

99.86

1.37

98.79

0.39

99.64

61.72

62.70

SSL-TIE [ACMMM, 2022]

✓

16.37

17.93

17.40

28.88

0.49

99.42

0.40

99.53

3.03

96.86

28.09

29.58

SSL-Align [ICCV, 2023]

✓

17.11

28.90

18.30

35.99

3.33

96.45

0.55

99.42

0.70

99.23

29.15

30.86

Ours

\to

SSL-SaN

✓

19.13

18.90

19.94

30.28

0.00

100.00

0.00

100.00

2.09

97.73

32.08

33.21

AVS-Bench S4

RCGrad [INTERSPEECH, 2022]

✘

15.76

33.08

16.15

33.25

4.34

95.92

4.34

95.91

4.33

95.92

27.07

27.65

LVS [CVPR, 2021]

✘

6.82

21.30

8.54

30.68

2.16

97.73

0.07

99.91

0.65

99.28

12.76

15.72

EZ-VSL [ECCV, 2022]

✘

11.14

19.57

12.56

30.81

0.71

99.24

0.52

99.45

1.40

98.53

20.04

22.29

FNAC [CVPR, 2023]

✘

18.57

22.85

19.55

33.14

5.79

94.17

5.79

94.17

3.17

96.81

31.07

32.42

SLAVC [NEURIPS, 2022]

✘

9.07

22.58

10.80

33.13

3.72

96.24

1.17

98.81

1.42

98.55

16.60

19.45

SSL-Align [ICCV, 2023]

✘

33.04

30.73

33.09

39.32

4.81

95.13

1.04

98.88

1.48

98.45

49.36

49.41

ACL [WACV 2025]

✘

51.34

65.75

49.57

64.11

0.00

99.99

0.03

99.98

0.32

99.72

67.82

66.26

SSL-TIE [ACMMM, 2022]

✓

28.40

31.24

28.64

38.60

2.90

97.08

2.69

97.25

1.37

98.56

44.00

44.29

SSL-Align [ICCV, 2023]

✓

31.75

32.30

31.94

39.35

0.87

99.03

0.16

99.81

0.48

99.49

48.14

48.35

Ours

\to

SSL-SaN

✓

32.76

33.87

32.86

41.11

0.05

99.95

0.00

100.00

0.95

98.96

49.31

49.42

Table 3. Table 3 : Results of the cross-modal retrieval. Best results in italics . Best results of self-supervised models in bold and the second-best underlined .

I

\to

A

\to

I

Test set

Model

Self Supervised

P@1

↑

P@5

↑

P@10

↑

A@1

↑

A@5

↑

A@10

↑

P@1

↑

P@5

↑

P@10

↑

A@1

↑

A@5

↑

A@10

↑

VGG-SS

LVS [CVPR, 2021]

✘

2.96

2.84

2.83

2.96

10.37

16.47

4.02

3.92

3.54

4.02

13.71

20.22

EZ-VSL [ECCV, 2022]

✘

2.11

2.27

2.16

2.11

8.21

13.06

3.94

3.51

3.24

3.94

14.19

22.55

FNAC [CVPR, 2023]

✘

2.03

1.83

2.06

2.03

6.65

14.26

2.08

1.66

1.84

2.08

7.51

14.92

SLAVC [NEURIPS, 2022]

✘

3.54

3.14

2.93

3.54

10.80

16.02

4.07

3.51

3.32

4.07

14.72

24.91

SSL-Align [ICCV, 2023]

✘

27.95

25.38

23.13

27.95

49.00

58.64

32.04

29.15

26.26

32.04

56.93

67.08

ACL [WACV, 2025]

✘

10.07

9.36

8.92

10.07

29.60

43.36

13.35

12.92

12.44

13.35

33.52

43.87

SSL-TIE [ACMMM, 2022]

✓

16.25

14.87

13.83

16.25

35.81

47.51

15.12

14.82

13.81

15.12

35.91

47.41

SSL-Align [ICCV, 2023]

✓

22.65

20.38

19.04

22.65

44.53

56.50

26.67

23.82

21.40

26.67

51.26

61.85

Ours

\to

SSL-SaN

✓

24.61

22.94

20.88

24.61

48.57

58.97

24.49

22.97

21.25

24.49

46.74

57.41

AVS-Bench S4

LVS [CVPR, 2021]

✘

20.81

23.89

25.15

20.81

56.49

70.81

33.38

30.81

30.26

33.38

54.59

66.22

EZ-VSL [ECCV, 2022]

✘

2.97

3.70

3.81

2.97

13.65

23.92

5.14

4.86

5.14

5.41

9.19

FNAC [CVPR, 2023]

✘

5.41

4.92

4.86

5.41

11.89

20.95

4.59

4.62

5.12

4.59

5.14

9.86

SLAVC [NEURIPS, 2022]

✘

5.95

5.22

5.01

5.95

15.68

24.32

6.76

6.62

6.47

6.76

7.16

12.70

SSL-Align [ICCV, 2023]

✘

85.27

84.08

82.38

85.27

91.76

93.24

87.16

86.00

84.70

87.16

94.86

96.76

ACL [WACV, 2025]

✘

75.14

73.95

71.68

75.14

91.08

94.05

72.57

72.22

72.57

80.14

SSL-TIE [ACMMM, 2022]

✓

73.38

75.27

74.92

73.38

85.27

90.00

66.76

66.69

66.76

77.03

SSL-Align [ICCV, 2023]

✓

80.00

78.54

77.43

80.00

91.76

94.59

81.89

80.65

79.28

81.89

91.08

94.73

Ours

\to

SSL-SaN

✓

79.86

80.81

81.50

79.86

89.59

92.43

81.35

81.38

81.82

81.35

81.49

87.16

Table 4. Table 4 : Ablation table on the use of the silence (S) and noise (N) samples during training, and the new loss terms ( ℒ S \mathcal{L}_{S} and ℒ N \mathcal{L}_{N} ). Best results in bold , second best underlined .

Negative audio input

Test set

S

N

ℒ_{S}

ℒ_{N}

cIoU_Uth

↑

pIA

_{S}

↓

pIA

_{N}

↓

pIA

_{O}

↓

F

_{LOC}

↑

Sep

↑

AVS-Bench S4

31.57

2.53

2.49

1.00

47.76

0.2384

✓

31.76

2.39

1.85

0.82

48.01

0.2350

✓

32.49

2.54

2.58

0.81

48.80

0.2458

✓

31.97

2.26

2.59

0.88

48.22

0.2558

✓

32.07

2.21

2.26

1.07

48.34

0.2375

✓

32.05

1.75

1.01

0.84

48.40

0.2393

✓

32.76

0.05

0.00

0.95

49.31

0.2479

Table 5. Table A.1 : Training cost comparison including number of training epochs

Method	Training Epoch	Validation Epoch	Retrieval Phase	Epochs	Total (Epoch)	Total
SSL-TIE	970.94 s. (16.18 min)	19.46 s. (0.32 min)	995.88 s. (16.60 min)	100	1986.28 s. (33.10 min)	55.17 h (2.30 day)
Ours $\to$ SSL-SaN	2093.79 s. (34.90 min)	22.14 s. (0.37 min)	1029.84 s. (17.16 min)	120	3145.77 s. (52.43 min)	104.86 h (4.37 day)

Table 6. Table A.2 : Average inference time per sample

Method	Inference Time (seconds)
LVS	0.42
EZ-VSL	0.27
FNAC	0.31
SLAVC	0.32
SSL-TIE	0.83
SSL-Align	0.45
ACL	0.47
Ours $\to$ SSL-SaN	0.82

Table 7. Table A.3 : Resource requirements (number of parameters in millions)

Method	Image Encoder	Audio Encoder	Full Model
LVS	11.18 M	11.17 M	23.95 M
EZ-VSL	11.18 M	11.17 M	22.87 M
FNAC	11.18 M	11.17 M	22.87 M
SLAVC	11.18 M	11.17 M	46.79 M
SSL-TIE	11.18 M	11.17 M	23.95 M
SSL-Align	11.18 M	11.17 M	23.95 M
ACL	85.05 M	89.79 M	248.34 M
Ours $\to$ SSL-SaN	11.18 M	11.17 M	23.95 M

Table 8. Table A.4 : Full mapping from original IS3 class labels to simplified IS3+ labels.

playing accordion $\to$ accordion	car engine starting $\to$ vehicle	fox barking $\to$ fox
playing acoustic guitar $\to$ acoustic guitar	car passing by $\to$ vehicle	playing french horn $\to$ french horn
airplane $\to$ airplane	driving buses $\to$ vehicle	gibbon howling $\to$ gibbon
airplane, airplane flyby $\to$ airplane	opening or closing car electric windows $\to$ vehicle	goat bleating $\to$ goat
airplane flyby $\to$ airplane	race car, auto racing $\to$ vehicle	playing electric guitar $\to$ guitar
alarm clock ringing $\to$ alarm clock	cat caterwauling $\to$ cat	playing steel guitar, slide guitar $\to$ guitar
alligators, crocodiles hissing $\to$ alligator	cat growling $\to$ cat	cap gun shooting $\to$ gun
baby laughter $\to$ baby	cat hissing $\to$ cat	machine gun shooting $\to$ gun
baby crying $\to$ baby	cat meowing $\to$ cat	hair dryer drying $\to$ hair dryer
playing banjo $\to$ banjo	cat purring $\to$ cat	playing harpsichord $\to$ harp
playing bassoon $\to$ bassoon	playing cello $\to$ cello	playing harp $\to$ harp
barn swallow calling $\to$ bird	chainsawing trees $\to$ chainsaw	hedge trimmer running $\to$ hedge trimmer
bird chirping, tweeting $\to$ bird	cheetah chirrup $\to$ cheetah	helicopter $\to$ helicopter
bird wings flapping $\to$ bird	chicken clucking $\to$ chicken	horse clip-clop $\to$ horse
black capped chickadee calling $\to$ bird	chicken crowing $\to$ chicken	ice cream truck, ice cream van $\to$ ice cream truck
canary calling $\to$ bird	child singing $\to$ child	lathe spinning $\to$ lathe/engine
wood thrush calling $\to$ bird	child speech, kid speaking $\to$ child	lawn mowing $\to$ lawn mower
blowtorch igniting $\to$ blowtorch	chimpanzee pant-hooting $\to$ chimpanzee	lions growling $\to$ lion
typing on computer keyboard $\to$ computer keyboard	chinchilla barking $\to$ chinchilla	lions roaring $\to$ lion
playing cornet $\to$ cornet	chipmunk chirping $\to$ chipmunk	male speech, man speaking $\to$ male voice
bull bellowing $\to$ cow	church bell ringing $\to$ church bell	playing mandolin $\to$ mandolin
cattle mooing $\to$ cow	cricket chirping $\to$ cricket	missile launch $\to$ missile
cow lowing $\to$ cow	playing cymbal $\to$ cymbal	motorboat, speedboat acceleration $\to$ motorboat
dinosaurs bellowing $\to$ dinosaur	dog barking $\to$ dog	driving motorcycle $\to$ motorcycle
dog baying $\to$ dog	dog bow-wow $\to$ dog	mouse squeaking $\to$ mouse
dog growling $\to$ dog	dog howling $\to$ dog	playing oboe $\to$ oboe
dog whimpering $\to$ dog	donkey, ass braying $\to$ donkey	ocean burbling $\to$ ocean
playing drum kit $\to$ drums	eagle screaming $\to$ eagle	orchestra $\to$ orchestra
electric grinder grinding $\to$ electric grinder	playing electronic organ $\to$ electronic organ	owl hooting $\to$ owl
elephant trumpeting $\to$ elephant	eletric blender running $\to$ eletric blender	parrot talking $\to$ parrot
elk bugling $\to$ elk	female singing $\to$ female voice	penguins braying $\to$ penguin
female speech, woman speaking $\to$ female voice	fireworks banging $\to$ fireworks	people crowd $\to$ people crowd
people eating crisps $\to$ people eating crisps	people marching $\to$ people marching	playing piano $\to$ piano
pigeon, dove cooing $\to$ pigeon	popping popcorn $\to$ popcorn	playing saxophone $\to$ saxophone
sea lion barking $\to$ sea lion	sheep bleating $\to$ sheep	playing shofar $\to$ shofar
fire truck siren $\to$ siren	police car (siren) $\to$ siren	skateboarding $\to$ skateboarding
slot machine $\to$ slot machine	snake hissing $\to$ snake	snake rattling $\to$ snake
driving snowmobile $\to$ snowmobile	splashing water $\to$ stream	squishing water $\to$ stream
subway, metro, underground $\to$ subway	tap dancing $\to$ tap dance	telephone bell ringing $\to$ telephone bell
playing timbales $\to$ timbales	tractor digging $\to$ vehicle	train horning $\to$ train
train wheels squealing $\to$ train	train whistling $\to$ train	playing trumpet $\to$ trumpet
turkey gobbling $\to$ turkey	playing ukulele $\to$ ukulele	vacuum cleaner cleaning floors $\to$ vacuum cleaner
waterfall burbling $\to$ waterfall	whale calling $\to$ whale	wind chime $\to$ wind chime
woodpecker pecking tree $\to$ woodpecker

Table 9. Table A.5 : Results of the cross-modal retrieval analysis for same class in VGG-SS, IS3, IS3 + and AVS-Bench S4.

I

\to

A

\to

I

Test set

Model

Self Supervised

P@1

↑

P@5

↑

P@10

↑

A@1

↑

A@5

↑

A@10

↑

P@1

↑

P@5

↑

P@10

↑

A@1

↑

A@5

↑

A@10

↑

VGG-SS

LVS [CVPR, 2021]

✘

2.96

2.84

2.83

2.96

10.37

16.47

4.02

3.92

3.54

4.02

13.71

20.22

EZ-VSL [ECCV, 2022]

✘

2.11

2.27

2.16

2.11

8.21

13.06

3.94

3.51

3.24

3.94

14.19

22.55

FNAC [CVPR, 2023]

✘

2.03

1.83

2.06

2.03

6.65

14.26

2.08

1.66

1.84

2.08

7.51

14.92

SLAVC [NEURIPS, 2022]

✘

3.54

3.14

2.93

3.54

10.80

16.02

4.07

3.51

3.32

4.07

14.72

24.91

SSL-Align [ICCV, 2023]

✘

27.95

25.38

23.13

27.95

49.00

58.64

32.04

29.15

26.26

32.04

56.93

67.08

ACL [WACV, 2025]

✘

10.07

9.36

8.92

10.07

29.60

43.36

13.35

12.92

12.44

13.35

33.52

43.87

SSL-TIE [ACMMM, 2022]

✓

16.25

14.87

13.83

16.25

35.81

47.51

15.12

14.82

13.81

15.12

35.91

47.41

SSL-Align [ICCV, 2023]

✓

22.65

20.38

19.04

22.65

44.53

56.50

26.67

23.82

21.40

26.67

51.26

61.85

Ours

\to

SSL-SaN

✓

24.61

22.94

20.88

24.61

48.57

58.97

24.49

22.97

21.25

24.49

46.74

57.41

IS3

LVS [CVPR, 2021]

✘

4.46

5.15

5.19

4.46

15.96

24.38

5.26

5.19

5.06

5.26

13.63

20.00

EZ-VSL [ECCV, 2022]

✘

2.25

2.80

2.82

2.25

10.85

17.16

3.36

3.14

3.19

3.36

12.45

21.36

FNAC [CVPR, 2023]

✘

4.14

2.96

2.85

4.14

12.25

21.25

2.04

2.21

2.17

2.04

9.85

16.93

SLAVC [NEURIPS, 2022]

✘

3.41

3.27

3.23

3.41

12.56

19.65

3.94

3.42

3.33

3.94

13.10

21.28

SSL-Align [ICCV, 2023]

✘

34.66

33.07

31.78

34.66

53.49

60.37

31.68

30.38

29.55

31.68

47.87

54.95

ACL [WACV, 2025]

✘

11.40

13.09

12.87

11.40

41.82

56.08

13.78

14.28

14.35

13.78

30.73

39.77

SSL-TIE [ACMMM, 2022]

✓

25.85

24.73

23.60

25.85

48.73

58.77

16.03

16.08

15.88

16.03

29.74

36.33

SSL-Align [ICCV, 2023]

✓

30.57

28.76

27.55

30.57

51.17

59.07

23.19

22.48

21.94

23.19

38.19

46.39

Ours

\to

SSL-SaN

✓

33.78

32.04

30.97

33.78

55.91

65.06

21.64

21.91

21.62

21.64

36.00

42.69

$IS3 +$

LVS [CVPR, 2021]

✘

4.55

4.76

5.37

4.55

16.73

24.98

5.62

5.92

6.19

5.62

6.64

8.47

EZ-VSL [ECCV, 2022]

✘

4.46

3.36

3.26

4.46

12.89

19.89

3.23

3.61

3.75

3.23

7.10

10.14

FNAC [CVPR, 2023]

✘

3.78

3.74

3.57

3.78

15.32

25.51

2.79

2.99

3.05

2.79

7.25

10.20

SLAVC [NEURIPS, 2022]

✘

3.84

4.64

4.34

3.84

18.02

25.82

3.47

3.59

3.69

3.47

6.88

10.00

SSL-Align [ICCV, 2023]

✘

43.52

42.30

40.33

43.52

63.86

70.29

30.29

30.65

30.52

30.29

32.18

34.85

ACL [WACV, 2025]

✘

11.14

15.16

14.82

11.14

43.41

57.78

16.11

16.61

17.07

16.11

17.67

20.48

SSL-TIE [ACMMM, 2022]

✓

33.92

30.57

28.58

33.92

52.85

62.21

17.11

17.41

17.29

17.11

18.33

20.37

SSL-Align [ICCV, 2023]

✓

39.65

36.40

34.94

39.65

61.20

67.58

23.64

24.06

24.93

23.64

26.11

29.46

Ours

\to

SSL-SaN

✓

43.58

40.12

38.33

43.58

64.81

73.53

22.70

22.96

23.16

22.70

23.94

26.82

S4 (AVSBench)

LVS [CVPR, 2021]

✘

20.81

23.89

25.15

20.81

56.49

70.81

33.38

30.81

30.26

33.38

54.59

66.22

EZ-VSL [ECCV, 2022]

✘

2.97

3.70

3.81

2.97

13.65

23.92

5.14

4.86

5.14

5.41

9.19

FNAC [CVPR, 2023]

✘

5.41

4.92

4.86

5.41

11.89

20.95

4.59

4.62

5.12

4.59

5.14

9.86

SLAVC [NEURIPS, 2022]

✘

5.95

5.22

5.01

5.95

15.68

24.32

6.76

6.62

6.47

6.76

7.16

12.70

SSL-Align [ICCV, 2023]

✘

85.27

84.08

82.38

85.27

91.76

93.24

87.16

86.00

84.70

87.16

94.86

96.76

ACL [WACV, 2025]

✘

75.14

73.95

71.68

75.14

91.08

94.05

72.57

72.22

72.57

80.14

SSL-TIE [ACMMM, 2022]

✓

73.38

75.27

74.92

73.38

85.27

90.00

66.76

66.69

66.76

77.03

SSL-Align [ICCV, 2023]

✓

80.00

78.54

77.43

80.00

91.76

94.59

81.89

80.65

79.28

81.89

91.08

94.73

Ours

\to

SSL-SaN

✓

79.86

80.81

81.50

79.86

89.59

92.43

81.35

81.38

81.82

81.35

81.49

87.16

Table 10. Table A.6 : Results of the ablation of λ S N \lambda_{SN} that multiplies the term ( ℒ S + ℒ N (\mathcal{L}_{S}+\mathcal{L}_{N} ).

	Positive audio input	Negative audio input			Global metric
$λ_{S N}$	cIoU Uth	pIA_S	pIA_N	pIA_O	F $_{LOC}$
0	20.18	0.03	0.00	1.95	33.55
0.01	20.37	0.61	0.40	2.18	33.79
0.1	17.90	0.00	0.00	1.54	30.34
0.25	19.43	0.00	0.00	2.03	32.50
0.5	20.38	0.00	0.00	2.18	33.81
0.75	18.67	0.00	0.00	1.84	31.43
1	20.47	0.00	0.00	2.10	33.94
1.25	18.94	0.00	0.00	1.77	31.81
1.5	20.38	0.00	0.00	2.18	33.81

Table 11. Table A.7 : Ablation table of the silence, noise, ℒ S \mathcal{L}_{S} and ℒ N \mathcal{L}_{N} . Best results in bold , second best underlined .

Negative audio input

Test set

S

N

ℒ_{S}

ℒ_{N}

cIoU_Uth

↑

pIA

_{S}

↓

pIA

_{N}

↓

pIA

_{O}

↓

F

_{LOC}

↑

Sep

↑

VGG-SS

28.38

0.71

1.54

44.11

0.0896

✓

28.21

0.69

0.61

1.47

43.91

0.1019

✓

27.90

0.72

0.76

1.32

43.54

0.0982

✓

29.14

0.89

1.01

1.56

45.01

0.1047

✓

29.30

0.71

0.72

1.50

45.22

0.0975

✓

29.99

0.68

0.35

1.63

46.05

0.1059

✓

29.61

0.01

0.00

1.72

45.63

0.0971

IS3

18.83

0.46

0.47

2.23

31.63

-0.0324

✓

17.80

0.51

0.50

2.10

30.17

-0.0279

✓

17.96

0.41

0.45

1.74

30.40

-0.0303

✓

18.07

0.91

0.99

1.98

30.54

-0.0300

✓

19.06

0.34

0.35

2.30

31.96

-0.0301

✓

19.83

0.36

0.17

2.42

33.05

-0.0261

✓

18.87

0.00

1.97

31.72

-0.0327

IS3+

18.30

0.45

0.46

2.04

30.89

-0.0303

✓

17.52

0.50

0.49

1.99

29.77

-0.0301

✓

17.72

0.43

0.48

1.84

30.06

-0.0347

✓

17.34

0.91

0.99

1.92

29.50

-0.0360

✓

18.61

0.33

0.34

2.36

31.33

-0.0369

✓

19.67

0.36

0.18

2.23

32.82

-0.0209

✓

19.13

0.00

2.09

32.08

-0.0327

AVS-Bench S4

31.57

2.53

2.49

1.00

47.76

0.2384

✓

31.76

2.39

1.85

0.82

48.01

0.2350

✓

32.49

2.54

2.58

0.81

48.80

0.2458

✓

31.97

2.26

2.59

0.88

48.22

0.2558

✓

32.07

2.21

2.26

1.07

48.34

0.2375

✓

32.05

1.75

1.01

0.84

48.40

0.2393

✓

32.76

0.05

0.00

0.95

49.31

0.2479

Equations14

S (a_{i}, v_{j}) = \frac{a _{i} \cdot v _{j}}{∥ a _{i} ∥ \cdot ∥ v _{j} ∥} \in [- 1, 1]^{h \times w},

S (a_{i}, v_{j}) = \frac{a _{i} \cdot v _{j}}{∥ a _{i} ∥ \cdot ∥ v _{j} ∥} \in [- 1, 1]^{h \times w},

L_{S} = S (a^{S}, v_{j})_{2}^{2},

L_{S} = S (a^{S}, v_{j})_{2}^{2},

L_{N} = S (a_{j}^{N}, v_{j})_{2}^{2} .

L_{N} = S (a_{j}^{N}, v_{j})_{2}^{2} .

S e p = Q_{1}^{+} - Q_{3}^{-}

S e p = Q_{1}^{+} - Q_{3}^{-}

Output = ReLU (BN_{2} (Conv_{2} (ReLU (BN_{1} (Conv_{1} (x)))) + Residual (x))

Output = ReLU (BN_{2} (Conv_{2} (ReLU (BN_{1} (Conv_{1} (x)))) + Residual (x))

L_{geo} = S (a_{j}, v_{j}^{T_{g eo}}) - T_{g eo} (S (a_{j}, v_{j}))_{2}^{2}

L_{geo} = S (a_{j}, v_{j}^{T_{g eo}}) - T_{g eo} (S (a_{j}, v_{j}))_{2}^{2}

W_{f_{i}}^{SAM} = (1 - α) W_{f_{i}} + α W_{f_{i}}^{sim},

W_{f_{i}}^{SAM} = (1 - α) W_{f_{i}} + α W_{f_{i}}^{sim},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation

Full text

\addauthor

Xavier [email protected] \addauthorGiovana [email protected] \addauthorMagdalena [email protected] \addauthorGloria [email protected] \addinstitution Intelligent Multimodal Vision Analysis

Universitat Pompeu Fabra

Barcelona, Spain

\addinstitution MARL-IDM

New York University,

New York, USA

SSL-SaN: Learning from Silence and Noise for VSSL

Learning from Silence and Noise for Visual Sound Source Localization

Abstract

Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio. Our data, metrics and code are available on the Project page.

1 Introduction

Humans have an incredible capacity for multimodal integration, particularly in processing audio-visual information from the environment. While sound localization is primarily handled by the auditory system, visual data aids in disambiguation and accuracy [Risoud et al.(2018)Risoud, Hanson, Gauvrit, Renard, Lemesre, Bonne, and Vincent]. Inspired by human perceptual abilities, multimodal approaches have demonstrated that combining auditory and visual information not only improves traditional perception tasks, such as object detection and activity recognition [Eliav and Gannot(2024), Cheng et al.(2024)Cheng, Wang, Zheng, Chen, Huang, Zhang, Chen, and Li, Salas-Cáceres et al.(2024)Salas-Cáceres, Lorenzo-Navarro, Freire-Obregón, and Castrillón-Santana], but also enables entirely new capabilities. For example, audio-visual scene synthesis allows systems to reconstruct missing modalities [Jamaludin et al.(2019)Jamaludin, Chung, and Zisserman, Montesinos et al.(2023)Montesinos, Michelsanti, Haro, Tan, and Jensen, Sung-Bin et al.(2023)Sung-Bin, Senocak, Ha, Owens, and Oh], predict soundscapes for silent videos [Du et al.(2023)Du, Chen, Salamon, Russell, and Owens], or even anticipate visual scenes based on sound [Liang et al.(2023)Liang, Huang, Tian, Kumar, and Xu, Sung-Bin et al.(2024)Sung-Bin, Senocak, Ha, and Oh, Shi(2021)]. Other examples that benefited from multimodality are audio-visual source separation [Afouras et al.(2020)Afouras, Owens, Chung, and Zisserman, Gao and Grauman(2021), Montesinos et al.(2022)Montesinos, Kadandale, and Haro], comprehensive scene understanding [Lei et al.(2023)Lei, Wang, Chen, Wang, Wang, and Yang, Fichna et al.(2021)Fichna, Biberger, Seeber, and Ewert] and visual sound source localization.

Visual sound source localization (VSSL) aims to localize sounding sources within a video. Pioneering approaches used networks pretrained in ImageNet [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] as backbones and finetuned them for audio-visual correspondence [Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman, Mo and Morgado(2022a), Mo and Morgado(2022b)], resulting in weakly-supervised models [Mo and Morgado(2022b)]. Recently, new methods, such as [Liu et al.(2022a)Liu, Ju, Xie, and Zhang, Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung], shifted from weakly-supervised learning to self-supervised learning, training models from scratch and using different data augmentation techniques, e.g\bmvaOneDotgeometrical transformations, to avoid overfitting and improve robustness. However, these augmentations are mostly done in positive cases (i.e\bmvaOneDotwhen there is a sounding source that is visible in the image), and thus these models still struggle when presented with negative cases [Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman, Juanola et al.(2025)Juanola, Haro, and Fuentes].

Moreover, current datasets used to benchmark VSSL methods, such as VGGSound [Chen et al.(2020a)Chen, Xie, Vedaldi, and Zisserman], VGG Sound Sources (VGG-SS) [Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman], and Flickr [Arandjelovic and Zisserman(2017)], feature mostly sounding objects that are in the foreground and centered in the image frame [Wu et al.(2022)Wu, Fuentes, Seetharaman, and Bello]. As a result, there is no strong incentive for models to use the audio content to localize the sound source in the image. Instead, models tend to identify the object regardless of sound [Oya et al.(2020)Oya, Iwase, Natsume, Itazuri, Yamaguchi, and Morishima]. Recent efforts have been made to mitigate this by introducing synthetic evaluation sets with multiple sources [Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung], but these still suffer from inaccurate image-audio pairings, and model evaluation is mostly done on positive cases.

To address these shortcomings, we introduce three main contributions: (1) A novel training strategy that incorporates negative audio samples (silence and noise) during training, with two additional loss terms that ensure the network learns to ignore these non-informative sounds; (2) The IS3+ benchmark, an improved version of IS3 [Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung], in which we substitute incorrect image-audio pairs that were present in the original test set with correct audio from Adobe Sound Effects [Adobe(2023)] and a clean subset of IS3 audio samples. In addition, we introduce the evaluation of negative audio cases, namely in the presence of noise, silence, and offscreen sounds; and (3) A new metric that quantifies the discriminative power (or separability) of auditory and visual features, which correlates with performance in both sound localization and cross-modal retrieval tasks. Through a comprehensive evaluation of sound localization models in both positive and negative audio samples, we show that our model SSL-SaN (Sound Source Localization with Silence and Noise), trained with our new strategy, yields state-of-the-art results. Code and data are available at https://xavijuanola.github.io/SSL-SaN/.

2 Related Work

Pioneering works [Fisher III et al.(2000)Fisher III, Darrell, Freeman, and Viola, Hershey and Movellan(1999), Kidron et al.(2005)Kidron, Schechner, and Elad] learn to capture correspondences between audio and visual features using classical machine learning methods, such as canonical correlation analysis [Hardoon et al.(2004)Hardoon, Szedmak, and Shawe-Taylor]. More recent methods adopted deep neural networks for representation learning by leveraging the synchronization between audio and video as a signal for self-supervised learning [Owens and Efros(2018), Korbar et al.(2018)Korbar, Tran, and Torresani]. Currently, the most widely used approach for sound source localization is cross-modal attention [Senocak et al.(2018)Senocak, Oh, Kim, Yang, and Kweon, Senocak et al.(2019)Senocak, Oh, Kim, Yang, and Kweon, Tian et al.(2018)Tian, Shi, Li, Duan, and Xu] with contrastive loss [Chopra et al.(2005)Chopra, Hadsell, and LeCun, Oord et al.(2018)Oord, Li, and Vinyals, Senocak et al.(2018)Senocak, Oh, Kim, Yang, and Kweon, Owens and Efros(2018), Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman]. These approaches seek to localize objects by aligning audio and visual representation spaces. Some methods use additional semantic labels to pretrain audio and vision with classification loss [Senocak et al.(2022b)Senocak, Ryu, Kim, and Kweon] or refine audio-visual feature alignment [Qian et al.(2020)Qian, Hu, Dinkel, Wu, Xu, and Lin]. Knowledge distillation from pretrained object detection and sound classification models was used in [Yaghoubi et al.(2023)Yaghoubi, Kelm, Gerkmann, and Frintrop]. LVS [Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman] used a contrastive loss with hard negative mining to learn the audio-visual co-occurrence map discriminatively. EZ-VSL [Mo and Morgado(2022a)] introduced a multiple instance contrastive learning framework that focuses only on the most aligned regions when matching the audio to the video by combining the attention-based localization output with a pretrained visual feature activation map. ACL [Park et al.(2024)Park, Senocak, and Chung] leverages the CLIPSeg model [Lüddecke and Ecker(2022)], which is an image segmentation model trained in a supervised way.

Most works in the literature were trained and evaluated with datasets that in majority feature a single-sounding object present in the scene at a given time [Arandjelovic and Zisserman(2017), Arandjelovic and Zisserman(2018), Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman, Liu et al.(2022a)Liu, Ju, Xie, and Zhang, Mo and Morgado(2022a), Mo and Morgado(2022b), Owens and Efros(2018), Ramaswamy and Das(2020), Senocak et al.(2018)Senocak, Oh, Kim, Yang, and Kweon, Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung, Sun et al.(2023)Sun, Zhang, Wang, Liu, Zhong, Feng, Guo, Zhang, and Barnes, Song et al.(2024)Song, Zhang, Wang, Fan, and Zhang, Wu et al.(2022)Wu, Fuentes, Seetharaman, and Bello]. This setting is rare in real-life scenarios, where there are multiple objects sounding at the same time (i.e\bmvaOneDota mixture of sounds), silent objects, sounds produced by objects that are not visually present in the scene or occluded by other objects (offscreen sounds), noise, etc. There has been an increasing interest in working with mixtures of sounds [Hu et al.(2022)Hu, Chen, and Owens, Kim et al.(2024)Kim, Um, Lee, and Kim, Mahmud et al.(2024)Mahmud, Tian, and Marculescu, Mo and Tian(2023), Qian et al.(2020)Qian, Hu, Dinkel, Wu, Xu, and Lin], but very few worked with silent objects [Hu et al.(2020)Hu, Qian, Jiang, Tan, Wen, Ding, Lin, and Dou, Liu et al.(2022b)Liu, Qian, Zhou, Hu, Lin, Liu, Zhou, and Zhou, Mo and Morgado(2022b)] or offscreen sounds [Liu et al.(2022b)Liu, Qian, Zhou, Hu, Lin, Liu, Zhou, and Zhou]. DSOL [Hu et al.(2020)Hu, Qian, Jiang, Tan, Wen, Ding, Lin, and Dou] and IEr [Liu et al.(2022b)Liu, Qian, Zhou, Hu, Lin, Liu, Zhou, and Zhou] proposed a method to suppress localization of silent objects by creating an Audio-Instance-Identifier module, which identifies the sounds present in the audio, and filters out possible offscreen sounds and silent objects. These methods, however, rely on the number of sound sources, which is not available in most large-scale datasets, or “in-the-wild” data. Given the complexity of the problem and the limitations of the datasets available for training, many models in this domain are prone to overfitting. As a result, early stopping is commonly adopted as a practical regularization technique to prevent overfitting and improve generalization performance on unseen data. SLAVC [Mo and Morgado(2022b)], on the other hand, proposes a framework that solves overfitting and the need for early stopping by adding extreme visual dropout and momentum encoders. Another key strategy to overcome data limitations is data augmentation [Chen et al.(2020b)Chen, Kornblith, Norouzi, and Hinton, Chen and He(2021), Grill et al.(2020)Grill, Strub, Altché, Tallec, Richemond, Buchatskaya, Doersch, Avila Pires, Guo, Gheshlaghi Azar, et al., He et al.(2020)He, Fan, Wu, Xie, and Girshick]. Following this, SSL-TIE [Liu et al.(2022a)Liu, Ju, Xie, and Zhang] presents a neural network composed of an image and an audio encoder, trained with contrastive learning and geometrical consistency, ensuring that the audio-visual similarity maps undergo the same geometrical transformation as the input images. SSL-Align [Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung] proposes a novel method that utilizes semantic alignment with multi-views and semantically similar samples. The authors present both a fully self-supervised and a weakly-supervised variation –with supervisedly pretrained audio and image encoders– of their model.

Except for a few works [Mo and Morgado(2022b), Juanola et al.(2025)Juanola, Haro, and Fuentes, Hamilton et al.(2024)Hamilton, Zisserman, Hershey, and Freeman], most evaluations are predominantly done on positive cases (i.e\bmvaOneDotvisible and audible sources). This not only results in incomplete assessments of the model capabilities, but also reinforces biases in models, as they are implicitly optimized and judged under conditions that favor co-occurring signals. To address this, we combine image and audio augmentations with additional negative audio samples (silence and noise) and additional loss terms during training, resulting in a more robust and accurate self-supervised model. This fully self-supervised model outperforms previous self-supervised VSSL models, as we demonstrate in a comprehensive evaluation including both positive and negative cases. In a concurrent work [Li et al.(2025)Li, Zhao, Huang, Guo, and Tian], and in a different context –supervised audio-visual segmentation– the inclusion of silence and noise during training has also been shown to be beneficial.

3 Method

Most of the VSSL models (e.g\bmvaOneDot[Arandjelovic and Zisserman(2018), Wu et al.(2022)Wu, Fuentes, Seetharaman, and Bello, Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman, Mo and Morgado(2022a), Sun et al.(2023)Sun, Zhang, Wang, Liu, Zhong, Feng, Guo, Zhang, and Barnes, Liu et al.(2022a)Liu, Ju, Xie, and Zhang, Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung, Senocak et al.(2022a)Senocak, Ryu, Kim, and Kweon, Park et al.(2023)Park, Senocak, and Chung, Senocak et al.(2022b)Senocak, Ryu, Kim, and Kweon, Hu et al.(2020)Hu, Qian, Jiang, Tan, Wen, Ding, Lin, and Dou]) use a two-stream network with an audio encoder that extracts auditory features, $\mathbf{a}_{i}\in\mathbb{R}^{c}$ , from the $i$ -th audio segment and a visual encoder that extracts visual features, $\mathbf{v}_{j}\in\mathbb{R}^{c\times h\times w}$ , from the $j$ -th image (usually the central frame of the $j$ -th audio segment). Then, an audio-visual similarity map is computed by cosine similarity:

[TABLE]

and a global audio-visual correspondence value is computed from $S$ by a certain pooling operation [Arandjelovic and Zisserman(2018), Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman, Park et al.(2023)Park, Senocak, and Chung, Mo and Morgado(2022a), Liu et al.(2022a)Liu, Ju, Xie, and Zhang, Hu et al.(2020)Hu, Qian, Jiang, Tan, Wen, Ding, Lin, and Dou] and eventually semantic projection heads [Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung]. The network is trained in a self-supervised way by contrastive learning; forming positive and negative audio-visual pairs just by taking $i=j$ and $i\neq j$ , respectively. Typically, e.g\bmvaOneDot[Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman, Mo and Morgado(2022a), Sun et al.(2023)Sun, Zhang, Wang, Liu, Zhong, Feng, Guo, Zhang, and Barnes, Mo and Morgado(2022b), Liu et al.(2022a)Liu, Ju, Xie, and Zhang, Hu et al.(2020)Hu, Qian, Jiang, Tan, Wen, Ding, Lin, and Dou, Senocak et al.(2022b)Senocak, Ryu, Kim, and Kweon, Park et al.(2023)Park, Senocak, and Chung, Senocak et al.(2022a)Senocak, Ryu, Kim, and Kweon], the audio and visual encoders are ResNet18 networks. The audio encoder is trained from scratch and the visual encoder is initialized with ImageNet pretrained weights, resulting in weakly-supervised models [Mo and Morgado(2022b)]. Recent works, [Liu et al.(2022a)Liu, Ju, Xie, and Zhang, Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung], have shown that state-of-the-art results in VSSL can be achieved by training both encoders from scratch (i.e\bmvaOneDotin a fully self-supervised way). Both of them use (different) data augmentation techniques. Additionally, [Liu et al.(2022a)Liu, Ju, Xie, and Zhang] uses an equivariance loss term and [Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung] uses a semantic projection head on top of the encoders. As baseline model we use SSL-TIE [Liu et al.(2022a)Liu, Ju, Xie, and Zhang], since it is a state-of-the-art self-supervised model [Juanola et al.(2025)Juanola, Haro, and Fuentes] whose code is open-source. However, our proposed training strategy can be applied to any localization model trained in a contrastive way.

3.1 Learning from Silence and Noise

Our aim is to design a sound localization model that is robust to negative sounds such as silence and noise. To that end, we introduce two modifications during training. First, we pair each image in the batch with these two types of negative audio samples, thus creating two new negative audio-visual pairs. We define silence as an empty audio and noise as an audio with random values following a Gaussian distribution with zero mean and standard deviation $\sigma$ =1. Second, we add two new loss terms forcing an empty similarity map for these two negative audio-visual pairs. More concretely, for every $j$ -th image in the batch we define:

[TABLE]

the square $L_{2}$ norm of the audio-visual similarity map between the visual features from the $j$ -th image, $\mathbf{v}_{j}$ , and the silence feature, $\mathbf{a}^{S}$ , and analogously, for the same visual features and the audio features extracted from a realization of the noise, $\mathbf{a}^{N}_{j}$ , that is:

[TABLE]

Intuitively, these terms penalize the model localizing any sound in the presence of silence and noise, enforcing a predictable behavior in the presence of negative audio. Moreover, as shown in the experimental section, learning from silence and noise also improves the results with a positive audio.

3.2 New evaluation set: IS3+

The current VSSL benchmarks, such as VGG-SS [Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman], contain videos gathered from YouTube, where typically a single object dominates the scene, both in the auditory and visual modalities. Consequently, the task of sound localization can be solved by using objectness cues in the image, without the need for the audio characteristics [Oya et al.(2020)Oya, Iwase, Natsume, Itazuri, Yamaguchi, and Morishima]. This fact motivated the proposal, in [Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung], of a new VSSL benchmark: Interactive-Synthetic Sound Source (IS3). IS3 spans 118 object categories and includes 3,240 images that have been synthetically generated by diffusion models [Rombach et al.(2022)Rombach, Blattmann, Lorenz, Esser, and Ommer], simulating scenes with multiple objects in diverse sizes. Each image in IS3 is paired with two audio samples from VGG-SS, corresponding to the two most visible objects present in the image, resulting in 6,480 audio-visual instances. Although IS3 improves upon the existing benchmarks in terms of visual data, it still inherits the weaknesses of VGG-SS in the audio modality. VGG-SS videos are in-the-wild videos from YouTube and in a significant number of videos the audio signal does not contain the sound of the object in the scene. Typical examples of wrong audio samples are music instead of the original audio, an offscreen sound that masks the sound of interest, or a mixture of different types of sounds (see Supplementary material for more details). Thus, we propose an improved version of IS3, named IS3+, where each IS3 image is paired with two clean audio samples corresponding to the two main objects in the image.

To do so, we use a combination of audio samples from Adobe Sound Effects (Adobe SFX) [Adobe(2023)], and clean samples from IS3. We manually selected samples from Adobe SFX through careful review and semantic matching of the audio content with IS3 categories. When we did not have enough Adobe SFX audio samples for a given class, we selected clean audio samples from IS3. The criteria for audio selection, in both Adobe SFX and IS3, were

to have audio that matches the necessary category, and
to avoid as much background noise as possible. We also simplified the dataset categories in which the image did not match the class. For example, the “playing harpsichord" category only had images of harps, therefore we replaced it with “harp”. Similarly, we simplified categories such as “cat growling", “cat caterwauling" by “cat" as the images did not reflect these actions (the full mapping is in the Supplementary material). Finally, we create IS3+ by taking the simplified IS3 annotations and replacing the audio with a randomly sampled item within the correct category from the clean audio pool. We normalize and downsample audio samples to 16kHz.

4 Experiments and Results

4.1 Experimental Setup

Datasets. Our method is trained using the VGGSound-144K [Senocak et al.(2018)Senocak, Oh, Kim, Yang, and Kweon], a fixed subset of 144K videos randomly selected from VGGSound [Chen et al.(2020a)Chen, Xie, Vedaldi, and Zisserman]. We test models’ performance on VGG Sound Sources (VGG-SS) [Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman] (a subset of VGG-Sound with annotated bounding boxes for sound sources), IS3 [Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung], IS3+ (presented in Section 3.2) and AVS-Bench S4 test set [Zhou et al.(2024)Zhou, Shen, Wang, Zhang, Sun, Zhang, Birchfield, Guo, Kong, Wang, and Zhong]. We evaluate the VGG-SS test set using the bounding boxes from the annotations, while we use the segmentation masks present in IS3, IS3+ and AVS-Bench S4.

Evaluation Metrics. The current standard metric to evaluate sound source localization performance is the consensus Intersection over Union (cIoU) proposed by [Senocak et al.(2018)Senocak, Oh, Kim, Yang, and Kweon]. To evaluate the cIoU, a threshold is necessary to binarize the audio-visual similarity map (1) and convert it to a localization mask. Recently, [Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung] proposed the Adaptive cIoU, which sets the top $B$ pixels to $1$ , where $B$ is the area of the ground truth bounding box or mask. This metric assumes that the object is always visible and that the object’s size is known, assumptions that do not hold in most cases. On the other hand, [Juanola et al.(2025)Juanola, Haro, and Fuentes] proposed a Universal threshold that does not assume a visible object in the scene, and is designed to be discriminative of positive vs. negative audio cases. This threshold is set to be the $3$ rd quartile of the maximum audio-visual similarity in negative cases, so that it filters false positives without any assumption on the object size. We report cIoU with both the Universal threshold (Uth) and the adaptive one (Adap.). We also report the Area Under the Curve (AUC), which measures the integral of the success ratio (proportion of samples with cIoU $>\tau$ ) as a function of the threshold $\tau$ varying from 0 to 1.

Besides these metrics, we report the percentage of Image Area (pIA) [Juanola et al.(2025)Juanola, Haro, and Fuentes], which is designed to assess models’ performance with negative audio by measuring the proportion of the image area that has been activated in the model’s localization mask in the presence of noise, silence and offscreen sounds separately. We also report AUCN which is analogous to AUC for the case of negative audio (in this case, the success ratio is defined as proportion of samples with pIA $\leq\tau$ ). Finally, we evaluate overall model performance across both positive and negative cases with FLOC and FAUC [Juanola et al.(2025)Juanola, Haro, and Fuentes]. They both compute the harmonic mean between a positive and a negative metric (FLOC uses the cIoU and pIA while FAUC uses AUC and AUCN).

Implementation. Following [Liu et al.(2022a)Liu, Ju, Xie, and Zhang], we resize input images into a resolution of $224\times 224$ , and we represent audio samples by log-Mel Spectrograms, extracted from 3 seconds of audio at a sample rate of 16 kHz. The log-Mel Spectrograms are computed using 512 as the size of the FFT windows, 239 for the hop length between STFT windows, and 257 mel filterbanks. We use ResNet18 [He et al.(2016)He, Zhang, Ren, and Sun] for the audio and image encoders. We train all models from scratch. The models are trained with a batch size of 32, the Adam optimizer [Kingma(2014)] with a learning rate of $1e^{-4}$ and ReduceOnPlateau as the learning rate scheduler.

4.2 Comparison to prior work

We compare our model to different prior works: RCGrad [Wu et al.(2022)Wu, Fuentes, Seetharaman, and Bello], LVS [Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman], EZ-VSL [Mo and Morgado(2022a)], FNAC [Sun et al.(2023)Sun, Zhang, Wang, Liu, Zhong, Feng, Guo, Zhang, and Barnes], SLAVC [Mo and Morgado(2022b)], ACL [Park et al.(2024)Park, Senocak, and Chung], SSL-Align [Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung], and SSL-TIE [Liu et al.(2022a)Liu, Ju, Xie, and Zhang]. SSL-TIE and a version of SSL-Align are, as ours, the only fully sef-supervised models. The rest are weakly-supervised, since they leverage image encoders/decoders pretrained in a supervised way on annotated datasets such as ImageNet [Deng et al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei] or PhraseCut [Wu et al.(2020)Wu, Lin, Cohen, Bui, and Maji] (in case of ACL). We exclude object guided localization (OGL) postprocessing [Mo and Morgado(2022a)], as it is not meaningful for negative sounds.

Modality Alignment and Separability. A good VSSL model should align auditory and visual features corresponding to a positive audio-visual pair and separate them in case of a negative pair, giving rise to a high value of audio-visual similarity in the first case and a low similarity in the second. We adopt the visualization of Figure 1 from [Juanola et al.(2025)Juanola, Haro, and Fuentes], showing the distribution of the maximum values of the similarity maps of the positive and the different negative cases. As can be seen, SSL-SaN is the model that better learns to separate the positives from the negatives in general, considering the three datasets (the rest of models are shown in the Supplementary material). We propose a new metric, denoted as separability:

[TABLE]

where $Q_{1}^{+}$ denotes the 1st quartile of the maximum audio-visual similarities of the positive pairs and $Q_{3}^{-}$ the 3rd quartile of the maximum audio-visual similarities of the negative pairs. $\mathcal{S}ep$ is a real number and the larger the value, the better the VSSL model is at discriminating positive audio-visual pairs from negative ones. A negative value indicates that the interquartile ranges of audio-visual similarity values for positive pairs ( $Q_{3}^{+}-Q_{1}^{+}$ ) and negative pairs ( $Q_{3}^{-}-Q_{1}^{-}$ ) overlap. This last scenario is not desirable.

Table 1 reports the separability metric in VGG-SS, IS3, IS3+ and AVS-Bench S4 test sets. To better understand how models align both modalities in a positive pair, we also report the metrics of magnitude and alignment [Zhang et al.(2023)Zhang, HaoChen, Huang, Wang, Zou, and Yeung, Goel et al.(2022)Goel, Bansal, Bhatia, Rossi, Vinay, and Grover]. Alignment measures how close the audio and image embeddings are for positive pairs using cosine similarity, while magnitude quantifies the $L_{2}$ distance between both embeddings. As shown in Table 1, the LVS model achieves the best scores in both magnitude and alignment in VGG-SS, IS3 and IS3+, and second best in S4, which at first glance could suggest an excellent reduction of the modality gap. However, this result seems to indicate that the audio and image embeddings are collapsing into the same region of the embedding space, rather than aligning them in a semantically meaningful way. This collapse reduces their discriminative power and is reflected in LVS’s poor performance on the separability metric. A reversed pattern is observed with FNAC, which achieves a relatively good separability score, but its poor alignment and high magnitude suggest that the model struggles to structure the modality representations meaningfully. In contrast, SSL-SaN model achieves a better balance, obtaining the best results in magnitude and alignment among the self-supervised models in VGG-SS, IS3 and IS3+, and best magnitude and second best alignment in S4. Most importantly, it achieves the best score, among the self-supervised models, in the separability metric in all test sets, which highlights a strong distinction between positive and negative samples. These results demonstrate the strength of our fully self-supervised training approach, which encourages the learning of semantically rich and well-structured audio-visual representations.

Localization results. As reported in Table 2, our model outperforms all prior self-supervised works in the three test sets in terms of cIoU-Uth, metrics of silence and noise, and more importantly, the global metrics. It is second best in offscreen metrics. For cIoU-Adap. it is the best in S4 and second best in the rest of datasets. s Thanks to the addition of loss terms specifically addressing silence and noise during training, our model completely filters out silence and noise. Interestingly, the weakly-supervised version of SSL-Align achieves by far the best results on positive sounds, but not on the negative ones (across all datasets). In fact, it performs worse on negative sounds compared to its fully self-supervised version. We hypothesize that the weakly-supervised version, which uses a pretrained visual encoder from ImageNet, is more prone to leveraging objectness cues in the image [Oya et al.(2020)Oya, Iwase, Natsume, Itazuri, Yamaguchi, and Morishima]. This appears to be beneficial for positive sound but detrimental for negative ones. ACL achieves the best results in positive, offscreen and global metrics in IS3, IS3+ and S4. It leverages CLIPSeg [Lüddecke and Ecker(2022)], an image segmentation model based on CLIP with a conditioned transformer decoder trained for object segmentation in a supervised way, with both positive and negative segmentation queries. Thus, it computes powerful and robust features for localization (highly related to segmentation) but less competitive for cross-modal retrieval, as shown in the next results.

Cross-modal retrieval. As in [Liu et al.(2022a)Liu, Ju, Xie, and Zhang, Senocak et al.(2024)Senocak, Ryu, Kim, Oh, Pfister, and Chung], we evaluate VSSL models using a cross-modal retrieval task, to better assess how they capture the semantic correspondence between the audio and visual modalities. For this evaluation, we use Precision and Accuracy at top- $K$ retrieved results, P@ $K$ and A@ $K$ , respectively. Precision refers to the proportion of the top- $K$ retrieved items that belong to the same category as the query, while Accuracy considers a retrieval successful if at least one item among the top- $K$ matches the query’s category. Table 3 reports results for VGG-SS and AVS-Bench S4 (IS3 and IS3+ ones are in the Supp. mat.). Our model outperforms the self-supervised SSL-Align model at all $K$ values in VGG-SS, IS3 and IS3+ for both Precision and Accuracy in Image to Audio. Our model outperforms the self-supervised SSL-Align at high $K$ for Precision in both Image to Audio and Audio to Image. On the other hand, the self-supervised version of SSL-Align outperforms our model in Audio to Image for all test sets except at high $K$ values of Precision for S4 test set. Interestingly, our model improves by a large margin its baseline method, SSL-TIE, across all metrics and datasets, showing the benefits of including silence and noise in the learning stage.

Qualitative results. We show in Fig. 2 the localization maps of the self-supervised models for an example of IS3+. The similarity maps are filtered using the univeral threshold [Juanola et al.(2025)Juanola, Haro, and Fuentes]. Unlike previous models, which fail to correctly localize both sound sources (dinosaur and firetruck) while filtering out silence, noise, and offscreen sounds, our approach successfully achieves both. More qualitative results (including failure cases and cross-modal retrieval results) are in the Supplementary material.

4.3 Ablation

First, we study the contribution of the new negative pairs with silence and noise as well as the loss terms $\mathcal{L}_{S}$ and $\mathcal{L}_{N}$ . Table 4 shows this study on the AVS-Bench S4 test set. The best metrics, except for the case of offscreen sounds and separability, are obtained when considering silence, noise and $\mathcal{L}_{S}+\mathcal{L}_{N}$ during training. The best result for offscreen is achieved when considering only silence and $\mathcal{L}_{S}$ , but this comes at the cost of worsening the results on positive sounds, silence and noise. The best result for separability is in the model trained with noise, but not with $\mathcal{L}_{N}$ . The second best result for separability is obtained with the model using silence, noise and $\mathcal{L}_{S}+\mathcal{L}_{N}$ . The results of this ablation on all test sets is present in the Supplementary material. Finally, Table A.6 in the Supplementary material shows an ablation study of the weight $\lambda_{SN}$ that multiplies the new loss term $(\mathcal{L}_{S}+\mathcal{L}_{N})$ . Best results are achieved for $\lambda_{SN}=1$ .

5 Conclusion

We present SSL-SaN, a simple yet effective approach that leverages silence and noise audio samples to enhance cross-modal retrieval, sound localization and robustness to negative audio samples. Additionally, we present IS3+, an improved version of IS3 that corrects mismatched image-audio pairs, improving evaluation reliability. Our model is comprehensively evaluated on four benchmark datasets using positive, negative, and global metrics. We further propose a new metric to quantify cross-modal alignment and feature separability, which also predicts sound localization and retrieval performance by capturing how well positive audio-visual pairs are distinguished from negative ones. Our approach outperforms recent state-of-the-art self-supervised models on most metrics and datasets.

Acknowledgements

The authors acknowledge support by Maria de Maeztu project ref. CEX2021-001195-M/AEI /10.13039/501100011033. X. J. has received financial support through FPI scholarship PRE2022-101321. G. H. has received financial support through Fulbright Program and Ministerio de Universidades (Spain) funding for mobility stays of professors and researchers in foreign higher education and research centers. M. F. has received support from the National Science Foundation under Grant Number 2152119. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.

Appendix A Supplementary material

A.1 Architecture

Our model, as well as all of the models presented and compared to ours (except for ACL [Park et al.(2024)Park, Senocak, and Chung] that uses transformers) use ResNet18 as the backbone for both the audio and image encoders. ResNet18 is composed of an initial convolutional layer followed by four residual stages, each made up of two BasicBlocks. Each BasicBlock introduces residual connections that help mitigate the vanishing gradient problem in deep neural networks, enabling more effective training of deeper models.

A.1.1 Input Branches.

In our implementation, we include two separate input branches to handle different modalities:

•

conv1_a: used for audio inputs (1 channel)

•

conv1: used for RGB images (3 channels)

Each of these branches consists of a $7{\times}7$ convolutional layer with stride 2 and padding 3, followed by batch normalization and ReLU activation. After this initial stage, all inputs share the same residual backbone.

A.1.2 ResNet18 Structure.

The full ResNet18 architecture consists of:

•

Initial convolution + batch norm + ReLU

•

$3{\times}3$ max pooling with stride 2

•

Layer 1: Two BasicBlocks with 64 filters

•

Layer 2: Two BasicBlocks with 128 filters (first block includes downsampling)

•

Layer 3: Two BasicBlocks with 256 filters (first block includes downsampling)

•

Layer 4: Two BasicBlocks with 512 filters (first block includes downsampling)

•

Adaptive average pooling

•

Fully connected classification layer (removed in our model)

A.1.3 BasicBlock.

The BasicBlock is the fundamental building unit of ResNet18. Each BasicBlock contains:

•

A $3{\times}3$ convolution, followed by batch normalization and ReLU

•

A second $3{\times}3$ convolution, followed by batch normalization

•

A residual (skip) connection that adds the input of the block to its output

If the input and output dimensions differ (e.g., due to a change in the number of channels or stride), a parallel downsample path is introduced using a $1{\times}1$ convolution and batch normalization to match dimensions before the addition.

The output of the BasicBlock is computed as:

[TABLE]

This structure allows gradients to propagate more easily during training and facilitates the learning of identity mappings when needed.

A.1.4 Adaptation.

In our model variants (e.g., SSL-SaN), the final fully connected layer is removed, and the ResNet18 backbone serves as a feature extractor.

A.2 Computational Efficiency and Resource Requirements

A.2.1 Training Cost

We report training time statistics for our proposed method (SSL-SaN) and compare them with SSL-TIE (backbone of our model).

The increase in training time seen in Table A.1 is due to the two additional audio samples (silence and noise) that the model processes for each image-audio pair, as well as the extra loss computations involved.

A.2.2 Inference Time

The average inference time per sample is reported in the Table A.2.

A.2.3 Resource Requirements

The number of parameters (in millions) for image and audio encoders, and the full models are listed in the Table A.3. Notice how ACL, which is based on transformer encoders and decoder, has significantly much more parameters than the rest of models, that are based on ResNet18 encoders.

A.3 Data augmentations

Following [Liu et al.(2022a)Liu, Ju, Xie, and Zhang], we apply augmentations to both audio and image and an additional loss term, during the training process, namely:

Spectrogram masking. The Mel Spectrograms are randomly replaced with zeros along the two axes with random widths (time and frequency masking).

Appearence transformations $(\mathcal{T}_{\text{app}})$ . The appearance of images is changed with some random transformations: color jittering, gaussian blur, and grayscale.

Geometrical transformations $(\mathcal{T}_{\text{geo}})$ . Images are transformed with geometrical transformations such as crop, resize, rotation, and horizontal flip. The visual features extracted from the geometrically-transformed $j$ -th image are denoted as $\mathbf{v}^{\mathcal{T}_{geo}}_{j}$ .

Ideally, the similarity map of the geometrically-transformed image with its corresponding $j$ -th audio, should match the geometrically-transformed similarity map of the non-transformed image with the same audio. The loss term that expresses this geometrical equivariance is:

[TABLE]

Similar Audio. To identify similar audio samples to the target audio, the auadio encoder of the VSSL model is used. At each training epoch, the similarity between the embeddings of every audio sample and all other audio samples is measured using a suitable similarity metric (e.g., cosine similarity). For each audio waveform $W_{f_{i}}$ , the audio with the highest similarity score, denoted as $W_{f_{i}}^{sim}$ , is selected as the most similar audio.

Similar Audio Mixing (SAM). After identifying the most similar audio, we apply Similar Audio Mixing (SAM). SAM combines the original audio waveform $W_{f_{i}}$ with its most similar audio waveform $W_{f_{i}}^{sim}$ using a mixing coefficient $\alpha$ . This mixing is formulated as follows:

[TABLE]

where the the mixing coefficient $\alpha$ starts at 0, preserving the original audio characteristics. As training progresses through epochs, $\alpha$ gradually increases after each epoch, reaching a maximum value of 0.5 at epoch 50; thus increasingly emphasizing the contribution of the similar audio. After the first epoch, the original audio is completely replaced by the mixed audio $W_{f_{i}}^{SAM}$ for subsequent training, thereby progressively leveraging similar audio characteristics to enhance model learning.

A.4 IS3+

The IS3+ dataset is publicly available in project page: https://xavijuanola.github.io/SSL-SaN/.

A.4.1 IS3+ data curation

Figure A.1 shows examples where the class labels in IS3 were incorrect, along with the new labels assigned to the corresponding images. These corrections were made through manual curation.

A.4.2 IS3 category simplification

Table A.4 shows the full class mapping from old labels to the ones assigned. There were some specific labels impossible to distinguish with a single image, so we merged them into a single one. One example could be: “chicken clucking” and “chicken crowing” are simplified as “chicken”.

A.5 Modality Alignment and Separability

Figure A.2 shows the distribution of the maximum audio-visual similarity values for positive and negative audio-visual pairs across all models evaluated in the paper. We observe that in VGG-SS, the only models capable of clearly distinguishing between positives (blue) and negatives (silence, noise, and offscreen) are SSL-TIE, SSL-Align (both the fully self-supervised and the weakly-supervised versions), and SSL-SaN. On the other hand, LVS, EZ-VSL, FNAC and SLAVC provide overlapping similarity values for positive and negative audio-visual pairs, both in VGG-SS and IS3+, making difficult to establish a proper universal threshold that allows to distinguish both cases. In IS3+, both versions of SSL-Align are not good at filtering silence samples, but are better than SSL-TIE and SSL-SaN at distinuishing offscreen sounds from positive ones. Although ACL does not provide a good separability from positive and negative pairs in VGG-SS, it does a much better job in the other two datasets. Actually, in these two datasets, it is the best model at separating positive pairs from negative offscreen pairs. This is due to the supervised training of CLIPSeg [Lüddecke and Ecker(2022)] (the segmentation network used by ACL), which has been trained for object segmentation with both (annotated) positive and negative queries. On the other hand, it can be seen that our model is better than ACL at separating negative pairs with silence and noise. This suggests that ACL could benefit from our general strategy of introducing silence and noise in the contrastive learning.

Among all the models, SSL-SaN shows a better balance in separating positive sounds from the three types of negative ones across all datasets.

A.6 Cross-modal Retrieval

In this section, we present the results of the cross-modal retrieval task on the same datasets as in Table 3 from the paper, but now including results from the IS3 dataset (omitted in the paper due to space constraints).

For IS3, we observe a similar trend as in IS3+, where our model achieves the best performance among the self-supervised models in the Image to Audio retrieval and SSL-Align is the best in Audio to Image.

After presenting the quantitative results, we now include some qualitative examples illustrating the cross-modal retrieval task. Specifically, we show, given an audio input, the top 4 images retrieved by the model, and vice versa when doing image to audio. These results are shown in Figures A.3 and A.4. In the case of audio, we specify its corresponding class in words and show its corresponding image. The audio samples (and images) are available in the project page: https://xavijuanola.github.io/SSL-SaN/

Figure A.3 illustrates the good performance of our model doing cross-modal retrieval from image to audio in IS3+. In all cases, except for one, the top four retrieved audio samples are perfect matches. It is interesting to note that in the failure case, where the target image contains a bus and a chicken, the top four retrieved audio samples correspond to bus, except for one, that contains a tractor, another vehicle that can produce a very similar sound.

A similar example can be seen in Figure A.4, where all retrieved images are correct except for one (third row), where the query audio is from a chicken, and the second retrieved image is of a turkey. These two animals produce similar sounds, which demonstrates how our model understands the sources of different sounds.

Figures A.5 and A.6 show qualitative results of cross-modal retrieval in the VGG-SS dataset. The labels in VGG-SS are highly specific, making it extremely challenging for models to differentiate accurately between certain classes using only single-query images instead of videos. For instance, in the last row of Figure A.5, the frame is labeled people eating apple, yet the depicted image shows a woman with her mouth closed, offering no visual indication of an apple being eaten. Consequently, the model struggles to retrieve appropriate audio clips corresponding to this precise class and instead returns audio associated with subtle mouth movements, typically with a closed mouth, such as people slurping or lip smacking. This issue is similarly evident in the Audio to Image retrieval scenario depicted in Figure A.6.

A.7 More ablation studies

As mentioned in the paper, we conducted an ablation study by training multiple models with different values of the $\lambda_{SN}$ parameter, which controls the relative importance of the $\mathcal{L}_{S}$ and $\mathcal{L}_{N}$ loss terms with respect to the contrastive loss. The results of this analysis is presented in Table A.6. It indicates that the optimal value is $\lambda_{SN}=1$ , which yields the highest value for the positive cIoU metric, fully filters out silence and noise negatives, and maintains a relatively low offscreen pIA value.

On the other hand, the Table A.7 shows the impact of training with samples of noise and silence, as well as with the new loss terms $\mathcal{L}_{S}$ and $\mathcal{L}_{N}$ . Overall, the model trained with silence, noise, and $\mathcal{L}_{S}$ and $\mathcal{L}_{N}$ delivers the best balance between negative audio filtering and localization. It is the only setup that drives negative leaks essentially to zero across datasets: in VGG-SS it achieves $pIA_{S}{=}0.01$ and $pIA_{N}{=}0.00$ , and in IS3/IS3+ both are exactly $0.00$ . At the same time, it is top on S4 with the best cIoUUth and F ${}_{\text{LOC}}$ , and remains very close to the best on VGG-SS and IS3/IS3+ (e.g., second-best cIoU and F ${}_{\text{LOC}}$ in VGG-SS). Compared to the version with silence and noise but without the loss terms, the difference in negative suppression is massive (e.g., VGG-SS pIAS: $0.68\rightarrow 0.01$ , pIAN: $0.35\rightarrow 0.00$ ) for small differences in cIoU/F ${}_{\text{LOC}}$ .

Conceptually, combining both types of negative samples (silence and noise) with explicit supervision, $\mathcal{L}_{S}$ and $\mathcal{L}_{N}$ , teaches the model to filter negative audio samples, preventing hallucinations, tightening activation thresholds and separating the values of the similarity maps of the positives from the negatives (see Figure A.2). This regularization yields better generalization across datasets and tasks.

A.8 Qualitative results

A.8.1 VGG-SS

We present in Figure A.7 two examples illustrating how our model outperforms others in the VGG-SS test set. SSL-TIE and our model are the only ones correctly localizing the sound coming from the piano, but SSL-TIE incorrectly filters the silence and noise. Moreover, our model highlights a bigger part of the piano, giving a better localization.

The second example depicts multiple chickens. Similar to the previous case, SSL-TIE and SSL-SaN are the only models able to localize the correct region for the positive sound, but SSL-TIE highlights part of the chickens when an offscreen sound is played.

A.8.2 IS3+

In the first example of Figure A.8, the scene shows a cow near a river with additional cows in the background. When the sound corresponds to the cow, SSL-SaN, ACL, and SSL-TIE all correctly localize it. However, when the sound corresponds to the river, only SSL-TIE and SSL-SaN produce activations: SSL-TIE localizes closer to the cow, while SSL-SaN focuses on the rocks in the river. None of the models successfully highlight the background cows.

In the second example, the image contains a firetruck in the foreground and fireworks in the sky. ACL delivers nearly perfect localization, accurately identifying the firetruck while filtering out negative sounds. SSL-SaN also localizes both positive sounds and filters negatives, though with slightly less precision. Notice that ACL leverages a network trained with a supervised segmentation loss, while our model has been trained completely from scratch in a contrastive way. On the other hand, SSL-TIE and SSL-Align, while correctly detecting the positive sources, fail to fully suppress the negative sounds.

A.8.3 AVSBench S4

In Figure A.9, we present two examples from the AVS-Bench S4 test set that highlight how our model outperforms competing approaches.

In the first example, the scene features a race car. Similar to the IS3+ case, ACL achieves nearly perfect localization while filtering out all negative sounds. SSL-TIE, SSL-Align, and SSL-SaN also localize the positive sound correctly; however, only our model successfully suppresses all negatives. Specifically, SSL-TIE fails to filter the offscreen sound, while SSL-Align fails to filter the silence.

In the second example, showing a computer keyboard, SSL-SaN, ACL, SSL-Align, and SSL-TIE all produce correct localizations of the keyboard. Yet, SSL-SaN is the only model that does not produce false activations for the negative sounds. SSL-Align and ACL fail to filter the silence, while SSL-TIE fails across all three negative cases.

A.9 Failure cases

A.9.1 VGG-SS

In Figure A.10, we present six failure cases from the VGG-Sound Sources test set. In the first example, the video shows an air conditioner: although the sound is identifiable as the machine, it is weak, and the model fails to localize the source in the image. In the second and third examples, the model struggles to localize the different instruments. In the fourth example, the audio is the sound of the shoes against the floor (slap dancing) and our model is not able to identify the sound source in the image. In the last two examples, the sound is also clear, but the model appears to interpret the tree’s leaves as birds and fails to detect the owl in the final image.

A.9.2 IS3+

Figure A.11, shows six examples where our model fails to detect the sounding object in the IS3+ curated test set. In the first and fourth examples, the model correctly identifies the sound sources (accordion and waterfall), but produces very limited or sparse localization maps compared to the actual size of the objects. In the remaining cases, the model completely fails to detect the sounding source and does not localize any relevant region of the image. We also see that our model tends to fail more towards the offscreen negative sound. This is an expected result if we observe Figure 1, where we can see that in IS3+, the Universal threshold in SSL_SaN doesn’t completely filter the interquartile range of offscreen sounds (the separibility of positive and offscreen is much better in the other two datasets).

A.9.3 AVSBench S4

We present several failure cases from the AVSBench S4 test set in Figure A.12. In the first two images, the target vehicles (a distant car and an ambulance) appear very small, and the model fails to extract sufficient detail for reliable localization. In the third and fifth columns, animals are clearly visible as foreground, yet the model does not localize them. In the third image, a pole occludes the scene and the cat’s face is only partially visible, which further hinders detection. The fourth and sixth images also contain vehicles that the model misses. Notably, in the first, second, third, and sixth images the model produces weak activations in approximately the correct region, but these responses are too diffuse or misplaced to demonstrate clear object-level localization.

A.9.4 Discussion

From the qualitative inspection of the failure cases across VGG-SS, IS3+, and AVS-Bench S4, several consistent trends can be identified. A common difficulty arises when the sounding object occupies only a small portion of the image, as with the distant vehicles in AVS-Bench S4, where the model fails to associate the weak visual evidence with the audio cue. Complex scenes also introduce confusion: for instance, musical instrument cases in VGG-SS often include people, sheet music, and microphones, all of which provide competing visual structures that the model mistakenly attends to. Similar effects are observed in nature scenes, such as the bird example where dense foliage is misinterpreted as additional animals. The model further struggles with monotonous sounds, where the acoustic signal provides little temporal variation or discriminative content, as seen with air conditioners or slap dance performances, leading to diffuse or misplaced activations. Partial occlusions represent another challenge, with examples such as cats or dogs hidden behind obstacles, or a helicopter where the blades are not visible, causing incomplete or absent localizations. These patterns suggest that limitations arise when the sound offers limited semantic cues, when the visual field is cluttered with distractors, or when the target object is visually ambiguous due to scale or occlusion.

Bibliography69

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Adobe(2023)] Adobe. Adobe audition sound effects, 2023. URL https://www.adobe.com/products/audition/offers/adobeauditiondlcsfx.html . Accessed: [25-Jan-2025].
2[Afouras et al.(2020)Afouras, Owens, Chung, and Zisserman] Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. Self-supervised learning of audio-visual objects from video. In European Conference on Computer Vision , pages 208–224. Springer, 2020.
3[Arandjelovic and Zisserman(2017)] Relja Arandjelovic and Andrew Zisserman. Look, listen and learn. In IEEE International Conference on Computer Vision , pages 609–617, 2017.
4[Arandjelovic and Zisserman(2018)] Relja Arandjelovic and Andrew Zisserman. Objects that sound. In European Conference on Computer Vision , pages 435–451, 2018.
5[Chen et al.(2020 a)Chen, Xie, Vedaldi, and Zisserman] Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In IEEE International Conference on Acoustics, Speech and Signal Processing , pages 721–725, 2020 a.
6[Chen et al.(2021)Chen, Xie, Afouras, Nagrani, Vedaldi, and Zisserman] Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. Localizing visual sounds the hard way. In IEEE/CVF conference on computer vision and pattern recognition , pages 16867–16876, 2021.
7[Chen et al.(2020 b)Chen, Kornblith, Norouzi, and Hinton] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning , pages 1597–1607. PMLR, 2020 b.
8[Chen and He(2021)] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In IEEE/CVF conference on computer vision and pattern recognition , pages 15750–15758, 2021.