Remove Cosine Window from Correlation Filter-based Visual Trackers: When   and How

Feng Li; Xiaohe Wu; Wangmeng Zuo; David Zhang; Lei Zhang

arXiv:1905.06648·cs.CV·July 19, 2023

Remove Cosine Window from Correlation Filter-based Visual Trackers: When and How

Feng Li, Xiaohe Wu, Wangmeng Zuo, David Zhang, Lei Zhang

PDF

Open Access 1 Repo

TL;DR

This paper investigates removing the cosine window from correlation filter trackers by using spatial regularization and mask functions, improving boundary handling and contamination issues, leading to better tracking performance.

Contribution

It introduces a method to eliminate the cosine window in CF trackers using spatial regularization and mask functions, enhancing boundary and contamination handling.

Findings

01

Outperforms state-of-the-art trackers on benchmarks.

02

Effectively handles boundary discontinuity and sample contamination.

03

Compatible with handcrafted and deep features.

Abstract

Correlation filters (CFs) have been continuously advancing the state-of-the-art tracking performance and have been extensively studied in the recent few years. Most of the existing CF trackers adopt a cosine window to spatially reweight base image to alleviate boundary discontinuity. However, cosine window emphasizes more on the central region of base image and has the risk of contaminating negative training samples during model learning. On the other hand, spatial regularization deployed in many recent CF trackers plays a similar role as cosine window by enforcing spatial penalty on CF coefficients. Therefore, we in this paper investigate the feasibility to remove cosine window from CF trackers with spatial regularization. When simply removing cosine window, CF with spatial regularization still suffers from small degree of boundary discontinuity. To tackle this issue, binary and…

Tables5

Table 1. TABLE I: The EAO, accuracy and robustness of two CF trackers (i.e., KCF and BACF) and their counterparts without using cosine window during training (i.e., KCF RC and BACF RC ) on the VOT-2018 dataset. Here, ↑ ↑ \uparrow ( ↓ ↓ \downarrow ) denotes higher (lower) is better.

Methods	KCF [2]	KCF_RC	BACF [13]	BACF_RC
EAO ( $↑$ )	0.106	0.069	0.137	0.124
Accuracy ( $↑$ )	0.327	0.374	0.432	0.466
Robustness ( $↓$ )	1.182	1.823	0.757	0.892

Table 2. TABLE II: The EAO, accuracy (Acc.) and robustness (RO.) results by progressively integrating our methods into the baseline CF trackers on the VOT-2018 dataset. Here, Baseline , RC , RCB , and RCG respectively represent the baseline CF tracker, that by removing cosine window, that by removing cosine window and incorporating with binary mask function 𝐌 𝐌 \mathbf{M} , and that by removing cosine window and incorporating with Gaussian shaped mask function 𝐌 G subscript 𝐌 𝐺 \mathbf{M}_{G} . ( ∗ ) Note that the results of ECO and UPDT are reproduced from the released codes on the VOT-2018 challenge website, and we report the UPDT results as the average scores of 15 times running following the protocols in [ 18 ] .

Methods	MOSSE [1]			KCF [2]			BACF [13]			STRCF [14]			ECOhc [12]			ECO^∗ [12]			UPDT^∗ [15]
	EAO	ACC.	RO.	EAO	ACC.	RO.	EAO	ACC.	RO.	EAO	ACC.	RO.	EAO	ACC.	RO.	EAO	ACC.	RO.	EAO	ACC.	RO.
Baseline	0.067	0.387	1.862	0.106	0.327	1.182	0.137	0.432	0.757	0.174	0.47	0.632	0.212	0.524	0.492	0.262	0.458	0.323	0.352	0.523	0.207
RC	0.033	0.403	2.438	0.069	0.374	1.823	0.124	0.466	0.892	0.166	0.486	0.683	0.194	0.532	0.521	0.251	0.476	0.334	0.343	0.529	0.221
RCB	0.038	0.408	2.392	0.075	0.377	1.788	0.158	0.462	0.723	0.187	0.478	0.604	0.225	0.528	0.476	0.279	0.464	0.272	0.384	0.524	0.174
RCG	0.042	0.396	2.375	0.081	0.372	1.774	0.165	0.458	0.692	0.192	0.474	0.595	0.231	0.524	0.464	0.287	0.462	0.258	0.391	0.528	0.168

Table 3. TABLE III: Comparison with the state-of-the-art trackers in terms of EAO, accuracy, and robustness on the VOT-2018 dataset. The first , second and third best results are highlighted in color. ( ∗ ) Note that the results of ECO and UPDT are reproduced from the released codes on VOT-2018 challenge website, and we report the UPDT result as the average score of 15 times running following the protocols in [ 18 ] .

Methods

ECO^∗

[12]

DLSTpp

[33]

SA_Siam_R

[34]

CPT

[18]

DeepSTRCF

[14]

UPDT^∗

[15]

DRT

[35]

RCO

[18]

SiamRPN

[36]

MFT

[18]

LADCF

[19]

ECO_RCG

Ours

UPDT_RCG

Ours

EAO (

↑

)

0.262

0.325

0.337

0.339

0.345

0.352

0.356

0.376

0.383

0.385

0.389

0.287

0.391

Accuracy (

↑

)

0.458

0.543

0.566

0.506

0.523

0.519

0.507

0.586

0.505

0.503

0.462

0.528

Robustness (

↓

)

0.323

0.224

0.258

0.239

0.215

0.207

0.201

0.155

0.276

0.14

0.159

0.258

0.168

Table 4. TABLE IV: The mean OP results (%) of different trackers using handcrafted and CNN features on the OTB-2015 dataset. Note that the first two rows compare the methods with handcrafted features, while the last two rows correspond to the trackers with CNN features. The first , second and third best results are highlighted in color.

Methods

DSST

[4]

SKSCF

[7]

SAMF_AT

[37]

Staple

[38]

SRDCF

[8]

TRACA

[39]

SRDCFDecon

[28]

BACF

[13]

ECOhc

[12]

STRCF

[14]

BACF_RCG

Ours

ECOhc_RCG

Ours

STRCF_RCG

Ours

Mean OP (

↑

)

62.2

66.5

68

71

72.8

74.7

76.6

76.7

77.2

80

77.8

78.5

82.3

Methods

CNN-SVM

[41]

HCF

[25]

HDT

[26]

FCNT

[42]

SiameseFC

[46]

CF-Net

[43]

DeepSRDCF

[45]

CCOT

[11]

ECO

[12]

DeepSTRCF

[14]

MDNet

[40]

VITAL

[44]

ECO_RCG

Ours

Mean OP (

↑

)

65.1

65.6

65.8

67.1

71

73

76.8

82.4

84.8

84.9

86.6

86.7

Table 5. TABLE V: The mean OP results (%) of different trackers using handcrafted features on each attribute of OTB-2015. The first , second and third best results are highlighted in color.

Methods

DSST

[4]

SKSCF

[7]

SAMF_AT

[37]

Staple

[38]

SRDCF

[8]

TRACA

[39]

SRDCFDecon

[28]

BACF

[13]

ECOhc

[12]

STRCF

[14]

BACF_RCG

Ours

ECOhc_RCG

Ours

STRCF_RCG

Ours

MB

55.3

63.4

70.8

65

72.9

73.9

79.9

73.5

75.3

79.4

78.4

77.4

81.5

OCC

56.2

63.4

64.8

68

67.6

71.2

73.5

71.1

74.5

75.1

73.1

74.5

80.2

IV

65.8

68.6

62.9

72

74.2

76.9

79.3

78.5

76.1

78.3

78.8

81

80.9

BC

59.9

69

63

67.7

69.2

74.1

78

76

76.5

79.5

79.9

78.6

82.6

IPR

60.8

65.4

65.5

66.9

66.3

71.5

70

71.5

68.5

73.9

72.4

70.8

76

OPR

58.3

64.3

64.8

66.5

65.9

72.5

72.4

71.8

72.1

76.6

72.2

74.8

80.1

SV

55.8

56.3

58.8

61.5

67.1

68.6

74.4

70.2

71.9

76.4

73

73.6

78.1

FM

55

63.2

66.8

65.9

72.1

70.6

74.6

76

74.5

76

74.6

72.2

77

DEF

53.1

62.7

58.4

68.6

66.1

70

68.2

71.3

73.7

73.3

68.7

75.4

73.9

OV

45.5

45.8

60.3

52.3

52.7

67.8

61.8

67.1

63.5

70.9

63.8

67.8

70.6

LR

34.7

24

51.4

39.3

64.1

54.9

63.9

62.2

56

69.6

67

53.8

68.9

Equations48

E (f) = \frac{1}{2} l = 1 \sum L f_{l} ⋆ (x_{t, l} ⊙ c) - y_{t}^{2} + λ R (f),

E (f) = \frac{1}{2} l = 1 \sum L f_{l} ⋆ (x_{t, l} ⊙ c) - y_{t}^{2} + λ R (f),

E (f) = \frac{1}{2} k = 1 \sum K α_{k} l = 1 \sum L f_{l} ⋆ (x_{k, l} ⊙ c) - y_{k}^{2} + λ R (f),

E (f) = \frac{1}{2} k = 1 \sum K α_{k} l = 1 \sum L f_{l} ⋆ (x_{k, l} ⊙ c) - y_{k}^{2} + λ R (f),

M (x, y) = {1, 0, if ∣ x ∣ \leq \frac{H}{2} - \frac{h}{2} and ∣ y ∣ \leq \frac{W}{2} - \frac{w}{2}, otherwise.

M (x, y) = {1, 0, if ∣ x ∣ \leq \frac{H}{2} - \frac{h}{2} and ∣ y ∣ \leq \frac{W}{2} - \frac{w}{2}, otherwise.

E (f) = \frac{1}{2} k = 1 \sum K α_{k} M ⊙ (l = 1 \sum L f_{l} ⋆ x_{k, l} - y_{k})^{2} + λ R (f) .

E (f) = \frac{1}{2} k = 1 \sum K α_{k} M ⊙ (l = 1 \sum L f_{l} ⋆ x_{k, l} - y_{k})^{2} + λ R (f) .

M_{G} (x, y) = {e^{- (\frac{x}{h δ})^{2} - (\frac{y}{w δ})^{2}}, 0, if ∣ x ∣ \leq \frac{H}{2} - \frac{h}{2} and ∣ y ∣ \leq \frac{W}{2} - \frac{w}{2}, otherwise,

M_{G} (x, y) = {e^{- (\frac{x}{h δ})^{2} - (\frac{y}{w δ})^{2}}, 0, if ∣ x ∣ \leq \frac{H}{2} - \frac{h}{2} and ∣ y ∣ \leq \frac{W}{2} - \frac{w}{2}, otherwise,

E (f) = \frac{1}{2} M ⊙ (l = 1 \sum L f_{l} ⋆ x_{l} - y)^{2} + λ R (f) .

E (f) = \frac{1}{2} M ⊙ (l = 1 \sum L f_{l} ⋆ x_{l} - y)^{2} + λ R (f) .

L (g) = \frac{1}{2} l = 1 \sum L (x_{l} ⊙ c) ⋆ (P^{T} g_{l}) - y^{2} + \frac{λ}{2} ∥ g ∥^{2},

L (g) = \frac{1}{2} l = 1 \sum L (x_{l} ⊙ c) ⋆ (P^{T} g_{l}) - y^{2} + \frac{λ}{2} ∥ g ∥^{2},

L (f, g) = \frac{1}{2} M ⊙ (l = 1 \sum L x_{l} ⋆ f_{l} - y)^{2} + \frac{λ}{2} ∥ g ∥^{2}, s.t. f_{l} = P^{T} g_{l} .

L (f, g) = \frac{1}{2} M ⊙ (l = 1 \sum L x_{l} ⋆ f_{l} - y)^{2} + \frac{λ}{2} ∥ g ∥^{2}, s.t. f_{l} = P^{T} g_{l} .

L (f, g, z) = \frac{1}{2} ∥ M ⊙ z ∥^{2} + \frac{λ}{2} ∥ g ∥^{2}, s.t. f_{l} = P^{T} g_{l}, z = l = 1 \sum L x_{l} ⋆ f_{l} - y .

L (f, g, z) = \frac{1}{2} ∥ M ⊙ z ∥^{2} + \frac{λ}{2} ∥ g ∥^{2}, s.t. f_{l} = P^{T} g_{l}, z = l = 1 \sum L x_{l} ⋆ f_{l} - y .

L (f, g, z, ζ, γ) = \frac{1}{2} ∥ M ⊙ z ∥^{2} + \frac{λ}{2} ∥ g ∥^{2} + l = 1 \sum L ζ_{l}^{T} (f_{l} - P^{T} g_{l}) + \frac{μ}{2} l = 1 \sum L f_{l} - P^{T} g_{l}^{2} + γ^{T} (l = 1 \sum L x_{l} ⋆ f_{l} - y - z) + \frac{τ}{2} l = 1 \sum L x_{l} ⋆ f_{l} - y - z^{2},

L (f, g, z, ζ, γ) = \frac{1}{2} ∥ M ⊙ z ∥^{2} + \frac{λ}{2} ∥ g ∥^{2} + l = 1 \sum L ζ_{l}^{T} (f_{l} - P^{T} g_{l}) + \frac{μ}{2} l = 1 \sum L f_{l} - P^{T} g_{l}^{2} + γ^{T} (l = 1 \sum L x_{l} ⋆ f_{l} - y - z) + \frac{τ}{2} l = 1 \sum L x_{l} ⋆ f_{l} - y - z^{2},

g = ar g g min \frac{λ}{2} ∥ g ∥^{2} + l = 1 \sum L ζ_{l}^{T} (f_{l} - P^{T} g_{l}) + \frac{μ}{2} l = 1 \sum L f_{l} - P^{T} g_{l}^{2} .

g = ar g g min \frac{λ}{2} ∥ g ∥^{2} + l = 1 \sum L ζ_{l}^{T} (f_{l} - P^{T} g_{l}) + \frac{μ}{2} l = 1 \sum L f_{l} - P^{T} g_{l}^{2} .

g_{l} = (λ I + μ P P^{T})^{- 1} (P ζ_{l} + μ P f_{l}),

g_{l} = (λ I + μ P P^{T})^{- 1} (P ζ_{l} + μ P f_{l}),

f = ar g f min l = 1 \sum L ζ_{l}^{T} (f_{l} - P^{T} g_{l}) + \frac{μ}{2} l = 1 \sum L f_{l} - P^{T} g_{l}^{2} + γ^{T} (l = 1 \sum L x_{l} ⋆ f_{l} - y - z) + \frac{τ}{2} l = 1 \sum L x_{l} ⋆ f_{l} - y - z^{2} .

f = ar g f min l = 1 \sum L ζ_{l}^{T} (f_{l} - P^{T} g_{l}) + \frac{μ}{2} l = 1 \sum L f_{l} - P^{T} g_{l}^{2} + γ^{T} (l = 1 \sum L x_{l} ⋆ f_{l} - y - z) + \frac{τ}{2} l = 1 \sum L x_{l} ⋆ f_{l} - y - z^{2} .

\hat{f} = ar g \hat{f} min l = 1 \sum L \hat{ζ}_{l}^{T} (\hat{f}_{l} - \hat{q}_{l}) + \frac{μ}{2} l = 1 \sum L \hat{f}_{l} - \hat{q}_{l}^{2} + \hat{γ}^{T} (l = 1 \sum L \hat{x}_{l} ⊙ \hat{f}_{l} - \hat{y} - \hat{z}) + \frac{τ}{2} l = 1 \sum L \hat{x}_{l} ⊙ \hat{f}_{l} - \hat{y} - \hat{z}^{2} .

\hat{f} = ar g \hat{f} min l = 1 \sum L \hat{ζ}_{l}^{T} (\hat{f}_{l} - \hat{q}_{l}) + \frac{μ}{2} l = 1 \sum L \hat{f}_{l} - \hat{q}_{l}^{2} + \hat{γ}^{T} (l = 1 \sum L \hat{x}_{l} ⊙ \hat{f}_{l} - \hat{y} - \hat{z}) + \frac{τ}{2} l = 1 \sum L \hat{x}_{l} ⊙ \hat{f}_{l} - \hat{y} - \hat{z}^{2} .

\hat{f} (t) = (τ \hat{x} (t) \hat{x} (t)^{T} + μ I)^{- 1} (τ \hat{x} (t) \hat{y} (t) + τ \hat{x} (t) \hat{z} (t) - \hat{x} (t) γ (t) - \hat{ζ} (t) + μ \hat{q} (t)) .

\hat{f} (t) = (τ \hat{x} (t) \hat{x} (t)^{T} + μ I)^{- 1} (τ \hat{x} (t) \hat{y} (t) + τ \hat{x} (t) \hat{z} (t) - \hat{x} (t) γ (t) - \hat{ζ} (t) + μ \hat{q} (t)) .

\hat{f} (t) = \frac{1}{μ} (τ \hat{x} (t) \hat{y} (t) + τ \hat{x} (t) \hat{z} (t) - \hat{x} (t) \hat{γ} (t) - \hat{ζ} (t) + μ \hat{q} (t)) - \frac{x ^ ( t )}{μ b} (τ \hat{y} (t) \overset{s}{^}_{x} (t) + τ \hat{z} (t) \overset{s}{^}_{x} (t) - \hat{γ} (t) \overset{s}{^}_{x} (t) - \overset{s}{^}_{ζ} (t) + μ \overset{s}{^}_{q} (t)) .

\hat{f} (t) = \frac{1}{μ} (τ \hat{x} (t) \hat{y} (t) + τ \hat{x} (t) \hat{z} (t) - \hat{x} (t) \hat{γ} (t) - \hat{ζ} (t) + μ \hat{q} (t)) - \frac{x ^ ( t )}{μ b} (τ \hat{y} (t) \overset{s}{^}_{x} (t) + τ \hat{z} (t) \overset{s}{^}_{x} (t) - \hat{γ} (t) \overset{s}{^}_{x} (t) - \overset{s}{^}_{ζ} (t) + μ \overset{s}{^}_{q} (t)) .

z = ar g z min \frac{1}{2} ∥ M ⊙ z ∥^{2} + γ^{T} (l = 1 \sum L x_{l} ⋆ f_{l} - y - z) + \frac{τ}{2} l = 1 \sum L x_{l} ⋆ f_{l} - y - z^{2} .

z = ar g z min \frac{1}{2} ∥ M ⊙ z ∥^{2} + γ^{T} (l = 1 \sum L x_{l} ⋆ f_{l} - y - z) + \frac{τ}{2} l = 1 \sum L x_{l} ⋆ f_{l} - y - z^{2} .

z = (Diag (M ⊙ M + τ 1))^{- 1} (τ (l = 1 \sum L x_{l} ⋆ f_{l} - y) + γ),

z = (Diag (M ⊙ M + τ 1))^{- 1} (τ (l = 1 \sum L x_{l} ⋆ f_{l} - y) + γ),

ζ^{(t + 1)} = ζ^{(t)} + μ (f^{(t + 1)} - P^{T} g^{(t + 1)}), γ^{(t + 1)} = γ^{(t)} + τ (l = 1 \sum L x_{l} ⋆ f_{l}^{(t + 1)} - y - z^{(t + 1)}) .

ζ^{(t + 1)} = ζ^{(t)} + μ (f^{(t + 1)} - P^{T} g^{(t + 1)}), γ^{(t + 1)} = γ^{(t)} + τ (l = 1 \sum L x_{l} ⋆ f_{l}^{(t + 1)} - y - z^{(t + 1)}) .

J_{l} {x_{k, l}} (t) = n = 0 \sum N_{l} - 1 x_{k, l} [n] b_{l} (t - \frac{T}{N _{l}} n),

J_{l} {x_{k, l}} (t) = n = 0 \sum N_{l} - 1 x_{k, l} [n] b_{l} (t - \frac{T}{N _{l}} n),

E (f, Q) = \frac{1}{2} k = 1 \sum K α_{k} d = 1 \sum D l = 1 \sum L q_{l, d} f_{d} ⋆ (J_{l} {x_{k, l}} ⊙ c) - y_{k}^{2} + \frac{1}{2} d = 1 \sum D ∥ w ⊙ f_{d} ∥^{2} + \frac{λ}{2} ∥ Q ∥^{2},

E (f, Q) = \frac{1}{2} k = 1 \sum K α_{k} d = 1 \sum D l = 1 \sum L q_{l, d} f_{d} ⋆ (J_{l} {x_{k, l}} ⊙ c) - y_{k}^{2} + \frac{1}{2} d = 1 \sum D ∥ w ⊙ f_{d} ∥^{2} + \frac{λ}{2} ∥ Q ∥^{2},

E (f, Q) = \frac{1}{2} k = 1 \sum K α_{k} M ⊙ (d = 1 \sum D l = 1 \sum L q_{l, d} f_{d} ⋆ J_{l} {x_{k, l}} - y_{k})^{2} + \frac{1}{2} d = 1 \sum D ∥ w ⊙ f_{d} ∥^{2} + \frac{λ}{2} ∥ Q ∥^{2} .

E (f, Q) = \frac{1}{2} k = 1 \sum K α_{k} M ⊙ (d = 1 \sum D l = 1 \sum L q_{l, d} f_{d} ⋆ J_{l} {x_{k, l}} - y_{k})^{2} + \frac{1}{2} d = 1 \sum D ∥ w ⊙ f_{d} ∥^{2} + \frac{λ}{2} ∥ Q ∥^{2} .

E (f, Q, z_{k}) = \frac{1}{2} k = 1 \sum K ∥ M ⊙ z_{k} ∥^{2} + \frac{1}{2} d = 1 \sum D ∥ w ⊙ f_{d} ∥^{2} + \frac{λ}{2} ∥ Q ∥^{2} + \frac{τ}{2} k = 1 \sum K α_{k} d = 1 \sum D l = 1 \sum L q_{l, d} f_{d} ⋆ J_{l} {x_{k, l}} - y_{k} - \frac{z _{k}}{α _{k}}^{2},

E (f, Q, z_{k}) = \frac{1}{2} k = 1 \sum K ∥ M ⊙ z_{k} ∥^{2} + \frac{1}{2} d = 1 \sum D ∥ w ⊙ f_{d} ∥^{2} + \frac{λ}{2} ∥ Q ∥^{2} + \frac{τ}{2} k = 1 \sum K α_{k} d = 1 \sum D l = 1 \sum L q_{l, d} f_{d} ⋆ J_{l} {x_{k, l}} - y_{k} - \frac{z _{k}}{α _{k}}^{2},

z_{k} = (Diag (M ⊙ M + τ 1))^{- 1} τ α_{k} (d = 1 \sum D l = 1 \sum L q_{l, d} f_{d} ⋆ J_{l} {x_{k, l}} - y_{k}) .

z_{k} = (Diag (M ⊙ M + τ 1))^{- 1} τ α_{k} (d = 1 \sum D l = 1 \sum L q_{l, d} f_{d} ⋆ J_{l} {x_{k, l}} - y_{k}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lifeng9472/Removing_cosine_window_from_CF_trackers
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Urban Heat Island Mitigation · Human Pose and Action Recognition

Full text

Remove Cosine Window from Correlation Filter-based Visual Trackers: When and How

Feng Li, Xiaohe Wu, Wangmeng Zuo, David Zhang, and Lei Zhang This work is partially supported by the National Natural Scientific Foundation of China (NSFC) under Grant No. 61671182 and 61872118, and the HK RGC GRF grant (under no. PolyU 152124/15E).F. Li, X. Wu, and W. Zuo are with the School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China. e-mail: (fengli $\_$ [email protected], [email protected], [email protected]).D. Zhang is with the School of Science and Engineering, The Chinese University of Hong Kong (Shenzhen), Shenzhen, China, e-mail: ([email protected]).L. Zhang is with the Department of Computing, The Hong Kong Polytechnic University, Hong Kong, e-mail: ([email protected]).(Corresponding author: Wangmeng Zuo)Manuscript received May xx, 2019.

Abstract

Correlation filters (CFs) have been continuously advancing the state-of-the-art tracking performance and have been extensively studied in the recent few years. Most of the existing CF trackers adopt a cosine window to spatially reweight base image to alleviate boundary discontinuity. However, cosine window emphasizes more on the central region of base image and has the risk of contaminating negative training samples during model learning. On the other hand, spatial regularization deployed in many recent CF trackers plays a similar role as cosine window by enforcing spatial penalty on CF coefficients. Therefore, we in this paper investigate the feasibility to remove cosine window from CF trackers with spatial regularization. When simply removing cosine window, CF with spatial regularization still suffers from small degree of boundary discontinuity. To tackle this issue, binary and Gaussian shaped mask functions are further introduced for eliminating boundary discontinuity while reweighting the estimation error of each training sample, and can be incorporated with multiple CF trackers with spatial regularization. In comparison to the counterparts with cosine window, our methods are effective in handling boundary discontinuity and sample contamination, thereby benefiting tracking performance. Extensive experiments on three benchmarks show that our methods perform favorably against the state-of-the-art trackers using either handcrafted or deep CNN features. The code is publicly available at https://github.com/lifeng9472/Removing_cosine_window_from_CF_trackers.

Index Terms:

Visual tracking, correlation filters, cosine window, spatial regularization

I Introduction

Correlation filter (CF) is a representative framework for visual tracking and has attracted great research interest. Since the pioneering work of MOSSE [1], extensive studies have been given to improve the CF models by incorporating nonlinear kernel [2, 3], scale adaptivity [4, 5, 6], max-margin classification [7], spatial regularization [8, 9, 10], and continuous convolution [11, 12]. Moreover, the use of deep representation and its combination with handcrafted features also significantly boosts the tracking performance. Benefited from the progress in models and feature representation, CFs have continuously advanced the state-of-the-art tracking accuracy and robustness in the recent few years.

In standard CF, the training set is formed as all the cyclic shifts of a base image and can be represented as a circulant matrix, making that CFs can be efficiently learned via fast Fourier transform (FFT). Albeit such circulant property greatly benefits learning efficiency, the negative samples (i.e., shifted images) will suffer from the boundary discontinuity problem. As shown in Fig. 1(a), except for the base image in green box, all the shifted images (e.g., the two patches in cyan and blue boxes) are generated using the circulant property and are not truly negative patches in real images.

In order to alleviate boundary discontinuity, cosine window has been introduced in early CF trackers, e.g., MOSSE [1] and KCF [2], and generally inherited by the subsequent improved models [4, 8, 13]. In particular, cosine window bands on base image as a pre-processing step by multiplying with a cosine shaped function (i.e., larger values for central regions and zeros for boundary pixels). Using KCF [2] as an example, it can be seen from Fig. 1(b) that after deploying cosine window boundary discontinuity can be largely suppressed (e.g., the patch in cyan box). Nonetheless, the shifted images near boundary are still plagued, as shown in the patch in blue box. Moreover, cosine window is deployed to base image, thus has the risk of contaminating negative training samples into unreal image patches.

Recently, spatial regularization has also been suggested in numerous CF trackers [8, 12, 9, 13, 14, 15] to alleviate boundary discontinuity, which can be roughly grouped into two categories. On the one hand, SRDCF [8] and its later works [12, 14, 15] penalize the filter coefficients near boundaries to approximate zero. On the other hand, CFLB [9] and its multi-channel extension BACF [13] directly restrict the filter coefficients to be zero outside target bounding boxes. In general, existing CF trackers with spatial regularization still adopt cosine window, and are more effective in handling boundary discontinuity, as illustrated in Fig. 1(c). Even though, the contamination of negative samples remains inevitable and may give rise to degraded performance.

Comparing the filters in Fig. 1(a)(b)(c), one can see that cosine window plays a similar role as spatial regularization in enforcing the filter coefficients near boundary to approach zero. Therefore, it is interesting to ask the first problem concerned in this work: when can we remove cosine window from CFs? Using BACF as an example, Fig. 1(d) shows the learned filters by simply removing cosine window. It can be observed that the filters in Fig. 1(c)(d) are only moderately different in appearance. Our empirical study further shows that BACF without cosine window performs slightly inferior to BACF. Thus, our answer to this question is when spatial regularization is deployed, it is possible to remove cosine window from CF trackers.

The second problem concerned in this work is: how to remove cosine window from CF trackers with spatial regularization. To begin with, Fig. 1(d) illustrates three representative samples used in BACF by simply removing cosine window. While most samples are real image patches (e.g., those in green and cyan boxes), there remain a small percentage of negative samples suffering from boundary discontinuity (e.g., the patch near boundary in blue box). To address this issue, we introduce a binary mask function to eliminate the effect of boundary discontinuous sample. In particular, we assign zero to negative samples with discontinuous boundaries, thereby safely removing cosine window. To further improve tracking performance, a Gaussian shaped mask function is also presented to emphasize more on samples near target center.

To evaluate the feasibility and effectiveness of removing cosine window, we incorporate our methods with several representative CF trackers with spatial regularization, including BACF [13], STRCF [14], ECO [12], and UPDT [15]. Experiments are then conducted on three tracking benchmarks, i.e., OTB-2015 [16], Temple-Color [17] and VOT-2018 [18]. In comparison to the counterparts with cosine window, our methods are effective in handling boundary discontinuity while avoiding sample contamination, and give rise to more robust appearance model as well as tracking performance. Moreover, by incorporating with UPDT [15], our methods achieve the state-of-the-art tracking performance, and attain an EAO score of 0.391 on VOT-2018, surpassing the rank-1 tracker (i.e., LADCF [19]) in the VOT2018 challenge.

To sum up, the main contributions of this paper are:

•

When to remove cosine window from CF trackers? Our analysis and empirical study show that both spatial regularization and cosine window can be used to alleviate boundary discontinuity. When spatial regularization is deployed, it is possible to remove cosine window from CF trackers.

•

How to remove cosine window from CF trackers with spatial regularization? When removing cosine window from CF trackers with spatial regularization, there are still a small percentage of negative samples suffering from boundary discontinuity. To tackle this issue, two mask functions are introduced for reweighting the estimation error of each training sample. And our methods can be incorporated with multiple representative CF trackers with spatial regularization.

•

Experiments on three tracking benchmarks indicate that our methods can eliminate boundary discontinuity while avoiding sample contamination, and perform favorably against the state-of-the-art trackers.

The remainder of this paper is organized as follows. Section II briefly reviews the CF trackers relevant to this work. Section III provides both qualitative and quantitative analysis to dissect the effect of removing cosine window from CF trackers. Section IV further describes our solutions to remove cosine window, which are then incorporated with multiple CF trackers with spatial regularization. Section V reports the experimental results. Finally, Section VI ends this work with several concluding remarks.

II Related Work

The core problem of CF trackers is to learn a discriminative filter for the next frame from current frame and historical information. Early methods, e.g., MOSSE [1] and KCF [2], formulate the CF framework with one single base image from the current frame, and update the CFs using the linear interpolation strategy. Denote by the sample pair $\{\left(\mathbf{x}_{t},\mathbf{y}_{t}\right)\}$ in frame $t$ , where each sample $\mathbf{x}_{t}$ consists of $L$ feature maps with $\mathbf{x}_{t}=[\mathbf{x}_{t,1},...,\mathbf{x}_{t,L}]$ , and $\mathbf{y}_{t}$ represents the Gaussian shaped label. Then the correlation filter $\mathbf{f}$ is obtained by minimizing the following objective,

[TABLE]

where $\star$ and $\odot$ respectively stand for circular convolution and Hadamard product, $\mathbf{c}$ denotes cosine window, and $\lambda$ denotes the tradeoff parameter of the regularization term $\mathcal{R}(\mathbf{f})$ .

Since the pioneering work of MOSSE [1], many improvements have been made to CF trackers. On the one hand, the CF models have been consistently improved with the introduction of non-linear kernel [2], scale adaptivity [4, 5, 6], long-term tracking [20], part-based CFs [21], particle filters [22], spatial regularization [9, 14], continuous convolution [11, 12], and formulation with multiple base images [8, 12, 15]. On the other hand, progress in feature engineering, e.g., HOG [23], ColorName [24] and deep CNN features [25, 26, 27], also greatly benefits the performance of CF trackers.

Among these improvements, we specifically mention a category of CF formulations with multiple base images [8, 12, 15]. Given a set of $K$ base images $\{\left(\mathbf{x}_{k},\mathbf{y}_{k}\right)\}_{k=1}^{K}$ , CF with multiple base images can then be expressed as,

[TABLE]

where $\alpha_{k}$ represents the weight of the $k$ -th base image $\mathbf{x}_{k}$ . For example, SRDCF [8] and CCOT [11] simply adopt the latest $K$ frames as base images. In SRDCFdecon [28], an adaptive decontamination model is presented to downweight corrupted samples while up-weighting faithful ones. ECO [12] and UPDT [15] apply a Gaussian mixture model (GMM) to determine both the weights as well as base images. In general, CF trackers with multiple base images perform much better than those with single base image, and have achieved state-of-the-art tracking performance.

In contrast to CF with single base image in Eqn. (1), the introduction of multiple base images breaks the circulant structure, and generally requires iterative optimization algorithms to solve the resulting formulation in Eqn. (2). Therefore, in this work different solutions are respectively developed for removing cosine window from CF trackers with single and multiple base images.

III When to Remove Cosine Window

Cosine window is first introduced in the early MOSSE and KCF methods to alleviate the effect of boundary discontinuity, and then adopted in all the subsequent CF trackers. In the recent few years, spatial regularization has also been deployed in CF trackers for handling boundary discontinuity. Albeit cosine window is also adopted in CF with spatial regularization, considering their similar roles, it is natural to ask whether it is possible to remove cosine window from CF when spatial regularization is adopted.

In this section, we use KCF and BACF as two representative examples, and evaluate the performance of CF trackers with and without cosine window on the VOT-2018 dataset. Here we name KCF and BACF without cosine window as KCFRC and BACFRC, respectively. Table I lists their EAO, accuracy and robustness on VOT-2018. For KCF, it can be seen that removing cosine window is harmful to tracking performance and gives rise to an obvious EAO drop from 0.106 to 0.069. From Fig. 1(a)(b), the filter learned by KCF is much different from that learned by KCFRC in appearance. Moreover, cosine window also performs similarly in enforcing non-central filter coefficients to approach zero. In contrast to KCF, cosine window actually plays a minor role on improving tracking performance for BACF, and the EAO of BACFRC is only 0.013 lower than that of BACF. From Fig. 1(c)(d), the filters learned by BACF and BACFRC are also similar in appearance. Similar results can also be observed for STRCF [14], ECO [12] and UPDT [15] and on the OTB-2015 and Temple-Color benchmarks in Section V, indicating that it is possible to remove cosine window from CF trackers when spatial regularization is introduced.

We also note that BACF still slightly outperforms BACFRC, which can be explained by taking both boundary discontinuity and sample contamination into account. From Fig. 1(c), it can be seen that BACF can well handle boundary discontinuity by incorporating cosine window and spatial regularization. However, cosine window is deployed on the base image, which makes the shifted negative samples contaminated and may be harmful to tracking performance. In BACFRC, one can see from Fig. 1(d) that most negative samples are real image patches (e.g., those in green and cyan boxes). However, a small percentage of negative samples still suffer from boundary discontinuity (e.g., that in blue box), which may explain the slight inferiority of BACFRC in comparison to BACF. To sum up, for removing cosine window, it is better to avoid sample contamination as well as eliminate boundary discontinuity for all negative samples. Thus, we turn to the second problem of this work, i.e., how to remove cosine window from CF trackers with spatial regularization, and present our solutions in the next section.

IV How to Remove Cosine Window

Simply removing cosine window from CF trackers with spatial regularization generally cannot outperform its counterpart due to that the negative samples near boundary still suffer from boundary discontinuity. To address this issue, we modify the formulation of CF trackers by introducing mask function to deactivate the boundary discontinuous samples. Two mask functions are presented to eliminate boundary discontinuity as well as emphasize more on samples near the target center. Then, optimization algorithms are respectively developed for removing cosine window from CF trackers with single and multiple base images.

IV-A Problem formulation

Without loss of generality, we use BACF as an example to analyze the positions of negative samples suffering from boundary discontinuity. Suppose that the sizes of the target bounding box and base image are $h\times w$ and $H\times W$ , respectively. For BACF, we have $H=W=5\sqrt{hw}$ . From Fig. 2(a), it can be seen that only the samples at position $(x,y)$ are with discontinuous boundaries when $\frac{H}{2}\geq|x|>\frac{H}{2}-\frac{h}{2}$ and $\frac{W}{2}\geq|y|>\frac{W}{2}-\frac{w}{2}$ . In general, $H$ ( $W$ ) is much larger than $h$ ( $w$ ), and thus the majority of samples (e.g., $64\%$ when $h=w$ ) are real image patches. In order to eliminate the effect of boundary discontinuity, we introduce a binary mask function $\mathbf{M}$ shown in Fig. 2(b) to indicate the samples of real image patches. In particular, a sample at position $(x,y)$ is a real image patch when ${M}(x,y)=1$ . Then the binary mask $\mathbf{M}$ can be defined as follows,

[TABLE]

In the following, we use Eqn. (2) as a general form to illustrate how to eliminate boundary discontinuity while avoiding sample contamination for CF trackers with spatial regularization. In particular, we remove cosine window from Eqn. (2), and incorporate the binary mask $\mathbf{M}$ to deactivate the negative samples suffering from boundary discontinuity, resulting in the following model,

[TABLE]

With the introduction of $\mathbf{M}$ , the estimation error of the sample with discontinuous boundary can be safely excluded during training. In comparison to CF tracker with cosine window in Eqn. (2), the formulation in Eqn. (4) can circumvent both boundary discontinuity and sample contamination, thereby benefiting tracking performance.

Furthermore, the CF model usually is learned from an unbalanced set containing few positive samples and a large amount of negative samples. The binary mask $\mathbf{M}$ treats all boundary continuous samples equally, and has the risk of degrading tracking performance due to vast negative samples. Considering that the samples near the target center are more important than those on image boundaries, we also present a Gaussian shaped mask function $\mathbf{M}_{G}$ defined as,

[TABLE]

where the parameter $\delta$ is introduced to control the weight decay speed of training samples. Empirical study also validates that Gaussian shaped mask function $\mathbf{M}_{G}$ generally performs moderately better than binary mask function $\mathbf{M}$ for CF trackers with spatial regularization.

Given a specific CF tracker, we denote the models by (i) removing cosine window, (ii) removing cosine window and incorporating binary mask function, (iii) removing cosine window and incorporating Gaussian mask function as CFRC, CFRCB, and CFRCG, respectively. In the following, we present the optimization algorithms to solve the model in Eqn. (4) for CF trackers with single and multiple base images, respectively.

IV-B Solution for CF trackers with single base image

For CFLB [29], BACF [13], CSR-DCF [10] and STRCF [14], the filter is updated by solving some CF models defined on a single base image (i.e., the current frame). In this case, the resulting constrained optimization problem can be efficiently solved via alternating minimization, in which each subproblem has the closed-form solution. When removing cosine window from this category of CF trackers with spatial regularization, we rewrite the model in Eqn. (4) as

[TABLE]

Then, alternating minimization algorithms can also be extended to solve it. In the following, we take BACF as an example, and present an alternating direction method of multipliers (ADMM) to optimize the resulting formulation.

With simple algebra, the original formulation of BACF can be equivalently rewritten as,

[TABLE]

where $\mathbf{P}$ stands for the binary mask matrix which crops the central $D$ elements of $\mathbf{g}_{l}$ with the size of $T$ . After removing cosine window and incorporating with the mask function $\mathbf{M}$ , we further let $\mathbf{f}_{l}=\mathbf{P}^{\text{T}}\mathbf{g}_{l}$ , and the modified BACF model can be formulated as,

[TABLE]

The model in Eqn. (8) is still a convex optimization problem, can be solved with the ADMM algorithm. To begin with, we introduce another auxiliary variable $\mathbf{z}=\sum_{l=1}^{L}\mathbf{x}_{l}\star\mathbf{f}_{l}-\mathbf{y}$ , and reformulate Eqn. (8) as,

[TABLE]

Then the augmented Lagrangian function of Eqn. (9) can be expressed as,

[TABLE]

where $\boldsymbol{\zeta}$ , $\boldsymbol{\gamma}$ denote the Lagrangian multipliers, and $\mu$ , $\tau$ represent the penalty parameters, respectively. Eqn. (10) can be solved iteratively with the ADMM algorithm, in which all the subproblems, i.e., $\mathbf{f}$ , $\mathbf{g}$ and $\mathbf{z}$ , have their closed-form solutions. In the following, we present the solution of each subproblem.

Subproblem $\mathbf{g}$ :

[TABLE]

Note that each channel of $\mathbf{g}$ in Eqn. (11) can be computed independently, thus the closed-form solution of the $l$ -th channel of $\mathbf{g}$ can be expressed as,

[TABLE]

where $\mathbf{I}$ denotes an identity matrix. Note that $\lambda\mathbf{I}+\mu\mathbf{P}\mathbf{P}^{\text{T}}$ is a diagonal matrix and its inverse matrix can be efficiently computed via element-wise operation.

Subproblem $\mathbf{f}$ :

[TABLE]

Using Parseval’s theorem, Eqn. (13) can be equivalently expressed in the Fourier domain,

[TABLE]

Here $\hat{\mathbf{x}}=\sqrt{T}\mathbf{F}\mathbf{x}$ represents the FFT of sample $\mathbf{x}$ where $\mathbf{F}$ is the orthonormal Discrete Fourier Transform (DFT) matrix, and $\hat{\mathbf{q}}_{l}$ takes the form of $\hat{\mathbf{q}}_{l}=\sqrt{T}\mathbf{FP}^{\text{T}}\mathbf{g}_{l}$ . Analogous to BACF [13], the solution for $\hat{\mathbf{f}}$ can be divided into $T$ independent subproblems. Denote by $\mathbf{x}\left(t\right)\in\mathbb{R}^{L}$ the vector consisting of $t$ -th elements of sample $\mathbf{x}$ along all $L$ channels, then the $t$ -th elements $\hat{\mathbf{f}}(t)$ of $\hat{\mathbf{f}}$ can be computed by,

[TABLE]

Note that $\hat{\mathbf{x}}\left(t\right)\hat{\mathbf{x}}\left(t\right)^{\text{T}}$ is rank-1 matrix, thus Eqn. (15) can be efficiently solved with Sherman-Morrison formula [30],

[TABLE]

where $\hat{s}_{\mathbf{x}}\left(t\right)\!=\!\hat{\mathbf{x}}\!\left(t\right)^{\text{T}}\!\hat{\mathbf{x}}\!\left(t\right)$ , $\hat{s}_{\boldsymbol{\zeta}}\left(t\right)\!=\!\hat{\mathbf{x}}\!\left(t\right)^{\text{T}}\!\hat{\mathbf{\boldsymbol{\zeta}}}\!\left(t\right)$ , $\hat{s}_{\mathbf{q}}\!\left(t\right)\!=\!\hat{\mathbf{x}}\!\left(t\right)^{\text{T}}\!\hat{\mathbf{q}}\!\left(t\right)$ and $b\!=\!\frac{\mu}{\tau}+\hat{s}_{\mathbf{x}}\left(t\right)$ . And the solution for $\mathbf{f}$ is further obtained with the inverse DFT operation.

Subproblem $\mathbf{z}$ :

[TABLE]

Analogous to Eqn. (11), each element in $\mathbf{z}$ can also be computed independently, and its solution is given as,

[TABLE]

where $\mathbf{1}$ defines a vector where each element equals to $1$ , and $\text{Diag}(\cdot)$ constructs a diagonal matrix from a vector.

Lagrangian Update: The Lagrangian multipliers $\boldsymbol{\zeta}$ , $\boldsymbol{\gamma}$ are updated as,

[TABLE]

where $\mathbf{f}^{(t+1)}$ , $\mathbf{g}^{(t+1)}$ and $\mathbf{z}^{(t+1)}$ are the current solutions to the above subproblems at iteration $t+1$ within the iterative ADMM algorithm.

Finally, we also note that the above solutions can be easily extended to remove cosine window from other CF trackers (e.g., STRCF) with a single base image.

IV-C Solution for CF trackers with multiple base images

Another category of CF trackers with spatial regularization are defined on multiple base images, which inevitably breaks the circulant structure and generally requires iterative optimization to solve some of the resulting subproblems. Several representative trackers in this category include SRDCF [8], CCOT [11], ECO [12] and UPDT [15]. In this subsection, we use ECO as an example to suggest an iterative optimization method for removing cosine window. Without loss of generality, our solution can be easily extended to remove cosine window from other CF trackers based on multiple base images (e.g., UPDT [15]).

In general, the learning algorithm in ECO consists of two stages. (i) In the first frame, a sample projection matrix is learned with the CF to reduce the number of feature channels in training samples. (ii) In the subsequent frames the projection matrix is fixed and the CFs are further updated with the reduced features. To keep consistent with the ECO tracker [12], we also define the formulation for data on a one-dimension domain. Denote by a collection of $K$ sample pairs $\{(\mathbf{x}_{k},\mathbf{y}_{k})\}_{k=1}^{K}$ , and the feature map size for the $l$ -th channel $\mathbf{x}_{k,l}$ is $N_{l}$ . The feature map $\mathbf{x}_{k,l}$ in ECO tracker is first transformed into the continuous spatial domain $t\in[0,T)$ with an interpolation operator $J_{l}$ ,

[TABLE]

where $b_{l}$ is an interpolation kernel with the period $T>0$ . Suppose the reduced correaltion filter $\mathbf{f}=[\mathbf{f}_{1},...,\mathbf{f}_{D}]$ consists of $D$ feature maps with $D<L$ , and the sample projection matrix $\mathbf{Q}\in\mathbb{R}^{L\times D}$ is represented with $\mathbf{Q}=\left(q_{l,d}\right)$ . Then the filter $\mathbf{f}$ and sample projection matrix $\mathbf{Q}$ can be computed by minimizing the following objective function,

[TABLE]

where $\mathbf{w}$ denotes the spatial regularization matrix.

When removing cosine window and incorporating the mask function $\mathbf{M}$ , the ECO model can be modified as,

[TABLE]

To solve Eqn. (22), we introduce a series of auxiliary variables [ $\mathbf{z}_{1}$ , …, $\mathbf{z}_{K}$ ] with $\mathbf{z}_{k}\!=\!\!\sqrt{\alpha_{k}}(\sum\limits_{d=1}^{D}\sum\limits_{l=1}^{L}\!q_{l,d}\mathbf{f}_{d}\star\!J_{l}\{\mathbf{x}_{k,l}\}\!-\mathbf{y}_{k})$ , then it can be relaxed as,

[TABLE]

where $\tau$ is a penalty parameter which is updated along with the iterations.

We suggest an iterative optimization algorithm for solving the problem in Eqn. (23). In particular, we minimize the objective in each iteration by alternating between updating the auxiliary variables $\mathbf{z}_{k}$ and the model parameters $\{\mathbf{f}$ , $\mathbf{Q}\}$ , which is further explained as follows.

Updating $\{\mathbf{f},\mathbf{Q}\}$ : Given the auxiliary variables [ $\mathbf{z}_{1}$ , …, $\mathbf{z}_{K}$ ], we can observe that the subproblem shares similar formulation with Eqn. (21), thus it can be minimized with the optimization method used in the ECO tracker.

Updating $\mathbf{z}$ : Analogous to Eqn. (17), the closed-form solution for $\mathbf{z}_{k}$ can be computed by,

[TABLE]

V Experiments

In this section, we evaluate the feasibility and effectiveness of removing cosine window by integrating it into five representative CF trackers with spatial regularization, i.e., BACF, STRCF, ECOhc, ECO and UPDT. Then, extensive experiments are conducted to compare our methods with the state-of-the-art methods on three popular tracking benchmarks, i.e., OTB-2015 [16], Temple-Color [17] and VOT-2018 [18] datasets.

V-A Baseline CF trackers

Our methods are generic and can be integrated to multiple CF trackers with spatial regularization, such as those with single or multiple base images, using handcrafted or deep CNN features. In the experiments, we choose three baseline CF trackers using handcrafted features, i.e., BACF [13], ECOhc [12] and STRCF [14]. Moreover, we also consider two state-of-the-art baseline CF trackers using CNN features, i.e., ECO [12] and UPDT [15]. It is worth noting that we only incorporate our method with UPDT on the VOT-2018 dataset, because UPDT employs the difficult videos from OTB-2015 for parameter tuning and most of these videos also exist in Temple-Color. Besides, another two CF trackers without spatial regularization, i.e., MOSSE and KCF, are also included to illustrate when to remove cosine window from CF trackers.

V-B Implementation details

We employ the publicly available codes provided by the authors to reproduce the results of the baseline CF trackers and competing methods. As for our modified trackers by removing cosine window, we keep most of the parameters the same with their counterparts, and mainly fintune the parameters added by our methods. In particular, we set the penalty parameters $\tau$ , $\mu$ , and the number of iterations in CFRCG as {2.5, 2.5, 3}, respectively. The penalty parameters $\tau$ , $\mu$ are updated along with iterations by $\tau^{(t+1)}=\min(p\tau^{(t)},\tau_{max})$ and $\mu^{(t+1)}=\min(p\mu^{(t)},\mu_{max})$ , where $\tau_{max}$ , $\mu_{max}$ and $p$ are set to ${100,100,1.05}$ , respectively. As for the ECO trackers, the parameters $\tau$ , the number of iterations are set to {2.2, 4} and {2.5, 5} for ECOhcRCG and ECORCG, respectively. In addition, we assign the standard deviation parameter $\delta$ in Eqn. (5) to {1.2, 1.4, 2} for BACFRCG, ECOhcRCG and ECORCG, respectively. Note that we employ the same parameter settings for each tracker throughout the experiments on all datasets. Our method is implemented on Matlab 2017b with Matconvnet library [31], and all the experiments are run on a PC equipped with Intel i7 7700 CPU, 32GB RAM and a single NVIDIA GTX 1070 GPU.

V-C Internal Analysis of our methods

V-C1 Ablation study

In this section, we study the effect of removing cosine window, incorporating binary or Gaussian shaped mask functions into the baseline CF trackers using the VOT-2018 benchmark [18]. To this end, we implement four variants for each baseline CF tracker, i.e., the baseline CF (termed as Baseline), removing cosine window ( $\bf{{RC}}$ ), removing cosine window and incorporating binary mask function ( $\bf{{RCB}}$ ), and removing cosine window and incorporating Gaussian shaped mask function ( $\bf{{RCG}}$ ). In addition, we also include MOSSE and KCF as baseline trackers to show that their performance is degraded by removing cosine window and cannot be remedied by incorporating mask function. Following the protocols in [32], we evaluate the performance of each method using Expected Average Overlap (EAO), accuracy and robustness as performance measures.

Table II presents the results of all the variants on the VOT-2018 dataset. One can observe that when removing the cosine window from MOSSE and KCF, the tracking performance degrades significantly. And they still perform inferior to the Baseline counterparts even with the introduction of binary or Gaussian shaped mask functions. Thus, cosine window cannot be removed from the CF trackers without spatial regularization.

As for the CF trackers with spatial regularization, we can make the following observations. (i) In comparison to Baseline, the performance of RC variant slightly degrade with a drop of $\sim$ 1.5 in terms of EAO. Such performance degradation can be explained by the fact that a small percentage of negative samples still suffer from boundary discontinuity which may be harmful to tracking performance. (ii) By integrating the binary mask function $\mathbf{M}$ into the CF trackers with spatial regularization, the RCB variants consistently outperform the RC and Baseline. In terms of EAO, the performance gain of RCB can be about $0.02\sim 0.04$ against RC and about $0.015\sim 0.03$ against Baseline. The performance improvement can be ascribed to the reason that RCB is more effective in handling both boundary discontinuity and sample contamination in comparison with Baseline and RC. (iii) The introduction of Gaussian shaped mask function $\mathbf{M}_{G}$ can further boost the performance of CF trackers with spatial regularization, indicating that the samples near target center should be emphasized more in the modified CF models. (iv) Finally, RCB and RCG significantly improve the robustness against the Baseline trackers with lower failure rate. In terms of accuracy, RCB and RCG perform on par with Baseline and RC, indicating that the gain of mask function should be attributed to the improvement on the robustness of appearance modeling.

To sum up, the results empirically validate our answers to the two problems concerned in this work. (i) It is feasible to remove cosine window for CF trackers with spatial regularization. (ii) By incorporating with mask function, we can not only safely remove cosine window from CF trackers with spatial regularization, but also bring moderate performance gains over their Baseline counterparts.

V-C2 Effect of hyper-parameter $\delta$ in $\mathbf{M}_{G}$

The hyper-parameter $\delta$ in Gaussian shaped mask function $\mathbf{M}_{G}$ controls the decay speed of training samples from target center to boundaries. In particular, higher $\delta$ indicates the slower decay speed, and more negative samples near boundary will be considered during training. When $\delta\to+\infty$ , the Gaussian shaped mask function $\mathbf{M}_{G}$ degrades to the binary mask function $\mathbf{M}$ . Using BACF, ECOhc and ECO, we analyze the effect of the hyper-parameter $\delta$ on tracking performance. Concretely, Fig. 3 show the EAO plots of the three trackers with different $\delta$ values on the VOT-2018 dataset. It can seen that the choice of $\delta$ has a significant effect on EAO score for all the three trackers. For BACF, ECOhc and ECO, the RCG methods achieve the best performance when $\delta=\{1.2,1.4,2\}$ , respectively.

V-D VOT-2018 benchmark

To further assess the proposed methods, we compare our best trackers (i.e., UPDTRCG and ECORCG) with the state-of-the-art trackers on the VOT-2018 dataset. VOT-2018 consists of 60 challenging videos collected from real-life datasets. In the benchmark, a tracker will be re-initialized with the ground-truth bounding boxes whenever it significantly drifts from the target. And the performance is evaluated with three measures: accuracy, robustness and EAO. The accuracy computes the average overlap between estimated bounding boxes and ground-truth annotations. The robustness score counts the times of tracking failures. And EAO measure is a principled combination of accuracy and robustness scores.

Table III lists the results of our UPDTRCG and ECORCG, ECO, and the top ten best performing trackers on the VOT-2018 challenge. For a fair comparison, we reproduce the results of ECO and UPDT with their publicly available codes on the VOT-2018 challenge website, and the UPDT result is reported as the average score of 15 times running. We also note that the reported EAO score of UPDT on the VOT-2018 challenge is 0.378, while our reproducing result is 0.352 based on the released code on the VOT-2018 challenge. From Table III, we can observe that UPDTRCG slightly outperforms VOT-2018 challenge winner LADCF and ranks the first among all the competing trackers. UPDTRCG is also superior to its counterpart UPDT by an EAO gain of 3.9%, indicating the feasibility and benefit of removing cosine window. Not surprisingly, ECORCG also shows its superiority, i.e., an improvement of 2.5% by EAO score, over the ECO counterpart.

V-E OTB-2015 dataset

The OTB-2015 dataset [16] consists of 100 full annotated videos with 11 video attributes, including illumination variations (IV), scale variation (SV), occlusion (OCC), in-plane rotation (IPR), out-of-plane rotation (OPR), motion blur (MB), fast motion (FM), deformation (DEF), background clutter (BC), out of view (OV) and low resolution (LR). Following the settings given in [16], we evaluate the trackers based on the One Pass Evaluation (OPE) protocol, and adopt the overlap precision (OP) metric for calculating the fraction of frames with bounding box overlaps exceeding 0.5 in a sequence. Besides, we also present the overlap success plots with different overlap thresholds for detailed comparison.

To assess our methods, we compare four of them (i.e., STRCFRCG, ECOhcRCG, BACFRCG and ECORCG) with 22 state-of-the-art trackers, which can be roughly grouped into two categories: (i) trackers using handcrafted features (i.e., STRCF [14], ECOhc [12], BACF [13], DSST [4], SAMFAT [37], Staple [38], TRACA [39], SRDCFDecon [28], SRDCF [8], SKSCF [7]), and (ii) trackers using deep CNN features (i.e., ECO [12], CCOT [11], MDNet [40], CNN-SVM [41], FCNT [42], CF-Net [43], DeepSTRCF [14], VITAL [44], DeepSRDCF [45], SiameseFC [46], HDT [26] and HCF [25]). In particular, STRCFRCG, ECOhcRCG, and BACFRCG are compared with the trackers using handcrafted features, while ECORCG is compared with the trackers using deep CNN features. For a fair comparison, UPDT and UPDTRCG are not included in the comparison because UPDT adopts the difficult videos from OTB-2015 for parameter tuning.

V-E1 Comparison with state-of-the-arts

We compare the proposed methods with the state-of-the-art trackers on OTB-2015. Table IV lists the mean OP results of all the competing methods. One can see that our methods are consistently superior to their baseline counterparts. Using handcrafted features, BACFRCG, ECOhcRCG and STRCFRCG outperform their counterparts with mean OP gains of 1.1%, 1.3% and 2.3%, respectively. Using deep CNN features, ECORCG also surpasses its ECO counterpart by 1.9% in terms of mean OP. Moreover, our STRCFRCG achieves the best mean OP among the trackers using handcrafted features, while our ECORCG performs the best among those using deep CNN features. Furthermore, Fig. 4 shows the overlap success curves of the competing methods, which are ranked with the Area-Under-the-Curve (AUC) score. Not surprisingly, our methods perform favorably against the competing trackers using handcrafted and deep CNN features.

V-E2 Attribute comparison

Using the handcrafted features, we further investigate the performance of our methods on all 11 video attributes. Table V gives the mean OP results of all the trackers. One can see that our ECORCG and STRCFRCG obtain the rank-1 performance on 9 of all 11 video attributes. For the attributes motion blur, background clutter, illumination variation and occlusion, significant improvement can be achieved by our methods. By removing cosine window and incorporating mask function, our methods are more effective in exploiting negative samples for model learning, and benefit the robustness of tracking performance. This may explain the better results of our methods when the target suffers from rapid appearance changes (e.g., motion blur, occlusion, and illumination variation) and background clutter. In addition, Fig. 5 provides the AUC plots of all competing trackers using handcrafted features on all video attributes. It can be seen that our ECORCG and STRCFRCG also perform favorably against the state-of-the-art methods on most attributes.

V-E3 Running time

Fig. 6 reports the tracking speed (FPS) of the four baseline trackers, i.e., BACF, STRCF, ECOhc and ECO, and their corresponding RCG methods on OTB-2015. It can be seen that BACFRCG, STRCFRCG achieve a tracking speed of 22.2 and 19.5 FPS, moderately slower than their CF counterparts BACF (26.7 FPS) and STRCF (24.3 FPS). respectively. Thus, while the introduction of mask function increases the model complexity, the two trackers can still be efficiently solved with the ADMM algorithms, and each subproblem has its closed-form solution. As for the trackers with multiple base images, ECOhcRCG runs at approximately 70% speed of the baseline ECOhc (42 FPS), but still maintains real-time tracking performance with 28.9 FPS. When extended to deep CNN features, ECORCG (5.9 FPS) can run at approximately 70% speed of its baseline ECO method (9.8 FPS).

V-E4 Qualitative evaluation

Fig. 7 shows the qualitative results of four baseline CF trackers, i.e., BACF, STRCF, ECOhc and ECO, as well as their RCG counterparts. It can be seen from the first row that the target suffers from background clutter and illumination variation. In comparison to baseline BACF, BACFRCG can take the benefit of removing cosine window, thus is able to exploit more useful and uncontaminated training samples for robust model learning, thereby significantly alleviating the tracker drift issue. In the second row, due to the effect of motion blur, fast motion and occlusion challenges, ECOhc cannot track the target throughout the whole sequence while ECOhcRCG still performs well.

In the last two rows, similar phenomena can also be observed when the STRCF and ECO trackers are applied to coupon and freeman4 videos, respectively. In all these videos, the RCG method consistently outperforms its baseline CF counterpart, indicating the effectiveness of removing cosine window and incorporating with mask function.

V-F Temple-Color dataset

To further evaluate our methods, comparative experiments are also conducted on the Temple-Color dataset containing 129 color video sequences in total. Fig. 8 shows the overlap success plots of the competing trackers using handcrafted and CNN features. It can be seen from Fig. 8(a) that our methods generally consistently improve the baseline CF trackers using handcrafted features. In particular, BACFRCG, ECOhcRCG and STRCFRCG respectively outperform the baseline CF counterparts with AUC score gains of 2.4%, 0.8% and 1.3%. Moreover, as shown in Fig. 8(b), ECORCG also performs better than ECO by 0.8% when using deep CNN features, further demonstrating the effectiveness of removing cosine window from CF trackers with spatial regularization.

VI Conclusion

In this paper, we investigated the problems of when and how to remove cosine window from CF trackers. Our empirical analysis showed that both spatial regularization and cosine window can be utilized to alleviate boundary discontinuity. However, cosine window may give rise to sample contamination, while for spatial regularization a small percentage of negative samples still suffer from boundary discontinuity. To remove cosine window from CF trackers with spatial regularization, we introduced a binary mask function to exclude the negative samples suffering from boundary discontinuity during training. Furthermore, another Gaussian shaped mask function was also introduced to downweight the negative samples far from target center. Then, optimization algorithms were respectively developed for removing cosine window from CF trackers with single and multiple base images.

Our experiments on OTB-2015, Temple-Color and VOT-2018 showed that our methods are effective in circumventing boundary discontinuity and sample contamination, and bring moderate performance gains over their CF counterparts with cosine window. Our methods also perform favorably against the state-of-the-art trackers using handcrafted and deep CNN features.

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation filters,” in IEEE Conference on Computer Vision and Pattern Recognition , 2010, pp. 2544–2550.
2[2] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 37, no. 3, pp. 583–596, March 2015.
3[3] M. Tang and J. Feng, “Multi-kernel correlation filter for visual tracking,” in IEEE International Conference on Computer Vision , 2015, pp. 3038–3046.
4[4] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Discriminative scale space tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 39, no. 8, pp. 1561–1575, Aug 2017.
5[5] Y. Li and J. Zhu, “A scale adaptive kernel correlation filter tracker with feature integration,” in European Conference on Computer Vision Workshop , 2014, pp. 254–265.
6[6] F. Li, Y. Yao, P. Li, D. Zhang, W. Zuo, and M. H. Yang, “Integrating boundary and center correlation filters for visual tracking with aspect ratio variation,” in IEEE International Conference on Computer Vision Workshop , 2017, pp. 2001–2009.
7[7] W. Zuo, X. Wu, L. Lin, L. Zhang, and M.-H. Yang, “Learning support correlation filters for visual tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 41, no. 5, pp. 1158–1172, May 2019.
8[8] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg, “Learning spatially regularized correlation filters for visual tracking,” in IEEE International Conference on Computer Vision , 2015, pp. 4310–4318.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Remove Cosine Window from Correlation Filter-based Visual Trackers: When and How

Abstract

Index Terms:

I Introduction

II Related Work

III When to Remove Cosine Window

IV How to Remove Cosine Window

IV-A Problem formulation

IV-B Solution for CF trackers with single base image

IV-C Solution for CF trackers with multiple base images

V Experiments

V-A Baseline CF trackers

V-B Implementation details

V-C Internal Analysis of our methods

V-C1 Ablation study

V-C2 Effect of hyper-parameter δ\deltaδ in MG\mathbf{M}_{G}MG​

V-D VOT-2018 benchmark

V-E OTB-2015 dataset

V-E1 Comparison with state-of-the-arts

V-E2 Attribute comparison

V-E3 Running time

V-E4 Qualitative evaluation

V-F Temple-Color dataset

VI Conclusion

V-C2 Effect of hyper-parameter $\delta$ in $\mathbf{M}_{G}$