ingle Object Tracking With Fuzzy Least Squares
Support Vector Machine
Shunli Zhang, Sicong Zhao, Yao Sui, and Li Zhang
Abstract— Single object tracking, in which a target is often
initialized manually in the first frame. and then is tracked
and located automatically in the subsequent frames, is a hot
topic in computer vision. The traditional tracking-by-detection
framework, which often formulates tracking as a binary classi-
fication problem, has been widely applied and achieved great
success in single object tracking. However, there are some
potential issues in this formulation. For instance, the boundary
between the positive and negative training samples is fuzzy,
and the objectives of tracking and classification are inconsistent.
In this paper, we attempt to address the above issues from the
fuzzy system perspective and propose a novel tracking method
by formulating tracking as a fuzzy classification problem. First,
we introduce the fuzzy strategy into tracking and propose a novel
fuzzy tracking framework, which can measure the importance of
the training samples by assigning different memberships to them
and offer more strict spatial constraints. Second, we develop a
fuzzy least squares support vector machine (FLS-SVM) approach
and employ it to implement a concrete tracker. In particular,
the primal form, dual form, and kernel form. of FLS-SVM are
analyzed and the corresponding closed-form. solutions are derived
for efficient realizations. Besides, a least squares regression
model is built to control the update adaptively, retaining the
robustness of the appearance model. The experimental results
demonstrate that our method can achieve comparable or superior
performance to many state-of-the-art methods.
Index Terms— Object tracking, fuzzy least squares support
vector machine (FLS-SVM), least square regression (LSR).
I. INTRODUCTION
S
INGLE object tracking is one of the fundamental problems
in computer vision. In single object tracking, a target
is often initialized manually in the first frame. and tracking
is defined as predicting the location of the target automat-
ically in the subsequent frames. Since it has a wide range
of applications in many domains, e.g. motion recognition,
surveillance, video editing, human-computer interfaces (HCI),
etc., it has attracted much attention of researchers. However,
tracking is still a challenging task because of various factors,
including heavy occlusion, deformation, illumination variation,
scale variation, complex background, etc. [1]–[3].
Manuscript. received March 29, 2015; revised August 20, 2015; accepted
September 24, 2015. Date of publication September 29, 2015; date of current
version October 23, 2015. This work was supported by the National Natural
Science Foundation of China under Grant 61132007 and Grant 61172125. The
associate editor coordinating the review of this manuscript. and approving it
for publication was Prof. Jean-Philippe Thiran.
The authors are with the Department of Electronic Engineering,
Color versions of one or more of the figures in this paper are available
online Digital Object Identifier 10.1109/TIP.2015.2484068
To address the above challenges, building an accurate and
robust appearance model is crucial. The existing appearance
models can be divided into two types. One is the generative
model [4]–[7], which represents the object with only the
information of itself. The other one is the discriminative
model [8]–[13], which makes use of the information of
both the target and background by formulating tracking as
a binary classification problem. The latter model, which is
always called tracking-by-detection method, makes a more
comprehensive description of both the target and background.
Besides, many collaboration based methods try to improve the
robustness by combining multiple sub-models with different
attributes [14]–[17].
However, there are some potential issues in the conventional
tracking-by-detection formulation [18]. First, the objectives
of tracking and classification are inconsistent, for the objective
of tracking is to predict the precise location of the object, while
the objective of classification is to label the samples correctly.
The inconsistency may result in the deviation between the best
classification result and the optimal tracking result. During
tracking, the accumulation of the deviation by on-line learning
may be an important factor causing the drift problem. Second,
the classification boundary between the selected positive and
negative samples is unclear. In tracking-by-detection methods,
the positive and negative training samples are often sampled
by the distance based rule, where the samples close to
the target are taken as positive and the ones far away
from the target are considered as negative. Concretely, let, we think
that the new tracking result cannot be well represented by the
samples in the target buffer. In this situation, we will update
the model to make the appearance subspace fit the changes.
Otherwise, we will not update the model. The LSR model can
effectively capture the drastic deformation and alleviate the
accumulation of the occlusion samples, controlling the update
adaptively.
In our method, the target buffer T corresponds to the target
samples with a unique translation t(0, 0) and a buffer depth
B
t
= 50, while the fuzzy buffer F corresponds to a group of
specific translations {t(α,β)} and the depth of each translation
is B
f
= 1. Once determining to update, we update T and F
respectively. T is updated according to the First In and First
Out (FIFO) rule, which means that the optimal candidate
sample in the current frame. is adopted to replace the earliest
sample in T. F is updated in terms of a probability strategy.
Since the samples in F are selected based on dense sampling,
each sample in F corresponds to a unique translation t(α,β)
and specific memberships. In order to make use of the spatial
constraints in more frames and improve the efficiency, we
update each sample in F with probability p
u
= N
r
/M.
In practice, we first select N
r
translations randomly, and then
sample the corresponding samples around x
opt
in the current
ZHANG et al.: SINGLE OBJECT TRACKING WITH FLS-SVM 5729
Algorithm 1 The FST Tracking Algorithm
frame, to replace the samples with the same translations
in F. Based on the updated samples, the FLS-SVM model
is retrained to continue the tracking in the next frame. The
complete tracking process is summarized in Algorithm. 1.
V. E XPERIMENTS
A. Experimental Setup
1) Initialization: We initialize the FST tracker as follows.
HOG is extracted as the feature for all training and
candidate samples, which adopts 5-pixel-window size and
9 orientations [40]. The normalization size of the samples is
30 × 30. The sliding radius Rs
for sampling the candidate
samples is 26 pixels. For FLS-SVM, the Gaussian kernel with
= 10 is adopted and the trade-off parameter C is set to 0.1
empirically. The threshold for update judgement depends on
the F-norm of the target buffer T,andissetas0.05bardblTbardbl.The
update probability pis set as 0.05. All the parameters are
fixed or adjusted adaptively on all the testing sequences. FST is
implemented in Matlab on a PC with an Intel i5 CPU 3.3 GHz.
The average running time is about 3 frames per second.
2) Dataset: To have fair evaluation of our method,
we conduct our tracker on all of the 51 fully annotated
sequences in the benchmark dataset [3]. These sequences
are captured in various conditions, including occlusion,
deformation, illumination variation, scale variation, out-of/in
plane rotation, fast motion, background clutter, etc.
3) Evaluation: We utilize 4 criteria to evaluate the perfor-
mance of the trackers. The first one is the average center
location error (CLE), which is defined as the average value of
the errors of the obtained center position of the target and the
ground truth. The second is the average Pascal VOC overlap
rate (VOR), which is defined as the average of Score =
represent the bounding boxes
of the tracking result and ground truth, respectively. The third
is the precision, which is determined based on CLE. The
precision is computed by the ratio of the number of frames
whose CLE is smaller than a predefined threshold Thand
the number of the total frames. The fourth is the success
rate (SR) which depends on the VOR. If the VOR in one
frame. is larger than a predefined threshold Th the tracking
in that frame. is assumed to be successful. SR is defined as the
ratio of the number of success frames and the total frames.
By varying Th