Look into Person: Self-supervised Structure-sensitive Learning and A New
Benchmark for Human Parsing
Human parsing has recently attracted a lot of research
interests due to its huge application potentials. However
existing datasets have limited number of images and an-
notations, and lack the variety of human appearances and
the coverage of challenging cases in unconstrained environ-
ment. In this paper, we introduce a new benchmark1 “Look
into Person (LIP)” that makes a significant advance in
terms of scalability, diversity and difficulty, a contribution
that we feel is crucial for future developments in human-
centric analysis. This comprehensive dataset contains over
50,000 elaborately annotated images with 19 semantic part
labels, which are captured from a wider range of view-
points, occlusions and background complexity. Given these
rich annotations we perform. detailed analyses of the lead-
ing human parsing approaches, gaining insights into the
success and failures of these methods. Furthermore, in con-
trast to the existing efforts on improving the feature dis-
criminative capability, we solve human parsing by explor-
ing a novel self-supervised structure-sensitive learning ap-
proach, which imposes human pose structures into pars-
ing results without resorting to extra supervision (i.e., no
need for specifically labeling human joints in model train-
ing). Our self-supervised learning framework can be in-
jected into any advanced neural networks to help incorpo-
rate rich high-level knowledge regarding human joints from
a global perspective and improve the parsing results. Exten-
sive evaluations on our LIP and the public PASCAL-Person-
Part dataset demonstrate the superiority of our method.
1. Introduction
Human parsing aims to segment a human image into
multiple parts with fine-grained semantics and provide more
detailed understanding of image contents. It can stimulate
The first two authors contribute equally to this paper. Corresponding
author is Dongyu Zhang. This work was supported by the National Natural
Science Foundation of China under Grant 61401125 and 61671182.
1The dataset is available at http://hcp.sysu.edu.cn/lip
many higher-level computer vision applications [35], such
as person re-identification [36] and human behavior. analy-
sis [12, 17].
Recently, Convolutional Neural Networks (CNNs) have
achieved exciting success in human parsing [14, 18, 16].
Nevertheless, as demonstrated in many other problems such
as object detection [15] and semantic segmentation [37], the
performance of those CNN-based approaches heavily rely
on the availability of annotated images for training. In or-
der to train a human parsing network with potentially prac-
tical value in real-word applications, it is highly desired to
have a large-scale dataset composed of representative in-
stances with varied clothing appearances, strong articula-
tion, partial (self-)occlusions, truncation at image borders,
diverse viewpoints and background clutters. Although there
exist training sets for special scenarios such as fashion pic-
tures [32, 9, 14, 18] and people in constrained situations
(e.g., upright) [6], these datasets are limited in their cover-
age and scalability, as shown in Fig. 1. The largest public
human parsing dataset [18] so far only contains 17,000 fash-
ion images while others only include thousands of images.
Moreover, to the best of our knowledge, no attempt has
been made to establish a standard representative bench-
mark aiming to cover a wide pallet of challenges for the
human parsing task. The existing datasets did not provide
an evaluation server with a secret test set to avoid potential
dataset over-fitting, which hinders further development on
this topic. Therefore we propose a new benchmark “Look
into Person (LIP)” and a public server for automatically re-
porting evaluation results. Our benchmark significantly ad-
vances the state-of-the-arts in terms of appearance variabil-
ity and complexity, which includes 50,462 human images
with pixel-wise annotations of 19 semantic parts.
The recent progress on human parsing [5, 30, 33, 32, 9,
26, 20, 18] has been achieved by improving the feature rep-
resentations using convolutional neural networks and recur-
rent neural networks. To capture rich structure information,
they combine CNNs and the graphical models (e.g., Condi-
tional Random Fields (CRFs)), similar to the general object
segmentation approaches [37, 4, 29]. However, evaluated
Face UpperClothes Hair RightArmPants LeftArmRightShoe LeftShoe HatCoatRightLegLeftLegGloves SocksSunglassesDressSkirtJumpsuitsScarf
Figure 1: Annotation examples for our “Look into Person (LIP)” dataset and existing datasets. (a) The images in ATR dataset
which are fixed in size and only contain stand-up person instances in the outdoors. (b) The images in PASCAL-Person-Part
dataset which also have lower scalability and only contain 6 coarse labels. (c) The images in our LIP dataset with high
appearance variability and complexity.
on the new LIP dataset, the results of some existing meth-
ods [3, 22, 4, 5] are unsatisfactory. Without imposing hu-
man body structure priors, these general approaches based
on bottom-up appearance information sometimes tend to
produce unreasonable results (e.g., right arm connected
with left shoulder), as shown in Fig. 2. The human body
structural information has been previously well-explored in
the human pose estimation [34, 7] where dense joint annota-
tions are provided. However, since human parsing requires
more extensive and detailed prediction than pose estima-
tion, it is difficult to directly utilize joint-based pose esti-
mation models in pixel-wise prediction to incorporate the
complex structure constraints. In order to explicitly enforce
the produced parsing results to be semantically consistent
with the human pose / joint structures, we propose a novel
structure-sensitive learning approach for human parsing. In
addition to using the traditional pixel-wise part annotations
as the supervision, we introduce a structure-sensitive loss to
evaluate the quality of predicted parsing results from a joint
structure perspective. That means a satisfactory parsing re-
sult should be able to preserve a reasonable joint structure
(e.g., the spatial layouts of human parts). Note that annotat-
ing both pixel-wise labeling map and pose joints is expen-
sive and may cause ambiguities. Therefore in this work we
generate approximated human joints directly from the pars-
ing annotations and use them as the supervision signal for
the structure-sensitive loss, which is hence called a “self-
supervised” strategy, noted as Self-supervised Structure-
Dataset #Training #Validation #Test Categories
Fashionista [33] 456 - 229 56
PASCAL-Person-Part [6] 1,716 - 1,817 7
ATR [18] 16,000 700 1,000 18
LIP 30,462 10,000 10,000 20
Table 1: Overview of the publicly available datasets for hu-
man parsing. For each dataset we report the number of an-
notated persons in training, validation and test sets as well
as the number of categories including background.
sensitive Learning (SSL).
Our contributions are summarized in the following three
aspects. 1) We propose a new large-scale benchmark and an
evaluation server to advance the human parsing research, in
which 50,462 images with pixel-wise annotations on 19 se-
mantic part labels are provided. 2) By experimenting on our
benchmark, we present the detailed analyses about the ex-
isting human parsing approaches to gain some insights into
the success and failures of these approaches. 3) We propose
a novel self-supervised structure-sensitive learning frame-
work for human parsing, which is capable of explicitly en-
forcing the consistency between the parsing results and the
human joint structures. Our proposed framework signifi-
cantly surpasses the previous methods on both the existing
PASCAL-Person-Part dataset [6] and our new LIP dataset.
1.1. Related Work
Human parsing datasets: The commonly used pub-
licly available datasets for human parsing are summarized
in Table. 1. The previous datasets were labeled with limited
(a) (b) (c)
FaceHair
RightArmLeftArm
Jumpsuits
Figure 2: An example shows that self-supervied structure-
sensitive learning is helpful for human parsing. (a): The
original image. (b): The parsing results by Attention-to-
scale [5] where the left-arm is wrongly labeled as right-arm.
(c): Our parsing results successfully incorporate the struc-
ture information to generate reasonable outputs.
number of images or categories. Containing 50,462 images
annotated with 20 categories, our LIP dataset is the largest
and most comprehensive human parsing dataset to date.
Some other datasets in the vision community were dedi-
cated to the tasks of clothes recognition, retrieval [21, 24]
and human pose estimation [1, 13], while our LIP dataset
only focuses on human parsing.
Human parsing approaches: Recently many research
efforts have been devoted to human parsing [18, 33, 32,
26, 20, 30, 5]. For example, Liang et al. [18] proposed a
novel Co-CNN architecture which integrates multiple lev-
els of image contexts into a unified nerwork. Besides hu-
man parsing, there has also been increasing research in-
terest on the part segmentation of other objects such as
animals or cars [27, 29, 23]. To capture the rich struc-
ture information based on the advanced CNN architecture,
common solutions inlcude the combination of CNNs and
CRFs [4, 37] and the adoptions of multi-scale feature rep-
resentations [4, 5, 30]. Chen et al. [5] proposed an attention
mechanism that learns to weight the multi-scale features at
each pixel location. Some previous works [8, 31] explored
human pose information to guide human parsing by gen-
erating “pose-guided” part segment proposals. To leverage
human joint structure more effortlessly and efficiently, the
focus in our approach is nevertheless a new self-supervised
structure-sensitive learning approach, which actually can be
embedded in any networks.
2. Look into Person Benchmark
In this section we introduce our new “Look into Person
(LIP)”, a new large-scale dataset focusing on semantic un-
derstanding of human bodies which has several appealing
properties. First, with 50,462 annotated images, LIP is an
order of magnitude larger and more challenging than pre-
vious similar attempts[33, 6, 18]. Second, LIP is annotated
with elaborated pixel-wise annotations with 19 semantic hu-
man part labels and one background label. Third, the im-
ages collected from the real-world scenarios contain people
appearing with challenging poses and viewpoints, heavy oc-
clusions, various appearances and in wide range of resolu-
tions. Furthermore, the background of images in the LIP
dataset is also more complex and diverse than the one in