A Survey on Multi-output Learning

Donna Xu; Yaxin Shi; Ivor W. Tsang; Yew-Soon Ong; Chen Gong; Xiaobo; Shen

arXiv:1901.00248·cs.LG·August 23, 2021

A Survey on Multi-output Learning

Donna Xu, Yaxin Shi, Ivor W. Tsang, Yew-Soon Ong, Chen Gong, Xiaobo, Shen

PDF

TL;DR

This paper provides a comprehensive survey of multi-output learning, discussing its challenges, output structures, evaluation metrics, and state-of-the-art methods to address the complexities introduced by multiple outputs.

Contribution

It offers a detailed overview of the challenges, techniques, and research directions in multi-output learning, filling a gap in the existing literature.

Findings

01

Categorizes multi-output learning methods based on challenges.

02

Summarizes output structures and evaluation metrics.

03

Highlights key datasets and future research directions.

Abstract

Multi-output learning aims to simultaneously predict multiple outputs given an input. It is an important learning problem due to the pressing need for sophisticated decision making in real-world applications. Inspired by big data, the 4Vs characteristics of multi-output imposes a set of challenges to multi-output learning, in terms of the volume, velocity, variety and veracity of the outputs. Increasing number of works in the literature have been devoted to the study of multi-output learning and the development of novel approaches for addressing the challenges encountered. However, it lacks a comprehensive overview on different types of challenges of multi-output learning brought by the characteristics of the multiple outputs and the techniques proposed to overcome the challenges. This paper thus attempts to fill in this gap to provide a comprehensive review on this area. We first…

Tables2

Table 1. TABLE I: A summary of subfields of multi-output learning and their corresponding output structures, applications and disciplines.

Subfield	Output Structure	Application	Discipline
Multi-label Learning	Independent Binary Vector	Document Categorization [19]	Natural Language Processing
		Semantic Scene Classification [20]	Computer Vision
		Automatic Video Annotation [21]	Computer Vision
Multi-target Regression	Independent Real-valued Vector	River Quality Prediction [22]	Ecology
		Natural Gas Demand Forecasting [23]	Energy Meteorology
		Drug Efficacy Prediction [24]	Medicine
Label Distribution Learning	Distribution	Head Pose Estimation [25]	Computer Vision
		Facial Age Estimation [26]	Computer Vision
		Text Mining [27]	Data Mining
Label Ranking	Ranking	Text Categorization Ranking [28]	Information Retrieval
		Question Answering [29]	Information Retrieval
		Visual Object Recognition [30]	Computer Vision
Sequence Alignment Learning	Sequence	Protein Function Prediction [31]	Bioinformatics
		Language Translation [32]	Natural Language Processing
		Named Entity Recognition [33]	Natural Language Processing
Network Analysis	Graph	Scene Graph [34]	Computer Vision
	Tree	Natural Language Parsing [35]	Natural Language Processing
	Link	Link Prediction [36]	Data Mining
Data Generation	Image	Super-resolution Image Reconstruction [37]	Computer Vision
	Text	Language Generation	Natural Language Processing
	Audio	Music Generation [38]	Signal Processing
Semantic Retrieval	Independent Real-valued Vector	Content-based Image Retrieval [39]	Computer Vision
		Microblog Retrieval [40]	Data Mining
		News Retrieval [41]	Data Mining
Time-series Prediction	Time Series	DNA Microarray Data Analysis [42]	Bioinformatics
		Energy Consumption Forecasting [43]	Energy Meteorology
		Video Surveillance [44]	Computer Vision

Table 2. TABLE II: Characteristics of the datasets of multi-output learning tasks.

Multi-output

Characteristic

Challenge

Application Domain

Dataset Name

Statistics

Source

Volume

Extreme Output Dimension¹¹1http://manikvarma.org/downloads/XC/XMLRepository.html

Output Dimension

Review Text

AmazonCat-13K

13,330

[76]

Review Text

AmazonCat-14K

14,588

[77, 78]

Text

Wiki10-31

30,938

[79, 80]

Social Bookmarking

Delicious-200K

205,443

[79, 81]

Text

WikiLSHTC-325K

325,056

[82, 83]

Text

Wikipedia-500K

501,070

Wikipedia

Product Network

Amazon-670K

670,091

[79, 76]

Text

Ads-1M

1,082,898

[82]

Product Network

Amazon-3M

2,812,281

[77, 78]

Extreme Class Imbalance

Largest Class

Imbalance Ratio

Scene Image

WIDER-Attribute

1:28

[84]

Face Image

Celeb Faces Attributes

1:43

[85]

Clothing Image

DeepFashion

1:733

[86]

Clothing Image

X-Domain

1:4,162

[87]

Unseen Outputs

Seen / Unseen Labels

Image

Attribute Pascal abd Yahoo

20 / 12

[88]

Animal Image

Animal with Attributes

40 / 10

[88]

Scene Image

HSUN

80 / 27

[89]

Music

MagTag5K

107 / 29

[90]

Bird Image

Caltech-UCSD Birds 200

150 / 50

[91]

Scene Image

SUN Attributes

645 /72

[20]

Health

MIMIC II

3,228 / 355

[92]

Health

MIMIC III

4,403 / 178

[93]

Velocity

Change of Output Distribution

Time Periods

Text

Reuters

365 days

[94]

Route

ECML/PKDD 15:

Taxi Trajectory Prediction

365 days

[95]

Route

epfl/mobility

30 days

[96]

Electricity

Portuguese Electricity Consumption

365 days

[97]

Traffic Video

MIT Traffic Data Set

90 minutes

[44]

Surveillance Video

VIRAT Video

8.5 hours

[98]

Variety

Complex Structures

Output Structures

Image

LabelMe

Label, Bounding Box

[8]

Image

ImageNet

Label, Bounding Box

[7]

Image

PASCAL VOC

Label, Bounding Box

[99]

Image

CIFAR100

Hierarchical Label

[100]

Lexical Database

WordNet

Hierarchy

[101]

Wikipedia Network

Wikipedia

Graph, Link

[102]

Blog Network

BlogCatalog

Graph, Link

[103]

Author Collaboration Network

arXiv-AstroPh

Link

[104]

Author Collaboration Network

arXiv-GrQc

Link

[104]

Text

CoNLL-2000 Shared Task

Text Chunks

[105]

Text

Wall Street Journal (WSJ) corpus

POS Tags, Parsing Tree

-

European Languages

Europarl corpus

Sequence

[32]

Veracity

Noisy Output Labels

Noisy Labeled Samples

Dog Image

AMT

7,354

[106]

Food Image

Food101N

310K

[107]

Clothing Image

Clothing1M

1M

[108]

Web Image

WebVision

2.4M

[109]

Image and Video

YFCC100M

100M

[110]

Equations36

H ammin g L oss = \frac{1}{N} i = 1 \sum N \frac{1}{m} ∣ Y_{i} Δ \hat{Y}_{i} ∣

H ammin g L oss = \frac{1}{N} i = 1 \sum N \frac{1}{m} ∣ Y_{i} Δ \hat{Y}_{i} ∣

B_{ma cr o} = \frac{1}{m} l = 1 \sum m B (T P_{l}, F P_{l}, T N_{l}, F N_{l}),

B_{ma cr o} = \frac{1}{m} l = 1 \sum m B (T P_{l}, F P_{l}, T N_{l}, F N_{l}),

B_{mi cr o} = B (\frac{1}{m} l = 1 \sum m T P_{l}, \frac{1}{m} l = 1 \sum m F P_{l}, \frac{1}{m} l = 1 \sum m T N_{l}, \frac{1}{m} l = 1 \sum m F N_{l}) .

B_{mi cr o} = B (\frac{1}{m} l = 1 \sum m T P_{l}, \frac{1}{m} l = 1 \sum m F P_{l}, \frac{1}{m} l = 1 \sum m T N_{l}, \frac{1}{m} l = 1 \sum m F N_{l}) .

One-error = \frac{1}{N} i = 1 \sum N \mathds I (ar g λ \in L min π_{i} (λ) \in / Y_{i})

One-error = \frac{1}{N} i = 1 \sum N \mathds I (ar g λ \in L min π_{i} (λ) \in / Y_{i})

R ank in g L oss = \frac{1}{N} i = 1 \sum N \frac{1}{∣ Y _{i} ∣∣ Y _{i} ∣} ∣ E ∣, w h er e E = (λ_{a}, λ_{b}) : π_{i} (λ_{a}) > π_{i} (λ_{b}), (λ_{a}, λ_{b}) \in Y_{i} \times \overline{Y_{i}}

R ank in g L oss = \frac{1}{N} i = 1 \sum N \frac{1}{∣ Y _{i} ∣∣ Y _{i} ∣} ∣ E ∣, w h er e E = (λ_{a}, λ_{b}) : π_{i} (λ_{a}) > π_{i} (λ_{b}), (λ_{a}, λ_{b}) \in Y_{i} \times \overline{Y_{i}}

A P = \frac{1}{N} i = 1 \sum N \frac{1}{∣ Y _{i} ∣} λ \in Y_{i} \sum \frac{{ λ ^{'} \in Y _{i} ∣ π _{i} ( λ ^{'} ) \leq π _{i} ( λ )}}{π _{i} ( λ )}

A P = \frac{1}{N} i = 1 \sum N \frac{1}{∣ Y _{i} ∣} λ \in Y_{i} \sum \frac{{ λ ^{'} \in Y _{i} ∣ π _{i} ( λ ^{'} ) \leq π _{i} ( λ )}}{π _{i} ( λ )}

M A E = \frac{1}{m} \frac{1}{N} i = 1 \sum N ∣ y_{i} - \hat{y}_{i} ∣

M A E = \frac{1}{m} \frac{1}{N} i = 1 \sum N ∣ y_{i} - \hat{y}_{i} ∣

M S E = \frac{1}{m} \frac{1}{N} i = 1 \sum N (y_{i} - \hat{y_{i}})^{2}

M S E = \frac{1}{m} \frac{1}{N} i = 1 \sum N (y_{i} - \hat{y_{i}})^{2}

A C C = \frac{1}{m} l = 1 \sum m \frac{\sum _{i = 1}^{N} ( y _{i}^{l} - y ˉ ^{l} ) ( y ^ _{i}^{l} - y ^ ˉ ^{l} )}{\sum _{i = 1}^{N} ( y _{i}^{l} - y ˉ ^{l} ) ^{2} \sum _{i = 1}^{N} ( y ^ _{i} ^{l} - y ^ ˉ ^{l} ) ^{2}}

A C C = \frac{1}{m} l = 1 \sum m \frac{\sum _{i = 1}^{N} ( y _{i}^{l} - y ˉ ^{l} ) ( y ^ _{i}^{l} - y ^ ˉ ^{l} )}{\sum _{i = 1}^{N} ( y _{i}^{l} - y ˉ ^{l} ) ^{2} \sum _{i = 1}^{N} ( y ^ _{i} ^{l} - y ^ ˉ ^{l} ) ^{2}}

I o U = \frac{Area of Overlap}{Area of Union}

I o U = \frac{Area of Overlap}{Area of Union}

\vspace - 4 mm w \in R^{q}, {ξ_{i} \geq 0}_{i = 1}^{n} min \frac{λ}{2} ∥ w ∥^{2} + \frac{C}{n} i = 1 \sum n structured hinge loss y \in Y max {Δ (y_{i}, y) + w^{T} Φ (x_{i}, y)} - w^{T} Φ (x_{i}, y_{i}) \vspace - 2 mm

\vspace - 4 mm w \in R^{q}, {ξ_{i} \geq 0}_{i = 1}^{n} min \frac{λ}{2} ∥ w ∥^{2} + \frac{C}{n} i = 1 \sum n structured hinge loss y \in Y max {Δ (y_{i}, y) + w^{T} Φ (x_{i}, y)} - w^{T} Φ (x_{i}, y_{i}) \vspace - 2 mm

w \in R^{q}, {ξ_{i} \geq 0}_{i = 1}^{n} min \frac{λ}{2} ∥ w ∥^{2} + \frac{C}{n} i = 1 \sum n ξ_{i}^{2} s . t . w^{T} Φ (x_{i}, y_{i}) - w^{T} Φ (x_{i}, y) \geq Δ (y_{i}, y) - ξ_{i}, \forall y \in Y ∖ y_{i}, \forall i .

w \in R^{q}, {ξ_{i} \geq 0}_{i = 1}^{n} min \frac{λ}{2} ∥ w ∥^{2} + \frac{C}{n} i = 1 \sum n ξ_{i}^{2} s . t . w^{T} Φ (x_{i}, y_{i}) - w^{T} Φ (x_{i}, y) \geq Δ (y_{i}, y) - ξ_{i}, \forall y \in Y ∖ y_{i}, \forall i .

L_{0/1} (y, y^{'}) = \mathds I (y \neq = y^{'})

L_{0/1} (y, y^{'}) = \mathds I (y \neq = y^{'})

L_{H in g e} (x, y, w) = y^{'} \in Y max [Δ (y, y^{'}) + w^{T} Φ (x, y^{'})] - w^{T} Φ (x, y)

L_{H in g e} (x, y, w) = y^{'} \in Y max [Δ (y, y^{'}) + w^{T} Φ (x, y^{'})] - w^{T} Φ (x, y)

L_{N e g a t i v e L o g} (x, y, w) = lo g y^{'} \in Y \sum exp [w^{T} Φ (x, y^{'})] - w^{T} Φ (x, y)

L_{N e g a t i v e L o g} (x, y, w) = lo g y^{'} \in Y \sum exp [w^{T} Φ (x, y^{'})] - w^{T} Φ (x, y)

L_{P er ce pt r o n} (x, y, w) = y^{'} \in Y max [w^{T} Φ (x, y^{'}) - w^{T} Φ (x, y)]

L_{P er ce pt r o n} (x, y, w) = y^{'} \in Y max [w^{T} Φ (x, y^{'}) - w^{T} Φ (x, y)]

L_{S o f t ma x M a r g in} (x, y, w) = lo g y^{'} \in Y \sum exp [Δ (y, y^{'}) + w^{T} Φ (x, y^{'})] - w^{T} Φ (x, y)

L_{S o f t ma x M a r g in} (x, y, w) = lo g y^{'} \in Y \sum exp [Δ (y, y^{'}) + w^{T} Φ (x, y^{'})] - w^{T} Φ (x, y)

L_{S q u a r e d} (y, y^{'}) = (y - y^{'})^{2}

L_{S q u a r e d} (y, y^{'}) = (y - y^{'})^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Survey on Multi-output Learning

Donna Xu, Yaxin Shi, Ivor W. Tsang, Yew-Soon Ong, Chen Gong, and Xiaobo Shen D. Xu, Y. Shi and I. W. Tsang are with the Centre for Artificial Intelligence, FEIT, University of Technology Sydney, Ultimo, NSW 2007, Australia (email: [email protected], [email protected], [email protected]).Y.-S. Ong is with the Data Science & Artificial Intelligence Research Centre, SCSE, Nanyang Technological University, Singapore 639798 (email: [email protected]).C. Gong and X. Shen are with the School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China (email: [email protected], [email protected]).This research is supported by ARC grant LP150100671 and DP180100106, CSC (No: 201706330075), NRFS under its AI Singapore Programme (AISG-RP-2018-004), NSF of China (No: 61602246, 61973162), NSF of Jiangsu Province (No: BK20171430), the FRF for the Central Universities (No: 30918011319), the “Summit of the Six Top Talents” Program (No: DZXX-027), the “Young Elite Scientists Sponsorship Program” by Jiangsu Province, the “Young Elite Scientists Sponsorship Program” by CAST (No: 2018QNRC001), the NNSFC under Grant No. 61906091, the NSF of Jiangsu Province, China (Youth Fund Project) under Grant No. BK20190440, and the FRF for the Central Universities under Grant No. 30919011229.

Abstract

The aim of multi-output learning is to simultaneously predict multiple outputs given an input. It is an important learning problem for decision-making, since making decisions in the real world often involves multiple complex factors and criteria. In recent times, an increasing number of research studies have focused on ways to predict multiple outputs at once. Such efforts have transpired in different forms according to the particular multi-output learning problem under study. Classic cases of multi-output learning include multi-label learning, multi-dimensional learning, multi-target regression and others. From our survey of the topic, we were struck by a lack in studies that generalize the different forms of multi-output learning into a common framework. This paper fills that gap with a comprehensive review and analysis of the multi-output learning paradigm. In particular, we characterize the 4 Vs of multi-output learning, i.e., volume, velocity, variety, and veracity, and the ways in which the 4 Vs both benefit and bring challenges to multi-output learning by taking inspiration from big data. We analyze the life cycle of output labeling, present the main mathematical definitions of multi-output learning, and examine the field’s key challenges and corresponding solutions as found in the literature. Several model evaluation metrics and popular data repositories are also discussed. Last but not least, we highlight some emerging challenges with multi-output learning from the perspective of the 4 Vs as potential research directions worthy of further studies.

Index Terms:

multi-output learning, structured output prediction, output label representation, crowdsourcing, label distribution, extreme classification.

I Introduction

Traditional supervised learning is one of the most well established and adopted machine learning paradigms. It offers fast and accurate predictions for today’s real-world smart systems and applications. The goal of traditional supervised learning is to learn a function that maps each of the given inputs to a corresponding known output. For prediction tasks, the output comes in the form of a single label. For regression tasks, it is a single value. Traditional supervised learning has been shown to be good at solving these simple single-output problems – classical examples being binary classification, such as filtering spam in an email system, or a regression problem where the daily energy consumption of a machine needs to be predicted based on temperature, wind speed, humidity levels, etc.

However, the traditional supervised learning paradigm is not coping well with the increasing needs of today’s complex decision making. As a result, there is a pressing need for new machine learning paradigms. Here, multi-output learning has emerged as a solution. The aim is to simultaneously predict multiple outputs given a single input, which means it is possible to solve far more complex decision-making problems. Compared to traditional single-output learning, multi-output learning is multi-variate nature, and the outputs may have complex interactions that can only be handled by structured inference. Additionally, the potentially diverse data types of the outputs has led to various categories of machine learning problems and corresponding subfields of study. For example, binary output values relate to multi-label classification problems [1, 2]; nominal output values relate to multi-dimensional classification problems [3]; ordinal output values are studied in label ranking problems [4]; and real-valued outputs are considered in multi-target regression problems [5].

Together, all these problems constitute the multi-output paradigm, and the body of literature surrounding this field has grown rapidly. Several works have been presented that provide a comprehensive review of the emerging challenges and learning algorithms in each subfield. For instance, Zhang and Zhou [1] studied the emerging area of multi-label learning; Borchani et al. [5] summarized the increasing problems in multi-target regression; and [4] Vembu and Gartner presented a review on multi-label ranking. However, little attention has been paid to the global picture of multi-output learning and the importance of the output labels (Section. II). In addition, although the problems in each subfield seem distinctive due to the differences in their output structures (Section. III-A), they do share common traits (Section. III-B) and encounter common challenges brought by the characteristics of the output labels. In this paper, we attempt to provide such a view.

I-A The 4 Vs Challenges of Multiple Outputs

The popular 4 Vs, i.e., volume, velocity, variety and veracity, have been well established as the main characteristics of big data. When scholars discuss the 4 Vs in multi-output learning scenarios, they are usually referring to input data; however, the 4 Vs can also be used to describe output labels. Moreover, these 4 Vs bring with them a set of challenges to multi-output learning processes, explained as follows.

Volume refers to explosive growth in output labels, which poses many challenges to multi-output learning. First, output label spaces can grow extremely large, which causes scalability issues. Second, the burden for label annotators is significantly increased and still there are often insufficient annotations in a dataset to adequately train a model. In turn, this may lead to unseen outputs during testing. Third, volume may pose label imbalance issues, especially if not all the generated labels in a dataset have sufficient data instances (inputs). 2. 2.

Velocity refers to how rapidly output labels are acquired, which includes the phenomenon of concept drift [6]. Velocity can present challenges due to changes in output distributions, where the target outputs vary over time in unforeseen ways. 3. 3.

Variety refers to the heterogeneous nature of output labels. Output labels are gathered from multiple sources and are of various data formats with different structures. In particular, output labels with complex structures can create multiple challenges in multi-output learning, such as finding an appropriate method of modeling output dependencies, or how to design a multi-variate loss function, or how to design efficient algorithms. 4. 4.

Veracity refers to differences in the quality of the output labels. Issues such as noise, missing values, abnormalities, or incomplete data are all characteristics of veracity.

I-B Purpose and Organization of This Survey

The goal of this paper is to provide a comprehensive overview of the multi-output learning paradigm using the 4 Vs as a frame for the current and future challenges facing this field of study. Multi-output learning has attracted significant attention from many machine learning disciplines, such as part-of-speech sequence tagging, language translation and natural language processing, motion tracking and optical character recognition in computer vision, document categorization and ranking in information retrieval, and so on. We expect this survey to deliver a complete picture of multi-output learning and a summation of the different problems being tackled across multiple communities. Ultimately, we hope to promote further development in multi-output learning, and inspire researchers to pursue worthy and needed future research directions.

The remainder of this survey is structured as follows. Section II illustrates the life cycle of output labels to help understand the challenges presented by the 4 Vs. Section III provides an overview of the myriad output structures along with definitions for the common subproblems addressed in multi-output learning. This section also includes some brief details on the common metrics and publicly-available data used when evaluating models. Section IV presents the challenges in multi-output learning presented by the 4 Vs and their corresponding representative works. Section V concludes the survey.

II Life Cycle of Output Labels

Output labels play an important role in multi-output learning tasks in that how well a model performs a task relies heavily on the quality of those labels. Fig. 1 depicts the three stages of a label’s life cycle: annotation, representation, and evaluation. A brief overview of each stage follows along with the underlying issues that could potentially harm the effectiveness of multi-output learning systems.

II-A How is Data Labeled

Label annotation requires a human to semantically annotate a piece of data and is a crucial step for training multi-output learning models. Data can be used directly with its basic annotations or, once labeled; they can be aggregated into sets for further analysis. Depending on the application and the task, label annotations come in various types. For example, the images for an image classification task should be labeled with tags or keywords, whereas a segmentation task would require each object in the images to be localized with a mask. A captioning task would require the images to be labeled with some textual descriptions, and so on.

Typically, creating large annotated datasets from scratch is time-consuming and labor-intensive no matter the annotation requirement. There are multiple ways to acquire labeled data. Social media provides a platform for researchers to search for labeled datasets - for example, Facebook and Flickr, which allow users to post pictures and comments with tags. Open-source collections, such as WordNet and Wikipedia, can also be useful sources of labeled datasets.

Beyond directly obtaining labeled datasets, crowdsourcing platforms like Amazon Mechanical Turk help researchers solicit labels for unlabeled datasets by recruiting online workers. The annotation type depends on the modeling task and, due to the efficiency of crowdsourcing, this method has quickly become a popular way of obtaining labeled datasets. ImageNet [7] is a popular dataset that was labeled through a crowdsourcing platform. Its database of images is organized into a WordNet hierarchy, and it has been used to help researchers solve problems in a range of areas.

There are also many annotation tools that have been developed to annotate different types of data. LabelMe [8], a web-based tool, provides users with a convenient way to label every object in an image and also correct labels annotated by other users. BRAT [9] is also web-based but is specifically designed for natural language processing tasks, such as named-entity recognition and POS-tagging (part-of-speech tagging). TURKSENT [10] is an annotation tool to support sentiment analysis in social media posts.

II-B Forms of Label Representations

There are many different types of label annotations for different tasks, such as tags, captions, masks, etc., and each type of annotation may have several representations, which are frequently represented as vectors. For example, the most common is the binary vector, whose size equals the vocabulary size of the tags. Annotated samples, e.g., samples with tags, are assigned with a value of 1 and the rest are given a 0. However, binary vectors are not optimal for more complex multi-output tasks because these representations do not preserve all useful information. Details like the semantics or the inherent structure are lost. To tackle this issue, alternative representation methods have been developed. For instance, real-valued vectors of tags [11] indicate the strength and degree of the annotated tags using real values. Binary vectors of the associations between a tag’s attributes have been used to convey the characteristics of tags. Hierarchical label embedding vectors [12] capture the structure information in tags. Semantic word vectors, such as Word2Vec [13], can be used to represent the semantics and/or context of tags and text descriptions. What is key in real-world multi-output applications is to select the label representation that is most appropriate for the given task.

II-C Label Evaluation and Challenges

Label evaluation is an essential step in guaranteeing the quality of labels and label representations. Thus, label evaluation plays a key role in the performance of multi-output tasks. Different models can be used to evaluate label quality: which to choose depends on the task. Generally, labels can be evaluated in three different respects:

whether the annotation is of good quality (Step A).
whether the chosen label representation represents the labels well (Step B).
whether the provided label set adequately covers the dataset (Label Set).

After evaluation, a human expert is generally required to explore any underlying issues and provide feedback to improve different aspects of the labels if needed.

II-C1 Issues of Label Annotation

The aforementioned annotation methods, e.g., crowdsourcing, annotation tools, and social media, help researchers collect annotated data efficiently. But, without experts, these annotations methods are highly likely to result in the so-called noisy label problem, which includes both missing annotations and incorrect annotations. There are various reasons for noisy labels – for example, using crowdsourced workers that lack the required domain knowledge, social media users that include irrelevant tags with their image or post, or ambiguous text in a caption.

II-C2 Issues of Label Representation

Output labels can also have internal structures and, often, this structure information is critical to the performance of the multi-output learning task at hand. Tag-based information retrieval [14] and image captioning [15] are two examples where structure is crucial. However, incorporating this information into a representation as a labels is a non-trivial undertaking, as the data are usually many and domain knowledge is required to define their structure. In addition, the output label space might contain ambiguity. For example, a bag-of-words (BOW) is traditionally used as a representation of a label space in natural language processing tasks, but BOW contains word sense ambiguity, as two different words may have the same meaning and one word might refer to multiple meanings.

II-C3 Issues of the Label Set

Constructing a label set for data annotation requires a human expert with domain knowledge. Plus, it is common that the provided label set does not contain sufficient labels for the data – perhaps due to fast data growth or the low occurrence of some labels. Therefore, there are likely to be unseen labels in the test data, which leads to open-set [16], zero-shot [17] or concept drift [18] problems.

III Multi-output Learning

In contrast to traditional single-output learning, multi-output learning can concurrently predict multiple outputs. The outputs can be of various types and structures, and the problems that can be solved are diverse. A summary of the subfields that use multi-output learning along with their corresponding output types, structures, and applications is presented in Table I.

We being the section with an introduction to some of the output structures in multi-output learning problems. The different problem definitions common to various subfields are provided next, along with the different constraints placed on the output space. We also discuss some special cases of these problems and give a brief overview of some of the evaluation metrics that are specific to multi-output learning. The section concludes with some insights into the evolution of output dimensions through an analysis of several commonly used datasets.

III-A Myriads of Output Structures

The increasing demand of sophisticated decision-making tasks has led to new creations of outputs, some of which have complex structures. With social media, social networks, and various online services becoming ubiquitous, a wide range of output labels can be stored and then collected by researchers. Output labels can be anything; they could be text, images, audio, or video, or a combination as multimedia. For example, given a long document as input, the output might be a summary of the input in text format. Given some text fragments, the output might be an image with its contents described by the input text. Similarly, audio, such as music and videos, can be generated given different types of inputs. In addition to the different output types, there are also a number of different possible output structures. Here we present several typical output structures given an image as an input using the example in Fig. 2 as a way to illustrate just how many output structures might be possible across all the different input types.

III-A1 Independent Vector

An independent vector is a vector with separate dimensions (features), where each dimension represents a particular label that does not necessarily depend on other labels. Binary vectors can be used to represent a given piece of data as tags, attributes, BOW, bag-of-visual-words, hash codes, etc. Real-valued vectors provide the weighted dimensions, where the real value represents the strength of the input data against the corresponding label. Applications include annotation or classification of text, images, or video with binary vectors [19, 20, 21], and demand or energy prediction with real-valued vectors [23]. An independent vector can be used to represent the tags of an image, as shown in Fig. 2 (1), where all the tags “people”, “dinner”, “table” and “wine” have equal weight..

III-A2 Distribution

Unlike independent vectors, distributions provide information about the probability that a particular dimension will be associated with a particular data sample. In Fig. 2 (2), the tag with the largest weight is “people” and is the main content of the image, while “dinner” and “table” have similar distributions. Applications for distribution outputs include head pose estimation [25], facial age estimation [26] and text mining [27].

III-A3 Ranking

Outputs might also be in the form of a ranking, which shows the tags ordered from the most to least important. The results from a distribution learning model can be converted into a ranking, but a ranking model is not restricted to only distribution learning models. Text categorization [28], question answering [29] and visual object recognition [30] are applications where rankings are often used.

III-A4 Text

Text can be in the form of keywords, sentences, paragraphs, or even documents. Fig. 2 (4) illustrates an example of text output as a caption of the image – “People are having dinner”. Other applications for text outputs are document summarization [45] and paragraph generation [46].

III-A5 Sequence

Sequence outputs refer to a series of elements selected from a label set. Each element is predicted depending on the input as well as the predicted output(s) from the preceding element. An output sequence often corresponds to an input sequence. For example, in speech recognition, we expect the output to be a sequence of text that corresponds to a given audio signal of speech [47]. In language translation, we expect the output to be a sentence transformed into the target language [32]. In the example shown in Fig. 2 (5), the input is an image caption, i.e., text, and the outputs are part-of-speech (POS) tags for each word in the sequence.

III-A6 Tree

Tree outputs are essentially outputs in the form of a hierarchy. The outputs, usually labels, have an internal structure where each output has a label that belongs to, or is connected to, its ancestors in the tree. For example, in syntactic parsing [35], as shown in Fig. 2 (6), each of the outputs for an input sentence is a POS tag and the entire output is a parsing tree. “people” is labeled as a noun N, but it is also a noun phrase NP as per the tree.

III-A7 Image

Images are a special form of output that consist of multiple pixel values, where each pixel is predicted depending on the input and the pixels around it. Fig. 2 (7) shows super-resolution construction [37] as one popular application where images are common outputs. Super-resolution construction means constructing a high-resolution image from a low-resolution image. Other image output applications include text-to-image synthesis [48], which generates images from natural language descriptions, and face generation [49].

III-A8 Bounding Box

Bounding boxes as outputs are often used to find the exact locations of an object or objects appearing in an image. This is a common task in object recognition and object detection [30]. In Fig. 2 (8), each of the faces is localized by a bounding box so that each person can be identified.

III-A9 Link

Links as outputs usually represent the association between two nodes in a network [36]. Fig. 2 (9) illustrates a task to predict whether two currently unlinked users will be friends in the future given a partitioned social network where the edges represent friendships between users.

III-A10 Graph

Graphs are commonly used to model relationships between. They consist of a set of nodes and edges, where a node represents an object and an edge indicates a relationship between two objects. Scene graphs [50], for example, are often output as a way to describe the content of an image [34]. Fig. 2 (10) shows that, given an input image, the output is a graph definition where the nodes are the objects appearing in the image, i.e., “people”, “dinner”, “table”, and “wine”, and the edges are the relationships between these objects. Scene graphs are very useful as representations for tasks like image generation [51] and visual question answering [52].

III-A11 Other Outputs

Beyond these few types, there are still many other types of output structures. For example, contour and polygon outputs are similar to bounding boxes and can be used as labels for object localization. In information retrieval, the output(s) could be of the list type, say, of data objects that are similar to the given query. In image segmentation, the outputs are usually segmentation masks of different objects. In signal processing, outputs might be audio of speech or music. In addition, some real-world applications may require more sophisticated output structures relating to multiple tasks. For example, one may require that objects be recognized and localized at the same time, such as in co-saliency, i.e., discovering the common saliency of multiple images [53], simultaneously segmenting similar objects given multiple images in co-segmentation [54], or detecting and identifying objects in multiple images in object co-detection [55].

III-B Problem Definition of Multi-output Learning

Multi-output learning maps each input (instance) to multiple outputs. Assume $\mathcal{X}=\mathds{R}^{d}$ is a $d$ -dimensional input space, and $\mathcal{Y}=\mathds{R}^{m}$ is an $m$ -dimensional output label space. The aim of multi-output learning is to learn a function $f:\mathcal{X}\rightarrow\mathcal{Y}$ from the training set $\mathcal{D}=\{(\mathbf{x}_{i},\mathbf{y}_{i})|1\leq i\leq n\}$ . For each training example $(\mathbf{x}_{i},\mathbf{y}_{i})$ , $\mathbf{x}_{i}\in\mathcal{X}$ is a $d$ -dimensional feature vector, and $\mathbf{y}_{i}\in\mathcal{Y}$ is the corresponding output associated with $\mathbf{x}_{i}$ . The general definition of multi-output learning is given as: Finding a function $F:\mathcal{X}\times\mathcal{Y}\rightarrow\mathds{R}$ based on the training sample of input-output pairs, where $F(\mathbf{x},\mathbf{y})$ is a compatibility function that evaluates how compatible the input $\mathbf{x}$ and the output $\mathbf{y}$ are. Then, given an unseen instance $\mathbf{x}$ at the test state, the output is predicted to be the one with the largest compatibility score, namely $f(\mathbf{x})=\widetilde{\mathbf{y}}=\arg\max_{\mathbf{y}\in\mathcal{Y}}F(\mathbf{x},\mathbf{y})$ [56].

This definition provides a general framework for multi-output learning problems. Although different multi-output learning subfields vary in their output structures, they can be defined within this framework given certain constraints on the output label space $\mathcal{Y}$ .

We selected several popular subfields and present the constraints of their output space in the following sections. Note that multi-output learning is not restricted to these particular scenarios; they are just examples for illustration.

III-B1 Multi-label Learning

The task of multi-label learning is to learn a function $f(\cdot)$ that predicts the proper label sets for unseen instances[1]. In this task, each instance is associated with a set of class labels/tags and is represented by a sparse binary label vector. A value of $+1$ denotes the instance is labeled and $−1$ means unlabeled. Thus, $\mathbf{y}_{i}\in\mathcal{Y}=\{-1,+1\}^{m}$ . Given an unseen instance $\mathbf{x}\in\mathcal{X}$ , the learned multi-label classification function $f(\cdot)$ outputs $f(\mathbf{x})\in\mathcal{Y}$ , where the labels in the output vector with a value of $+1$ are used as the predicted labels for $\mathbf{x}$ .

III-B2 Multi-target Regression

The aim of multi-target regression is to simultaneously predict multiple real-valued output variables for one instance [5, 57]. Here, multiple labels are associated with each instance, represented by a real-valued vector, where the values represent how strongly the instance corresponds to a label. Therefore, we have the constraint of $\mathbf{y}_{i}\in\mathcal{Y}=\mathds{R}^{m}$ . Given an unseen instance $\mathbf{x}\in\mathcal{X}$ , the learned multi-target regression function $f(\cdot)$ predicts a real-valued vector $f(\mathbf{x})\in\mathcal{Y}$ as the output.

III-B3 Label Distribution Learning

Label distribution learning determines the relative importance of each label in the multi-label learning problem [58]. This is as opposed to multi-label learning, which simply learns to predict a set of labels. But, as illustrated in Fig.2, the idea of label distribution learning is to predict multiple labels with a degree value that represents how well each label describes the instance. Therefore, the sum of the degree values for each instance is 1. Thus, the output space for label distribution learning satisfies $\mathbf{y}_{i}=(y_{i}^{1},y_{i}^{2},...,y_{i}^{m})\in\mathcal{Y}=\mathds{R}^{m}$ with the constraints $y_{i}^{j}\in[0,1],1\leq j\leq m$ and $\sum_{j=1}^{m}y_{i}^{j}=1$ .

III-B4 Label Ranking

The goal of label ranking is to map instances to a total order over a finite set of predefined labels [4]. In label ranking, each instance is associated with the rankings of multiple labels. Therefore, the outputs of the problem are the total order of all the labels for each instance. Let $\mathcal{L}=\{\lambda_{1},\lambda_{2},...,\lambda_{m}\}$ denotes the predefined label set. A ranking can be represented as a permutation $\pi$ on $\{1,2,...,m\}$ , such that $\pi(j)=\pi(\lambda_{j})$ is the position of the label $\lambda_{j}$ in the ranking. Therefore, given an unseen instance $\mathbf{x}\in\mathcal{X}$ , the learned label ranking function $f(\cdot)$ predicts a permutation $f(\mathbf{x})=(y_{i}^{\pi(1)},y_{i}^{\pi(2)},...,y_{i}^{\pi(m)})\in\mathcal{Y}$ as the output.

III-B5 Sequence Alignment Learning

Sequence alignment learning aims to identify the regions of relationships between two or more sequences. The outputs in this task are a sequence of multiple labels for the input instance. The output vector has the constraint $\mathbf{y}_{i}\in\mathcal{Y}=\{0,1,...,c\}^{m}$ , where $c$ denotes the total number of labels. In sequence alignment learning, $m$ may vary depending on the input. Given an unseen instance $\mathbf{x}\in\mathcal{X}$ , the learned sequence alignment function $f(\cdot)$ outputs $f(\mathbf{x})\in\mathcal{Y}$ , where all of the predicted labels in the output vector form the predicted sequence for $\mathbf{x}$ .

III-B6 Network Analysis

Network analysis explores the relationships and interactions between objects and entities in a network structure, and link prediction is a common task within this subfield. Let $G=(V,E)$ denotes the graph of a network. $V$ is the set of nodes, which represent objects, and $E$ is the set of edges, which represent the relationships between objects. Given a snapshot of a network, the goal of link prediction is to infer whether a connection exists between two nodes. The output vector $\mathbf{y}_{i}\in\mathcal{Y}=\{-1,+1\}^{m}$ is a binary vector whose value represents whether there will be an edge $e=(u,v)$ between any pair of nodes $u,v\in V$ and $e\notin E$ . $m$ is the number of node pairs that does not appear in the current graph $G$ and each dimension in $\mathbf{y}_{i}$ represents a pair of nodes that are not currently connected.

III-B7 Data Generation

Data generation is a subfield of multi-output learning that aims to create and then output structured data of a certain distribution. Deep generative models are usually used to generate the data, which may be in the form of text, images, or audio. The multiple output labels in the problem become the different words in the vocabulary, the pixel values, the audio tones, etc. Take image generation as an example. The output vector has the constraint $\mathbf{y}_{i}\in\mathcal{Y}=\{0,1,...,255\}^{m_{w}\times m_{h}\times 3}$ , where $m_{w}$ and $m_{h}$ are the width and height of the image. Given an unseen instance $\mathbf{x}\in\mathcal{X}$ , which is usually a random noise or an embedding vector with some constraints, the learned GAN-based network $f(\cdot)$ outputs $f(\mathbf{x})\in\mathcal{Y}$ , where all of the predicted pixel values in the output vector form the generated image for $\mathbf{x}$ .

III-B8 Semantic Retrieval

Semantic retrieval means finds the meanings within some given information. Here, we consider semantic retrieval in a setting where each input instance has semantic labels that can be used to help retrieval [59]. Thus, each instance representation comprises semantic labels as the output $\mathbf{y}_{i}\in\mathcal{Y}=\mathds{R}^{m}$ . Given an unseen instance $\mathbf{x}\in\mathcal{X}$ as the query, the learned retrieval function $f(\cdot)$ predicts a real-valued vector $f(\mathbf{x})\in\mathcal{Y}$ as the intermediate output result. The intermediate output vector can then be used to retrieve a list of similar data instances from the database by using a proper distance-based retrieval method.

III-B9 Time-series Prediction

The goal in time-series prediction is to predict the future values in a series based on previous observations [60]. The inputs are a series of data vectors for a period of time, and the output is a data vector for a future timestamp. Let $t$ denotes the time index. The output vector at time $t$ is represented as $\mathbf{y}_{i}^{t}\in\mathcal{Y}=\mathds{R}^{m}$ . Therefore, the outputs within a period of time from $t=0$ to $t=T$ are $\mathbf{y}_{i}=(\mathbf{y}_{i}^{0},...,\mathbf{y}_{i}^{t},...\mathbf{y}_{i}^{T})$ . Given previously observed values, the learned time-series function outputs predicted consecutive future values.

III-C Special Cases of Multi-output Learning

III-C1 Multi-class Classification

Multi-class classification can be categorized as a traditional single-output learning paradigm if the output class is represented as either an integer encoding or a one-hot vector.

III-C2 Fine-grained Classification

Fine-grained classification is a challenging multi-classification task where the categories may only have subtle visual differences [61]. Although the output of fine-grained classification shares the same vector representation as multi-class classification, the vectors have different internal structures. Also, in its label hierarchy, labels with the same parents tend to be more closely related than labels with different parents.

III-C3 Multi-task Learning

The aim of multi-task learning (MTL) is the subfield that aims to improve generalization performance by learning multiple related tasks simultaneously [62, 63]. Each task in the problem outputs one single label or value. This can be thought of as part of the multi-output learning paradigm in that learning multiple tasks is similar to learning multiple outputs. MTL leverages the relatedness between tasks to improve the performance of learning models. One major difference between multi-task learning and multi-output learning is that, in multi-task learning, different tasks might be trained on different training sets or features, while, in multi-output learning, the output variables usually share the same training data or features.

III-D Model Evaluation Metrics

In this section, we presents the conventional evaluation metrics used to assess the multi-output learning models with a test dataset. Let $\mathcal{T}=\{(\mathbf{x}_{i},\mathbf{y}_{i})|1\leq i\leq N\}$ be the test dataset, $f(\cdot)$ be the multi-output learning model, and $\hat{\mathbf{y}}_{i}=f(\mathbf{x}_{i})$ be the predicted output of $f(\cdot)$ for the testing example $\mathbf{x}_{i}$ . In addition, let $Y_{i}$ and $\hat{Y}_{i}$ denote the set of labels corresponding to $\mathbf{y}_{i}$ and $\hat{\mathbf{y}}_{i}$ , respectively. $\mathds{I}$ is an indicator function, where $\mathds{I}(g)=1$ if $g$ is true, and [math] otherwise.

III-D1 Classification-based Metrics

Classification-based metrics evaluate the performance of multi-output learning with respect to classification problems, such as multi-label classification, semantic retrieval, image annotation, label ranking, etc. The outputs are usually in discrete values. The conventional classification metrics fall into three groups: example-based, label-based and ranking-based.

(a)

Example-based Metrics: Example-based metrics [64] evaluate the performance of multi-output learning models with respect to each data instance. Performance is first evaluated on each test instance separately, and then the mean of all the individual results is used to reflect the overall performance of the model. The evaluation for multi-output classification tasks works under the same mechanism as binary classification (single output) tasks, the classic metrics for binary classification can be extended to evaluate multi-output classification models [1]. The commonly used metrics are exact match ratio, accuracy, precision, recall and $F_{1}$ score.

Hamming loss

is an example-based metric specifically designed for multi-output classification tasks. It computes the average difference between the predicted and actual output, considering both prediction and omission errors, i.e., when the prediction is incorrect or a label is not predicted at all. The Hamming loss averaged overall data instances is defined as:

[TABLE]

where $m$ is the number of labels and $\Delta$ represents the symmetric difference between two sets. The lower the hamming loss, the better the performance of the model is. 2. (b)

Label-based Metrics: Label-based metrics evaluate performance with respect to each output label. These metrics aggregate the contributions of all the labels to arrive at an averaged evaluation of the model. There are two techniques for obtaining label-based metrics: macro- and micro-averaging. Macro-based approaches compute the metrics for each label independently and then average over all the labels with equal weights. By contrast, micro-based approaches give equal weight to every data sample. Let $TP_{l}$ , $FP_{l}$ , $TN_{l}$ and $FN_{l}$ denote the number of true positives, true negatives, false positives, and false negatives, for each label, respectively. Let $B$ be a binary evaluation metric (accuracy, precision, recall or $F_{1}$ score) for a particular label. The macro and micro approaches are therefore defined as -

macro-averageing:

[TABLE]

micro-averaging:

[TABLE] 3. (c)

Ranking-based Metrics: Ranking-based metrics evaluate the performance in terms of the ordering of the output labels.

One-error

is the number of times the top-ranked label is not in the true label set. This approach only considers the most confident predicted label of the model. An averaged one-error over all data instances is computed as:

[TABLE]

where $\mathds{I}$ is an indicator function, $\mathcal{L}$ denotes the label set, and $\pi_{i}(\lambda)$ is the predicted rank of label $\lambda$ for the test instance $\mathbf{x_{i}}$ . The smaller the one-error, the better the performance.

Ranking loss

indicates the average proportion of incorrectly ordered label pairs.

[TABLE]

where $\overline{Y_{i}}=\mathcal{L}\setminus Y_{i}$ . The smaller the ranking loss, the better the performance of the model.

Average Precision (AP)

is the proportion of the labels ranked above a particular label in the true label set as an average over all the true labels. The larger the value, the better the performance of the model is. The averaged AP over all test data instances is defined as follows:

[TABLE]

Discussion: The metrics listed above are those commonly used with classification-based multi-output learning problems. But the choice of metrics varies according to the different considerations of each application. Take image annotation for example. If the aim of the task is to annotate each image correctly, example-based metrics are optimal for evaluating performance. However, if the objective is keyword-based image retrieval, the macro-averaging metric is preferable [64]. Further, some metrics are more suited to special cases of multi-output learning problems. For instance, for imbalanced learning tasks, geometric mean [65] for some classification-based metrics, e.g. the errors, accuracy, F1-scores and etc., are more convincing to be used for evaluation. The minimum sensitivity [66] can help determine the classes that hinder the performance in the imbalanced setting. We do not discuss these metrics in detail as they are not the focus here.

III-D2 Regression-based Metrics

Unsurprisingly, regression-based metrics evaluate multi-output learning performance with regression problems, e.g., object localization or image generation. The outputs are usually real values.

Mean absolute error (MAE)

is a classic single-output regression metric that computes the absolute difference between the predicted and the actual outputs. It can be extended to evaluate multi-output regression models by simply averaging the metric over all the outputs.

[TABLE]

Mean squared error (MSE)

is a regression metric that computes the average squared difference between the predicted and the actual outputs. Like MAE, it can also be extended to the multi-output setting. However, MSE is more sensitive to the outliers, as it will contribute much higher errors compared to MAE.

[TABLE]

Average correlation coefficient (ACC)

measures the degree of association between the actual and the predicted outputs.

[TABLE]

where $y_{i}^{m}$ and $\hat{y}_{i}^{m}$ are the actual and predicted $m$ output of $\mathbf{x_{i}}$ , respectively, and $\bar{\mathbf{y}}^{l}$ and $\bar{\hat{\mathbf{y}}}^{l}$ are the vectors of the averages of the actual and predicted outputs for a label $l$ over all samples.

Intersection over union threshold (IoU)

is a specifically-designed metric for assessing object localization or segmentation. It is computed as:

[TABLE]

where area of overlap is the area of intersection between the predicted and the actual bounding boxes/segmentation masks. Similarly, area of union is the union area between the actual and predicted boxes/masks.

III-D3 New Metrics

Data generation is an emerging subfield of multi-out learning that uses generative models to output structured data with certain distributions. Based on the particulars of the task at hand, a model’s performance is usually evaluated in two respects: 1). whether the generated data actually follows the desired real data distribution; and 2). the quality of the generated samples. Metrics like average log-likelihood [67], coverage metric [68], maximum mean discrepancy (MMD) [69], geometry score [70], are frequently used to assess the veracity of the distribution. Metrics that quantify the quality of the generated data remain challenging. The commonly used are inception scores (IS) [71], mode score (MS) [72], Fréchet inception distance (FID) [73] and kernel inception distance (KID) [74]. Precision, recall and F1 score are also employed in GANs to quantify the degree of overfitting in the model [75].

III-E Multi-output Learning Datasets

Most of the datasets used to experiment with multi-output learning problems have either been constructed or become popular because they reflect, and therefore test, a challenge that needs to be overcome. We have presented these datasets according to the challenges reflected in the 4 Vs. Table II lists the datasets, including their multi-output characteristics, the challenge can be tested, the application domain, plus the dataset name, source, and descriptive statistics.

The large-scale datasets, i.e., the datasets that can be used to test volume, are extremely large. The enormity of their corresponding statistics illustrate the pressing need to overcome the challenges caused by this particular V among the 4.

Many studies that have focused on change in output distribution, e.g., concept drift/velocity, rely on synthetic streaming data or static databases in their experiments. We have also included some of the more popular real-world and/or dynamic databases that are used to experiment with these tasks. As shown in the table, the datasets come from various application domains, demonstrating the importance of this challenge.

The datasets designed to test complex multi-output learning problems contain a mix of different output structures. For example, the image datasets listed in the table includes both labels and bounding boxes for the objects. These datasets can be used to test the variety of data.

Lastly, we come to veracity. Many efforts to detail with noisy labels evaluate their methods by beginning with a clean dataset to which artificial noise is then added. This helps researchers control and test different levels of noise. We have also listed several popular real-world datasets with some unknown level of errors in annotation.

IV The Challenges of Multi-output Learning and Representative Works

The pressing need for the complex prediction output and the explosive growth of output labels pose several challenges to multi-output learning and have exposed the inadequacies of many learning models that exist to date. In this section, we discuss each of these challenges and review several representative works on how they cope with these emerging phenomena. Further, given the success of artificial neural networks (ANNs), we also present several state-of-the-art examples of multi-output learning using an ANN for each challenge.

IV-A Volume - Extreme Output Dimensions

Large-scale datasets are ubiquitous in real-world applications. A dataset is defined to be large-scale if it meets one of three criteria: it has a large number of data instances, the input feature space has high dimensionality, or the output space has high dimensionality. Many studies have sought to solve the scalability issues caused by a large number of data instances, e.g., the instance selection method in [212], or with high-dimensional feature spaces, such as the feature selection method in [213]. However, the issues associated with high output dimensions have received much less attention.

Consider, for example, that if the label for each dimension of an $m$ -dimensional output vectors can be selected from a label set with $c$ different labels, then the number of output outcomes is $c^{m}$ . Hence, these ultra-high-output dimensions/labels result in an extremely large output space and, in turn, high computation costs. Therefore, it is crucial to design multi-output learning models that can handle the immense and ongoing growth in outputs.

An analysis of the current state-of-the-art research on ultra-high-output dimensions revealed some interesting insights. Our analysis was based on the datasets used in studies of multiple disciplines, such as machine learning, computer vision, natural language processing, information retrieval, and data mining. We specifically focused on articles in three top journals and three top international conferences: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), IEEE Transactions on Neural Networks and Learning Systems (TNNLS), the Journal of Machine Learning Research (JMLR), the International Conference on Machine Learning (ICML), the Conference on Neural Information Processing Systems (NIPS), and the Conference on Knowledge Discovery and Data Mining (KDD). Fig. 3 and Fig. 4 summarize our review. From these two figures, it is evident that the output dimensionality of the under-studied algorithms has continued to increase over time. In addition, the latest papers to address this issue in all selected titles are now dealing with more than a million output dimensions and, in some cases, are approaching billions of outputs. Moreover, the statistics for the conferences with shorter time-lags to publication demonstrate just how rapidly output dimensionality is increasing. From this analysis, we conclude that the explosion in output dimensionality is driving many developments in multi-output learning algorithms.

The studies we reviewed tend to fall into two categories: qualitative and quantitative approaches. The qualitative approaches generally involve generative models, while the quantitative models generally involve discriminative models. The main difference between the two models is that generative models focus on learning the joint probability $P(x,y)$ of the inputs $x$ and the label $y$ , while the discriminative models focus on the posterior $P(y|x)$ . Note that in a generative model, $P(x,y)$ can be used to generate some data $x$ , where, in this case, $x$ is the generated output in this particular case.

IV-A1 Qualitative Approaches/Generative Models

The aim of image synthesis [48, 214] is to synthesize new images from textual image descriptions of the image. Some pioneering researchers have synthesized images using a GAN with the image distribution as multiple outputs [215]. But, in real life, GANs can only generate low-resolution images. However, since the first attempts at this foray, there has been progress in scaling up GANs to generate high-resolution images with sensible outputs. For example, Reed et al. [48] proposed a GAN architecture that generates visually plausible 64 x 64 pixel images given text descriptions. In a follow-up study, they presented GAWWN [214], which scales the synthesized image up to 128 x 128 resolution by leveraging additional annotations. Subsequently, StackGAN [216] was proposed, which is capable of generating photo-realistic images at a 256 x 256 resolution from text descriptions. HDGAN [217] is the current state-of-the-art in image synthesis. It models high-resolution images in an end-to-end fashion at 512 x 512 pixels. Inevitably, the future will see further increases in resolution.

MaskGAN [218] use GAN to generate text (i.e., meaningful word sequences). The label set size accords with the vocabulary size. The output dimension is the length of the word sequence that is generated, which, technically, can be unlimited. However, MaskGAN only handles sentence-level text generation. Document-level and book-level text generations are still challenging.

IV-A2 Quantitative Approaches/Discriminative Models

Like instance and feature selection methods, which reduce the number of input instances and, in turn, reduce input dimensionality, it is natural to design models that similarly reduce output dimensionality. Embedding methods can be used to compress a space by projecting the original space onto a lower-dimensional space, with the expected information preserved, such as label correlations and neighborhood structure. Popular methods, such as random projections or canonical correlation analysis projections [219, 220, 221, 222], can be adopted to reduce the dimensions of the output label space. As a result, these modeling tasks can be performed on a compressed output label space and then the predicted compressed label can be projected back onto the original high-dimensional label space. Recently, several embedding methods have been proposed for extreme output dimensions. Mineiro and Karampatziakis [223] proposed a novel randomized embedding for extremely large output spaces. AnnexML [169] is another novel embedding method for graphs that captures graph structures in the embedding space. The embeddings are constructed from the k-nearest neighbors of the label vectors, and the predictions are made efficiently through an approximate nearest neighbor search method. Two popular ANN methods for handling extreme output dimensions are fastText learn tree [224] and XML-CNN [225]. FastText learn tree [224] jointly learns the data representation and the tree structure, and the learned tree structure is then used for efficient hierarchical prediction. XML-CNN is a CNN-based model that incorporates a dynamic max pooling scheme to capture fine-grained features from regions of the input document. A hidden bottleneck layer is used to reduce the model size.

IV-B Variety - Complex Structures

With the increasing abundance of labels, there is a pressing need to understand their inherent structures. Complex output structures can lead to multiple challenges in multi-output learning. For instance, it is common for strong correlations and complex dependencies to exist between labels. Therefore, appropriately modeling output dependencies in the label representation is critical but non-trivial in multi-output learning. In addition, designing a multi-variate loss function and proposing an efficient algorithm to alleviate the high complexity caused by complex structures is also challenging.

IV-B1 Appropriate Modeling of Output Dependencies

The simplest method of multi-output learning is to decompose the learning problem into $m$ independent single-output problems with each corresponding to a single value in the output space. A representative approach is binary relevance (BR) [226], which independently learns binary classifiers for all the labels in the output space. Given an unseen instance $\mathbf{x}$ , BR predicts the output labels by predicting each of the binary classifiers and then aggregating the predicted labels. However, such independent models do not consider the dependencies between outputs. A set of predicted output labels might be assigned to the testing instance even though these labels never co-occur in the training set. Hence, it is crucial to model the output dependencies appropriately to obtain better performance for multi-output tasks.

Many classic learning methods have been proposed to model multiple outputs with interdependencies. These include label powersets (LPs) [227], classifier chains (CC) [228, 229], structured SVMs (SSVM) [230], conditional random fields (CRF) [231] and etc. LPs model the output dependencies by treating each different combination of labels in the output space as a single label, which transforms the problem into one of learning multiple single-label classifiers. The number of single-label classifiers to be trained is the number of label combinations, which grows exponentially with the number of labels. Therefore, LP has the drawback of high computation cost when training with a large number of output labels. Random k-labelsets [232], an ensemble of LP classifiers, is a variant of LP that alleviates the computational complexity problem by training each LP classifier on a different random subset of labels.

CC improves BR by taking the output correlations into account. It links all the binary classifiers from BR into a chain via a modified feature space. Given the $j$ th label, the instance $\mathbf{x}_{i}$ is augmented with the 1st, 2nd, … $(j-1)$ th label, i.e., $(\mathbf{x}_{i},l_{1},l_{2},...,l_{j-1})$ , as the input, to train the $j$ th classifier. Given an unseen instance, CC predicts the output using the 1st classifier, and then augments the instance with the prediction from the 1st classifier as the input to the 2nd classifier for predicting the next output. CC processes values in this way from the 1st classifier to the last and so preserves the output correlations. However, a different order of chains leads to different results. ECC [228], an ensemble of CC, was proposed to solve this problem. It trains the classifiers over a set of random ordering chains and averages the results. Probabilistic classifier chains (PCCs) [233] provide a probabilistic interpretation of CC by estimating the joint distribution of the output labels to capture the output correlations. CCMC [114] is a classifier chain model that considers the order of label difficulties to reduce the degradation in performance caused by ambiguous labels. It is an easy-to-hard learning paradigm that identifies easy and hard labels and uses the predictions for easy labels to help solve the harder labels.

SSVM leverages the idea of large margins to deal with multiple interdependent outputs. The compatibility function is defined as $F(\mathbf{x},\mathbf{y})=\mathbf{w}^{T}\Phi(\mathbf{x},\mathbf{y})$ , where $\mathbf{w}$ is the weight vector and $\Phi:\mathcal{X}\times\mathcal{Y}\rightarrow\mathds{R}^{q}$ is the joint feature map over input and output pairs. The SSVM aims to find the classifier $h_{\mathbf{w}}(\mathbf{x})=\arg\max\limits_{\mathbf{y}\in\mathcal{Y}}\left<\mathbf{w},\phi(\mathbf{x},\mathbf{y})\right>$ with the following objective

[TABLE]

Constraining the structured hinge loss with $\Delta(\mathbf{y}_{i},\mathbf{y})+\mathbf{w}^{T}\Phi(\mathbf{x}_{i},\mathbf{y})-\mathbf{w}^{T}\Phi(\mathbf{x}_{i},\mathbf{y}_{i})\leq\xi_{i}$ , for all $y\in\mathcal{Y}$ , the objective can be reformulated as

[TABLE]

where $\Delta:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathds{R}$ is a loss function, $C$ is a positive constant that controls the trade-off between the training error minimization and the margin maximization [56], $n$ is the number of training samples and $\xi_{i}$ is the slack variable. In practice, SSVM is solved with the cutting-plane algorithm [234].

Apart from the classic models that learn the correlations between output, some of the state-of-the-art multi-output learning models are based on ANNs. For example, models based on convolutional neural networks typically focus on hierarchical multi-labels [235] or rankings [236]. Recurrent neural network (RNNs) models generally focus on sequence-to-sequence learning [237] and time-series prediction [238]. Generative deep neural networks are used to generate output data, such as images, text, and audio [215].

IV-B2 Multivariate Loss Functions

Various loss functions were defined to compute the difference between the groundtruth and the predicted output. Different loss functions presents different errors given the same dataset, and they greatly affect the performance of the model.

0/1 loss

is a standard loss function that is commonly used in classification [239]:

[TABLE]

where $\mathds{I}$ is the indicator function. In general, 0/1 loss refers to the number of misclassified training examples. However, it is very restrictive and does not consider label dependency. Therefore, it is not suitable for large numbers of outputs or for outputs with complex structures. In addition, it is non-convex and non-differentiable, so it is difficult to minimize the loss using standard convex optimization methods. In practice, one typically uses a surrogate loss, which is a convex upper bound of the task loss. However, a surrogate loss in multi-output learning usually loses the consistency when generalizing single-output methods to deal with multiple outputs [240]. Several works on subfields of multi-output learning study the consistency of different surrogate functions and show that they are consistent under some sufficient conditions [241, 242]. Yet this is still a challenging aspect of multi-output learning. More exploration on the theoretical consistency of different problems is required.

Below, we describe four popular surrogate losses: hinge loss, negative log loss, perceptron loss, and softmaxmargin loss.

Hinge loss

is one of the most widely used surrogate losses and is usually used in structured SVMs [243]. It pushes the score of the correct outputs to be greater than that of the prediction:

[TABLE]

The margin, $\Delta(\mathbf{y},\mathbf{y}^{\prime})$ , has different definitions based on the output structures and task. For example, for sequence learning or outputs with equal weights, $\Delta(\mathbf{y},\mathbf{y}^{\prime})$ can be simply defined as the Hamming loss $\sum_{j=1}^{m}\mathds{I}(\mathbf{y}_{(j)}\neq\mathbf{y}^{\prime}_{(j)})$ . For taxonomic classification with the hierarchical output structure, $\Delta(\mathbf{y},\mathbf{y}^{\prime})$ can be defined as the tree distance between $\mathbf{y}$ and $\mathbf{y}^{\prime}$ [19]. For ranking, $\Delta(\mathbf{y},\mathbf{y}^{\prime})$ can be defined as the mean average precision of a ranking $\mathbf{y}^{\prime}$ compared to the optimal $\mathbf{y}$ [244]. In syntactic parsing, $\Delta(\mathbf{y},\mathbf{y}^{\prime})$ is defined as the number of labeled spans where $\mathbf{y}$ and $\mathbf{y}^{\prime}$ do not agree [35]. Non-decomposable losses, such as the $F_{1}$ measure, average precision (AP), or intersection over union (IOU), can also be defined as a margin.

Negative log loss

is commonly used in CRFs [231]. Note that minimizing negative log loss is the same as maximizing the log probability of the data.

[TABLE]

Perceptron loss

is usually adopted in structured perceptron tasks [245] and is the same as hinge loss without the margin.

[TABLE]

Softmax-margin loss

is one of the most popular loss functions in multi-output learning models such as SSVMs [246] and CRFs [247].

[TABLE]

Squared loss

is a popular and convenient loss function that quadratically penalizes the difference between the ground truth and the prediction. It is commonly used in traditional single-output learning and can be easily extended to multi-output learning by summing the squared differences over all the outputs:

[TABLE]

In multi-output learning, it is usually used with continuous valued outputs or continuous intermediate results before converting them into discrete valued outputs. It is also commonly used in neural networks and boosting.

IV-B3 Efficient Algorithms

Complex output structures significantly increase the burden on algorithms to formulate a model. Large-scale outputs, complex output dependencies, and/or complex loss functions can all be problematic. Therefore, several algorithms have been proposed specifically to tackle these challenges efficiently. Many leverage classic machine learning models so as to speed up the algorithms and alleviate the burden of complexity. The four most widely used classic models are based on $k$ nearest neighbor ( $k$ NN), decision trees, $k$ -means, and hashing.

$k$ NN-based methods are simple yet powerful machine learning models. Predictions are made based on the closest $k$ instances to the test instance vector in terms of Euclidean distance. LMMO- $k$ NN [248] is an SSVM-based model involving an exponential number of constraints w.r.t. the number of labels. This model imposes $k$ NN constraints instantiated by the label vectors from neighboring examples to significantly reduce the training time and make rapid predictions. 2. 2.

Decision tree based methods [249, 250] learn a tree from the training data with a hierarchical output label space. They recursively partition the nodes until each leaf contains a small number of labels. Each novel data point is passed down the tree until it reaches a leaf. This method usually achieves a logarithmic time prediction. 3. 3.

$k$ -means based methods such as SLEEC [79] cluster the training data using $k$ -means clustering. SLEEC learns a separate embedding per cluster and performs classification for a novel instance within its cluster alone. This significantly reduces the prediction time. 4. 4.

Hashing methods, such as co-hashing [251, 252] and DBPC [253], reduce the prediction time by using hashing on the input or the intermediate embedding space. Co-hashing learns an embedding space to preserve semantic similarity structures between inputs and outputs. Compact binary representations are then generated for the learned embeddings for prediction efficiency. DBPC jointly learns a deep latent Hamming space and binary prototypes while capturing the latent nonlinear structures of the data with an ANN. The learned Hamming space and binary prototypes significantly decrease the prediction complexity and reduce memory/storage costs.

IV-C Volume - Extreme Class Imbalances

Real-world multi-output applications rarely provide data with an equal number of training instances for all labels/classes. Too many instances in one class over another mean the data is imbalanced, and this is common in many applications. Therefore, traditional models learned from such data tend to favor majority classes more. For example, in face generation, a trained model tends to generate the faces of famous people because there are so many more images of celebrities than other people. Though class imbalance problems have been studied extensively in the context of binary classification, this issue still remains a challenge in multi-output learning, especially with extreme imbalances.

Many studies on multi-output learning either create a balanced dataset or ignore the problems introduced by imbalanced data. A natural way to balance class distributions is to resample the dataset. There are two main resampling techniques: undersampling and oversampling [254]. Undersampling methods down-size the majority classes. The NearMiss family of methods [255] are representative works of this category. The oversampling methods, such as SMOTE and its variants [256], adopt oversampling technique on minority classes to handle the imbalanced class learning problem. However, all these resampling methods are mainly designed for single output learning problems. There are other techniques to handle class imbalance in multi-output learning tasks with ANN.

For example, Dong et al. [257] combined incremental rectification of mini-batches with a deep neural network. Then a hard sample mining strategy minimizes the dominant effect of the majority classes by discovering the boundaries of sparsely-sampled minority classes. Both of the methods in [258] and [259] leveraged adversarial training to mitigate imbalance by using a re-weighting technique so that majority classes tend to have a similar impact as minority classes.

IV-D Volume - Unseen Outputs

Traditional multi-output learning assumes that the output set in testing is the same as the one in training, i.e., the output labels of a testing instance have already appeared during training. However, this may not be true in real-world applications. For example, a new emerging living species can not be detected using a learned classifier based on existing living animals. Similarly, it is infeasible to recognize the actions or events in a real-time video if no such actions or events with the same labels appeared in the training video set. Nor could a coarse animal classifier provide details of the species of a detected animal, such as whether a dog is a labrador or a shepherd.

Depending on the complexity of the learning task, label annotation is usually very costly. In addition, the enormous growth in the number labels not only leads to high-dimensional output space as a result of computation inefficiency, but also makes supervised learning tasks challenging due to unseen output labels during testing.

IV-D1 Zero-shot Multi-label Classification

Multi-label classification is a typical multi-output learning problem. Multi-label classification problems can have various inputs, such as text, images, and video, depending on the application. The output for each input instance is usually a binary label vector, indicating what labels are associated with the input. Multi-label classification problems learn a mapping from the input to the output. However, as the label space increases, it is common to find unseen output labels during testing, where no such labels have appeared in the training set. To study such cases, the zero-shot multi-class classification problem was first proposed in [17, 260] and most leverage the predefined semantic information, such as attributes [11], word representations [13] and etc. This technique was then extended to zero-shot multi-label classification to assign multiple unseen labels to an instance. Similarly, zero-shot multi-label learning leverages the knowledge of the seen and unseen labels and models the relationships between the input features, label representations, and labels. For example, Gaure et al. [261] leverage the co-occurrence statistics of seen and unseen labels and model the label matrix and co-occurrence matrix jointly using a generative model. Rios and Kavuluru [262] and Lee et al. [263] incorporate knowledge graphs of the label relationships with neural networks.

IV-D2 Zero-shot Action Localization

Similar to zero-shot classification problems, localizing human actions in videos without any training video examples is a challenging task. Inspired by zero-shot image classification, many studies into zero-shot action classification predict unseen actions from disjunct training actions based on the prior knowledge of action-to-attribute mappings [264, 265, 266]. Such mappings are usually predefined and the seen and unseen actions are linked through a description of the attributes. Thus, they can be used to generalize undefined actions but are unable to localize actions. More recently, some works are proposed to overcome the issue. Jain et al. [267] proposes Objects2action without using any video data or action annotations. It leverages vast object annotations, images and text descriptions that can be obtained from open-source collections such as WordNet and ImageNet. Mettes and Snoek [268] have subsequently enhanced Objects2action by considering the relationships between actors and objects.

IV-D3 Open-set Recognition

Traditional multi-output learning problems, including zero-shot multi-output learning, operate under a closed-set assumption, i.e., where all the testing classes are known at the time of training time either through the training samples or because they are predefined in a semantic label space. However, Scheirer et al. [16] proposed a concept called open-set recognition to describe a scenario where unknown classes appear in testing. Open-set recognition presents 1-vs-set machine to classify the known classes as well as deal with the unknown classes. In later studies [269, 270], they extended this idea into to multi-class settings by formulating a compact abating probability model. Bendale and Boult [271] adapted ANNs for open-set recognition by proposing a new model layer that estimates the probability of an input being an unknown class.

Fig. 5 illustrates the relationships between different levels of unseen outputs in multi-output learning. Open-set recognition is the most generalized problem of all. Few-shot and zero-shot learning have studied with different multi-output learning problems, such as multi-label learning and event localization. However, open-set recognition has only been studied in conjunctions with multi-class classification. Other problems in the context of multi-output learning are still unexplored.

IV-E Veracity - Noisy Output Labels

Almost all methods of label annotation lead to some amount of noise for various reasons. Associations may be weak, the text may be ambiguous, crowdsourced workers may not be domain experts so labels may be incorrect [272]. Therefore, it is usually necessary to handle noisy outputs like missing, corrupt, incorrect, and/or partial labels, in real-world tasks.

IV-E1 Missing Labels

Often human annotators annotate an image or document with prominent labels but miss some of the less emphasized labels. Additionally, all the objects in an image may not be localized because there are, say, too many objects or the objects are too small. Social media, such as Instagram, allow users to tag uploaded images. But the tags could relate to anything: the type of event, the person’s mood, the weather. Plus, no user is likely to tag every object or every aspect of an image. Directly using such labeled datasets in traditional multi-output learning models can not guarantee the performance of the given tasks. Therefore, handling missing labels is necessary in real-world applications.

In early studies, missing labels were handled by treating them as negative labels [273, 274, 275]. Then modeling tasks are performed based on a fully-labeled dataset. However, this approach can introduce undesirable bias into the learning problem. Therefore, a more widely-used method now is missing value imputation through matrix completion [276, 192, 186]. Most of these approaches are based on a low-rank assumption and, more recently, on label correlations, which improves learning performance [277, 278].

IV-E2 Incorrect Labels

Many labels in high-dimensional output space are non-informative or simply wrong [279]. This is especially common with annotations from crowdsourcing platforms that hire non-expert workers. Labeled datasets from social media networks are also often less than useful. A basic approach for handling incorrect labels is to simply remove those samples [280, 281]. That said, it is frequently difficult to detect which samples have been mislabeled. Therefore, designing multi-output learning algorithms that learn from noisy datasets is of great practical importance.

Existing multi-output learning methods handling noisy labels generally fall into two groups. The first group is based on building robust loss functions [282, 283, 284], which modify the labels in the loss function to alleviate the effect of noise. The second group models latent labels and learns the transition from the latent to the noisy labels [285, 286, 287].

Partial Labels A special case of incorrect labels is partial labels [288, 289, 290], where each training instance is associated with a set of candidate labels but only one of them is correct. This is a common problem in real-world applications. For example, a photograph might contain many faces with captions listing who is in the photo but the names are not matched to the face. Many methods for learning partial labels have been developed to recover the ground-truth labels from a candidate set [291, 292]. However, most are based on the assumption of exactly one ground truth for each instance, which may not always hold true by different label annotation methods. With the use of multiple workers on the crowdsourcing platform to annotate a dataset, the final annotations are usually gathered from the union set of the annotations of all the workers, where each instance might associate with both multiple relevant and irrelevant labels. Hence, Xie and Huang [293] developed a new learning framework, partial multi-label learning (PML), that relaxes this assumption by leveraging the data structure information to optimize the confidence weighted rank loss. Fig. 6 summarizes all the scenarios with noisy output labels, including multi-label learning, missing labels, incorrect labels, partial label learning, and partial multi-label learning.

IV-F Velocity - Changes in Output Distribution

Many real-world applications must deal with data streams, where data arrives continuously and possibly endlessly. In these cases, the output distributions can change over time or concept drift can occur. Streaming data is common in surveillance [98], driver route prediction [95], demand forecasting [97], and many other applications. Take visual tracking [294] in surveillance video as an example, where the video stream is potentially endless. Data streams come in high velocity as the video keeps generating consecutive frames. The goal is to detect, identify, and locate events or objects in the video. Therefore, the learning model must adapt to possible concept drift while working with limited memory.

Existing multi-output learning methods model changes in output distribution by updating the learning system each time data streams arrive. The update method might be ensemble-based [295, 296, 297, 298, 299] or ANN-based methods [300, 294]. Other strategies to handle concept drift include: the assumption of a fading effect on past data [298]; maintaining a change detector on predictive performance measurements and recalibrating models accordingly [297, 301]; and using stochastic gradient descent to update the network and accommodate new data streams with an ANN [294]. Notably, the $k$ neareast neighbor ( $k$ NN) is one of the most classic frameworks in handling multi-output problems, but it cannot be successfully adapted to deal with the challenge of change of output distribution due to the inefficiency issue. Many online hashing and online quantization based methods [302, 303] are proposed to improve the efficiency of $k$ NN while accommodating the changing output distribution.

IV-G Other Challenges

Any two of the aforementioned challenges can be combined to form a more complex challenge. For example, noisy labels and unseen outputs can be combined to form an open-set noisy label problem [304]. In addition, the combination of noisy labels and extreme output dimensions are also worthy of study and further exploration [206]. Changes in output distribution together with noisy labels result in online time-series prediction problems with missing values [305], while changes in distributions combined with dynamic label sets (unseen outputs) lead to open-world recognition problems with incremental labels [306]. Changing output distribution with extreme class imbalances create the common problem of streaming data with concept drift and class imbalances at the same time [18, 307]. Moreover, the combination of complex output structures with changing output distribution is also frequent in real-world applications [308].

IV-H Open Challenges

IV-H1 Output Label Interpretation

There are different ways to represent output labels and each expresses label information from a specific perspective. Taking label tags as an output for example, binary attributed output embeddings represent what attributes the input relates to. Hierarchical label output embedding conveys the hierarchical structure of the inputs. Semantic word output embeddings reflect the semantic relationships between the outputs. As one can see, each exhibits a certain level of human interpretability. Hence, an emerging approach to label embedding is to incorporate different label information from multiple perspectives and rich contexts to enhance interpretability [309]. This is a challenging undertaking because it is quite difficult to appropriately model the interdependencies between outputs in a way that humans can easily interpret and understand. For example, an image of a centaur is expected to be described with semantic labels like horse and person. Moreover, the image is expected to be described with attributes like head, arm, tail, etc. As such, appropriately modeling the relationships between input and outputs with rich interpretations of the labels is an open challenge that should be explored in future studies.

IV-H2 Output Heterogeneity

As the demand for sophisticated decision making increases, so does demand for outputs with more complex structures. Returning to the example of surveillance, people re-identification in traditional approaches usually consists of two steps: people detection, then re-identifying that person if they are input. These steps are essentially two separate tasks that need to be learned together if performance is to be enhanced. Several researchers have recently attempted this demanding challenge, i.e., building a model that can simultaneously learn multiple tasks with different outputs. Mousavian et al. [310] undertook joint people detection in tandem with re-identification, while Van Ranst et al. [311] tackled image segmentation with depth estimation. However, more exploration and investigation to overcome this challenge is needed. As an example, one worthy undertaking would be to answer the question: Can we simultaneously learn the representation of a new user in a social network as well as their potential links to existing users?

V Conclusion

Multi-output learning has attracted significant attention over the last decade. This paper provides a comprehensive review of the study of multi-output learning using the 4 Vs as a frame. We explore the characteristics of the multi-output learning paradigm beginning with the life cycle of the output labels. We emphasize the issues associated with each step of the learning process. In addition, we provide an overview of the types of outputs, the structures, selected problem definitions, common model evaluation metrics, and the popular data repositories used in experiments, with representative works referenced throughout. The paper concludes with a discussion on the challenges caused by 4 Vs and some future research directions that are worthy of further study.

Bibliography311

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Zhang and Z. Zhou, “A review on multi-label learning algorithms,” TKDE , vol. 26, no. 8, pp. 1819–1837, 2014.
2[2] C. Gong, D. Tao, J. Yang, and W. Liu, “Teaching-to-learn and learning-to-teach for multi-label propagation,” in AAAI , 2016, pp. 1610–1616.
3[3] C. Bielza, G. Li, and P. Larrañaga, “Multi-dimensional classification with bayesian networks,” Int. J. Approx. Reasoning , vol. 52, no. 6, pp. 705–727, 2011.
4[4] S. Vembu and T. Gärtner, “Label ranking algorithms: A survey,” in Preference Learning. , 2010, pp. 45–64. [Online]. Available: https://doi.org/10.1007/978-3-642-14125-6_3
5[5] H. Borchani, G. Varando, C. Bielza, and P. Larrañaga, “A survey on multi-output regression,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery , vol. 5, no. 5, pp. 216–233, 2015.
6[6] G. Widmer and M. Kubat, “Learning in the presence of concept drift and hidden contexts,” Machine learning , vol. 23, no. 1, pp. 69–101, 1996.
7[7] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Image Net Large Scale Visual Recognition Challenge,” IJCV , vol. 115, no. 3, pp. 211–252, 2015.
8[8] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: A database and web-based tool for image annotation,” IJCV , vol. 77, no. 1-3, pp. 157–173, 2008.