Prior-Knowledge and Attention-based Meta-Learning for Few-Shot Learning

Yunxiao Qin; Weiguo Zhang; Chenxu Zhao; Zezheng Wang; Xiangyu Zhu,; Guojun Qi; Jingping Shi; Zhen Lei

arXiv:1812.04955·cs.CV·September 8, 2021

Prior-Knowledge and Attention-based Meta-Learning for Few-Shot Learning

Yunxiao Qin, Weiguo Zhang, Chenxu Zhao, Zezheng Wang, Xiangyu Zhu,, Guojun Qi, Jingping Shi, Zhen Lei

PDF

TL;DR

This paper introduces a novel meta-learning approach that incorporates prior-knowledge and attention mechanisms inspired by human cognition to improve few-shot learning performance and address generalization issues.

Contribution

It presents a new meta-learning paradigm integrating prior-knowledge and attention, along with a Cross-Entropy across Tasks metric to mitigate task-overfitting.

Findings

01

Achieves state-of-the-art results on few-shot learning benchmarks.

02

Effectively alleviates the task-overfitting problem.

03

Enhances meta-learner's generalization across different K-shot tasks.

Abstract

Recently, meta-learning has been shown as a promising way to solve few-shot learning. In this paper, inspired by the human cognition process which utilizes both prior-knowledge and vision attention in learning new knowledge, we present a novel paradigm of meta-learning approach with three developments to introduce attention mechanism and prior-knowledge for meta-learning. In our approach, prior-knowledge is responsible for helping meta-learner expressing the input data into high-level representation space, and attention mechanism enables meta-learner focusing on key features of the data in the representation space. Compared with existing meta-learning approaches that pay little attention to prior-knowledge and vision attention, our approach alleviates the meta-learner's few-shot cognition burden. Furthermore, a Task-Over-Fitting (TOF) problem, which indicates that the meta-learner has…

Tables8

Table 1. Table 1: Few-shot learning performance on Omniglot. The method which is colored with blue uses deep network (ResNet) to extract image features, while the other use shallow network (4 cascading convolution layers). The accuracy is tested as the same way as MAML [ 11 ]

Method	Venue	5-way Accuracy		20-way Accuracy
Method	Venue	1-shot	5-shot	1-shot	5-shot
MAML[11]	ICML-17	98.70 $\pm$ 0.40%	99.90 $\pm$ 0.10%	95.80 $\pm$ 0.30%	98.90 $\pm$ 0.20%
Prototypical Nets[6]	NIPS-17	98.80%	99.70%	96.00%	98.90%
Meta-SGD[12]	/	99.53 $\pm$ 0.26%	99.93 $\pm$ 0.09%	95.93 $\pm$ 0.38%	98.97 $\pm$ 0.19%
Relation Net[54]	CVPR-18	99.60 $\pm$ 0.20%	99.80 $\pm$ 0.10%	97.60 $\pm$ 0.20%	99.10 $\pm$ 0.10%
GNN[55]	ICLR-18	99.20%	99.70%	97.40%	99.00%
Spot-Learn[56]	CVPR-19	97.56 $\pm$ 0.31%	99.65 $\pm$ 0.06%	/	/
iMAML HF[35]	NIPS-19	99.50 $\pm$ 0.26%	99.74 $\pm$ 0.11%	96.18 $\pm$ 0.36%	99.14 $\pm$ 0.10%
SNAIL[15]	ICLR-18	99.07 $\pm$ 0.16%	99.78 $\pm$ 0.09%	97.64 $\pm$ 0.30%	99.36 $\pm$ 0.18%
MetaGAN+RN[18]	NIPS-18	99.67 $\pm$ 0.18%	99.86 $\pm$ 0.11%	97.64 $\pm$ 0.17%	99.21 $\pm$ 0.10%
AML(ours)	/	99.65 $\pm$ 0.10%	99.85 $\pm$ 0.04%	98.48 $\pm$ 0.09%	99.55 $\pm$ 0.06%

Table 2. Table 2: Few-shot learning performance on MiniImagenet. The method which is colored with blue uses deep network to extract image features, while the other use shallow network. We separately highlight the best result of the methods using shallow network and that of the methods using deep network, for each task.

Method	Venue	5-way Accuracy
Method	Venue	1-shot	5-shot
MAML[11]	ICML-17	48.70 $\pm$ 1.84%	63.11 $\pm$ 0.92%
Prototypical Nets[6]	NIPS-17	49.42 $\pm$ 0.78%	68.20 $\pm$ 0.66%
Meta-SGD[12]	/	50.47 $\pm$ 1.87%	64.03 $\pm$ 0.94%
LLAMA[33]	ICLR-18	49.40 $\pm$ 1.83%	/
Relation Net[54]	CVPR-18	51.38 $\pm$ 0.82%	67.07 $\pm$ 0.69%
GNN[55]	ICLR-18	50.33 $\pm$ 0.36%	66.41 $\pm$ 0.63%
Spot-Learn[56]	CVPR-19	51.03 $\pm$ 0.78%	67.96 $\pm$ 0.71%
iMAML HF[35]	NIPS-19	49.30 $\pm$ 1.88%	/
Meta-MinibatchProx[57]	NIPS-19	50.77 $\pm$ 0.90%	67.43 $\pm$ 0.89
AML(ours)	/	52.25 $\pm$ 0.85%	69.46 $\pm$ 0.68%
SNAIL[15]	ICLR-18	55.71 $\pm$ 0.99%	68.88 $\pm$ 0.92%
TADAM[58]	NIPS-18	58.50 $\pm$ 0.30%	76.70 $\pm$ 0.30%
MetaGAN+RN[18]	NIPS-18	52.71 $\pm$ 0.64%	68.63 $\pm$ 0.67%
AM3-TADAM[59]	ICLR-19	65.30 $\pm$ 0.49%	78.10 $\pm$ 0.36%
Incremental[60]	NIPS-19	54.95 $\pm$ 0.30%	63.04 $\pm$ 0.30%
RAML(ours)	/	63.66 $\pm$ 0.85%	80.49 $\pm$ 0.45%
URAML(ours)	/	49.56 $\pm$ 0.79%	63.42 $\pm$ 0.76%

Table 3. Table 3: Ablation experimental results about the attention mechanism on Omniglot.

Method	5-way Accuracy		20-way Accuracy
Method	1-shot	5-shot	1-shot	5-shot
MAML*	97.40 $\pm$ 0.27%	99.71 $\pm$ 0.05%	93.37 $\pm$ 0.23%	97.46 $\pm$ 0.11%
MAML+attention	97.41 $\pm$ 0.28%	99.48 $\pm$ 0.12%	92.99 $\pm$ 0.25%	97.94 $\pm$ 0.10%
Meta-SGD*	98.94 $\pm$ 0.17%	99.51 $\pm$ 0.07%	95.82 $\pm$ 0.21%	98.40 $\pm$ 0.09%
Meta-SGD+attention	99.26 $\pm$ 0.15%	99.79 $\pm$ 0.04%	97.94 $\pm$ 0.14%	98.99 $\pm$ 0.10%

Table 4. Table 4: Detailed structure of the decoder module in URAML.

Layers	Number of filters	Kernel
CONV	1024	5
DeCONV	512	3
DeCONV	256	3
CONV	1 or 2	1

Table 5. Table 5: Ablation experimental results about the attention mechanism on MiniImagenet

Method	5-way Accuracy
Method	1-shot	5-shot
MAML*	48.03 $\pm$ 0.83%	64.11 $\pm$ 0.73%
MAML+attention	48.52 $\pm$ 0.85%	64.94 $\pm$ 0.69%
Reptile*	48.23 $\pm$ 0.43%	63.69 $\pm$ 0.49%
Reptile+attention	48.30 $\pm$ 0.45%	64.22 $\pm$ 0.39%
Meta-SGD*	48.15 $\pm$ 0.93%	63.73 $\pm$ 0.85%
Meta-SGD+attention	49.11 $\pm$ 0.94%	65.54 $\pm$ 0.84%

Table 6. Table 6: Results of several ablation experiments.

Method	5-way Accuracy
Method	1-shot	5-shot
AML	52.25 $\pm$ 0.85%	69.46 $\pm$ 0.68%
AML-attention	51.27 $\pm$ 0.78%	67.73 $\pm$ 0.65%
RAML	63.66 $\pm$ 0.85%	80.49 $\pm$ 0.45%
RAML-Places2	58.82 $\pm$ 0.89%	74.09 $\pm$ 0.76%

Table 7. Table 7: Ablation experimental results about URAML.

Method	Dataset	Number of images	5-way Accuracy
Method	Dataset	Number of images	1-shot	5-shot
URAML-V1	MiniImagenet-900	1.15million	45.91 $\pm$ 0.79%	61.04 $\pm$ 0.71%
URAML-V2	MiniImagenet-900, places365, COCO2017	4.10million	48.82 $\pm$ 0.79%	62.84 $\pm$ 0.78%
URAML-AE	MiniImagenet-900, places365, COCO2017, OpenImages-300	7.10million	33.29 $\pm$ 0.71%	43.60 $\pm$ 0.66%
URAML	MiniImagenet-900, places365, COCO2017, OpenImages-300	7.10million	49.56 $\pm$ 0.79%	63.42 $\pm$ 0.76%

Table 8. Table 8: Performance of different meta-learning methods on the CET metric.

Method	MAML	Meta-SGD	AML	RAML	URAML
CET	57.19	34.22	33.35	32.13	32.16

Equations28

\left\{\begin{array}[]{lr}\gamma_{i}=\mathcal{F}(x_{i};\ \theta_{f})\\ m_{i}=\mathcal{A}(\gamma_{i};\ \theta_{a})\\ \gamma^{\alpha}_{i}=\gamma_{i}\odot m_{i}\\ \hat{y_{i}}=\mathcal{C}(\gamma^{\alpha}_{i};\ \theta_{c})\end{array}\right.

\left\{\begin{array}[]{lr}\gamma_{i}=\mathcal{F}(x_{i};\ \theta_{f})\\ m_{i}=\mathcal{A}(\gamma_{i};\ \theta_{a})\\ \gamma^{\alpha}_{i}=\gamma_{i}\odot m_{i}\\ \hat{y_{i}}=\mathcal{C}(\gamma^{\alpha}_{i};\ \theta_{c})\end{array}\right.

\left\{\begin{array}[]{lr}\gamma^{\prime}=\mathcal{P}_{a}(\gamma),\\ m=\sigma(\mathcal{F}_{a}(\gamma^{\prime};\ \theta_{a}))\end{array}\right.

\left\{\begin{array}[]{lr}\gamma^{\prime}=\mathcal{P}_{a}(\gamma),\\ m=\sigma(\mathcal{F}_{a}(\gamma^{\prime};\ \theta_{a}))\end{array}\right.

\left\{\begin{array}[]{lr}\hat{y_{i}}=\mathbb{F}(x_{i};\theta_{f},\theta_{a},\theta_{c}),\\ \mathcal{L}_{i}(\theta_{f},\theta_{a},\theta_{c})=l(\hat{y_{i}},y_{i}),\\ \mathfrak{L}_{s}(\theta_{f},\theta_{a},\theta_{c})=\frac{1}{N_{s}}\displaystyle{\sum_{i=1}^{N_{s}}}\mathcal{L}_{i}(\theta_{f},\theta_{a},\theta_{c})\end{array}\right.

\left\{\begin{array}[]{lr}\hat{y_{i}}=\mathbb{F}(x_{i};\theta_{f},\theta_{a},\theta_{c}),\\ \mathcal{L}_{i}(\theta_{f},\theta_{a},\theta_{c})=l(\hat{y_{i}},y_{i}),\\ \mathfrak{L}_{s}(\theta_{f},\theta_{a},\theta_{c})=\frac{1}{N_{s}}\displaystyle{\sum_{i=1}^{N_{s}}}\mathcal{L}_{i}(\theta_{f},\theta_{a},\theta_{c})\end{array}\right.

(θ_{f}^{^{'}}, θ_{a}^{^{'}}, θ_{c}^{^{'}}) = (θ_{f}, θ_{a}, θ_{c}) - α \boldmath \circ \nabla_{(θ_{f}, θ_{a}, θ_{c})} L_{s} (θ_{f}, θ_{a}, θ_{c})

(θ_{f}^{^{'}}, θ_{a}^{^{'}}, θ_{c}^{^{'}}) = (θ_{f}, θ_{a}, θ_{c}) - α \boldmath \circ \nabla_{(θ_{f}, θ_{a}, θ_{c})} L_{s} (θ_{f}, θ_{a}, θ_{c})

\left\{\begin{array}[]{lr}\hat{y_{i}}=\mathbb{F}(x_{i};\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c}),\\ \mathcal{L}_{i}(\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})=l(\hat{y_{i}},y_{i}),\\ \mathfrak{L}_{q}(\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})=\frac{1}{N_{q}}\displaystyle{\sum_{i=1}^{N_{q}}}\mathcal{L}_{i}(\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})\end{array}\right.

\left\{\begin{array}[]{lr}\hat{y_{i}}=\mathbb{F}(x_{i};\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c}),\\ \mathcal{L}_{i}(\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})=l(\hat{y_{i}},y_{i}),\\ \mathfrak{L}_{q}(\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})=\frac{1}{N_{q}}\displaystyle{\sum_{i=1}^{N_{q}}}\mathcal{L}_{i}(\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})\end{array}\right.

(θ_{f}, θ_{a}, θ_{c}, α) = (θ_{f}, θ_{a}, θ_{c}, α) - β \boldmath \cdot \nabla_{(θ_{f}, θ_{a}, θ_{c}, α)} L_{q} (θ_{f}^{^{'}}, θ_{a}^{^{'}}, θ_{c}^{^{'}})

(θ_{f}, θ_{a}, θ_{c}, α) = (θ_{f}, θ_{a}, θ_{c}, α) - β \boldmath \cdot \nabla_{(θ_{f}, θ_{a}, θ_{c}, α)} L_{q} (θ_{f}^{^{'}}, θ_{a}^{^{'}}, θ_{c}^{^{'}})

\left\{\begin{array}[]{lr}\gamma_{i}=\mathcal{F}_{r}(x_{i};\ \theta_{r})\\ \hat{y_{i}}=\mathcal{C}_{au}(\gamma_{i};\ \theta_{au})\\ L_{au}=\frac{1}{n}\displaystyle{\sum_{i=1}^{n}}l(\hat{y_{i}},y_{i})\\ \theta^{*}_{r},\theta^{*}_{au}=\mathop{argmin}\limits_{\theta_{r},\theta_{au}}L_{au}\end{array}\right.

\left\{\begin{array}[]{lr}\gamma_{i}=\mathcal{F}_{r}(x_{i};\ \theta_{r})\\ \hat{y_{i}}=\mathcal{C}_{au}(\gamma_{i};\ \theta_{au})\\ L_{au}=\frac{1}{n}\displaystyle{\sum_{i=1}^{n}}l(\hat{y_{i}},y_{i})\\ \theta^{*}_{r},\theta^{*}_{au}=\mathop{argmin}\limits_{\theta_{r},\theta_{au}}L_{au}\end{array}\right.

\left\{\begin{array}[]{lr}\hat{y_{i}}=\mathbb{F}(x_{i};\theta_{r}^{*},\theta_{a},\theta_{c}),\\ \mathcal{L}_{i}(\theta_{r}^{*},\theta_{a},\theta_{c})=l(\hat{y_{i}},y_{i}),\\ \mathfrak{L}_{s}(\theta_{r}^{*},\theta_{a},\theta_{c})=\frac{1}{N_{s}}\displaystyle{\sum_{i=1}^{N_{s}}}\mathcal{L}_{i}(\theta_{r}^{*},\theta_{a},\theta_{c})\end{array}\right.

\left\{\begin{array}[]{lr}\hat{y_{i}}=\mathbb{F}(x_{i};\theta_{r}^{*},\theta_{a},\theta_{c}),\\ \mathcal{L}_{i}(\theta_{r}^{*},\theta_{a},\theta_{c})=l(\hat{y_{i}},y_{i}),\\ \mathfrak{L}_{s}(\theta_{r}^{*},\theta_{a},\theta_{c})=\frac{1}{N_{s}}\displaystyle{\sum_{i=1}^{N_{s}}}\mathcal{L}_{i}(\theta_{r}^{*},\theta_{a},\theta_{c})\end{array}\right.

(θ_{a}^{^{'}}, θ_{c}^{^{'}}) = (θ_{a}, θ_{c}) - α \boldmath \circ \nabla_{(θ_{a}, θ_{c})} L_{s} (θ_{r}^{*}, θ_{a}, θ_{c})

(θ_{a}^{^{'}}, θ_{c}^{^{'}}) = (θ_{a}, θ_{c}) - α \boldmath \circ \nabla_{(θ_{a}, θ_{c})} L_{s} (θ_{r}^{*}, θ_{a}, θ_{c})

\left\{\begin{array}[]{lr}\hat{y_{i}}=\mathbb{F}(x_{i};\theta_{r}^{*},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c}),\\ \mathcal{L}_{i}(\theta_{r}^{*},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})=l(\hat{y_{i}},y_{i}),\\ \mathfrak{L}_{q}(\theta_{r}^{*},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})=\frac{1}{N_{q}}\displaystyle{\sum_{i=1}^{N_{q}}}\mathcal{L}_{i}(\theta_{r}^{*},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})\end{array}\right.

\left\{\begin{array}[]{lr}\hat{y_{i}}=\mathbb{F}(x_{i};\theta_{r}^{*},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c}),\\ \mathcal{L}_{i}(\theta_{r}^{*},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})=l(\hat{y_{i}},y_{i}),\\ \mathfrak{L}_{q}(\theta_{r}^{*},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})=\frac{1}{N_{q}}\displaystyle{\sum_{i=1}^{N_{q}}}\mathcal{L}_{i}(\theta_{r}^{*},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})\end{array}\right.

(θ_{a}, θ_{c}, α) = (θ_{a}, θ_{c}, α) - β \boldmath \cdot \nabla_{(θ_{a}, θ_{c}, α)} L_{q} (θ_{r}^{*}, θ_{a}^{^{'}}, θ_{c}^{^{'}})

(θ_{a}, θ_{c}, α) = (θ_{a}, θ_{c}, α) - β \boldmath \cdot \nabla_{(θ_{a}, θ_{c}, α)} L_{q} (θ_{r}^{*}, θ_{a}^{^{'}}, θ_{c}^{^{'}})

\left\{\begin{array}[]{lr}\gamma_{i}^{l}=\mathcal{F}_{l}(x_{i}^{l};\ \theta_{l})\\ \hat{x}^{ab}_{i}=\mathcal{D}_{l}(\gamma_{i}^{l};\ \omega_{l})\\ L_{l}(\theta_{l},\omega_{l})=\frac{1}{n}\displaystyle{\sum_{i=1}^{n}}l_{2}(x_{i}^{ab},\ \hat{x}_{i}^{ab})\\ \theta_{l}^{*},\omega_{l}^{*}=\mathop{argmin}\limits_{\theta_{l},\omega_{l}}L_{l}(\theta_{l},\omega_{l})\end{array}\right.

\left\{\begin{array}[]{lr}\gamma_{i}^{l}=\mathcal{F}_{l}(x_{i}^{l};\ \theta_{l})\\ \hat{x}^{ab}_{i}=\mathcal{D}_{l}(\gamma_{i}^{l};\ \omega_{l})\\ L_{l}(\theta_{l},\omega_{l})=\frac{1}{n}\displaystyle{\sum_{i=1}^{n}}l_{2}(x_{i}^{ab},\ \hat{x}_{i}^{ab})\\ \theta_{l}^{*},\omega_{l}^{*}=\mathop{argmin}\limits_{\theta_{l},\omega_{l}}L_{l}(\theta_{l},\omega_{l})\end{array}\right.

\left\{\begin{array}[]{lr}\gamma_{i}^{ab}=\mathcal{F}_{ab}(x_{i}^{ab};\ \theta_{ab})\\ \hat{x}_{i}^{l}=\mathcal{D}_{i}^{ab}(\gamma_{i}^{ab};\ \omega_{ab})\\ L_{ab}(\theta_{ab},\omega_{ab})=\frac{1}{n}\displaystyle{\sum_{i=1}^{n}}l_{2}(x_{i}^{l},\ \hat{x}_{i}^{l})\\ \theta_{ab}^{*},\omega_{ab}^{*}=\mathop{argmin}\limits_{\theta_{ab},\omega_{ab}}L_{ab}(\theta_{ab},\omega_{ab})\end{array}\right.

\left\{\begin{array}[]{lr}\gamma_{i}^{ab}=\mathcal{F}_{ab}(x_{i}^{ab};\ \theta_{ab})\\ \hat{x}_{i}^{l}=\mathcal{D}_{i}^{ab}(\gamma_{i}^{ab};\ \omega_{ab})\\ L_{ab}(\theta_{ab},\omega_{ab})=\frac{1}{n}\displaystyle{\sum_{i=1}^{n}}l_{2}(x_{i}^{l},\ \hat{x}_{i}^{l})\\ \theta_{ab}^{*},\omega_{ab}^{*}=\mathop{argmin}\limits_{\theta_{ab},\omega_{ab}}L_{ab}(\theta_{ab},\omega_{ab})\end{array}\right.

\left\{\begin{array}[]{lr}\bm{d}_{i}=\mathcal{S}(\bm{a_{i}}/max(\bm{a_{i}}))\\ l_{ij}=\mathcal{D}(\bm{d_{i}},\bm{d_{j}})\\ L=\sum_{i,j\in{1,3,5,7,9}}^{i\neq j}l_{ij}\end{array}\right.

\left\{\begin{array}[]{lr}\bm{d}_{i}=\mathcal{S}(\bm{a_{i}}/max(\bm{a_{i}}))\\ l_{ij}=\mathcal{D}(\bm{d_{i}},\bm{d_{j}})\\ L=\sum_{i,j\in{1,3,5,7,9}}^{i\neq j}l_{ij}\end{array}\right.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Prior-Knowledge and Attention based Meta-Learning for Few-Shot Learning

Yunxiao Qin1,2, Weiguo Zhang1, Chenxu Zhao2, Zezheng Wang2, Xiangyu Zhu3

Guojun Qi4, Jingping Shi1, Zhen Lei3

1Northwestern Polytechnical University of China 2AIBEE,

3Institute of Automation, Chinese Academy of Science, 4Huawei Cloud

{qyxqyx, zhangwg, shijingping}@mail.nwpu.edu.cn, [email protected],

{xiangyu.zhu, zlei}@nlpr.ia.ac.cn, {cxzhao, zezhengwang}@jd.com

Abstract

Recently, meta-learning has been shown as a promising way to solve few-shot learning. In this paper, inspired by the human cognition process which utilizes both prior-knowledge and vision attention in learning new knowledge, we present a novel paradigm of meta-learning approach with three developments to introduce attention mechanism and prior-knowledge for meta-learning. In our approach, prior-knowledge is responsible for helping meta-learner expressing the input data into high-level representation space, and attention mechanism enables meta-learner focusing on key features of the data in the representation space. Compared with existing meta-learning approaches which pay little attention to prior-knowledge and vision attention, our approach alleviates the meta-learner’s few-shot cognition burden. Furthermore, a Task-Over-Fitting (TOF) problem111When is tested on $J$ -shot classification tasks, the meta-learner trained on $K$ -shot tasks performs not as well as the one trained on $J$ -shot tasks, where $K$ and $J$ are different unsigned integers denoting different numbers of shots for the meta-learner., which indicates that the meta-learner has poor generalization on different $K$ -shot learning tasks, is discovered and we propose a Cross Entropy across Tasks (CET) metric222A metric for quantizing how much a meta-learning method suffers from the TOF problem. to model and solve the TOF problem. Extensive experiments demonstrate that we improve the meta-learner with state-of-the-art performance on several few-shot learning benchmarks, and at the same time the TOF problem can also be released greatly.

1 Introduction

The development of deep learning makes remarkable progresses in many tasks[1, 2, 3, 4]. To achieve all of them, large amounts of thousands and even millions of labeled data are required for the deep learning approach to obtain satisfactory performance. However, collecting and annotating abundant data is notoriously expensive. Therefore, few-shot learning[5, 6, 7] which requires the model to learn from a few data, has attracted researchers’ attention in recent years.

Learning from few-data is challenging for Computer Vision. In comparison, we human beings can rapidly learn new categories from very few examples. Recently, meta-learning[8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] has shown promising performance to improve the few-shot learning for Computer Vision. However, existing meta-learning methods commonly ignore prior-knowledge[20, 21, 22, 23, 24] and attention mechanism[25, 26] which have been both demonstrated important for human cognitive and learning process. We illustrate a few-shot classification problem in Fig.1 for a better understanding of the role of prior-knowledge and attention mechanism in human few-shot learning. In Fig.1, we unconsciously leverage our learned knowledge about the world to understand and express these images into high-level compact representations, such as plant, animal, tree, and table etc. However, according to the four training images, we discover that only the feature of the tree and table are useful for us to recognize these two classes of images. Then, we quickly adjust ourselves to pay attention to the critical features and make the decision based on the focused features.

Evidently, we can summarize two main modules in human few-shot learning: a stable Representation module that utilizes prior-knowledge to express the image into compact feature representations; and a smart attention-based decision logical module that adapts accurately and performs recognition based on the feature representations. While existing meta-learning approaches commonly train meta-learners to learn adaptive networks directly based on the original input data with no attention mechanism and prior-knowledge.

In this paper, inspired by the human cognition process, we present a novel paradigm of meta-learning approach with three developments to introduce attention mechanism and prior-knowledge step-by-step for meta-learning. Here, we briefly introduce the proposed methods. 1) The first method is Attention based Meta-Learning (AML) which leverages attention mechanism to enable the meta-learner paying more attention on essential feature. 2) For the meta-learner enjoying not only attention but also prior-knowledge, we present another method Representation and Attention based Meta-Learning (RAML). Its network contains a Representation module and an attention-based prediction (ABP) module. The Representation module is similar to the same module of human vision. It learns the prior-knowledge in a supervised fashion and is responsible for understanding and extracting stable compact feature representations from the input image. The ABP module plays the same role as the smart attention-based decision logic module of human vision. It enables the meta-learner to precisely adjusting first its attention to the most discriminative feature representations of input images and second the corresponding predictions. 3) In the third method, to take full advantage of endless unlabeled data, we design a novel method where the Representation module learns the past knowledge in unsupervised fashion [27, 28, 29, 30, 31, 32]. We call this method Unsupervised Representation and Attention based Meta-Learning(URAML). With URAML, we show in our experiments that the growth of the number of unlabeled data and the development of unsupervised learning both improve the performance of URAML apparently.

In addition, we show a Task-Over-Fitting (TOF) problem for existing meta-learning and present a Cross-Entropy across Tasks (CET) metric to evaluate how much a meta-learning method is troubled by the TOF problem. An example of the TOF problem is, the meta-learner trained on 5-way 1-shot tasks is not as capable as the one trained on 5-way 5-shot tasks when they are tested on 5-way 5-shot tasks, and vice versa. However, in practical applications, it is uncertain how much data and how many shot times are available to the meta-learner to learn. Therefore, we argue that the trained meta-learner should generalizes well to different $K$ -shot tasks. The possible reason behind the TOF problem is that existing meta-learners are vulnerable to the features irrelevant to the presented tasks since they ignoring both priori knowledge and attention mechanism. Our experiment validates that by incorporating prior-knowledge and attention mechanism, our methods suffer less from the TOF problem than existing meta-learning methods.

We summarize the main contributions of our work as:

•

We propose that both attention mechanism and prior-knowledge are crucial for meta-learner to reduce its cognition burden in few-shot learning, and we develop three methods AML, RAML, and URAML to step-by-step leverage attention mechanism and prior-knowledge in meta-learning.

•

We discover the TOF problem for meta-learning, and design a novel metric Cross-Entropy across Tasks (CET) to measure how much meta-learning approaches suffer from the TOF problem.

•

Through extensive experiments, we show that the proposed methods achieve state-of-the-art performance on several few-shot learning benchmarks and in the meantime, they are less sensitive to the TOF problem, especially the RAML and URAML.

2 Related Work

2.1 Meta-learning for Few-Shot Learning

An $N$ -way $K$ -shot learning task contains a support set and a query set. The support and query set contain $K$ and $L$ examples for each of the $N$ classes, respectively.Existing meta-learning approaches usually solve the few-shot learning by training a meta-learner on the $N$ -way $K$ -shot learning tasks in the following way. Firstly, the meta-learner is required to inner-update itself on the support set. Secondly, after the inner-updating, meta-learner is evaluated on the query set. Finally, by minimizing the loss on the query set, the meta-learner learns a base learner which has easy-fine-tune weights[11, 14] or a skillful weight updater[13, 19] or both[12] or the ability to memorize the support set[15]. The methods train the meta-learner learning an easy-fine-tune base learner are also called as weight initialization based methods, as the meta-learner learns generalized initial weight for few-shot learning tasks. Recently, MAML, which is a classical weight initialization based method, is popular and lots of MAML based methods have been proposed. For example, LLAML[33] uses a local Laplace approximation to model the task parameters, and MTL[34] trains a meta-transfer to adapt a pre-trained deep network to few-shot learning tasks. Besides, MetaGAN[18] shows that by coupling MAML with adversarial training, the meta-learner is trained to learn a better decision boundaries between different classes in few-shot learning. To reduce the computation and memory cost of MAML, iMAML[35] leverages implicit differentiation to remove the need of differentiation through the inner-update path.

Though existing meta-learning methods performs promising, they seldom consider the prior-knowledge and attention mechanism in meta-learning. In our paper, we improve meta-learning for few-shot learning by introducing prior-knowledge and attention mechanism to meta-learning.

2.2 Attention Mechanism

Recent years, attention mechanism[36, 37, 38, 39] has been widely used in computer vision systems, machine translation and etc.. Several manners of the attention mechanism have been proposed, such as soft attention[36, 37], hard attention[38] and self attention[39] etc. Soft attention can be seen as simulating the attention mechanism by multiplying weight on the neural unit so that the network pays more attention on the neural unit which multiplies with larger weight. SENet[37] takes advantage of soft attention mechanism to win the champion on the image classification task of ILSVRC-2017[40]. Hard attention[38] can be seen as a module that decides a block region of the input image where is visible to the network, and the other region is invisible. Self-attention[39] improves the performance of the machine translation system by training a network to find the inner dependency of the input and that of the output. In this paper, we use soft attention mechanism as the meta-learner’s attention mechanism.

2.3 Unsupervised Representation Learning

Supervised learning is a data-hungry manner to train deep network. Considering this, several unsupervised learning approach[27, 28, 29, 30, 31, 32] have been proposed. A well-known way is training a neural network to reconstruct the original input through an Encoder-Decoder architecture, such as Auto-Encoder[27], Variational Auto-Encoder (VAE)[28] and etc. Given partial masked images, Context Auto-Encoder[29] trains a network to reconstruct not only the visible but also the masked region of the image. Colorization[30] uses Lab images to train a network to generate the unseen ab channels from the input L channel. Based on Colorization, Split-Brain[32] trains two separated networks to separately generate the ab channels from the L channel and generate the L channel from the ab channels. Different from these methods, DeepCluster[31] couples deep learning with Cluster algorithm[41, 42]. However, in real world, many unlabeled images containing complex semantic information and are not suitable to be categorized into specific clusters. Therefore, we consider there might be a limitation for DeepCluster and we utilize Split-Brain as the unsupervised learning method in URAML.

3 Method

3.1 Problem of Learning from Few-Data

Learning from few-data is extremely difficult for the deep learning model. One reason is that the original input data is commonly represented in a large dimension space. Usually, tens or hundreds of thousands of dimension space is required. For example, for the image classification task, the original image is commonly stored in a large dimensional space (dimension of an 224x224 RGB image is 150528). In such a large dimension space, it is difficult for a few samples of one category to accurately reflect the character of this category.

Humans learn new categories efficiently because they utilize prior-knowledge and attention mechanism in cognition[20, 21, 43, 44, 45, 46, 47, 24, 23]. Prior-knowledge facilitates human to express perceptual images into high-level representations or descriptions, and attention mechanism helps human to focus on critical components of the representations. In this way, humans reduce the dimension of images and maintain the discriminative components of the images, which alleviates human cognition load and facilitate humans to efficiently learn new categories.

Existing meta-learning approaches improve deep learning a lot in few-shot learning. However, they train the meta-learner to quickly fit few-shot learning tasks directly on the few original high dimensional input data and pay little attention to the importance of prior-knowledge and attention mechanism, leading to unsatisfactory performance. Besides, as introduced before, we propose that ignoring prior-knowledge and attention mechanism is also the possible reason for existing meta-learning approaches to be vulnerable to suffer from the TOF problem.

In this paper, inspired by human cognition and for addressing the problem existing meta-learning approaches expose, we propose three methods step-by-step: Attention based Meta-Learning (AML), Representation and Attention based Meta-Learning (RAML), Unsupervised Representation and Attention based Meta-Learning (URAML).

3.2 AML

AML equips the meta-learner with the power of attention mechanism. We first introduce the network structure and then detail the training of AML.

AML Network

The network architecture of AML is shown in Fig.2. The network consists of a feature extractor and an attention-based prediction (ABP) module. The feature extractor is a CNN $\mathcal{F}$ which is composed of four stacking convolutional layers. The ABP module contains an convolution-based attention model $\mathcal{A}$ and a fully-connect layer based classifier $\mathcal{C}$ . Eq.1 shows the inference of the network. $\theta_{f}$ , $\theta_{a}$ , and $\theta_{c}$ are weights of $\mathcal{F}$ , $\mathcal{A}$ , and $\mathcal{C}$ , respectively. $\mathcal{F}$ extracts features $\gamma_{i}$ of the input image $x_{i}$ and feed $\gamma_{i}$ into the attention model $\mathcal{A}$ . Then, $\mathcal{A}$ calculates the soft attention mask $m_{i}$ of the features $\gamma_{i}$ . By channel-wise multiplication $\odot$ between $\gamma_{i}$ and $m_{i}$ , the focused features $\gamma^{\alpha}_{i}$ is calculated. Finally, the classifier $\mathcal{C}$ predicts the category of the input image, and $\hat{y_{i}}$ is the corresponding prediction of $x_{i}$ . We simplify and integrate the inference in Eq.1 as $\hat{y_{i}}=\mathbb{F}(x_{i};\theta_{f},\theta_{a},\theta_{c})$ .

[TABLE]

In this paper, we use soft attention mechanism to build up the attention model. Although the soft attention mechanism is not exactly the same with the attention mechanism in human vision, it still plays a similar role with the human attention mechanism and helps the meta-learner to control its attention to key features. Fig.4 is used to better understand the soft attention processing of the meta-learner.

Fig.3 shows the attention model structure and Eq.2 shows the inference of the attention model. The input feature $\gamma$ is firstly global-average-pooled to get feature $\gamma$ ′, and then a convolution layer coupled with a sigmoid activation layer are used to predict the attention mask m from the feature $\gamma$ ′.

[TABLE]

$\mathcal{P}_{a}$ is the global-average-pooling operation, and $\sigma$ is the sigmoid activation, and $\mathcal{F}_{a}$ is the convolution layer in the attention model.

AML Meta-Train Process

Given a few-shot classification task $\tau$ , AML meta-trains the meta-learner to solve the task $\tau$ in the two steps. First, AML requires the meta-learner to inner-update itself on the the support set of $\tau$ , which can be formulated as Eq.3 and Eq.4.

[TABLE]

In Eq.3, $x_{i}$ is any image that belongs to the support set, $l$ is the cross-entropy loss function, $\mathcal{L}_{i}$ is the meta-learner’s loss on the image $x_{i}$ , $\mathfrak{L}_{s}$ is the meta-learner’s loss on the total support set, and $N_{s}$ is the number of images in the support set. In Eq.4, inspired by Meta-SGD[12], we set $\alpha$ as a trainable vector which adjusts the inner-update direction and $\alpha$ has the same shape with the weights $\theta_{f},\theta_{a}$ , and $\theta_{c}$ . $\alpha$ can also be presented as $\alpha=[\alpha_{f},\alpha_{a},\alpha_{c}]$ and the Eq.4 can be split into three equations, i.e. $\theta^{{}^{\prime}}_{f}=\theta_{f}-\alpha_{f}{\boldmath\circ}\nabla_{\theta_{f}}\mathfrak{L}_{s}(\theta_{f},\theta_{a},\theta_{c})$ and etc.. For simplicity, we merge these three equations into one equation as Eq.4 shows. $\circ$ is the element-wise multiplication. Supervised by the loss on the support set, the meta-learner inner-updates its weights $\theta_{f},\theta_{a},\theta_{c}$ to $\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c}$ .

Second, as the inner-updated weight $\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a}$ , and $\theta^{{}^{\prime}}_{c}$ depend on not only the initial values of $\theta_{f},\theta_{a}$ , and $\theta_{c}$ , but also $\alpha$ , all $\theta_{f},\theta_{a},\theta_{c}$ , and $\alpha$ can be meta-optimized. We formulate this process as Eq.5 and Eq.6.

[TABLE]

In Eq.5, $x_{i}$ is an image belonging to the query set, and $N_{q}$ denotes the number of images in the query set. $\mathfrak{L}_{q}$ is the inner-updated meta-learner’s loss on the query set. It should be noted that $\nabla_{(\theta_{f},\theta_{a},\theta_{c},\alpha)}\mathfrak{L}_{q}(\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})$ computes the gradient of $\mathfrak{L}_{q}$ towards $(\theta_{f},\theta_{a},\theta_{c},\alpha)$ but not $(\theta^{{}^{\prime}}_{f},\theta^{{}^{\prime}}_{a},\theta^{{}^{\prime}}_{c})$ . By optimizing $\mathfrak{L}_{q}$ , the meta-learner is forced to learn not only the suitable initial weights $\theta_{f},\theta_{a},\theta_{c}$ but also $\alpha$ for task $\tau$ . With the learned initial weights and $\alpha$ , the meta-learner can inner-update itself precisely on the support set and then perform well on the query set.

In AML, the meta-learner is trained on lots of few-shot learning tasks with these two steps, which makes the meta-learner learn generalizable initial weights for not only the feature extractor $\mathcal{F}$ and the classifier $\mathcal{C}$ , but also the attention model $\mathcal{A}$ . While existing initialization based meta-learning methods only train the meta-learner to learn initial weights for the feature extractor and the classifier. Therefore, compared with existing meta-learners, AML simplifies the few-shot problem and improves performance since its attention ability is meta-trained and can be easily adjusted to the crucial features for solving few-shot learning, which leads the classifier can make a precise prediction for the input. In our experiment, we show the positive effect of attention mechanism.

3.3 RAML

RAML assembles the meta-learner not only the attention mechanism but also the ability to well use the past learned knowledge.

Fig.4 shows the meta-learner’s network structure. Its network consists of a Representation module and an ABP module. The Representation module is different from the feature extractor in AML because the Representation module here is responsible for the meta-learner learning and leveraging prior-knowledge to understand the input image. While the feature extractor in AML is meta-trained for learning how to update itself for solving few-shot learning tasks. In our work, the Representation module is a ResNet-50 network. Similar to the ABP module in AML, the ABP module here also contains an attention model and a classifier. It is responsible for quickly adjusting the meta-learner’s attention and prediction based on the output feature from the Representation module. Besides, Fig.4 contains an Auxiliary module. The Auxiliary module does not belong to the meta-learner, and it is only used to assist the meta-learner learning prior-knowledge.

RAML Training Process

The training process of RAML can be separated into two stages: prior-knowledge learning and meta-training stage.

At the prior-knowledge learning stage, with the assist of the Auxiliary module, the Representation module is trained to learn prior-knowledge about image classification in a supervised manner. The training process can be formulated as

[TABLE]

$\mathcal{F}_{r}$ and $\mathcal{C}_{au}$ denote the Representation and Auxiliary modules, respectively, and $\theta_{r}$ and $\theta_{au}$ are their weights. $x_{i}$ is an input image used for the representation model learning prior-knowledge, and $n$ is the number of images. $\theta^{*}_{r}$ and $\theta^{*}_{au}$ are the learned values of $\theta_{r}$ and $\theta_{au}$ .

At the meta-training stage, for the meta-learner well using the learned knowledge to stably express the input image into high-level representations, the Representation module will not be meta-trained. Similar to AML, in RAML, we simplify the prediction of the meta-learner as $\hat{y_{i}}=\mathbb{F}(x_{i};\theta_{r}^{*},\theta_{a},\theta_{c})$ , where all symbols denote the same meanings as those in AML. In RAML, the inner-update of the meta-learner on the support set can be formulated as Eq.8 and Eq.9. We can see that different from the inner-update of AML which update all weights of the network, the inner-update of RAML only update the weights $\theta_{a}$ and $\theta_{c}$ of the ABP module. The weight $\theta_{r}^{*}$ of the Representation module is fixed to keep the learned prior-knowledge.

[TABLE]

The meta-optimizing in RAML can be formulated as Eq.10 and Eq.11.

[TABLE]

The character of RAML is that the Representation module and the ABP module are trained separately. The Representation module is supervisorily trained to learn the prior-knowledge about image classification, and the ABP module is meta-trained to learn how to adjust itself quickly to solve few-shot learning tasks in the representation space provided by the Representation module. Compared with AML, which meta-trains the meta-learner not only adjusting the feature extractor but also the ABP module, RAML meta-trains the meta-learner simplify the few-shot learning problem as the meta-learner only need to adjust its ABP module in the representation space. This is possibly the reason why RAML outperforms AML in our experiment.

3.4 URAML

The prior-knowledge can be learned on not only labeled data but also large-scale unlabelled data. Thus, we design the method URAML and show its network structure in Fig.5. Similar to RAML, the meta-learner is also composed of a Representation module and an ABP module, and the Auxiliary module does not belong to the meta-learner. The training process of URAML can be separated into two stages: prior-knowledge learning and meta-training stage.

At the prior-knowledge learning stage, the Representation module learns the knowledge with an unsupervised learning algorithm: Split-Brain auto-encoder[32]. The Split-Brain auto-encoder simultaneously trains two auto-encoders with Lab images. In Lab color system, the L channel determines the brightness of the image, and the ab channels determine the color. One auto-encoder in Split-Brain is trained to predict the unseen ab channels of the input Lab image, given only the L channel. Another is trained to predict the unseen L channel, given the ab channels. As Fig.5 shows, the Representation module consists of two ResNet-50 based encoders and the Auxiliary module consists of two corresponding deconvolution[48] based decoders. We formulate the prior-knowledge learning process as Eq.12 and Eq.13.

[TABLE]

In Eq.12, $x^{l}_{i}$ and $x^{ab}_{i}$ are the L and ab channels of the input Lab image $x_{i}$ , respectively. $\mathcal{F}_{l}$ and $\mathcal{D}_{l}$ are the encoder and decoder that predict $x^{ab}_{i}$ based on $x^{l}_{i}$ , respectively, and $\hat{x}_{i}^{ab}$ is the prediction. $\theta_{l}$ and $\omega_{l}$ are the weights of $\mathcal{F}_{l}$ and $\mathcal{D}_{l}$ , respectively, and $\theta_{l}^{*}$ and $\omega_{l}^{*}$ are the optimized values of $\theta_{l}$ and $\omega_{l}$ . $\gamma_{i}^{l}$ is the squeezed feature of $x^{l}_{i}$ by the encoder $\mathcal{F}_{l}$ . $L_{l}$ is the loss of $\mathcal{F}_{l}$ and $\mathcal{D}_{l}$ , and $l_{2}$ is the MSE loss function. $n$ is the number of Lab images that trains $\mathcal{F}_{l}$ and $\mathcal{D}_{l}$ . In Eq.13, all symbols are defined in the same way with those in Eq.12.

[TABLE]

After unsupervised learning, the representations $\gamma_{i}$ of an Lab image $x_{i}$ can be calculated by first concatenating $\gamma_{i}^{l}$ with $\gamma_{i}^{ab}$ and second average-pooling, which is shown as $\gamma_{i}=\mathcal{P}_{a}(\gamma_{i}^{l},\gamma_{i}^{ab})$ , where $\mathcal{P}_{a}$ is an average-pooling layer.

At the meta-training stage, the ABP module is trained in the same way with that in RAML. Note that, the learned weight of the Representation module in URAML is $\theta_{r}^{*}=[\theta_{l}^{*},\theta_{ab}^{*}]$ .

At the end of our methodology, we summarize our three methods briefly. Inspired by human cognition which makes full use of attention mechanism and prior-knowledge to efficiently learn new knowledge, we design a novel paradigm with three methods to step-by-step utilize attention mechanism and prior-knowledge in meta-learning. Firstly, the method AML is designed to leverage attention mechanism in meta-learning. Secondly, the method RAML is designed to use not only the attention mechanism but also prior-knowledge in meta-learning. Compared with RAML, the method URAML learns the prior-knowledge with unsupervised learning, which brings URAML the advantage that with the growth of available unlabeled images used in the prior-knowledge learning stage and the progress of unsupervised learning algorithm, the performance of the meta-learner will be boosted up.

4 Experiments

In this section, we firstly present the datasets we used in our experiments, and then the details and results of our experiments.

4.1 Dataset

We use several datasets in all our experiments: MiniImagenet[13], Omniglot[49], MiniImagenet-900, Places2[50], COCO[51], and OpenImages-300. Note that, we resize all the images in Omniglot into 28x28 resolution, and all the other images into 84x84.

4.1.1 MiniImagenet

MiniImagenet[13] is popularly used for evaluating few-shot learning and meta-learning. It contains 100 image classes, including 64 training classes, 16 validation classes, and 20 testing classes. Each image class with 600 images are sampled from the ImageNet dataset[52].

4.1.2 Omniglot

Omniglot[49] is another widely used dataset for meta-learning. It contains 50 different alphabets and 1623 characters from these alphabets, and each character has 20 images that hand-drawn by 20 different people.

4.1.3 MiniImagenet-900

MiniImagenet-900 dataset is designed for the Representation modules in RAML and URAML learning prior-knowledge, and it is composed of 900 image classes. Each image class with 1300 images are collected from the original ImageNet dataset. It is worth noting that there is no image class in MiniImageNet-900 coincides with the classes in the MiniImagenet dataset.

4.1.4 Other Datasets

As the Representation module of URAML is trained by unsupervised learning, we take full advantage of this characteristic by training the Representation module of URAML on not only MiniImagenet-900 but also Places2[50], COCO2017[51], and OpenImages-300. The dataset OpenImages-300 is a subset of the OpenImages-V4 dataset[53]. The total OpenImages-V4 dataset contains 9 million images, and we randomly downloaded 3 million images from the OpenImages-V4 website to form the OpenImages-300 dataset.

4.2 Experiments on MiniImagenet

On MiniImagenet, we test all our methods on 5-way 1-shot and 5-way 5-shot classification tasks. The testing accuracy is averaged by the accuracies on 600 tasks, with 95% confidence intervals, and all these 600 tasks are randomly generated on the test set of MiniImagenet. The support and query set of each $N$ -way $K$ -shot task contains $NK$ and $15*N$ images, respectively.

In AML, the network structure of the meta-leaner is shown in Fig.2. The feature extractor is composed of 4 Convolution layers and the classifier is a fully-connect layer, and the attention model structure is shown in Fig.3. Each Convolution layer consists of 64 channels and is followed with a ReLU and batch-normalization layer. We train the meta-learner on 200000 randomly generated tasks for 60000 iterations, and set the learning rate to 0.001, and decay the learning rate to 0.0001 after 30000 iterations. Moreover, Dropout with dropout-rate 0.2, L1 and L2 normalization with 0.001 and 0.00001, respectively, are used to prevent the meta-learner from over-fitting.

The experimental result of the method AML on MiniImagenet shows in Tab.2. Note that in Tab.2, the method whose name is printed as black uses a shallow network consists of 4 or 5 Convolution layers and one or two fully-connect layers, and the method whose name is printed as blue uses a deep ResNet-based network. Among all the methods using shallow network, AML attained the state-of-the-art on both the 5-way 1-shot and 5-way 5-shot image classification tasks.

In RAML, the Representation module is a ResNet-50[61] network, and the Auxiliary module is a fully-connect layer. The attention model is the same as that in AML, and the classifier is composed of two fully-connect layers.

At the prior-knowledge learning stage, we set the batch size to 256, and the learning rate to 0.001, and decay the learning rate to 0.0001 after 30000 iterations, and use L2 normalization with 0.00001 and Dropout with 0.2 to prevent the Representation module from over-fitting. At the meta-training stage, the ABP module is meta-trained with the same setting as AML. The experiment result of RAML is shown in Tab.2. Compared to method AML, RAML improves the meta-learner’s performance more significantly. It rises the accuracy on 5-way 1-shot tasks from 52.25% to 63.66%, and the accuracy on 5-way 5-shot tasks from 69.46% to 80.49%.

The most likely reason why RAML performs well is: before the meta-training stage, the Representation module has learned old knowledge to help the meta-learner understanding new input image and provides high-level meaningful representations and features of the input image. In the meta-training stage, the meta-learner’s work becomes more comfortable because it only needs to learn how to quickly adjust its ABP module according to the compact features the Representation module provided, and do not need to take care of the original high dimensional input data. While the meta-learner of AML works harder than the meta-learner of RAML, as it has to adjust its total network to fit new few-shot learning tasks according to the original input data.

In URAML, the Representation module learns the prior-knowledge with an unsupervised learning algorithm: Split-Brain. As Fig.5 shows, two independent ResNet-50 network-based encoders compose the Representation module, and we halve all the filters in each encoder so that the Representation module outputs feature vector with a dimension of 2048, which is the same with that in RAML. The Auxiliary module is composed of two deconvolution-based decoders, and Tab.4 shows the detail of the decoder network structure. The last Conv-layer’s number of filters is 1 or 2 according to that the decoder is recovering the L channel or the ab channels of the Lab image.

At both the prior-knowledge learning and meta-training stage, we set all hyperparameters the same with those in the RAML experiment. Noted that for saving the training computation cost, the decoders in the Auxiliary module recover the ab and L channels into 11x11 resolution, but not the original 84x84. When calculating the MSE losses $L_{l}(\theta_{l},\omega_{l})$ and $L_{ab}(\theta_{ab},\omega_{ab})$ shown in Eq.12 and Eq.13, we first resize ab and L channels of the input Lab image into 11x11 resolution and then calculate $L_{l}(\theta_{l},\omega_{l})$ and $L_{ab}(\theta_{ab},\omega_{ab})$ . The experiment result of URAML is shown in Tab.2. We also highlight the result of URAML in Tab.2, even though its result is not state-of-the-art. In our viewpoint, the reason why URAML lags behind RAML is that the Representation module in URAML learns the prior-knowledge with unsupervised learning while the Representation module in RAML learns with supervised learning.

4.3 Experiments on Omniglot

As Omniglot is a much easier dataset than MiniImagenet that existing meta-learners can easily achieve more than 95% accuracy on most testing tasks generated on Omniglot, we only test method AML on Omniglot.

Same to the experiments on Miniimagenet, we also train the meta-learner on 200000 randomly generated tasks for 60000 iterations and set the learning rate to 0.001. The experiment results are shown in Tab.1

It is clear that the proposed method AML attains state-of-the-art performance on 2 of all 4 kinds of few-shot image classification tasks. On the 5-way 1-shot task, though the method MetaGAN+RN performs slightly better than AML, we still highlight AML as MetaGAN+RN uses a deeper ResNet-based network while AML uses a shallower network. On the 20-way 1-shot task, our method AML surpasses other methods by a large margin. For example, compared to IMAML HF, AML improves the meta-learner’s performance from 96.18% to 98.48%.

4.4 Ablation Study

4.4.1 Ablation Study about the Attention Mechanism

To confirm the promotion effect of the attention mechanism for meta-learning, we conduct experiments to compare the performance of the meta-learner which is equipped with the attention model and its counterpart which is not. The experimental results show in Tab.5 and Tab.3. The compared meta-learner which is marked with * is the meta-learner re-implemented by ourselves. The performances of our re-implemented meta-learners differ slightly from those reported in their original papers. This is probably caused by different hyper-parameters or experiment settings (all methods in this experiment use convolution layers with 32 filters). The comparisons in Tab.5 and Tab.3 revealing that in most cases, the attention mechanism improves the meta-learner significantly, which demonstrates the reason-ability of our idea.

As attention mechanism brings the meta-learner more weights and computation cost, we do another experiment to validate that the improvement of AML is the contribution of the attention mechanism but not the growth of the number of weights and computation cost. The experiment detail is: since the attention model in AML is a convolution layer with the kernel size of 1x1, we remove the attention model, and stack a convolution layer with the same kernel size on the top of the CNN feature extractor. We name the meta-learner with this network as AML-attention, and its number of weight is the same as that of AML. The corresponding experimental result is shown in Tab.6, and it is clear that AML outperforms AML-attention, which further shows the improvement effect of attention mechanism for meta-learning.

4.4.2 Prior-Knowledge Learning Dataset

We do experiments to test how does the prior-knowledge learning dataset affects RAML and URAML.

a) affects to RAML: In RAML, the default prior-knowledge learning dataset is our reorganized Miniimagenet-900 dataset. In this experiment, the Representation module learns the prior-knowledge on Places2[50] instead of Miniimagenet-900, and all the other experiment settings and hyper-parameters are constant with the primordial RAML. We denote this meta-learner as RAML-Places2. Corresponding experimental result shows in Tab.6. It is clear that prior-knowledge learning dataset affects the meta-learner. The reason is that different prior-knowledge learning dataset leads the Representation module learning different knowledge and expressing image features differently. Places2 is a dataset commonly used for scene classification, which results in that the Representation module learning the knowledge about scene understanding rather than object classification.

b) affects to URAML: In this experiment, we test how the quantity of unlabeled Lab images in the prior-knowledge learning dataset affect URAML. We design two new versions of URAML: URAML-V1 and URAML-V2. The Representation module of URAML-V1 learns prior-knowledge only on MiniImagenet-900, and that of URAML-V2 learns prior-knowledge on not only MiniImagenet-900, but also the Places2 and COCO2017. Compared with URAML-V1 and URAML-V2, the quantity of unlabeled Lab used in the primordial URAML is the largest, as MiniImagenet-900, places365, COCO2017, and OpenImages-300 are all used in the primordial URAML. Tab.7 shows the prior-knowledge learning dataset and the performances of URAML-V1, URAML-V2, and the primordial URAML. It is clear that the primordial URAML performs the best, and the more the unlabeled Lab images used for the meta-learner to learn prior-knowledge, the better the meta-learner performs. Besides, there remains a large performance progress space as we can use more unlabeled data in URAML.

4.4.3 Unsupervised Learning for URAML

The development of unsupervised learning also affects URAML a lot. To verify this viewpoint, we do an experiment that the Representation module in URAML learns the prior-knowledge with a basic unsupervised learning method Auto-Encoder[27], and we name this version of URAML as URAML-AE. The experimental result of URAML-AE shown in Tab.7 revealing that the unsupervised learning algorithm affects the meta-learner significantly. Maybe the most promising way to improve the performance of URAML is to develop the unsupervised learning algorithm and collect more unlabeled data.

4.5 Cross-Testing Experiment

We find that existing meta-learning methods generally suffer from a Task-Over-Fitting (TOF) problem, and this problem has seldom been studied. An example of the TOF problem is that the meta-learner to be tested on 5-way 1-shot classification tasks should be trained on 5-way 1-shot tasks rather than on other tasks, and similarly, the meta-learner to be tested on 5-way 5-shot tasks should be trained on 5-way 5-shot tasks. This is because the meta-learner trained on 5-shot tasks over-fits to 5-shot tasks, and when testing it on 1-shot tasks, it will perform obviously worse than the meta-learner trained on 1-shot tasks.

We do lots of cross-testing experiments to test how much does MAML, Meta-SGD, AML, RAML, and URAML suffer from the TOF problem, and the experimental results show that compared with the other methods, our methods suffer less from this problem, especially RAML and URAML.

For each tested meta-learning method, we do the cross-testing experiments in the following way:

train the meta-learner on 5-way $K$ -shot image classification tasks, where $K\in$ {1,3,5,7,9},
test the meta-learner on 5-way $J$ -shot tasks, where $J\in$ {1,3,5,7,9}. For example, we train a meta-learner with MAML on 5-way 3-shot tasks and test its performance on all 5-way $K$ -shot tasks, $K\in$ {1,3,5,7,9}. The experimental results are shown in Fig.6.

Obviously, Fig.6 shows that MAML suffers seriously from the TOF problem, because its meta-learner which performs best on $K$ -shot tasks probably performs not well on $J$ -shot tasks, where $K\neq J$ . For example, in MAML, the meta-learner trained on 1-shot tasks performs best on the 1-shot tasks, but it can not perform as well as the other meta-learners on 3-, 5-, 7-, and 9-shot tasks, which means the meta-learner trained on 1-shot tasks over-fits to 1-shot tasks. The meta-learner trained by URAML troubled little by the TOF problem because the meta-learner which performs best on $K$ -shot tasks probably performs best on $J$ -shot tasks, where $K$ , $J\in$ {1,5,7,9}. For example, in URAML, the meta-learner trained on 1-shot tasks performs best not only on the 1-shot tasks but also on 5-, 7-, and 9-shot tasks, which means the meta-learner trained on 1-shot tasks generalizes well to the other $J$ -shot tasks.

We design a metric Cross-Entropy across Tasks (CET), to quantize how much does a meta-learning approach be vulnerable to the TOF problem. The evaluation process is shown as Eq.14, where i, j $\in$ {1,3,5,7,9} and overstriking variables are vector. $\mathcal{S}$ and $\mathcal{D}$ are the softmax and cross-entropy operation. $\bm{a}_{i}$ is the testing accuracies of five meta-learners trained on 1-, 3-, 5-, 7-, 9-shot tasks when they are tested on $i$ -shot tasks. $\bm{d}_{i}$ is the meta-learners’ accuracy distribution on $i$ -shot tasks. $l_{i,j}$ presents the similarity between accuracy distribution vector di and dj, where i,j $\in$ {1,3,5,7,9}. L presents the overall similarities of $l_{i,j}$ for a specific approach.

[TABLE]

For example, the testing accuracies $\bm{a}_{3}$ of Meta-SGD [58.24%, 59.18%, 58.90%, 58.75%, 59.15%] is the five trained meta-learners of Meta-SGD when they are tested on 3-shot tasks. So, $\bm{a}_{3}/max(\bm{a}_{3})$ = [58.24%, 59.18%, 58.90%, 58.75%, 59.15%] / 59.18%, and $\bm{d}_{3}$ = $\mathcal{S}(\bm{a}_{3}/max(\bm{a}_{3}))$ = [0.116, 0.255, 0.202, 0.178, 0.249]. Similarly, d7 = [0.122, 0.206, 0.255, 0.233, 0.184]. Then, l3,7 = 1.603, and L = 34.22.

Obviously, the smaller the total distance L appears, the less the meta-learning approach suffers from the TOF problem. We show different meta-learning approaches’ performance on the CET metric in Tab.8. This experiment shows that the proposed AML, RAML, and URAML performs better then MAML and Meta-SGD on the CET metric, and RAML and URAML performs best. The possible reason for this is that prior-knowledge and attention mechanism are both helpful for the meta-learner to reduce its few-shot cognitive load and to avoid itself be affected by redundant useless information.

We can see an interesting phenomenon in Fig.6, that the meta-learner trained by RAML on 5-way 9shot tasks performs best in most of the test tasks, while the meta-learner trained by URAML on 5-way 1-shot tasks performs best. The possible reason behind this phenomenon is that the Representation module of RAML learns knowledge by supervised learning, while the Representation module of URAML learns knowledge by unsupervised learning, which results in the output features between these two kinds of Representation module be different.

4.6 Feature Analysis

To understand the effect of attention mechanism, we visualize the distributions of feature $\gamma$ and $\gamma^{\alpha}$ (shown in Fig.2, Fig.4 and Fig.5) in Fig.7 with t-SNE[62]. In Fig.7, 500 feature points of each picture represent 500 $\gamma$ or $\gamma^{\alpha}$ of the query set images of a 5-way 1 or 5 shot task that randomly generated on the test set of MiniImagenet.

The average distribution inner-class distance D1 of $\gamma^{\alpha}$ is smaller than that of $\gamma$ , and the average inter-class distance D2 of $\gamma^{\alpha}$ is larger than that of $\gamma$ . This result indicates that among different image classes, the distribution of $\gamma^{\alpha}$ is more distinguishable than that of $\gamma$ . The reason for this is that the attention mechanism makes the meta-learner be able to adjust its attention quickly to critical image features and makes $\gamma^{\alpha}$ more distinguishable than $\gamma$ to differentiate images of different classes.

4.7 Heat-Map of $\gamma$ and $\gamma^{\alpha}$

To further analyze how the attention mechanism affects the meta-learner, we visualize the heat-maps of $\gamma$ and $\gamma^{\alpha}$ in Fig.8. To get the heat-map of $\gamma$ , we first inner-update the RAML meta-learner on the support set of a randomly generated 5-way 1-shot testing task on MiniImagenet. Then, we feed the meta-learner with the query set images and average the feature maps $\gamma$ across the channel axis to get the heat-maps of $\gamma$ . Similarly, the heat-maps of $\gamma^{\alpha}$ can be got.

From the heat-maps shown in Fig.8, we can see that compared with $\gamma$ , $\gamma^{\alpha}$ is more sensitive to the distinguishable part of the input image, revealing that the meta-learner changes its attention to the most discriminative image feature. For example, the first column of Fig.8 is a fish. Besides the fish body, $\gamma$ is also sensitive to some background region of the image. However, the meta-learner discovers that only the fish body is the crucial feature to category this image and shrinks its attention region so that $\gamma^{\alpha}$ sensitive only to the fish body.

Through the visualization and analysis of the heat-map of $\gamma$ and $\gamma^{\alpha}$ , we can see that the attention mechanism helps the meta-learner to focus on the most distinguishable image feature, and further helps the meta-learner to do a better few-shot learning task.

5 Conclusion and the Future Work

In this paper, be inspired by human cognition and learning process, we find the importance of attention mechanism and the prior-knowledge for meta-learning based few-shot learning. To solve a few-shot learning task, the meta-learner should first well use stable prior-knowledge to understand images and extract compact feature representations of images so that it can solve the task in the compact representation space rather than the original image space. Then, the meta-learner should adjust its attention to the crucial feature of the extracted feature representations, and make the final decision based on its attention. Therefore, we step-by-step propose three methods AML, RAML, and URAML to introduce attention mechanism and the prior-knowledge to meta-learning. All of them work successfully with state-of-the-art performance on several few-shot learning benchmarks, which indicating the rationality of our viewpoints and methods.

Besides, we find existing meta-learning approaches suffer from the TOF problem, which is unfriendly to practical applications. We design a novel Cross-Entropy across Tasks (CET) metric to evaluate how much does a meta-learning suffers from TOF. The experiment shows that compared to existing meta-learning methods, the proposed methods suffer less from the TOF problem, especially the RAML and URAML methods.

Among all the proposed methods, though URAML performs not the best, we think it is the most promising method yet because there is a large development space for the performance of URAML method which will also be the direction of our future work. From the ablation study, two manners seem can improve the performance of URAML significantly. One is to develop the unsupervised learning algorithm or self-supervised learning. RAML performs better than URAML revealing that the current unsupervised learning algorithm falls behind supervised learning. Bridging the gap between unsupervised learning and supervised learning algorithms will boost up the performance of URAML in a substantial probability. The other manner is to use more unlabeled data for URAML to learn prior-knowledge. Although 7.1 million unlabeled images are used in URAML, it still dramatically falls behind the images that humans have ever seen in terms of both quantity and quality. As for the quantity, we assume that, if a person watches 1 image per second and keep watching 15 hours per day, he/she can see 100 million images in 5 years. As for quality, humans see the world in a multimodal way, that is, the human can not only see the object but also touch and move around the object, which helps humans understand the world more accurately than Computer Vision. In a word, developing the unsupervised or self-supervised learning algorithm and collecting more unlabeled images will both help URAML to perform well.

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61573286).

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems , 2012, pp. 1097–1105.
2[2] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v 4, inception-resnet and the impact of residual connections on learning.” in AAAI , vol. 4, 2017, p. 12.
3[3] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR , vol. 1, no. 2, 2017, p. 3.
4[4] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” ar Xiv preprint ar Xiv:1409.0473 , 2014.
5[5] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al. , “Matching networks for one shot learning,” in Advances in Neural Information Processing Systems , 2016, pp. 3630–3638.
6[6] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Advances in Neural Information Processing Systems , 2017, pp. 4077–4087.
7[7] W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, and J. Luo, “Revisiting local descriptor based image-to-class measure for few-shot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2019, pp. 7260–7268.
8[8] Y. Bengio, S. Bengio, and J. Cloutier, Learning a synaptic learning rule . Université de Montréal, Département d’informatique et de recherche opérationnelle, 1990.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Prior-Knowledge and Attention based Meta-Learning for Few-Shot Learning

Abstract

1 Introduction

2 Related Work

2.1 Meta-learning for Few-Shot Learning

2.2 Attention Mechanism

2.3 Unsupervised Representation Learning

3 Method

3.1 Problem of Learning from Few-Data

3.2 AML

3.3 RAML

3.4 URAML

4 Experiments

4.1 Dataset

4.1.1 MiniImagenet

4.1.2 Omniglot

4.1.3 MiniImagenet-900

4.1.4 Other Datasets

4.2 Experiments on MiniImagenet

4.3 Experiments on Omniglot

4.4 Ablation Study

4.4.1 Ablation Study about the Attention Mechanism

4.4.2 Prior-Knowledge Learning Dataset

4.4.3 Unsupervised Learning for URAML

4.5 Cross-Testing Experiment

4.6 Feature Analysis

4.7 Heat-Map of γ\gammaγ and γα\gamma^{\alpha}γα

5 Conclusion and the Future Work

Acknowledgements

4.7 Heat-Map of $\gamma$ and $\gamma^{\alpha}$