Which One Are You Referring To? Multimodal Object Identification in   Situated Dialogue

Holy Lovenia; Samuel Cahyawijaya; Pascale Fung

arXiv:2302.14680·cs.CL·March 16, 2023

Which One Are You Referring To? Multimodal Object Identification in Situated Dialogue

Holy Lovenia, Samuel Cahyawijaya, Pascale Fung

PDF

Open Access 1 Repo

TL;DR

This paper investigates multimodal object identification in situated dialogue, proposing three methods evaluated on the SIMMC 2.1 dataset, with the best method improving F1-score by approximately 20%.

Contribution

It introduces and evaluates three novel methods for multimodal object identification in situated dialogue, with scene-dialogue alignment being the most effective.

Findings

01

Scene-dialogue alignment improves performance by ~20% F1-score.

02

The methods are evaluated on the largest situated dialogue dataset, SIMMC 2.1.

03

Analysis highlights limitations and future directions for multimodal dialogue systems.

Abstract

The demand for multimodal dialogue systems has been rising in various domains, emphasizing the importance of interpreting multimodal inputs from conversational and situational contexts. We explore three methods to tackle this problem and evaluate them on the largest situated dialogue dataset, SIMMC 2.1. Our best method, scene-dialogue alignment, improves the performance by ~20% F1-score compared to the SIMMC 2.1 baselines. We provide analysis and discussion regarding the limitation of our methods and the potential directions for future works. Our code is publicly available at https://github.com/holylovenia/multimodal-object-identification.

Tables3

Table 1. Table 1: Statistics of the ambiguous candidates identification of the SIMMC 2.1 dataset.

Split	# Sample	# Dialogue	$\frac{𝑶^{𝒎 𝒂 𝒕 𝒄 𝒉}}{𝑶^{𝒔 𝒄 𝒆 𝒏 𝒆}}$
Train	4239	3983	28.74%
Validation	414	371	24.72%
Test	940	905	30.78%

Table 2. Table 2: Experimental results of multimodal object identification on the SIMMC 2.1 dataset Kottur et al. ( 2021 ) . Bold denotes the best performances of baselines and proposed methods. Underline denotes the best performances within a method type.

Baselines
Method Type	Approach	Recall	Precision	F1-score
Heuristic	No object	0.00%	0.00%	0.00%
	Random	49.90%	22.43%	30.95%
	All objects	100.00%	22.34%	36.52%
SIMMC 2.1	ResNet50-GPT2	36.40%	42.26%	39.11%
SIMMC 2.1	ResNet50-BERT	36.70%	43.39%	39.76%
Dialogue-Contextualized Object Detection	MDETR (zero-shot)	16.33%	29.70%	21.07%
Object-Dialogue Alignment	CLIP (zero-shot)	55.70%	26.39%	35.81%
Object-Dialogue Alignment	CLIP (fine-tuned)	73.00%	32.62%	45.09%
Proposed Methods
Dialogue-Contextualized Object Detection	SitCoM-DETR (aug)	47.82%	25.69%	33.42%
Dialogue-Contextualized Object Detection	SitCoM-DETR (no aug)	49.51%	25.81%	33.93%
Object-Dialogue Alignment	CLIPPER (v1)	73.41%	33.00%	45.53%
Object-Dialogue Alignment	CLIPPER (v2)	59.95%	25.60%	35.88%
Scene-Dialogue Alignment	DETR-BERT	65.47%	51.48%	57.64%
Scene-Dialogue Alignment	DETR-GPT2	63.81%	56.79%	60.10%

Table 3. Table 3: Results for object-dialogue alignment models with different thresholding strategies.

CLIP — Cross-Entropy
Approach	Rec.	Prec.	F1
Mean	73.00%	32.62%	45.09%
Oracle	74.99%	74.96%	74.98%
CLIPPER (v1) — Binary Cross-Entropy
Sigmoid	73.41%	33.00%	45.53%
Mean	73.08%	31.97%	44.48%
Oracle	73.37%	73.34%	73.36%
CLIPPER (v2) — Binary Cross-Entropy
Sigmoid	59.95%	25.60%	35.88%
Mean	53.90%	23.42%	32.65%
Oracle	54.92%	54.89%	54.91%

Equations4

R ec a l l = \frac{N ^{cor r ec t}}{∥ L ∥}

R ec a l l = \frac{N ^{cor r ec t}}{∥ L ∥}

P r ec i s i o n = \frac{N ^{cor r ec t}}{∥ P ∥}

F 1 = \frac{2 * P r ec i s i o n * R ec a l l}{P r ec i s i o n + R ec a l l}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

holylovenia/multimodal-object-identification
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Natural Language Processing Techniques

Full text

Which One Are You Referring To?

Multimodal Object Identification in Situated Dialogue

Holy Lovenia , Samuel Cahyawijaya∗, Pascale Fung

Center for Artificial Intelligence Research (CAiRE),

The Hong Kong University of Science and Technology

{hlovenia, scahyawijaya}@connect.ust.hk Equal contribution.

Abstract

The demand for multimodal dialogue systems has been rising in various domains, emphasizing the importance of interpreting multimodal inputs from conversational and situational contexts. One main challenge in multimodal dialogue understanding is multimodal object identification, which constitutes the ability to identify objects relevant to a multimodal user-system conversation. We explore three methods to tackle this problem and evaluate them on the largest situated dialogue dataset, SIMMC 2.1. Our best method, scene-dialogue alignment, improves the performance by $\sim$ 20% F1-score compared to the SIMMC 2.1 baselines. We provide analysis and discussion regarding the limitation of our methods and the potential directions for future works. Our code is publicly available at https://github.com/holylovenia/multimodal-object-identification.

1 Introduction

Recent advancements in multimodal dialogue systems have gained more traction in various domains such as retail, travel, fashion, interior design, and many others. A real-world application of multimodal dialogue systems is situated dialogue, where a dialogue agent shares a co-observed vision or physical space with the user, and is responsible for handling user requests based on the situational context, which are often about the objects in their surroundings. This makes multimodal object identification from a dialogue (i.e., identifying objects that fit a dialogue context) an indispensable skill in multimodal dialogue understanding, built on cross-modal understanding to comprehend the relations between linguistic expressions and visual cues.

Various methods have been proposed to perform multimodal object identification through different paradigms Yu et al. (2016); Hu et al. (2016); Ilinykh et al. (2019); Kamath et al. (2021); Kuo and Kira (2022). These efforts have established remarkable progress in solving this problem. However, aside from an observed gap between the performance of the existing works and human-level performance in multimodal object identification, prior works also rely on a presumption that the information given by the textual context will only lead to specific (i.e., unambiguous) objects, which does not conform to real-world multimodal conversations where ambiguity exists.

Therefore, in this work, we explore three different solutions to enable multimodal object identification in the situated dialogue system, i.e., dialogue-contextualized object detection, object-dialogue alignment, and scene-dialogue alignment, without adopting the unambiguity assumption. Dialogue-contextualized object detection utilizes the spatial and object understanding capability of a pre-trained object detection model, to generate semantic representation containing both visual cues and the spatial understanding of the object. Object-dialogue alignment incorporates the image-text alignment capability of CLIP Radford et al. (2021), which has been pre-trained on large image-text corpora to perform multimodal object identification from the given dialogue context. Scene-object alignment combines the spatial and object understanding capability of a pre-trained object detection model and a pre-trained textual understanding model to produce better semantic vision-language alignment.

Our contributions are three-fold:

•

We introduce three different methods for handling multimodal object identification in situated dialogue, i.e., dialogue-contextualized object detection, object-dialogue alignment, and scene-dialogue alignment;

•

We show the dialogue-contextualized object detection method fails to outperform even the heuristic baselines despite having an acceptable performance on the object detection task;

•

We show the effectiveness of the other two methods which significantly outperform the SIMMC 2.1 baselines by $\sim$ 5% F1-score for object-dialogue alignment and $\sim$ 20% F1-score for scene-dialogue alignment;

2 Related Work

Multimodal Dialogue System

Multiple studies have attempted to enable the skills required for multimodal dialogue system, e.g., understanding visual Antol et al. (2015); Das et al. (2017); Kottur et al. (2019) or visual-temporal Alamri et al. (2019) content to answer user’s questions, grounding conversations to images Mostafazadeh et al. (2017); Shuster et al. (2020), interpreting multimodal inputs and responding with multimodal output to assist users with their goal Saha et al. (2018) or as a means to converse Sun et al. (2022), and perceiving the shared environment to grasp situational context to enable proper navigation, adaptation, and communication Lukin et al. (2018); Brawer et al. (2018); Kottur et al. (2021).

At the core of these efforts, the ability to understand language and vision, as well as integrate both representations to align the linguistic expressions in the dialogue with the relevant visual concepts or perceived objects, is the key to multimodal dialogue understanding Landragin (2006); Loáiciga et al. (2021b, a); Kottur et al. (2018); Utescher and Zarrieß (2021); Sundar and Heck (2022); Dai et al. (2021).

Multimodal Object Identification

Identifying objects or visual concepts related to a linguistic expression is an incremental exploration in vision-language research. It starts with identifying simple objects in a sanitized environment Mitchell et al. (2010) based on image descriptions or captions. Then, multimodal object identification has been gradually increasing in complexity and realism by involving visual contexts with cluttered and diverse scenes Kazemzadeh et al. (2014); Gkatzia et al. (2015); Yu et al. (2016); Mao et al. (2016); Hu et al. (2016); Ilinykh et al. (2019); Kamath et al. (2021); Kuo and Kira (2022).

While these works base their multimodal object identification on single-turn text contexts, another line of works explores the usage of multi-turn sequences as a textual context to enable identifying objects based on implicit constraints deduced through multi-round reasoning Seo et al. (2017); Johnson et al. (2017); Liu et al. (2019); Moon et al. (2020). However, they focus on identifying only the specific (i.e., unambiguous) objects, in which only a certain object in the scene fits the corresponding linguistic context. This is quite dissimilar from real-world multimodal object identification, where multiple objects could fit a given textual context and induce ambiguity into the conversation Kottur et al. (2021). For this reason, existing works are not equipped with the ability to identify all objects that plausibly fit those constraints although this skill is required to perform multimodal object identification in situated dialogue.

Multimodal and Cross-Modal Learning

Past works have studied multimodal and cross-modal alignment, grounding, and generation to solve various vision-language tasks, e.g., image captioning Hossain et al. (2019); Sharma et al. (2018), generating stories from image Min et al. (2021); Lovenia et al. (2022), as well as multimodal object identification Li et al. (2019); Wang et al. (2022). These attempts become more substantial and extensive after the rise of pre-trained vision-language models such as CLIP Radford et al. (2021), ALIGN Jia et al. (2021), and FLAVA Singh et al. (2022), which allows transfer knowledge obtained from the large-scale pre-training to downstream tasks.

3 Methodology

In this section, we describe the preliminaries of our work (§3.1) and extensively elaborate on each of our approaches, i.e., dialogue-contextualized object detection (§3.2), object-dialogue alignment (§3.3), and scene-dialogue alignment (§3.4).

3.1 Preliminaries

The goal of multimodal object identification in situated dialogue is to identify objects from a given scene image that fulfill the user’s request gathered from the user-system interactions. To identify the object(s) that could satisfy a user’s request in a dialogue, it is crucial to match the objects and the implicit constraints interwoven in the dialogue, e.g., S: “I do! Take a look at these. I have a brown coat towards the far end on the left wall, another brown coat on the left side of the front floor rack, and a black coat on the front of the same rack.”, U: “Awesome! Tell me the cost and label on that one.”. Thus, it is essential for the system to understand the relation between the visual perception of the objects in the scenes and the natural language used to verbalize these constraints, which describe the target object(s) by visual attributes (e.g., color, object category or type, etc.), location (i.e., absolute or relative position), or the combination of both.

We define a dialogue between a user and a system as $D=\{u_{1},s_{1},u_{2},s_{2},\dots,u_{n},s_{n}\}$ , a scene consisting of images corresponding to multiple viewpoints of the scene as $\{I^{scene}_{1},I^{scene}_{2},\dots,I^{scene}_{n}\}$ , and a set of objects in the scene as $O^{scene}=\{(b_{1},c_{1}),(b_{2},c_{2}),\dots,(b_{n},c_{n})\}$ , where $u_{i}$ and $s_{i}$ respectively denote the user utterance and the system utterance, and $c_{i}$ and $b_{i}$ denote the bounding box and the class category of an object. Given a user dialogue turn $D^{user}_{i}=\{u_{1},s_{1},u_{2},s_{2},\dots,u_{i}\}$ , $i\leq n$ , and a scene image $I^{scene}_{i}$ , the goal of the task is to select a subset of scene objects $O^{match}\subseteq O^{scene}$ that could satisfy the referred criteria in $D^{user}_{i}$ .

3.2 Approach 1: Dialogue-Contextualized Object Detection

For dialogue-contextualized object detection, we frame the task of multimodal object identification as the contextualized object detection task. In object detection, given a scene image $I^{scene}$ , we aim to detect all objects $O^{scene}$ in the scene by predicting their bounding box and class category. While in contextualized object detection, the aim is instead to select only a set of scene objects $O^{match}$ that satisfy a given context.

Our approach for dialogue-contextualized object detection extends a state-of-the-art object detection model, namely DETR Carion et al. (2020), by injecting dialogue information as the context to guide the detection model to filter out unidentified objects. A similar solution has been proposed by Modulated DETR (MDETR) Kamath et al. (2021). Despite its strong performance on text-contextualized object detection, MDETR requires an aligned annotation between the text phrase and the visual object for training. Such annotation is not available on SIMMC 2.1, hence we develop a new text-contextualized object detection model namely Situational Context for Multimodal DETR (SitCoM-DETR). Unlike MDETR which concatenates the textual representation along with the visual representation before feeding them into the transformer encoder of DETR (shown in Appendix 6), SitCoM-DETR injects a dialogue-level semantic representation vector into the input query of the transformer decoder of DETR in order to guide the model to select objects that match the dialogue context. We incorporate the same loss functions as the original DETR model. The depiction of our SitCoM-DETR model is shown in Figure 2.

3.3 Approach 2: Object-Dialogue Alignment

For object-dialogue alignment, we frame the task of multimodal object identification as the alignment between a target object $O^{match}_{i}$ and a user dialogue turn $D^{user}_{i}$ pair. Given a user dialogue turn $D^{user}_{i}$ and its corresponding scene image $I^{scene}_{i}$ , we first preprocess $I^{scene}_{i}$ to extract the object images of $O^{match}$ . Each of the object images is paired with $D^{user}_{i}$ as the positive pairs. We obtain the visual embeddings from the image by feeding it to an image encoder, and the textual embeddings from the dialogue turn by feeding it to a text encoder. After these embeddings pass through a linear projection, we calculate the similarity using the dot product between the two resulting vectors. Utilizing the contrastive learning objective, on a batch of object-dialogue pairs, this cross-modal alignment architecture learns by maximizing the similarity of the positive pairs and minimizing the similarity of the negative pairs (Figure 3).

Object-Dialogue Similarity Learning Strategy

The original contrastive learning approaches the object-dialogue alignment task as a one-to-one function, where the positive sample of $D_{i}$ is only $O_{i}$ in Figure 3. This is different from the actual nature of multimodal object identification, where more than one object could be relevant to a dialogue turn. For this reason, in addition to the original contrastive learning, we explore two modifications of the learning objective, where: 1) the positive samples of $D_{i}$ include $O_{i}$ (image pair) and similar objects111We define similar objects to $O_{i}$ as any other objects in the corresponding scene that use the same prefabricated design as $O_{i}$ in the SIMMC 2.1 dataset. to $O_{i}$ ; and 2) the positive samples of $D_{i}$ include $O_{i}$ and other supposedly identified objects in $D_{i}$ . For simplicity, we refer to these methods as CLIPPER (v1) and CLIPPER (v2).

3.4 Approach 3: Scene-Dialogue Alignment

For scene-dialogue alignment, we aim to combine the spatial understanding learned from object detection training with the image-text matching for multimodal similarity learning to solve multimodal object identification. For this approach, we utilize a pre-trained object detection model, i.e., DETR, and two pre-trained language models, i.e., BERT and GPT2. The resulting models are referred to as DETR-BERT and DETR-GPT2, respectively. We illustrate the overview of this approach in Figure 4.

In this approach, we first frame our dataset as an object detection task, where a data instance consists of a scene image $I^{scene}_{i}$ and its object annotations $O^{scene}=\{(b_{1},c_{1}),(b_{2},c_{2}),...,(b_{m},c_{m})\}$ , and train an object detection model (DETR) on it. The resulting model is then used to extract the visual representations of all objects in the scene image $I^{scene}$ by matching the object queries with $O^{scene}$ using Hungarian matching Stewart et al. (2016).

For the next step, we frame our dataset as a binary classification task, where a data instance consists of a user dialogue turn $D^{user}_{i}$ , an object ${O^{scene}_{j}}$ in a corresponding scene $I^{scene}_{i}$ , and a binary label (i.e., whether the object is identified by the user dialogue turn or not). We utilize a dialogue encoder to extract textual representation from a user dialogue turn $D^{user}_{i}$ . The textual representation of ${D^{user}_{i}}$ and the visual representation of ${O^{scene}_{j}}$ are projected into a latent space. We compute the dot product of the two and use the resulting vector as the prediction logits for training and inference.

4 Experiment

4.1 Dataset

For all of our experiments, we utilize the ambiguous candidate identification task from the SIMMC 2.1 dataset Kottur et al. (2021). The dataset studies conversational scenarios where the system shares a co-observed vision (i.e., the same scene) with the user. The dataset focuses on improving the shopping experience in two domains: fashion and furniture. In the setting of SIMMC 2.1, the system is able to access the ground truth meta information of all objects (e.g., object price, size, material, brand, etc.) in the scene $O^{scene}$ , while the user observes objects only through the scene viewpoints $\{I^{scene}_{1},I^{scene}_{2},\dots,I^{scene}_{n}\}$ to describe a request.

Each dialogue in the dataset can utilize different scene viewpoints at different dialogue turns throughout the session. This represents scenarios where the user navigates the scene during the interaction in a real physical store. Therefore, the multimodal dialogue system needs to understand user requests using both the dialogue history and the scene image as a unified multimodal context. The statistics of the ambiguous candidate identification of SIMMC 2.1 dataset is presented in Table 1.222We use the devtest split of SIMMC 2.1 dataset as the test set in our experiment.

4.2 Baselines

We incorporate various baselines including simple heuristics and deep learning based multimodal matching methods from SIMMC 2.1.333SIMMC 2.1 repository: https://github.com/facebookresearch/simmc2. For the heuristic methods, we incorporate uniform random prediction (Random), empty prediction (No object), and all objects prediction (All objects) as our baselines. For the deep learning approaches (ResNet50-BERT and ResNet50-GPT2), we apply cosine similarity between the feature extracted from ResNet-50 He et al. (2016)444We use the pre-extracted visual feature provided in the SIMMC 2.1 repository. and two widely-used pre-trained LMs, i.e., BERT Devlin et al. (2019)555 https://huggingface.co/bert-base-uncased. and GPT2 Radford et al. (2019)666 https://huggingface.co/gpt2..

In addition to these baselines, we incorporate several additional baselines: 1) pre-trained CLIP Radford et al. (2021)777We use the checkpoint from https://huggingface.co/openai/clip-vit-base-patch32., which serves as a baseline for the object-dialogue alignment approach and 2) pre-trained MDETR Kamath et al. (2021)888We use the EfficientNet B5 (ENB5) backbone checkpoint from https://github.com/ashkamath/mdetr., which represents a text-conditioned object detection baseline trained with an explicit alignment between phrases and objects. For CLIP, we report both zero-shot (CLIP (zero-shot)) and direct fine-tuning (CLIP) performances, while for MDETR, we only use the zero-shot performance (MDETR (zero-shot)) due to the unavailability of the explicit alignment between objects and dialogues in the dataset.

4.3 Models

We propose three different approaches to solve the multimodal object identification task §3. For the dialogue-contextualized object detection approach, we incorporate one model, namely SitCoM-DETR which will be compared to the MDETR baseline. For the object-dialogue alignment approach, we incorporate two model variants, i.e., CLIPPER (v1) and CLIPPER (v2). For the scene-object alignment approach, we incorporate two model variants, i.e., DETR-BERT and DETR-GPT2.

4.4 Evaluation

Given a label set $L$ and a prediction set $P$ , we define the number of true positive $N^{correct}$ as the objects that appear in both the prediction and the label sets. Using this definition, we evaluate the models’ performance on the multimodal object identification task using three evaluation metrics, i.e., recall, precision, and F1-score. The definition of each metric is defined as:

[TABLE]

4.5 Implementation Details

Dialogue Preprocessing

In all of our experiments, following prior works in end-to-end task-oriented dialogue system, we encode the last three utterances from the dialogue into a single text. For example a user dialogue turn $D^{user}_{i}=\{u_{1},s_{1},u_{2},s_{2},\dots,u_{i}\}$ is encoded into a text "U: < $u_{i-1}$ > S: < $s_{i-1}$ > U: < $u_{i}$ >" to be further processed by the dialogue encoder.

Inference strategy for object-dialogue alignment

For the proposed CLIPPER model in the object-dialogue alignment approach, we simply apply sigmoid to the logits and use a threshold value of 0.5 (denoted as Sigmoid), since it has a built-in capability to perform multi-label classification. While for the CLIP model, which serves as a baseline, does not have the same capability, hence we use the mean value of the logits as the threshold (denoted as Mean). Additionally, we also evaluate the performance of the model if the top- $k$ objects with the highest logits are considered valid predictions, where $k$ denotes the correct amount of objects in the ground-truth label (denoted as Oracle).

Inference strategy for dialogue-contextualized object detection

For the dialogue-contextualized object detection, since the model is originally for the object detection task, we develop our own inference strategy to allow it to perform multi-label classification for object identification. This is done through several steps: 1) we perform Hungarian matching using all objects, 2) we compute intersection over union (IoU) of all pairs of matched prediction and ground-truth bounding boxes999We do not consider the class label in the scoring to have a fairer comparison with the zero-shot MDETR approach., and 3) we take all objects having IoU score $\geq$ 10%101010We align this with MDETR’s class probability setting during inference..

Hyperparameter Details

For the dialogue-contextualized object detection, we fine-tune the SitCoM-DETR model for a maximum of 200 epochs with AdamW optimizer using a linear learning rate decay, a learning rate between [1e-4..1e-5], and an early stopping of 10 epochs. For the object-dialogue alignment, we fine-tune the CLIP and CLIPPER models for a maximum of 200 epochs with AdamW optimizer using a linear learning rate decay, a learning rate between [1e-4..1e-5], and an early stopping of 10 epochs. For the scene-dialogue alignment, we fine-tune the DETR-BERT and DETR-GPT2 models for a maximum of 200 epochs with AdamW optimizer using a linear learning rate decay, a learning rate between [1e-4..1e-5], and an early stopping of 10 epochs.

5 Result and Analysis

5.1 Result Overview

The results of our experiments are shown in Table 2. The best baseline performance is achieved by CLIP (fine-tuned) with 45.09% F1-score outperforming the baselines provided by the SIMMC 2.1 (i.e., ResNet50-GPT2 and ResNet50-BERT), showing the superiority of image-text alignment pre-training over separate unimodal pre-trainings for multimodal object identification. For the dialogue-contextualized object detection methods, the proposed SitCoM-DETR outperforms MDETR (zero-shot). Nevertheless, its performance for multimodal object identification is low despite having an acceptable object detection quality. We conjecture that a better method for adapting an object detection model for multimodal object identification is required, which is also shown by our scene-dialogue alignment approach in §3.4.

For the object-dialogue alignment, our CLIPPER (v1) marginally outperforms the CLIP (fine-tuned) baseline. This shows the effectiveness of modifying the CLIP objective which is explained in more detail in §5.3. For the scene-dialogue alignment (i.e., DETR-BERT and DETR-GPT2), where we combine the object detection and the image-text contrastive objective, we show a significant improvement over CLIP (fine-tuned), which is the highest-performing baseline, by $\sim$ 10-15% F1-score. This suggests the importance of combining object detection representation and image-text contrastive learning to fulfill the need for both visual and spatial matching to solve multimodal object identification.

5.2 Pitfalls of the Best Performing Models

We manually analyze the incorrect predictions made by our scene-dialogue alignment approaches, i.e., DETR-BERT and DETR-GPT2. Based on our analysis in Table 5, our models encounter two main issues. First, our models have difficulties in identifying objects when faced with a sudden object shift in the dialogue, e.g., the sudden shift from beds to a chair in this user dialogue turn U: “I need a new bed too. Any suggestions?”, S: “Both of these grey beds are in stock.”, U: “What’s the rating on that chair?”.

The second issue is the ineffectiveness of handling textual coreferences. For instance, in the user dialogue turn U: “How about a hat, but cheap and in a small?”, S: “I have the black hat third from the front, the white hat at the front, and the black hat between them.”, U: “What’s the brand and reviews for the black hat?”, the models fail to recognize that “the black hat” in the user utterance is anaphoric to either “the black hat third from the front” or “the black hat between them” in the system utterance, which leads to the system’s failure to identify both black hats as $O^{match}$ . This shortcoming also becomes more pronounced if the coreference chains are longer.

These issues show the limitation of pre-trained LMs for discourse understanding and analysis, especially in terms of coreference and entity linking Jurafsky and Martin (2019); Pandia et al. (2021); Koto et al. (2021). Additionally, some other cases require the model to process long-term dialogue history dependency which existing LMs are not able to handle because of the quadratic cost bottleneck of the attention mechanism of the transformer architecture Vaswani et al. (2017). Adapting an efficient attention mechanism with linear complexity might be beneficial to mitigate this problem.

5.3 Impact of Changing CLIP Objective

As shown in Table 3, the CLIPPER models with binary cross-entropy objective have a built-in capability for multi-label classification with Sigmoid which consistently performs better compared to the Mean thresholding. In addition, CLIPPER (v1) outperforms the original CLIP model which is trained with the cross-entropy loss. These facts suggest that changing the CLIP objective is beneficial for performing multi-label classification tasks such as multimodal object identification.

When using Oracle, we can observe a significant improvement in F1-score score, which mainly comes from the improvement in the precision with only a minor degradation on recall. This suggests that there is a very sensitive range of logits which consists of many negative samples with a few positive samples. To better segregate these few positive samples from the negative ones, hard negative mining techniques such as focal loss Lin et al. (2020) might be beneficial to alleviate this problem.

6 Discussion

Based on the results and analysis, we show that the scene-object alignment approach is the best performing approach, achieving $\sim$ 55-60% F1-score in the multimodal object identification task of SIMMC 2.1. We analyze the behavior of the model and conjecture that existing LMs have a limitation on understanding discourse. Additionally, we show the potential benefit of modeling the long-term dependency of dialogue history to further improve the quality of multimodal object identification task (§5.2). Lastly, we analyze the limitation of the existing image-text contrastive approaches for multimodal object identification and propose an alternative objective to alleviate this limitation (§5.3).

For future work, we aim to focus on the scene-dialogue alignment methods to further improve the model performance on the multimodal object identification capability. We note five potential points of improvement that can be further explored to improve the model performance in multimodal object identification: 1) the incorporation of cross-object attention in the modality fusion phase to enable a better relative position understanding between objects, 2) the incorporation of linear attention mechanism to handle the long-term dependency of dialogue history, 3) the exploration on better contrastive objectives for multimodal object identification, 4) the exploration on improving discourse understanding for LMs to better handle coreference and sudden object shift, and 5) the synthetic scene-dialogue data augmentation through the utilization of other publicly available object detection datasets to handle the in-domain data scarcity problem.

7 Conclusion

In this paper, we explore three methods to tackle multimodal object identification and evaluate them on SIMMC 2.1. Our best method, scene-dialogue alignment, improves the performance by $\sim$ 20% F1-score compared to the SIMMC 2.1 baselines. We provide an analysis of incorrect predictions by our best approach and the impact of changing the CLIP learning objective. We further provide discussion regarding the limitation of our methods and the potential directions for future works.

Acknowledgement

We appreciate the guidance that Prof. Dan Xu has provided for this research. This work has been supported by the School of Engineering PhD Fellowship Award, the Hong Kong University of Science and Technology and PF20-43679 Hong Kong PhD Fellowship Scheme, Research Grant Council, Hong Kong.

Appendix A MDETR Architecture

We provide Figure 6 for illustrative comparison with our proposed SitCoM-DETR in §3.2.

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Alamri et al. (2019) Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K Marks, Chiori Hori, Peter Anderson, et al. 2019. Audio visual scene-aware dialog. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 7558–7567.
2Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision , pages 2425–2433.
3Brawer et al. (2018) Jake Brawer, Olivier Mangin, Alessandro Roncone, Sarah Widder, and Brian Scassellati. 2018. Situated human–robot collaboration: predicting intent from grounded natural language. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 827–833. IEEE.
4Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Computer Vision – ECCV 2020 , pages 213–229, Cham. Springer International Publishing.
5Dai et al. (2021) Wenliang Dai, Samuel Cahyawijaya, Zihan Liu, and Pascale Fung. 2021. Multimodal end-to-end sparse model for emotion recognition . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 5305–5316, Online. Association for Computational Linguistics. · doi ↗
6Das et al. (2017) Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. 2017. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 326–335.
7Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics. · doi ↗
8Gkatzia et al. (2015) Dimitra Gkatzia, Verena Rieser, Phil Bartie, and William Mackaness. 2015. From the virtual to the Real World: Referring to objects in real-world spatial scenes . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 1936–1942, Lisbon, Portugal. Association for Computational Linguistics. · doi ↗