A Framework for Decoding Event-Related Potentials from Text

Shaorong Yan; Aaron Steven White

arXiv:1902.10296·cs.CL·April 3, 2019

A Framework for Decoding Event-Related Potentials from Text

Shaorong Yan, Aaron Steven White

PDF

Open Access

TL;DR

This paper introduces a new framework combining convolutional decoders and language models to predict ERPs during reading, revealing that hybrid models outperform individual approaches in reconstructing neural responses.

Contribution

The paper presents a novel framework for decoding ERPs from text using combined neural and language models, advancing understanding of neural language processing.

Findings

01

Hybrid models outperform individual models in ERP reconstruction

02

Contextual embeddings underperform surprisal-based models alone

03

Combining models yields superior decoding accuracy

Abstract

We propose a novel framework for modeling event-related potentials (ERPs) collected during reading that couples pre-trained convolutional decoders with a language model. Using this framework, we compare the abilities of a variety of existing and novel sentence processing models to reconstruct ERPs. We find that modern contextual word embeddings underperform surprisal-based models but that, combined, the two outperform either on its own.

Tables3

Table 1. Table 1: Mean MSE and R 2 superscript 𝑅 2 R^{2} (in parentheses).

\hlineB4 Model

No Intercept

Intercept

\hlineB4 alpha

49.9 ​ (0.532)

49.7 ​ (0.533)

beta

33.5 ​ (0.686)

32.7 ​ (0.692)

\hlineB4

Table 2. Table 2: Proportion variance explained by each model ( × \times 100) and confidence interval across folds computed by a nonparametric bootstrap. F = frequency, S(urp) = surprisal, S(em)D(is) = semantic distance.

\hlineB4 Model	$R_{mod}^{2}$	$95$ % CI
\hlineB4 Frequency	$19.5$	$[18.5, 20.7]$
F + Surp	$37.4$	$[36.5, 38.3]$
F + SemDis	$36.1$	$[32.3, 38.4]$
F + GLoVE	$35.0$	$[31.8, 38.2]$
F + ELMo	$35.2$	$[34.3, 36.2]$
F + S + SD	$46.6$	$[43.5, 49.7]$
F + S + SD + GloVe	$47.1$	$[43.2, 49.4]$
F + S + SD + ELMo	$49.5$	$[48.9, 50.1]$
\hlineB4

Table 3. Table 3: Model estimates and t statistics from mixed-effects model. : ∗ ∗ p < 0.01 {}^{**}:p<0.01 ; : ∗ p < 0.05 {}^{*}:p<0.05 ; : + p < 0.1 {}^{+}:p<0.1

\hlineB4 Predictor	$\hat{β}$	t
\hlineB4 Intercept	$- 0.0013$	$- 0.225$
Word Type (Content)	$0.0030$	$2.01$	^∗
Frequency	$0.0110$	$21.2$	^∗∗
Surprisal	$0.0050$	$13.0$	^∗∗
Semantic Distance	$0.0040$	$11.60$	^∗∗
GloVe Embeddings	$0.0040$	$10.3$	^∗∗
ELMo Embeddings	$0.0040$	$10.1$	^∗∗
Freq : Word Type	$- 0.0010$	$- 2.43$	^∗
Surp : Word Type	$0.0001$	$0.24$
SemDis : Word Type	$- 0.0003$	$- 0.70$
GloVe : Word Type	$0.0002$	$0.55$
ELMo : Word Type	$0.0007$	$- 1.85$	⁺
\hlineB4

Equations2

R_{mod}^{2} = 1 - \frac{MSE _{model} - MSE _{autoencoder}}{MSE _{intercept} - MSE _{autoencoder}}

R_{mod}^{2} = 1 - \frac{MSE _{model} - MSE _{autoencoder}}{MSE _{intercept} - MSE _{autoencoder}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Neurobiology of Language and Bilingualism · Natural Language Processing Techniques

Full text

A Framework for Decoding Event-Related Potentials from Text

Shaorong Yan

Department of Brain and Cognitive Sciences

University of Rochester

Rochester, NY 14627, USA

[email protected]

&Aaron Steven White

Deparment of Linguistics

University of Rochester

Rochester, NY 14627, USA

[email protected]

Abstract

We propose a novel framework for modeling event-related potentials (ERPs) collected during reading that couples pre-trained convolutional decoders with a language model. Using this framework, we compare the abilities of a variety of existing and novel sentence processing models to reconstruct ERPs. We find that modern contextual word embeddings underperform surprisal-based models but that, combined, the two outperform either on its own.

1 Introduction

Understanding the mechanisms by which comprehenders incrementally process linguistic input in real time has been a key endeavor of cognitive scientists and psycholinguists. Due to its fine time resolution, event-related potentials (ERPs) are an effective tool in probing the rapid, online cognitive processes underlying language comprehension. Traditionally, ERP research has focused on how the properties of the language input affect different ERP components (see Van Petten and Luka, 2012; Kuperberg, 2016, for reviews).111Examples of such components include the N1/P2 (Sereno et al., 1998; Dambacher et al., 2006); N250 (Grainger et al., 2006); N400 (Kutas and Hillyard, 1980; Hagoort et al., 2004; Lau et al., 2008); and P600 (Osterhout and Holcomb, 1992; Kuperberg et al., 2003; Kim and Osterhout, 2005)

While this approach has been fruitful, researchers have also long been aware of the potential drawbacks to this component-centric approach: a predictor’s effects can be too transient to detect when averaging ERP amplitudes over a wide time window—as is typical in component-based approaches (see Hauk et al., 2006, for discussion). Different predictors can affect ERP in the same time window as an established component but have slightly different temporal (Frank and Willems, 2017) or spatial (DeLong et al., 2005) profiles. This means that the definition of a component strongly affects interpretation.

There are two typical approaches to resolving these issues. The first is to plot the data and use visual inspection to select an analysis plan, introducing uncontrollable researcher degrees of freedom (Gelman and Loken, 2014). Another approach is to run separate models for each time point (or even each electrode) to look for the emergence of an effect. This necessitates complex statistical tests to monitor for inflated Type I error (see, e.g., Blair and Karniski, 1993; Laszlo and Federmeier, 2014, for discussion) and to control for autocorrelation across time points (Smith and Kutas, 2015a, b).

We explore an alternative approach to the analysis of ERP data in language studies that substantially reduces such researcher degrees of freedom: directly decoding the raw electroencephalography (EEG) measurements by which ERPs are collected. Inspired by multimodal tasks like image captioning (see Hossain et al., 2019, for a review) and visual question answering (Antol et al., 2015), we propose to model EEG using standard convolutional neural networks (CNNs) pre-trained under an autoencoding objective. The decoder CNN can then be decoupled from its encoder and recoupled with any language processing model, thus enabling explicit quantitative comparison of such models. We demonstrate the efficacy of this framework by using it to compare existing sentence processing models based on surprisal and/or static word embeddings with novel models based on contextual word embeddings. We find that surprisal-based models actually outperform contextual word embeddings on their own, but when combined, the two outperform either model alone.

2 Models

All of the models we present have two components: (i) a pre-trained CNN for decoding raw EEG measurements time-locked to each word in a sentence; and (ii) a language model from which features can be extracted for each word—e.g. the surprisal of that word given previous words or its contextual word embedding. An example model structure using ELMo embeddings (Peters et al., 2018) is illustrated in Figure 1.

Convolutional decoder

For all models, we use a convolutional decoder pre-trained as a component of an autoencoder. To reduce researcher degrees of freedom, the decoder architecture is selected from a set of possible architectures by cross-validation of the containing autoencoder.

The autoencoder consists of two parts: (a) a convolutional encoder that finds a way to best compress the ERP signals; and (b) a convolutional decoder with a homomorphic architecture that reconstructs the ERP data from the compressed representation. ERPs were organized into a 2D matrix (channel $\times$ time points). For the encoder, we pass the ERPs through multiple interleaved 1D convolutional and max pooling layers with receptive fields along the time dimension, shrinking the number of latent channels at each step. Correspondingly, for the decoder part, we use a homomorphic series of 1D transposed convolutional layers to reconstruct the ERP data.

At train time, the decoder weights are frozen, and the encoder is replaced by one of the language models described below. This entails fitting an interface mapping—a linear transformation for each channel produced by the encoder—from the features extracted from the language model into the representation space output by the encoder.

Language models

We consider a variety of features that can be extracted from a language model.

Surprisal

We use the lexical surprisal $-\log p(w_{i}\;|\;w_{1},\ldots,w_{i-1})$ obtained from a RNN trained by Frank et al. (2015).

Semantic distance

Following Frank and Willems (2017), we point-wise average the GloVe embedding (Pennington et al., 2014) of each word prior to a particular word to obtain a context embedding and then calculate the cosine distance between the context embedding and the word embedding for that word. We use the GloVe embeddings trained on Wikipedia 2014 and Gigaword 5 (6B tokens, 400K vocabulary size).

Static word embeddings

We also consider the GloVe embedding dimensions as features. We do not tune the GloVe embeddings using an additional recurrent neural network (RNN), instead just passing the them through a multi-layer perceptron with one hidden layer of tanh nonlinearities. The idea here is that the GloVe-only model tells us how much the distributional properties of a word, outside of the current context, contribute to ERPs.

Contextual word embeddings

We consider contextual word embeddings generated from ELMo (Peters et al., 2018) using the allennlp package Gardner et al. (2017). ELMo produces contextual word embeddings using a combination of character-level CNNs and bidirectional RNNs trained against a language modeling objective, and thus it is a useful contrast to GloVe, since it captures not only a word’s distributional properties, but how they interact with the current context.

We take all three layers of the hidden layer output in the ELMo model and concatenate them. To ensure a fair comparison with the surprisal- and GloVe-based models, we use the same tuning procedure employed for the static word embeddings. Further, because sentences are presented incrementally in ERP experiments and because ELMo is bidirectional and thus later words in the sentence will affect the word embeddings of previous words, we do not obtain an embedding for a particular word on the basis of the entire sentence, instead using only the portion of the sentence up to and including that word to obtain its embedding.

Combined models

We also consider models that combine either static or contextual word embedding features with frequency, surprisal, and semantic distance. The latter features were concatenated onto the tuned word embeddings before being passed to the interface mapping.

3 Experiments

We use the EEG recordings collected and modeled by Frank and Willems (2017). In their study, 24 subjects read sentences drawn from natural text. Sentences were presented word-by-word using a rapid serial visual presentation paradigm. We use the ERPs of each word epoched from -100 to 700ms and time-locked to word onset from all the 32 recorded scalp channels. After artifact rejection (provided by Frank and Willems with the data), this dataset contains 41,009 training instances.

Pre-training

To select which decoder to use, we compare the performance of two CNN architectures motivated by well-known properties of EEG. The first architecture has $5$ latent channels and $9$ time steps. Given the sampling rate and size of the input ( $250$ Hz, $200$ time steps), this roughly corresponds to filtering the EEG data with alpha band frequency ( $\sim 10$ Hz). The other has $10$ latent channels and $20$ time steps, thus lying within the range of beta band activity ( $\sim 25$ Hz). In addition to these two architectures, we also examine whether including subject- and electrode-specific random intercepts improves model performance.

We conduct a 5-fold cross-validation for each architecture to find the one that has the best performance in reconstructing ERP data. As shown in Table 1, the beta models perform better overall than alpha models, since they likely capture both alpha and beta band activities. Adding subject-specific intercept, on the other hand, did not greatly improve the model performance.

Figure 2 shows the reconstructed ERPs of the beta model on one trial. The autoencoder can reconstruct the ERP signal very well. The selected channels are illustrative of the reconstruction accuracy across all channels. We thus selected the beta model without subject-specific intercept as the decoder for our consequent models.

Training

The interface mapping and (where applicable) word embedding tuner are trained under an MSE loss using mini-batch gradient descent (batch size = 128) with the Adam optimizer (learning rate=0.001 and default settings for beta1, beta2, and epsilon) implemented in pytorch Paszke et al. (2017). Each model is trained for 200 epochs. Since we need at least one preceding word to compute contextual word embeddings, we do not include the first word of the sentence. This left ERPs for 1,618 word tokens per subject (638 word types). After excluding trials containing artifacts, a total of 37,112 training instances remain.

Development

To avoid overfitting, we use early stopping and report the models with the best performance on the development set. We did a parameter search over three different weight decays: 1e-5, 1e-3, 1e-1. For each model, we chose the weight decay that produced the best mean performance on held-out data in a 5-fold cross-validation.

Baselines

As a baseline we train an intercept-only model that passes a constant input (optimized to best predict the data) to the decoder. In addition, we fit a baseline model that only has word frequency as a feature. Frequency is also included as an additional feature in all models.

Metrics

To account for the fact that our model performance is bounded by the performance of the autoencoder, we report a modified form of $R^{2}$ to evaluate the overall model performance.

[TABLE]

4 Results

Table 2 shows the $R^{2}_{\text{mod}}$ metric for each model. We see that both surprisal and semantic distance outperform both types of word embedding features, all of which outperform frequency alone. When combined, surprisal and semantic distance outperform either alone, and further gains can be made with the addition of either static (GloVe) or contextual (ELMo) embedding features. The addition of contextual embedding features increases performance more than the addition of static word embedding features, such that there is some benefit to capturing context over and above that provided by surprisal and semantic distance.

Time course analysis

To understand where in time each predictor improved model performance, we examine the increase in correlation over the intercept model at each time point (Figure 3). There are roughly three regions where the language models outperform the intercept model. The first is right after 100ms post word onset: corresponding to the N1 component, which is typically considered to reflect perceptual processing; the second is between 200 and 350ms: corresponding to the N250 component, which correlates with lexical access Grainger et al. (2006); Laszlo and Federmeier (2014); and the third is between 300ms and 500ms: corresponding to the N400, which is typically associated with semantic processing.

Consistent with previous findings (Hauk et al., 2006; Laszlo and Federmeier, 2014; Yan and Jaeger, 2019), adding frequency into the model improved model performance in all three time windows. Also consistent with the literature, adding surprisal and semantic distance improved model performance in the N400 time window (Frank and Willems, 2017; Yan and Jaeger, 2019).

Models with word embeddings do not differ much from the models containing only frequency, surprisal, and semantic distance, with the biggest difference around 300ms post word onset. This might indicate an effect in the early N400 time window. This could also indicate that processes commonly associated with the N250 may be better captured by the models containing word embeddings. If so, it is less expected and potentially interesting, since most of our models have no access to perceptual properties of the input—with the possible exception of ELMo, whose charCNN may capture orthographic regularities. These effects could reflect our models’ ability to capture top-down lexical processing (see, e.g., Penolazzi et al., 2007; Yan and Jaeger, 2019) or possibly systematic correlations between higher-level features and perceptual features.

Part-of-speech analysis

Prior work on ERP during reading distinguishes function word—such as determiners, conjunctions, pronouns, prepositions, numerals, particles—from content words—such as (proper) nouns, verbs, adjectives, adverbs (Nobre and McCarthy, 1994; Frank et al., 2015). As such, we also examine whether each model’s performance differs for content words and function words. We calculate the Pearson’s correlation between the predicted and actual ERPs for each word of each model and used linear mixed-effects model to examine the influence on model fit with the inclusion of different information. If a model included a specific type of information, the corresponding predictor is coded as 1, otherwise it was coded as -1. For example, the surprisal model was trained with surprisal but not semantic distance, so the surprisal predictor is 1 for this model and the semantic distance predictor is -1. We further included the interaction between models and word types (function=-1, content=1).

Table 4 shows the resulting coefficients. Overall, models display better performance for content words than for function words ( $\hat{\beta}=0.003\text{, t}=2.01\text{, {p}}<0.05$ ), consistent with previous findings (Frank et al., 2015). Including each type of information also significantly increased model fit ( $\text{ts}>10.1\text{, {p}}<0.01$ ). There was a significant interaction between frequency and word type ( $\hat{\beta}=-0.001\text{, t}=-2.43\text{, {p}}<0.02$ ): including frequency increased model performance for function words more than for content words. There was also a marginally significant interaction between ELMo and word type ( $\hat{\beta}=0.0007\text{, t}=-1.85\text{, {p}}<0.064$ ), suggesting that including ELMo embeddings increased model performance for content words more than for function words.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Acharya et al. (2018) U Rajendra Acharya, Shu Lih Oh, Yuki Hagiwara, Jen Hong Tan, and Hojjat Adeli. 2018. Deep convolutional neural network for the automated detection and diagnosis of seizure using EEG signals. Computers in biology and medicine , 100:270–278.
2Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In Proceedings of the IEEE international conference on computer vision , pages 2425–2433.
3Biemann et al. (2015) Chris Biemann, Steffen Remus, and Markus J Hofmann. 2015. Predicting word ‘predictability’ in cloze completion, electroencephalographic and eye movement data. In Proceedings of natural language processing and cognitive science , pages 83–93. Libreria Editrice Cafoscarina.
4Blair and Karniski (1993) R Clifford Blair and Walt Karniski. 1993. An alternative method for significance testing of waveform difference potentials. Psychophysiology , 30(5):518–524.
5Broderick et al. (2018) Michael P Broderick, Andrew J Anderson, Giovanni M Di Liberto, Michael J Crosse, and Edmund C Lalor. 2018. Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech. Current Biology , 28(5):803–809.
6Brouwer et al. (2017) Harm Brouwer, Matthew W Crocker, Noortje J Venhuizen, and John CJ Hoeks. 2017. A neurocomputational model of the N 400 and the P 600 in language processing. Cognitive Science , 41:1318–1352.
7Dambacher et al. (2006) Michael Dambacher, Reinhold Kliegl, Markus Hofmann, and Arthur M Jacobs. 2006. Frequency and predictability effects on event-related potentials during reading. Brain Research , 1084(1):89–103.
8Delaney-Busch et al. (2019) Nathaniel Delaney-Busch, Emily Morgan, Ellen Lau, and Gina R Kuperberg. 2019. Neural evidence for bayesian trial-by-trial adaptation on the N 400 during semantic priming. Cognition , 187:10–20.