Lock-Free Parallel Perceptron for Graph-based Dependency Parsing

Xu Sun; Shuming Ma

arXiv:1703.00782·cs.CL·March 3, 2017

Lock-Free Parallel Perceptron for Graph-based Dependency Parsing

Xu Sun, Shuming Ma

PDF

Open Access

TL;DR

This paper introduces a lock-free parallel perceptron algorithm for graph-based dependency parsing, significantly accelerating training on multi-core systems without sacrificing accuracy.

Contribution

It presents a novel parallel perceptron algorithm that reduces training time for dependency parsing by leveraging multi-core architectures.

Findings

01

8-fold faster training with 10 threads

02

No loss in parsing accuracy

03

Effective utilization of multi-core systems

Abstract

Dependency parsing is an important NLP task. A popular approach for dependency parsing is structured perceptron. Still, graph-based dependency parsing has the time complexity of $O (n^{3})$ , and it suffers from slow training. To deal with this problem, we propose a parallel algorithm called parallel perceptron. The parallel algorithm can make full use of a multi-core computer which saves a lot of training time. Based on experiments we observe that dependency parsing with parallel perceptron can achieve 8-fold faster training speed than traditional structured perceptron methods when using 10 threads, and with no loss at all in accuracy.

Tables2

Table 1. Table 1: Accuracy of baselines and our method.

Models	1st-order	2nd-order
MST Parser	91.60	92.30
Locked Para-Perc	91.68	92.55
Lock-free Para-Perc 5-thread	91.70	92.55
Lock-free Para-Perc 10-thread	91.72	92.53

Table 2. Table 2: Speed up and time cost per pass of our algorithm

Models	1st-order	2nd-order
Structured Perc	1.0x(449s)	1.0x(3044s)
Locked Para-Perc	5.1x(88s)	5.0x(609s)
Lock-free Para-Perc 5-thr.	4.3x(105s)	4.5x(672s)
Lock-free Para-Perc 10-thr.	8.1x(55.4s)	8.3x(367s)

Equations30

s (x, y) = Φ (x, y) \cdot α

s (x, y) = Φ (x, y) \cdot α

s (i, j) = α \cdot f (i, j)

s (i, j) = α \cdot f (i, j)

s (x, y) = (i, j) \in y \sum s (i, j) = (i, j) \in y \sum α \cdot f (i, j)

s (x, y) = (i, j) \in y \sum s (i, j) = (i, j) \in y \sum α \cdot f (i, j)

Φ (x, y) = (i, j) \in y \sum f (i, j)

Φ (x, y) = (i, j) \in y \sum f (i, j)

\forall z \in \overline{GE N (x)}, U \cdot Φ (x, y) - U \cdot Φ (x, z) \geq δ

\forall z \in \overline{GE N (x)}, U \cdot Φ (x, y) - U \cdot Φ (x, z) \geq δ

y_{j}^{'} = z \in GE N (x) argmax Φ_{j} (x, y) \cdot α

y_{j}^{'} = z \in GE N (x) argmax Φ_{j} (x, y) \cdot α

α^{i + 1} = α^{i} + Φ_{j} (x, y) - Φ_{j} (x, y_{j}^{'})

α^{i + 1} = α^{i} + Φ_{j} (x, y) - Φ_{j} (x, y_{j}^{'})

α^{t + 1} = α^{t} + j = 1 \sum k (Φ_{j} (x, y) - Φ_{j} (x, y_{j}^{'}))

α^{t + 1} = α^{t} + j = 1 \sum k (Φ_{j} (x, y) - Φ_{j} (x, y_{j}^{'}))

U \cdot α^{t + 1} = U \cdot α^{t} + j = 1 \sum k U \cdot (Φ_{j} (x, y) - Φ_{j} (x, y_{j}^{'})) \geq U \cdot α^{t} + k δ

U \cdot α^{t + 1} = U \cdot α^{t} + j = 1 \sum k U \cdot (Φ_{j} (x, y) - Φ_{j} (x, y_{j}^{'})) \geq U \cdot α^{t} + k δ

∥ α^{t + 1} ∥ \geq t k δ

∥ α^{t + 1} ∥ \geq t k δ

∥ α^{t + 1} ∥^{2} = ∥ α^{t} ∥^{2} + ∥ j = 1 \sum k (Φ_{j} (x, y) - Φ_{j} (x, y_{j}^{'})) ∥^{2} + 2 α^{t} \cdot (j = 1 \sum k (Φ_{j} (x, y) - Φ_{j} (x, y_{j}^{'}))) \leq ∥ α^{t} ∥^{2} + k^{2} R^{2}

∥ α^{t + 1} ∥^{2} = ∥ α^{t} ∥^{2} + ∥ j = 1 \sum k (Φ_{j} (x, y) - Φ_{j} (x, y_{j}^{'})) ∥^{2} + 2 α^{t} \cdot (j = 1 \sum k (Φ_{j} (x, y) - Φ_{j} (x, y_{j}^{'}))) \leq ∥ α^{t} ∥^{2} + k^{2} R^{2}

∥ α^{t + 1} ∥^{2} \leq t k^{2} R^{2}

∥ α^{t + 1} ∥^{2} \leq t k^{2} R^{2}

t^{2} k^{2} δ^{2} \leq ∥ α^{t + 1} ∥^{2} \leq t k^{2} R^{2}

t^{2} k^{2} δ^{2} \leq ∥ α^{t + 1} ∥^{2} \leq t k^{2} R^{2}

t \leq R^{2} / δ^{2}

t \leq R^{2} / δ^{2}

t \leq R^{2} / (k δ^{2})

t \leq R^{2} / (k δ^{2})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Bioinformatics

Full text

Lock-Free Parallel Perceptron for Graph-based Dependency Parsing

Xu Sun

Shuming Ma

MOE Key Laboratory of Computational Linguistics, Peking University

School of Electronics Engineering and Computer Science, Peking University

{xusun, shumingma}@pku.edu.cn

Abstract

Dependency parsing is an important NLP task. A popular approach for dependency parsing is structured perceptron. Still, graph-based dependency parsing has the time complexity of $O(n^{3})$ , and it suffers from slow training. To deal with this problem, we propose a parallel algorithm called parallel perceptron. The parallel algorithm can make full use of a multi-core computer which saves a lot of training time. Based on experiments we observe that dependency parsing with parallel perceptron can achieve 8-fold faster training speed than traditional structured perceptron methods when using 10 threads, and with no loss at all in accuracy.

1 Introduction

Dependency parsing is an important task in natural language processing. It tries to match head-child pairs for the words in a sentence and forms a directed graph (a dependency tree). Former researchers have proposed various models to deal with this problem Bohnet (2010); McDonald and Pereira (2006).

Structured perceptron is one of the most popular approaches for graph-based dependency parsing. It is first proposed by Collins Collins (2002) and McDonald et al. McDonald et al. (2005) first applied it to dependency parsing. The model of McDonald is decoded with an efficient algorithm proposed by Eisner Eisner (1996) and they trained the model with structured perceptron as well as its variant Margin Infused Relaxed Algorithm (MIRA) Crammer and Singer (2002); Taskar et al. (2004). It proves that MIRA and structured perceptron are effective algorithms for graph-based dependency parsing. McDonald and Pereira McDonald and Pereira (2006) extended it to a second-order model while Koo and Collins Koo and Collins (2010) developed a third-order model. They all used perceptron style methods to learn the parameters.

Recently, many models applied deep learning to dependency parsing. Titov and Henderson Titov and Henderson (2007) first proposed a neural network model for transition-based dependency parsing. Chen and Manning Chen and Manning (2014) improved the performance of neural network dependency parsing algorithm while Le and Zuidema Le and Zuidema (2014) improved the parser with Inside-Outside Recursive Neural Network. However, those deep learning methods are very slow during training Sun (2016).

To address those issues, we hope to implement a simple and very fast dependency parser, which can at the same time achieve state-of-the-art accuracies. To reach this target, we propose a lock-free parallel algorithm called lock-free parallel perceptron. We use lock-free parallel perceptron to train the parameters for dependency parsing. Although lots of studies implemented perceptron for dependency parsing, rare studies try to implement lock-free parallel algorithms. McDonald et al. McDonald et al. (2010) proposed a distributed perceptron algorithm. Nevertheless, this parallel method is not a lock-free version on shared memory systems. To the best of our knowledge, our proposal is the first lock-free parallel version of perceptron learning.

Our contribution can be listed as follows:

•

The proposed method can achieve 8-fold faster speed of training than the baseline system when using 10 threads, and without additional memory cost.

•

We provide theoretical analysis of the parallel perceptron, and show that it is convergence even with the worst case of full delay. The theoretical analysis is for general lock-free parallel perceptron, not limited by this specific task of dependency parsing.

2 Lock-Free Parallel Perceptron for Dependency Parsing

The dataset can be denoted as $\{(x_{i},y_{i})\}_{i=1}^{n}$ while $x_{i}$ is input and $y_{i}$ is correct output. $GEN$ is a function which enumerates a set of candidates $GEN(x)$ for input $x$ . $\Phi(x,y)$ is the feature vector corresponding to the input output pair $(x,y)$ . Finally, the parameter vector is denoted as $\alpha$ .

In structured perceptron, the score of an input output pair is calculated as follows:

[TABLE]

The output of structured perceptron is to generate the structure $y^{\prime}$ with the highest score in the candidate set $GEN(x)$ .

In dependency parsing, the input $x$ is a sentence while the output $y$ is a dependency tree. An edge is denoted as $(i,j)$ with a head $i$ and its child $j$ . Each edge has a feature representation denoted as $f(i,j)$ and the score of edge can be written as follows:

[TABLE]

Since the dependency tree is composed of edges, the score are as follows:

[TABLE]

The proposed lock-free parallel perceptron is a variant of structured perceptron Sun et al. (2009, 2013); Sun (2015). We parallelize the decoding process of several examples and update the parameter vector on a shared memory system. In each step, parallel perceptron finds out the dependency tree $y^{\prime}$ with the highest score, and then updates the parameter vector immediately, without any lock of the shared memory. In typical parallel learning setting, the shared memory should be locked, so that no other threads can modify the model parameter when this thread is computing the update term. Hence, with the proposed method the learning can be fully parallelized. This is substantially different compared with the setting of McDonald et al. McDonald et al. (2010), in which it is not lock-free parallel learning.

3 Convergence Analysis of Lock-Free Parallel Perceptron

For lock-free parallel learning, it is very important to analyze the convergence properties, because in most cases lock-free learning leads to divergence of the training (i.e., the training fails). Here, we prove that lock-free parallel perceptron is convergent even with the worst case assumption. The challenge is that several threads may update and overwrite the parameter vector at the same time, so we have to prove the convergence.

We follow the definition in Collins’s work Collins (2002). We write $\overline{GEN(x)}$ as all incorrect candidates generated by input $x$ . We define that a training example is separable with margin $\delta>0$ if $\exists U$ with $\lVert U\rVert=1$ such that

[TABLE]

Since multiple threads are running at the same time in lock-free parallel perceptron training, the convergence speed is highly related to the delay of update. Lock-free learning has update delay, so that the update term may be applied on a “old” parameter vector, because this vector may have already be modified by other threads (because it is lock-free) and the current thread does not know that. Our analysis show that the perceptron learning is still convergent, even with the worst case that all of the $k$ threads are delayed. To our knowledge, this is the first convergence analysis for lock-free parallel learning of perceptrons.

We first analyze the convergence of the worse case (full delay of update). Then, we analyze the convergence of optimal case (minimal delay). In experiments we will show that the real-world application is close to the optimal case of minimal delay.

3.1 Worst Case Convergence

Suppose we have $k$ threads and we use $j$ to denote the $j$ ’th thread, each thread updates the parameter vector as follows:

[TABLE]

Recall that the update is as follows:

[TABLE]

Here, $y_{j}^{\prime}$ and $\Phi_{j}(x,y)$ are both corresponding to $j^{th}$ thread while $\alpha^{i}$ is the parameter vector after $i^{th}$ time stamp.

Since we adopt lock-free parallel setting, we suppose there are $k$ perceptron updates in parallel in each time stamp. Then, after a time step, the overall parameters are updated as follows:

[TABLE]

Hence, it goes to:

[TABLE]

where $\delta$ is the separable margin of data, following the same definition of Collins Collins (2002). Since the initial parameter $\alpha=0$ , we will have that $U\cdot\alpha^{t+1}\geq tk\delta$ after $t$ time steps. Because $U\cdot\alpha^{t+1}\leq\lVert U\rVert\lVert\alpha^{t+1}\rVert$ , we can see that

[TABLE]

On the other hand, $\lVert\alpha^{t+1}\rVert$ can be written as:

[TABLE]

where $R$ is the same definition following Collins Collins (2002) such that $\Phi(x,y)-\Phi(x,y_{j}^{\prime})\leq R$ . The last inequality is based on the property of perceptron update such that the incorrect score is always higher than the correct score (the searched incorrect structure has the highest score) when an update happens. Thus, it goes to:

[TABLE]

Combining Eq.10 and Eq.9, we have:

[TABLE]

Hence, we have:

[TABLE]

This proves that the lock-free parallel perceptron has bounded number of time steps before convergence even with the worst case of full delay, and the number of time steps is bounded by $t\leq R^{2}/\delta^{2}$ in the worst case. The worst case means that the parallel perceptron is convergent even if the update is extremely delayed, such that $k$ threads are updating based on the same old parameter vector.

3.2 Optimal Case Convergence

In practice the worst case of extremely delayed update is not probable to happen, or at least not always happening. Thus, we expect that the real convergence speed should be much faster than this worst case bound. The optimal bound is as follows:

[TABLE]

This bound is derived when the parallel update is not delayed (i.e., the update of each thread is based on a most recent parameter vector). As we can see, in the optimal case we can get $k$ times speed up by using $k$ threads for lock-free parallel perceptron training. This can achieve full acceleration of training by using parallel learning.

4 Experiments

4.1 Dataset

Following prior work, we use English Penn TreeBank (PTB) Marcus et al. (1993) to evaluate our proposed approach. We follow the standard split of the corpus, using section 2-21 as training set, section 22 as development set, and section 23 as final test set. We implement two popular model of graph-based dependency parsing: first-order model and second-order model. We tune all of the hyper parameters in development set. The features in our model can be found in McDonald et al. McDonald et al. (2005); McDonald and Pereira (2006). Our baselines are traditional perceptron, MST-Parser McDonald et al. (2005)111www.seas.upenn.edu/s̃trctlrn/MSTParser/MSTParser.html, and the locked version of parallel perceptron. All of the experiment is conducted on a computer with the Intel(R) Xeon(R) 3.0GHz CPU.

4.2 Results

Table 2 shows that our lock-free method can achieve 8-fold faster speed than the baseline system, which is better speed up when compared with locked parallel perceptron. For both 1st-order parsing and 2nd-order parsing, the results are consistent that the proposed lock-free method achieves the best rate of speed up. The results show that the lock-free parallel peceptron in real-world applications is near the optimal case theoretical analysis of low delay, rather than the worst case theoretical analysis of high delay.

The experimental results of accuracy are shown in Table 1. The baseline MST-Parser McDonald et al. (2005) is a popular system for dependency parsing. Table 1 shows that our method with 10 threads outperforms the system with single-thread. Our lock system is slightly better than MST-Parser mainly because we use more feature including distance based feature – our distance features are based on larger size of contextual window.

Figure 1 shows that the lock-free parallel perceptron has no loss at all on parsing accuracy on both 1st-order and 2nd-order parsing setting, in spite of the substantial speed up of training.

Figure 2 shows that the method can achieve near linear speed up, and with almost no extra memory cost.

5 Conclusions

We propose lock-free parallel perceptron for graph-based dependency parsing. Our experiment shows that it can achieve more than 8-fold faster speed than the baseline when using 10 running threads, and with no loss in accuracy. We also provided convergence analysis for lock-free parallel perceptron, and show that it is convergent in the lock-free learning setting. The lock-free parallel perceptron can be directly used for other structured prediction NLP tasks.

6 Acknowledgements

This work was supported in part by National Natural Science Foundation of China (No. 61673028), and National High Technology Research and Development Program of China (863 Program, No. 2015AA015404). Xu Sun is the corresponding author of this paper.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bohnet (2010) Bernd Bohnet. 2010. Very high accuracy and fast dependency parsing is not a contradiction. In Proceedings of the 23rd International Conference on Computational Linguistics . pages 89–97.
2Chen and Manning (2014) Danqi Chen and Christopher D. Manning. 2014. A fast and accurate dependency parser using neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . pages 740–750.
3Collins (2002) Michael Collins. 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10 . Association for Computational Linguistics, pages 1–8.
4Crammer and Singer (2002) Koby Crammer and Yoram Singer. 2002. On the algorithmic implementation of multiclass kernel-based vector machines. The Journal of Machine Learning Research 2:265–292.
5Eisner (1996) Jason Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of the 16th conference on Computational linguistics . pages 340–345.
6Koo and Collins (2010) Terry Koo and Michael Collins. 2010. Efficient third-order dependency parsers. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics . pages 1–11.
7Le and Zuidema (2014) Phong Le and Willem Zuidema. 2014. The inside-outside recursive neural network model for dependency parsing. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) . pages 729–739.
8Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics 19(2):313–330.