Predicting Algorithm Classes for Programming Word Problems

Vinayak Athavale; Aayush Naik; Rajas Vanjape; Manish Shrivastava

arXiv:1903.00830·cs.CL·April 5, 2019

Predicting Algorithm Classes for Programming Word Problems

Vinayak Athavale, Aayush Naik, Rajas Vanjape, Manish Shrivastava

PDF

TL;DR

This paper introduces the task of predicting algorithm classes for natural language programming problems, presents new datasets, and develops models that approach human-level accuracy in classification.

Contribution

It defines a novel task, creates four datasets, and trains neural and non-neural models achieving near-human accuracy in classifying programming word problems.

Findings

01

Best classifier achieves 62.7% accuracy on multiclass dataset

02

Models are within 9% of human performance

03

First reported results on algorithm class prediction for word problems

Abstract

We introduce the task of algorithm class prediction for programming word problems. A programming word problem is a problem written in natural language, which can be solved using an algorithm or a program. We define classes of various programming word problems which correspond to the class of algorithms required to solve the problem. We present four new datasets for this task, two multiclass datasets with 550 and 1159 problems each and two multilabel datasets having 3737 and 3960 problems each. We pose the problem as a text classification problem and train neural network and non-neural network-based models on this task. Our best performing classifier gets an accuracy of 62.7 percent for the multiclass case on the five class classification dataset, Codeforces Multiclass-5 (CFMC5). We also do some human-level analysis and compare human performance with that of our text classification…

Tables6

Table 1. Table 1: Dataset statistics for multiclass datasets. CFMC5 has 550 problems with a balanced class distribution. CFMC10 has 1159 problems and has a class imbalance. CFMC5 is a subset of CFMC10. Red classes belong to the solution category; blue classes belong to the problem category.

Dataset	Size	Vocab	classes	Avg. words	Class percentage
CFMC5	550	9326	5	504	greedy: 20%, implementation:20%, data structures: 20%, dp: 20%, math: 20%
CFMC10	1159	14691	10	485	implementation: 34.94%, dp: 12.42%, math: 11.38%, greedy: 10.44%, data structures: 9.49%, brute force: 5.60%, geometry: 4.22%, constructive algorithms: 5.52%, dfs and similar: 3.10%, strings: 2.84%

Table 2. Table 2: Dataset statistics for multilabel datasets. The problems of the CFML10 dataset are a subset of those in the CFML20 dataset.

Dataset	Size	Vocab	N classes	Avg. len	Label card	Label den	Label subsets
CFML10	3737	28178	10	494	1.69	0.169	231
CFML20	3960	29433	20	495	2.1	0.105	808

Table 3. Table 3: Classification Accuracy for single label classification. Note that all results were obtained on 10-fold cross validation. CNN Random refers to a CNN trained on a random labelling of the dataset. F1 W stands for weighted macro F1-score.

Classifier	CFMC5		CFMC10
Classifier	Acc	F1 W	Acc	F1 W
CNN Random	25.0	22.1	35.2	19.2
MNB	47.6	47.5	43.9	37.4
SVM BoW	49.3	49.1	47.9	43.2
SVM TFIDF	47.8	47.6	45.7	41.2
MLP	47.8	47.6	49.3	46.2
CNN	61.7	61.3	54.7	51.3
CNN Ensemble	62.7	62.2	53.5	50.5
CNN GloVe	62.2	61.3	54.5	51.4

Table 4. Table 4: Classification Accuracy for multi-label classification. TWE stands for trainable word embeddings initialized with a normal distribution. Note that all results were obtained on 10-fold cross validation. CNN Random refers to a CNN trained on a random labelling of the dataset.

Classifier	CFML10			CFML20
Classifier	hamming loss	F1 micro	F1 macro	hamming loss	F1 micro	F1 macro
CNN Random TWE	0.2158	15.98	9.39	0.1207	12.07	4.02
MNB BoW	0.1706	30.57	25.73	0.1067	29.67	23.41
SVM BoW	0.1713	36.08	31.09	0.1056	34.93	30.70
SVM BoW + TF-IDF	0.1723	38.20	33.68	0.1059	38.55	34.70
MLP BoW	0.1879	39.13	34.92	0.1167	38.12	31.37
CNN TWE	0.1671	39.20	32.59	0.1023	38.44	30.38
CNN Ensemble TWE	0.1703	45.32	38.93	0.1093	42.75	37.29
CNN GloVe	0.1676	39.22	33.77	0.1052	37.56	30.29
Human	-	-	-	-	51.8	42.7

Table 5. Table 5: Performance on different categories of PWPs for different parts of the PWPs. The rows with ”only statement” features use only the problem description part of the PWP, the rows with ”only i/o” use only the I/O and constraint information, and ”all prob” use the entire PWP. The results under the ”Soln category” column are of those problems that belong to the solution category, those under ”Prob category” belong to the problem category, and those under ”all” are for all the PWPs. So, for example, the F1 Micro score using only I/O and constraint for solution category problems of CFML20 is 34.63. Note that for CFMC5, F1 Mi (F1 Micro) is the same as accuracy, and F1 Ma (F1 Macro) score is a weighted Macro F1-score.

Dataset	Features	Classifier	Soln. category		Prob. category		all
Dataset	Features	Classifier	F1 Mi	F1 Ma	F1 Mi	F1 Ma	F1 Mi	F1 Ma
CFMC5	only statement	cnn	42.73	46.14	51.32	64.35	46.13	45.20
CFMC5	only i/o	cnn	44.24	51.73	74.73	81.31	56.42	55.41
CFMC5	all prob	cnn	54.24	59.91	71.36	78.32	61.71	61.32
CFML20	only statement	cnn	30.83	17.32	38.64	41.82	33.59	28.34
CFML20	only i/o	cnn	34.63	19.59	44.49	44.34	38.44	30.38
CFML20	all prob	cnn	34.39	19.23	45.36	44.02	39.20	32.59

Table 6. Table 6: Human accuracy on a 100 sized subset of the CFML20 dataset. HL is the hamming loss.

Classifier	20multi subset
Classifier	F1 micro	F1 macro
Human 1	56.3	42.3
Human 2	46.1	38.7
Human 3	51.1	40.6
Human 4	48.4	42.8
Human 5	57.3	49.1
Human Average	51.8	42.7

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Predicting Algorithm Classes for Programming Word Problems

Vinayak Athavale, Aayush Naik 11footnotemark: 1, Rajas Vanjape, Manish Shrivastava

IIIT Hyderabad

[email protected] equal contribution

Abstract

We introduce the task of algorithm class prediction for programming word problems. A programming word problem is a problem written in natural language, which can be solved using an algorithm or a program. We define classes of various programming word problems which correspond to the class of algorithms required to solve the problem. We present four new datasets for this task, two multiclass datasets with 550 and 1159 problems each and two multilabel datasets having 3737 and 3960 problems each. We pose the problem as a text classification problem and train neural network and non-neural network based models on this task. Our best performing classifier gets an accuracy of 62.7 percent for the multiclass case on the five class classification dataset, Codeforces Multiclass-5 (CFMC5). We also do some human-level analysis and compare human performance with that of our text classification models. Our best classifier has an accuracy only 9 percent lower than that of a human on this task. To the best of our knowledge, these are the first reported results on such a task. We make our code and datasets publicly available.

1 Introduction

In this paper we introduce and work on the problem of predicting algorithms classes for programming word problems (PWPs). A PWP is a problem written in natural language which can be solved using a computer program. These problems generally map to one or more classes of algorithms, which are used to solve them. Binary search, disjoint-set union, and dynamic programming are some examples. In this paper, our aim is to automatically map programming word problems to the relevant classes of algorithms. We approach this problem by treating it as a classification task.

Programming word problems A programming word problem (PWP) requires the solver to design correct and efficient programs. The correctness and efficiency is checked by various test-cases provided by the problem writer. A PWP usually consists of three parts – the problem statement, a well-defined input and output format, and time and memory constraints. An example PWP can be seen in Figure 1.

Solving PWPs is difficult for several reasons. One reason is, the problems are often embedded in a narrative, that is, they are described as quasi real-world situations in the form of short stories or riddles. The solver must first decode the intent of the problem, or understand what the problem is. Then the solver needs to apply their knowledge of algorithms to write a solution program. Another reason is that, the solution programs must be efficient with respect to the given time and memory constraints. An outgrowth of this is that, the algorithm required to solve a particular problem not only depends on the problem statement, but also the constraints. Consider that there may be two different algorithms which will generate the correct output, for example, linear search, and binary search, but only one of those will abide by the time and memory constraints. With the growing popularity of these problems, various competitions like ACM-ICPC, and Google CodeJam have emerged. Additionally, several companies including Google, Facebook, and Amazon evaluate problem-solving skills of candidates for software-related jobs (McDowell, 2016) using PWPs. Consequently, as noted by Forišek (2010), programming problems have been becoming more difficult over time. To solve a PWP, humans get information from all its parts, not just the the problem statement. Thus, we predict algorithms from the entire text of a PWP. We also try to identify which parts of a PWP contribute the most towards predicting algorithms.

Significance of the Problem Many interesting real-world problems can be solved and optimised using standard algorithms. Time spent grocery shopping can be optimised by posing it as a graph traversal problem Gertin (2012). Arranging and retrieving items like mail, or books in a library can be done more efficiently using sorting and searching algorithms. Solving problems using algorithms can be scaled by using computers, transforming the algorithms into programs. A program is an algorithm that has been customised to solve a specific task under a specific set of circumstances using a specific language. Converting textual descriptions of such real-world problems into algorithms, and then into programs has largely been a human endeavour. An AI agent that could automatically generate programs from natural language problem descriptions could greatly increase the rate of technological advancement by quickly providing efficient solutions to the said real-world problems. A subsystem that could identify algorithm classes from natural language would significantly narrow down the search space of possible programs. Consequently, such a subsystem would be a useful feature for, or likely be even part of, such an agent. Therefore, building a system to predict algorithms from programming word problems is potentially an important first step toward an automatic program generating AI. More immediately, such a system could serve as an application to help people in improving their algorithmic problem-solving skills for software job interviews, competitive programming, and other uses.

As per our knowledge, this task has not been addressed in the literature before. Hence, there is no standard dataset available for this task. We generate and introduce new datasets by extracting problems from Codeforces111codeforces.com, a sport programming platform. We release the datasets and our experiment code at $masked$ 222hidden for the the double blind review.

Contribution The major contributions of this paper are: Four datasets on programming word problems - two multiclass333each problem belongs to only one class datasets having 5 and 10 classes and two multilabel444each problem belongs to one or more classes datasets having 10 and 20 classes. Evaluation of Classifiers on various multiclass and multilabel classifiers that can predict classes for programming word problems on our datasets along with the human baseline. We define our problem more clearly in section 2. Then we explain our datasets – their generation and format along with human evaluation in section 3. We describe the models we use for multiclass and multilabel classification in section 4. We delineate our experiments, models, and evaluation metrics in section 5. We report our classification results in section 6. We analyse some dataset nuances in section 7. Finally, we discuss related work and the conclusion in sections 8 and 9 respectively.

2 Problem Definition

The focus of this paper is the problem of mapping a PWP to one or more classes of algorithms. A class of algorithms is a set containing more specific algorithms. For example, breadth-first search, and Dijkstra’s algorithm belong to the class of graph algorithms. A PWP can be solved using one of the algorithms in the class it is mapped to. Problems on the Codeforces platform have tags that correspond to the class of algorithms.

Thus, our aim is to find a tagging function, $f^{*}:\mathcal{S}\rightarrow\mathcal{P}(\mathcal{T})$ which maps a PWP string, $s\in\mathcal{S}$ , to a set of tags, $\{t_{1},t_{2},...\}\in\mathcal{P}(\mathcal{T})$ . We also consider another variant of the problem. For the PWPs that only have one tag, we focus on finding a tagging function, $f_{1}^{*}:\mathcal{S}\rightarrow\mathcal{T}$ , which maps a PWP string, $s\in\mathcal{S}$ , to a tag, $t\in\mathcal{T}$ . We approximate $f^{*}$ and $f_{1}^{*}$ by training models on data.

3 Dataset

3.1 Data Collection

We collected the data from a popular sport programming platform called Codeforces. Codeforces was founded in 2010, and now has over 43000 active registered participants555http://codeforces.com/ratings/page/219. We first collected a total of 4300 problems from this platform. Each problem has associated tags, with most of the problems having more than one tag. These tags correspond to the algorithm or class of algorithms that can be used to solve that particular problem. The tags for a problem are given by the problem writer and they can only be edited only by high-rated (expert) contestants who have solved the problem. Next, we performed basic filtering on the data – removing the problems which had non-algorithmic tags, problems with no tags assigned to them, and also the problems wherein the problem statement was not extracted completely. After this filtering, we got 4019 problems with 35 different tags. This forms the Codeforces dataset. The label (tag) cardinality (average number of labels/tags per problem) was 2.24. Since the Codeforces dataset is the first dataset generated for a new problem, we select different subsets of this dataset with differing properties. This is to check if classification models are robust to different variations of the problem.

3.2 Multilabel Datasets

We found that a large number of tags had a very low frequency. Hence, we removed those problems and tags from the Codeforces dataset as follows. First, we got the list of 20 most frequently occurring tags, ordered by decreasing frequency. We observed that the $20^{th}$ tag in this list had a frequency of 98, in other words, 98 problems had this tag. Next, for each problem, we removed the tags that are not in this list. After that, all problems that did not have any tags left were removed.

This led to the formation of the Codeforces Multilabel-20 (CFML20) dataset, which has 20 tags. We used the same procedure for the 10 most frequently occurring tags to get the Codeforces Multilabel-10 (CFML10) dataset. The CFML20 has 98.53 (3960 problems) percent of the problems of the original dataset and the label (tag) cardinality only reduces from 2.24 to 2.21. CFML10 on the other hand has 92.9 percent of the problems with label (tag) cardinality 1.69. Statistics about both these multilabel datasets are given in Table 2.

3.3 Multiclass Datasets

To generate the multiclass datasets, first, we extracted the problems from the CFML20 dataset that only had one tag. There were about 1300 such problems. From those, we selected the problems whose tags occur in the list of 10 most common tags. These problems formed the Codeforces Multiclass-10 (CFMC10) dataset which contains 1159 examples. We found that the CFMC10 dataset has a class (tag) imbalance. We also make a balanced dataset, Codeforces Multiclass-5 (CFMC5), in which the prior class (tag) distribution is uniform. The CFMC5 dataset has five tags, each having 110 problems. To make CFMC5, first we extracted the problems whose tags are among the five most common tags. The fifth most common tag occurs 110 times. We sampled 110 random problems corresponding to the other four tags to give a total of 550 problems. Statistics about both the multiclass datasets are given in Table 1.

3.4 Dataset Format

Each problem in the datasets follows the same format (refer to Figure 1 for an example problem). The header contains the problem title, and the time and memory constraints for a program running on the problem testcases. The problem statement is the natural language description of the problem framed as a real world scenario. The input and output format describe the input to, and the output from a valid solution program. It also contains constraints that will be put on the size of inputs (for example, max size of input array, max size of 2 input values). The tags associated with the problem are the algorithm classes that we are trying to predict using the above information.

3.5 Class Categories in the Dataset

The classes for PWPs can be divided into two categories: Problem category classes tell us what kind of broad class of problem the PWP belongs to. For instance, math, and string are two such classes. Solution category classes tell us what kind of algorithm can solve a particular PWP. For example, a PWP of class dp or binary search would need a dynamic programming or binary search based algorithm to solve it.

Problem category PWPs are easier to classify because, in some cases, simple keyword mapping may lead to the classification (an equation in the problem is a strong indicator that a problem is of math type). Whereas, for solution category PWPs, a deeper understanding of the problem is required.

The classes belong to problem and solution categories for CFML20 are mentioned in the supplementary material.

3.6 Human Evaluation

In this section, we evaluate and analyze the performance of an average competitor on the task of predicting an algorithm for a PWP. The tags for a given PWP are added by its problem setter or other high-rated contestants who have solved it. Our test participants were recent computer science graduates with some experience in algorithms and competitive programming. We gave 5 participants the problem text along with all the constraints, and the input and output format. We also provided them with a list of all the tags and a few example problems for each tag. We randomly sample 120 problems from the CFML20 dataset and split them into two parts – containing 20 and 100 problems respectively. The 20 problems were given along with their tags to familiarize the participants with the task. For the remaining 100 problems, the participants were asked to predict the tags (one or more) for each problem. We chose to sample the problems from the CFML20 dataset as it is the closest to a real-world scenario of predicting algorithms for solving problems. We find that there is some variation in the accuracy reported by different humans with the highest F1 micro score being 11 percent greater than that of the the lowest. (see supplementary material for more details). The F1 micro score averaged over all 5 participants was 51.8 while the averaged F1 Macro was 42.7. The results are not surprising since this task is like any other problem solving task, and people based on their proficiency would get different results. This shows us that the problem is hard even for humans with a computer science education.

4 Classification Models

To test the compatibility of our problem with text classification paradigm, we apply to it some standard text classification models from recent literature.

4.1 Multiclass Classification

To approximate the optimal tagging function $f_{1}^{*}$ (see section 2) we use the following models.

Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) Wang and Manning (2012) proposed several simple and effective baselines for text classification. An MNB is a naive Bayes classifier for multinomial models. An SVM is a discriminative hyperplane-based classifier Hearst et al. (1998). These baselines use unigrams and bigrams as features. We also try applying TF-IDF to these features.

Multi-layer Perceptron (MLP) An MLP is a class of artificial neural network that uses backpropagation for training in a supervised setting Rumelhart et al. (1986). MLP-based models are standard for text classification baselines Glorot et al. (2011).

Convolutional Neural Network (CNN) We also train a Convolutional Neural Network (CNN) based model, similar to the one used by Kim (2014) in their paper, to classify the problems. We use the model both with and without pre-trained GloVe word-embeddings Pennington et al. (2014).

CNN ensemble Hansen and Salamon (1990) introduce neural network ensemble learning, in which many neural networks are trained and their predictions combined. These neural network systems show greater generalization ability and predictive power. We train five CNN networks and combine their predictions using the majority voting system.

4.2 Multilabel Classifiers

To approximate, $f^{*}$ (see section 2), we apply the following augmentations to the models described above.

Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) For applying these models to the multilabel case, we use the one-vs-rest (or, one-vs-all) strategy. This strategy involves training a single classifier for each class, with the samples of that class as positive samples and all other samples as negatives Bishop (2006).

Multi-layer Perceptron (MLP) Nam et al. (2014) use MLP-based models for multilabel text classification. We use similar models, but use the MSE loss instead of the cross-entropy loss.

Convolutional Neural Network (CNN) For multilabel classification we use a CNN based feature extractor similar to the one used in Kim (2014). The output is passed through a sigmoid activation function, $\sigma(x)=\frac{1}{1+e^{-x}}$ . The labels which have a corresponding activation greater than 0.5 are considered Liu et al. (2017). Similar to the multiclass case, we train the model both with and without pre-trained GloVe Pennington et al. (2014) word-embeddings.

CNN ensemble We train five CNNs and add their output linear activation values. We pass this sum through a sigmoid function and consider the labels (tags) with activation greater than 0.5.

5 Experiment setup

All hyperparameter tuning experiments were performed with 10-fold cross validation. For the non-neural network-based methods, we first vectorize each problem using a bag-of-words vectorizer, scikit-learn’s Pedregosa et al. (2011) CountVectorizer. We also experiment with TF-IDF features for each problem. In the multiclass case, we use the LIBSVM chung Chang and Lin (2001) implementation of the SVM classifier and we grid search over different kernels. However, the LIBSVM implementation is not compatible with the one-vs-rest strategy (complexity $\mathcal{O}(n)$ where $n$ is the number of classes), but only the one-vs-one (complexity $\mathcal{O}(n^{2})$ ). This becomes prohibitively slow and thus, we use the LIBLINEAR Fan et al. (2008) implementation for the multilabel case. For hyperparameter tuning, we applied a grid search over the parameters of the vectorizers, classifiers, and other components. The exact parameters tuned can be seen in our code repository. For the neural network-based methods, we tokenize each problem using the spaCy tokenizer Honnibal and Montani (2017). We only use words appearing 2 or more times in building the vocabulary and replace the words that appear fewer times with a special UNK token. Our CNN network architecture is similar to that used by Kim (2014). The batch size used is 32. We apply 512 one-dimensional convolution filters of size 3, 4, and 5 on each problem. The rectifier, $R(x)=max(x,0)$ , is used as the activation function. We concatenate these filters, apply a global max-pooling followed by a fully-connected layer with output size equal to the number of classes. We use the PyTorch framework Paszke et al. (2017) to build this model. For the word embedding we use two approaches - a vanilla PyTorch trainable embedding layer and a 300-dimensional GloVe embedding Pennington et al. (2014). The networks were initialized using the Xavier method Glorot and Bengio (2010) at the beginning of each fold. We use the Adam optimization algorithm Kingma and Ba (2014) as we observe that it converges faster than vanilla stochastic gradient descent.

6 Results

6.1 Multiclass Results

We see that the classification accuracy of the best performing classifier, CNN ensemble, for the CFMC5 dataset is 62.7 %. The highest accuracy for the CFMC10 dataset was achieved by the CNN classifer which does not use any pretrained embeddings. For all the multiclass classification results refer to table 3. We observe that CNN-based classifiers perform better than other classifiers – MLP, MNB, and SVM for both CFMC5 and CFMC10 datasets. Since these are the first learning results on the task of algorithm prediction for PWPs, we train a CNN classifier on a random labelling of the dataset. The results are given in the row called CNN random. To obtain this random labelling we shuffle the current mapping from problem to tag randomly. This ensures that the class distribution of the datasets remain the same. We see that all the classifiers significantly outperform the performance on the random dataset. We also observe that the classification accuracy is not the same for every class. We get the highest accuracy (see Fig. 2) for the class, data structures, at 90%, while, the lowest accuracy is for the class, greedy, at 40%. These results are on the CFMC5 dataset.

6.2 Multilabel Results

We see that CNN-based classifiers give the best results for the CFML10 and CFML20 datasets. The best F1 micro and macro scores for the CFML10 dataset were 45.32, 38.9 respectively. These were obtained by the CNN Ensemble model. For complete results see table 4. The best performing model on the CFML20 dataset was also the CNN ensemble. As we did in the multiclass case, we train a CNN model on the randomly shuffled labelling for both CFML10, CFML20 datasets. We find that all the classifers significantly outperform the model trained on a shuffled labelling. The human-level F1 micro and macro scores on a subset of the CFML20 dataset were 51.2 and 40.5. In comparison, our best performing classifier on the CMFL20 dataset, CNN Ensemble, got F1 macro and micro scores of 42.75, 37.29 respectively. We see that the performance of our best classifiers trail average human performance by about 8.45% and 3.21% on F1 micro and F1 macro scores respectively.

7 Analysis

7.1 Experiments with various subsets of the problem

As described in section 1, a PWP consists of three components – the problem statement, input and output format, and time and memory constraints. We seek to answer the following questions. Does one component contribute to the accuracy more than any other? Does the contribution of different components vary over the problem class? We performed some experiments to address these questions. We split the problem into two parts – 1) the problem statement, and 2) the input and output format, and time and memory constraints. We train an SVM, and a CNN on these two components independently.

Multiclass PWP component analysis We find classifier accuracies on the CFMC5 dataset. We choose the CFMC5 dataset out of the two multiclass datasets because it has a balanced class distribution. We find that the classifiers perform quite well on only the input and output format, and time and memory constraints – the best classifier getting an accuracy of 56.4 percent (only 5.3 percent lower than the accuracy of CNN with the whole problem). Classification using only the problem statement gives worse results than using the format and constraints, with a classification accuracy of 45.2 percent for the best classifier CNN (16.5 percent lower than the accuracy of a CNN trained on the whole problem). Complete results are given in table 5. We also see that the performance across different classes varies when trained on different inputs. We find that the class dp performs better when trained on the problem statement, whereas the other classes perform much better on the format and constraints. For each class except greedy, we see an additive trend – the accuracy is improved by combining both these features. Refer to figure 2 for more details.

Multilabel partial problem results We also tabulate the classifier accuracies on the CFML20 dataset by training it only on the format and constraints, and the problem statement. Even here, we observe similar trends as the multiclass partial problem experiments. We find that classifiers are more accurate when trained only on the format and constraints than only on the problem statement. Again, the accuracy is improved by combining both these features. Refer to table 5 for more details.

7.2 Problem category and Solution category results

We find that correctly classifying PWPs of the solution category is harder than correctly classifying PWPs of the problem category (table 5). For instance, take a look at the row corresponding to CFMC5 dataset and ”all prob” feature. The accuracy for solution category is 54.24% as compared to 71.36% for the problem category. This trend is followed for both CFMC5 and CFML20 datasets and also when using different features of the PWPs. In spite of the difficulty, the classification scores for the solution category are significantly better than random.

8 Related Work

Our work is related to three major topics of research, math word problem solving, text document classification and program synthesis.

Math word problem solving In the recent years, many models have been built to solve different kinds of math word problems. Some models solve only arithmetic problems Hosseini et al. (2014), while others solve algebra word problems Kushman et al. (2014). There are some recent solvers which solve a wide range pre-university level math word problems Matsuzaki et al. (2017), Hopkins et al. (2017). Wang et al. (2017), and Mehta et al. (2017) have built deep neural network based solvers for math word problems. Program synthesis Work related to the task of converting natural language description to code comes under the research areas of program synthesis and natural language understanding. This work is still in its nascent stage. Zhong et al. (2017) worked on generating SQL queries automatically from natural language descriptions. Lin et al. (2017) worked on automatically generating bash commands from natural language descriptions. Iyer et al. (2016) worked on summarizing source code. Sudha et al. (2017) use a CNN based model to classify the algorithm used in a programming problem using the C++ code. Our model tries to accomplish this task by using the natural language problem description. Gulwani et al. (2017) is a comprehensive treatise on program synthesis. Document classification The problem of classifying a programming word problem in natural language is similar to the task of document classification. The state-of-the-art approach currently for single label classification is to use a hierarchical attention network based model (Yang et al., 2016). This model is improved by using transfer learning Howard and Ruder (2018). Other approaches include a Recurrent Convolutional Neural Network based approach Lai et al. (2015) or the fasttext model Joulin et al. (2016) which uses bag-of-words features and a hierarchical softmax. Nam et al. (2014) use a feed-forward neural network with binary cross entropy per label to perform multilabel document classification. Kurata et al. (2016) leverage label co-occurrence to improve multilabel classification. Liu et al. (2017) use a CNN based architecture to perform extreme multilabel classification.

9 Conclusion

We introduced a new problem of predicting the algorithm classes for programming word problems. For this task we generated four datasets – two multiclass (CFMC5 and CFMC10), having five and 10 classes respectively, and two multilabel (CFML10 and CFML20), having 10 and 20 classes respectively. Our classifiers are falling short only by about 9 percent of the human score. We also did some experiments which show that increasing the size of the train dataset improves the accuracy (see supplementary material). These problems are much harder than high school math word problems as they require a good knowledge of various computer science algorithms and an ability to reduce a problem to these known algorithms. Even our human analysis shows that trained computer science graduates only get an F1 of 51.8. Based on these results, we see that algorithm class prediction is compatible with and can be solved using text classification.

Appendix A Appendix

A.1 Experiments with limited training data

We wanted to see how the dataset size affects the performance of the classifier. So, we train a CNN classifier on 25, 50, 75, and 100 percent of the CFML20 dataset. As expected, we find that the performance of the classifier improves with increase in size of the training data. The F1 micro and macro scores increase, and the hamming loss decreases. For the F1 scores, higher is better, while for hamming loss lower is better. See figure 3.

A.2 Evaluation Metrics

A.3 Multiclass: Accuracy

Accuracy is the percentage of labels correctly predicted. Note that for multiclass classification the micro-averaged F1 score is equal to the accuracy.

A.4 Multiclass: Macro-averaged F1 score

Macro-averaged F1 score is computed by first computing the F1 score for each class independently and then take an averaging all the F1 scores. This metric treats all the classes as equal, independent of their frequency in the test set.

A.5 Multiclass: Weighted macro-averaged F1 score

Weighted macro-averaged F1 score is computed by first computing the F1 score for each class independently and then take an averaging all the F1 scores, weighted by their support.

A.6 Multilabel: Hamming loss

Hamming loss is the proportion of mis-classified examples in the dataset.

A.7 Multilabel: Micro-averaged F1 score

It is the F-measure averaging on the prediction matrix. The individual true positives, false positives, and false negatives are summed up across labels/classes and then the F-measure is calculated.

A.8 Multilabel: Macro-averaged F1 score

Macro-averaged F1 score is calculated by computing the F1 score for each of the labels, then averaging the label wise F1 scores.

Appendix B Human accuracy

We did a human study with 5 participants on the CFML20 dataset 6. Each participant is a recent graduate in computer science and is a frequent competitive programmer. You can see the results in 6

Appendix C Classes classification in CFML20

C.1 Problem category

Following classes belong to Problem category: probabilities, geometry, combinatorics, number theory, strings, trees, graphs, math, data structures

C.2 Solution Category

Following classes belong to Solution category: dsu, binary search, dfs and similar, constructive algorithms, brute force, greedy, dp, bitmask, two pointers, sortings, implementation

Bibliography36

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bishop (2006) Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics) . Springer-Verlag, Berlin, Heidelberg.
2chung Chang and Lin (2001) Chih chung Chang and Chih-Jen Lin. 2001. Libsvm: a library for support vector machines.
3Fan et al. (2008) Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research , 9:1871–1874.
4Forišek (2010) Michal Forišek. 2010. The difficulty of programming contests increases. In International Conference on Informatics in Secondary Schools-Evolution and Perspectives , pages 72–85. Springer.
5Gertin (2012) Thomas Gertin. 2012. Maximizing the cost of shortest paths between facilities through optimal product category locations . Ph.D. thesis.
6Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics .
7Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for large-scale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML-11) , pages 513–520.
8Gulwani et al. (2017) Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. 2017. Program synthesis. Foundations and Trends® in Programming Languages , 4(1-2):1–119.