Hybrid NER System for Multi-Source Offer Feeds

Anusha Holla; Bharat Gaind; Vikas Reddy Katta; Abhishek Kundu; S; Kamalesh

arXiv:1901.08406·cs.IR·June 12, 2019

Hybrid NER System for Multi-Source Offer Feeds

Anusha Holla, Bharat Gaind, Vikas Reddy Katta, Abhishek Kundu, S, Kamalesh

PDF

Open Access

TL;DR

This paper introduces a hybrid NER system combining multiple models to effectively extract key offer entities from diverse, unstructured web data, enhancing targeted advertising efforts.

Contribution

A novel hybrid NER model using stacking of CRF, BiLSTM, and Spacy with an SVM classifier, tailored for offer feed data from multiple sources.

Findings

01

Hybrid model outperforms existing NER models in offer domain

02

Effective extraction of offer entities from multi-source feeds

03

Improved accuracy in identifying offer-related information

Abstract

Data available across the web is largely unstructured. Offers published by multiple sources like banks, digital wallets, merchants, etc., are one of the most accessed advertising data in today's world. This data gets accessed by millions of people on a daily basis and is easily interpreted by humans, but since it is largely unstructured and diverse, using an algorithmic way to extract meaningful information out of these offers is hard. Identifying the essential offer entities (for instance, its amount, the product on which the offer is applicable, the merchant providing the offer, etc.) from these offers plays a vital role in targeting the right customers to improve sales. This work presents and evaluates various existing Named Entity Recognizer (NER) models which can identify the required entities from offer feeds. We also propose a novel Hybrid NER model constructed by two-level…

Tables4

Table 1. TABLE I: Data Sources

Datasets	Dataset Source	Source Url	Number of Offers Scraped	Number of Templates made	Number of Offers after bloating
$D_{1}$	Axis Bank	https://www.axisbank.com/grab-deals/online-offers	91	35	651
$D_{2}$	ICICI Bank	https://www.icicibank.com/Personal-Banking/offers /offer-index.page	95	27	864
$D_{3}$	HDFC Bank	https://offers.smartbuy.hdfcbank.com/list_offer /credit_card/2	42	33	761
$D_{4}$	Grabon	https://www.grabon.in/paytm-coupons/	148	34	891
$D_{5}$	SBI Bank	https://www.sbicard.com/en/personal/offers.page	14	10	57

Table 2. TABLE II: Comparison of various CRF Models

CRF Models	F1 score
$M_{C R F 1}$	0.5125
$M_{C R F 2}$	0.5497
$M_{C R F 3}$	0.4618
$M_{C R F 4}$	0.4044
M_CRF	0.6130

Table 3. TABLE III: Overall F1 scores of the various models

Models	F1 score
$M_{C R F}$	0.6130
$M_{B L S T M}$	0.7761
$M_{s p a C y}$	0.6870
M_Hybrid	0.8156

Table 4. TABLE IV: Tag Wise F1 Scores of the various models

	M_CRF	M_BLSTM	M_spaCy	M_Hybrid
OAMT	0.7742	0.8110	0.6987	0.8366
OTYPE	0.6992	0.8571	0.7717	0.9714
MIN_AMT	0.4545	0.7397	0.6857	0.8750
MAX_AMT	0.1739	0.5945	0.0	0.7050
PRD	0.4706	0.8750	0.7407	0.8478
MERCH	0.5714	0.6560	0.5870	0.6458

Equations6

R ec a l l = \frac{T P}{T P + F N}

R ec a l l = \frac{T P}{T P + F N}

P r ec i s i o n = \frac{T P}{T P + F P}

P r ec i s i o n = \frac{T P}{T P + F P}

F 1 scor e = \frac{2 * P r ec i s i o n * R ec a l l}{P r ec i s i o n + R ec a l l}

F 1 scor e = \frac{2 * P r ec i s i o n * R ec a l l}{P r ec i s i o n + R ec a l l}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Web Data Mining and Analysis · Topic Modeling

MethodsSigmoid Activation · Tanh Activation · Support Vector Machine · Long Short-Term Memory

Full text

Hybrid NER System for Multi-Source Offer Feeds 1

Anusha Holla

Samsung Research Institute

Bangalore, India

[email protected]

Bharat Gaind

Samsung Research Institute

Bangalore, India

[email protected]

Vikas Reddy Katta

Samsung Research Institute

Bangalore, India

[email protected]

Abhishek Kundu

Samsung Research Institute

Bangalore, India

[email protected]

S Kamalesh

Samsung Research Institute

Bangalore, India

[email protected]

Abstract

Data available across the web is largely unstructured. Offers published by multiple sources like banks, digital wallets, merchants, etc., are one of the most accessed advertising data in today’s world. This data gets accessed by millions of people on a daily basis and is easily interpreted by humans, but since it is largely unstructured and diverse, using an algorithmic way to extract meaningful information out of these offers is hard. Identifying the essential offer entities (for instance, its amount, the product on which the offer is applicable, the merchant providing the offer, etc.) from these offers plays a vital role in targeting the right customers to improve sales. This work presents and evaluates various existing Named Entity Recognizer (NER) models which can identify the required entities from offer feeds. We also propose a novel Hybrid NER model constructed by two-level stacking of Conditional Random Field, Bidirectional LSTM and Spacy models at the first level and an SVM classifier at the second. The proposed hybrid model has been tested on offer feeds collected from multiple sources and has shown better performance in the offer domain when compared to the existing models.

Index Terms:

Named Entity Recognition, Data Mining, Machine Learning, Stanford NER, Bidirectional LSTM, Spacy, Support Vector Machines

I INTRODUCTION

Offers are one of the major sources of unstructured data in the marketing domain. They are also one of the most consumed datasets. Every single day, millions of customers read offer statements and extract meaning out of them, which they use for improving the profitability of their shopping experience. It would be highly beneficial for the industry to use this wealth of data to enhance existing customer shopping experience. If offers can be converted to a machine-readable format, algorithms could be developed to target the right customers, which can prove vital in improving sales. The motivation is to analyze marketing offers based on information extraction, in an industrial setting. One use-case where extracting the constituent entities/attributes of offers could be important is an organization/business trying to understand the offers that are being offered by their competitors in the market. The solutions proposed in this paper could be utilized by a third-party business to create a portal where marketing offers of these competitors could be compared, using which the buisness can provide a better offer to their customers and thus, improving sales. Another use-case could be to filter all the unnecessary offers received by the user (as SMS messages on his phone) to give him/her personalized offers and avoid clutter. Yet another use-case could be a continuation of the work done by Ujwal et al. [1], which proposes a method to scrape offers from offer-aggregator websites. The Hybrid Model we propose could be used to extract meaningful entities from these scraped offers. All this is only possible if the essential elements that make up the offers are correctly understood.

However, there are multiple challenges in doing this. One of these challenges is the problem of data variety. Offers come from numerous sources in various formats - all in natural language. It is difficult to convert these offers to a machine-readable format (like JSON). Also, the structure of the offers from a source is prone to vary. In this paper, we try to address these challenges and enhance the prediction accuracy by proposing a novel Hybrid Named Entity Recognition (NER) system, constructed by two-level stacking of Conditional Random Field (CRF), Bidirectional LSTM and spaCy [2] models in the first level and a Support Vector Machine (SVM) classifier in the second. These models have been implemented using some very popular Natural Language Processing (NLP) and Machine Learning (ML) libraries, such as Stanford NER [3], Keras [4], spaCy and scikit-learn [5]. We also evaluate and compare the independent NER models (the ones used at the first level: CRF, BLSTM, spaCy) and the Hybrid Model by training them on four known sources and subsequently testing them on an unknown fifth one. It is found that the proposed Hybrid Model has a significantly higher accuracy when compared to the other models. Therefore, it can be used to efficiently extract various important entities in offer feeds.

II LITERATURE REVIEW

Named Entity Recognition is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories [6]. There are a number of algorithms that can be used for Named Entity Recognition. Various Named Entity Recognition systems have been developed in the last two decades. But, there has not been a significant effort to analyze the complex marketing offers, which is a very important domain (as explained in the previous section). In the effort of building NERs in the offer domain, we have drawn inspiration from various previous works/literature.

Initially, statistical methods were commonly applied to build Named Entity Recognizers [7]. Recently, neural architectures have gained popularity for Named Entity Recognition. The work of Zhiheng et al. [8] discusses the Bidirectional LSTM for sequential Tagging. The work of Shriberg et al. [9] and Lafferty et al. [7] has shown that CRFs can produce higher tagging accuracy. Comparisons made by R.Jiang et al. [10] showed that spaCy performed best, next to Stanford NER. Another method is Stacking, which allows blended intelligence from many different approaches to be combined into one superior result. Stacked generalization was introduced by Wolpert [11]. We take inspiration from various concepts/works described above to build our proposed Hybrid system, which shows significantly better results than any of the existing/popular NER systems (also evaluated in this paper), in the marketing offers domain.

III DATASET

The offer-data is collected by scraping offers from five different sources. Four of these sources are banks, and the fifth is an offer-aggregator website. The offers contained in each of these sources are very diverse and different in structure from one another. Each offer contains some entities/attributes that constitute the offer. We call each such entity a tag. The following is the list of tags in an offer that we are interested in extracting:

•

OAMT - Offer amount

•

OTYPE - Offer Type (discount, cashback, voucher)

•

MIN_AMT - Minimum purchase amount above which offer is valid

•

MAX_AMT - Maximum offer amount

•

PRD - Product on which the offer is valid

•

MERCH - Name of the Merchant offering the Offer

•

O - Any token we’re not interested in extracting as an offer-entity, should be tagged as Other (O).

Since the number of offers obtainable from these sources is limited in number and not enough to train an NER model, we use offer-templates (generic structures that the maker of the offer follows, while creating the offer) to generate a large number of offers. For example, the offer, “Get 20% off on pizzas at Dominos” follows the generic offer-template, “Get OAMT OTYPE on PRD at MERCH” (where OAMT, OTYPE, etc. are tags). We now convert the scraped offers from each source into its corresponding set of offer-templates. Five different labeled datasets (containing a large number of offers) are created corresponding to each of these five sets of offer-templates, after bloating their (offer-templates’) constituent tags randomly with appropriate values. Finally, we tokenize all these datasets. To tokenize the input uniformly for all our NER models, we use the spaCy tokenizer. The resultant labeled datasets are called the tokenized datasets, which will be subsequently used for supervised learning. For simplicity, we refer to them as Di (i=1,2,..5). Four of these datasets ( $D_{1}$ , $D_{2}$ , $D_{3}$ , $D_{4}$ ) are used for training and the fifth one for testing ( $D_{5}$ or Dtest). The details of these datasets are shown in Table I.

IV SYSTEM ARCHITECTURE

In this paper, we use three independent models for the purpose of Named Entity Recognition (NER): CRF Model, BLSTM Model, and spaCy Model. Then, we use an SVM Classifier to combine these models and propose a Hybrid Model.

IV-A CRF Model

Conditional Random Field (CRF) is a probabilistic sequence model, mainly used for NER. It is a framework for building probabilistic models to segment and label sequential data. It is preferred because they offer a huge advantage by relaxing the independence assumptions made by models like HMMs (Hidden Markov Models) and stochastic grammars [7].

In this paper, we use Stanford NER to implement the CRF classifier, which has a Java-based implementation of the same. It expects its input (a tokenized dataset) as pairs of tab-separated tokens (words) and tags, in separate lines, where each offer-message is separated by two new lines. The following features are set to true in Stanford NER while training the CRF model:

•

usePrev

•

useNext

•

useTags

•

useWordPairs

•

usePrevSequences

•

useNextsequences

•

useLemmas

•

useLemmaAsWord

•

normalizeTerms

•

normalizeTimex

•

usePosition

•

useBeginSent

The output generated by this model is the probability of each tag for every token.

Now, there could be instances in the future, where offers are coming from a new unknown source. Also, the structure of offers coming from a particular source is prone to vary. Hence, there is a need for a system, which is agnostic to the source of an offer. So, it is better to combine all the tokenized training datasets ( $D_{1}$ , $D_{2}$ , $D_{3}$ , $D_{4}$ ) into a single combined dataset Dcomb, so that the final dataset used for training contains as many diverse offer-templates as possible. To further justify the need of a combined dataset, we experimented by training various CRF models on individual datasets ( $D_{1}$ , $D_{2}$ , $D_{3}$ , $D_{4}$ ) and another model on the combined dataset. It was found (see results in Section V) that the accuracy was higher for the combined dataset model, compared to the individual dataset models. Dcomb is further divided in two equal sets : Dcomb1 and Dcomb2. Dcomb1 is used to train the three independent models (CRF, BLSTM and spaCy) and Dcomb2 is used to train the Hybrid model. The CRF model trained using the dataset $D_{comb1}$ is referred to as MCRF.

IV-B BLSTM Model

In the last few years, Recurrent Neural Networks (RNNs) have shown significant results in a variety of tasks like speech recognition, language modeling, translation, and image captioning. The idea of RNNs is that they use previous information while predicting the tag for the current token (word). Consider the offer, “Shop at Lifestyle and get flat 20% off on apparels” and the offer, “Get instant 20% off on Lifestyle”. In the first example, the token followed by “on” (the last token of the sentence) should be tagged as PRD, whereas in the second example, the token followed by “on” should be tagged as MERCH. To predict what comes after “on”, we need a history of what has already been seen in the sentence. RNNs don’t seem to be able to learn long-term dependencies [12], which is why Long Short Term Memory (LSTM) is needed. In the first example, the information that MERCH was already seen at the beginning of the sentence can be used by an LSTM model to predict what comes after “on” (PRD in this case). Also, since we need to consider both the left and the right side long-term dependencies of a token while predicting its tag accurately, we need to use Bi-directional LSTM (BLSTM) [13] for the purpose of NER.

The BLSTM model is implemented using Keras. It is trained using the dataset $D_{comb1}$ (as explained in the previous section). The input to the model is a list, where each element is itself a list of pairs of tokens and tags of an offer-message. Each of the tokens in an offer-message is converted to one-hot encoding and GloVe embedding [14] is applied to get a 300-dimensional vector, corresponding to every token. Each offer-message is padded with zeroes to make the size of all the offer-messages equal. The output from the hidden states is a 64-dimensional vector which is applied over softmax activation function to get a 7-dimensional vector (because the number of tags is 7). This vector represents the probability scores of tags for every token. The BLSTM model thus built is represented as MBLSTM.

IV-C spaCy Model

spaCy is an open-source software library for advanced Natural Language Processing, written in Python and Cython. Ridong Jiang et al. [10] showed that spaCy performed best, next to Stanford NER.

The expected input for spaCy is a list, where every element is itself a list of the offer-message sentence, the start and end index in that sentence of the token that corresponds to a tag, and finally, the tag itself. For training, we used the default English model in spaCy. This model is also trained using the tokenized dataset $D_{comb1}$ . The tokens from $D_{comb1}$ are fed into spaCy’s EntityRecognizer. It generates docs (a sequence of tokens) for each offer-message, which when fed into the GoldParse, along with the tag offsets (a list of tag locations in the offer-message), produces gold-standard tokens. These tokens and their associated tags are then fed to spaCy’s EntityTagger to train the model. The model is updated (retrained) for every offer-message. The output of this model is the tag associated with each token, whereas the list of probabilities associated with the tokens is not given. The model built from spaCy is represented as MspaCy.

IV-D The Hybrid Model

In each of the models explained above, we are relying on a single model for entity recognition. But, diversification of models provides a more robust prediction. Hence, ensembling is used. Ensembling is a technique of combining the individual predictions of multiple models to give superior results. The resulting model is often much more accurate than the constituent individual classifiers [15], [16].

There are three main methods of ensembling: Bagging, Boosting and Stacking. Bagging (stands for Bootstrap Aggregation) improves the classification by combining classifications of randomly generated training sets [17]. It is aimed to decrease variance. In the case of Boosting, the results of previous classifier’s misclassified data are used to train the next classifier. All the classifiers are aggregated using majority voting. It is aimed to decrease bias. In Stacking, we use a pool of base classifiers, and then use another classifier to combine the predictions, with the aim of reducing the generalization error. Since our application requires to reduce both the variance and bias, we make use of stacking. The stacked model will be able to discern where each model performs well and where it performs poorly.

The Hybrid Model, we propose, is constructed using two-level stacking. Three models are used at the first level: $M_{CRF}$ , $M_{BLSTM}$ and $M_{spaCy}$ (as trained in the previous sections). A Linear SVM classifier is used at the second level. It is a standard method for large-scale classification tasks and is preferred because it is one of the best multi-class text classifiers. This classifier is implemented using scikit-learn’s SVMClassifier, with Hinge Loss function. The two levels of the Hybrid model are depicted in Fig. 1.

The following steps are used for training the Hybrid Model:

•

First, we feed the dataset $D_{comb2}$ as input to MCRF, MBLSTM, MspaCy.

•

For every token, the output of MCRF (a 7-dimensional vector of the probabilities of all 7 tags for every token), MBLSTM (another 7-dimensional vector of the probabilities of all 7 tags for every token) and MspaCy (an integer in the range [0, 5] depicting the tag predicted for a token) is merged to form a 15-dimensional vector.

•

A list (lX) of such 15-dimensional vectors (with each vector representing a token), created by merging all the tokens in all the offers in $D_{comb2}$ , is fed as input to train the SVM classifier. Another list (lY) containing the correct tags (already present in the dataset) for each of the tokens is also fed as input to the classifier. For example, if there are 100 offers, and each offer has an average of 10 tokens, $l_{X}$ will have 1000 15-dimensional vectors, whereas $l_{Y}$ will contain 1000 correct tags, corresponding to each of the tokens.

The output of the model is the tag associated with each token (word) of an offer-message. The Hybrid model, thus formed, is represented as MHybrid.

V RESULTS AND DISCUSSION

In this section, we test the various models we trained in the previous sections: $M_{CRF}$ , $M_{BLSTM}$ , $M_{spaCy}$ and $M_{Hybrid}$ , using the metric F1 score/F Measure. But before that, we define the various metrics, needed to evaluate the F1 score of our models:

•

True Positive (TP): The token is correctly classified as one of the six tags: OAMT, OTYPE, MIN_AMT, MAX_AMT, PRD and MERCH.

•

True Negative (TN): The token is correctly classified as the tag O (which is not a tag we’re interested in extracting).

•

False Positive (FP): The token is misclassified as one of the six tags: OAMT, OTYPE, MIN_AMT, MAX_AMT, PRD and MERCH.

•

False Negative (FN): The token is misclassified as the tag O.

The precision, recall and finally the F1 score are calculated using the following formulas:

[TABLE]

Before proceeding with the testing of various models trained, we first prove that a combined dataset model ( $D_{comb1}$ ) will give better accuracy than the models trained on individual datasets: $D_{1}$ , $D_{2}$ , $D_{3}$ , $D_{4}$ (as explained in Section IVA). For this, we train four CRF models, ${M_{CRF1}}$ , ${M_{CRF2}}$ , ${M_{CRF3}}$ , ${M_{CRF4}}$ , corresponding to the datasets, $D_{1}$ , $D_{2}$ , $D_{3}$ , $D_{4}$ and use the already trained CRF model, ${M_{CRF}}$ , corresponding to the dataset, $D_{comb1}$ (trained in section IVA). We tested all these five models on $D_{test}$ , as shown in Table II. It can be seen that the accuracy of ${M_{CRF}}$ is higher than the accuracy of the models trained on the individual datasets, which further justifies the need to diversify the datasets by combining them.

Now, we test the models, $M_{CRF}$ , $M_{BLSTM}$ , $M_{spaCy}$ and $M_{Hybrid}$ on $D_{test}$ . The overall F1 scores (calculated using the total TPs, FNs and FPs across all tags) for all models is shown in Table III. Also, the F1 scores of all 6 tags for each of the models is shown in Table IV.

The proposed Hybrid Model was tested on the same dataset as the rest of the models, and as we can see, the F1 score of the last row in Table III is significantly higher compared to the other models. The Hybrid Model is 3.95% more accurate than the BLSTM Model, which is the most accurate among the three independent models (CRF, BLSTM, spaCy). The reason for this is that while training, the hybrid model assigns different weights to different models, based on their performances on the various tags. In other words, an informed decision is made and accordingly more weights are assigned to the better performing models for a particular tag. The better performance of the proposed model is also evident from the tag wise F1 scores reported in Table IV, where its accuracy is higher on almost all the tags when compared to the other models. Another important point to be observed here is that since the dataset $D_{test}$ is completely unknown to the hybrid model, it simulates the case when the offer-structure has been changed in a known-source (which was used to train the model). Therefore, the good performance of the hybrid model indicates/implies that the problem of structure change of an offer-source has been addressed.

VI CONCLUSION

In this paper, we evaluate the various existing/popular NER models (CRF, BLSTM, spaCy) to analyze marketing offers, in an industrial setting. We also propose a Hybrid model, constructed by two-level stacking. Amongst all the models, the Hybrid Model gives the best results, when tested on an unknown source. We also try to solve the problem of data variety and structure-change, using this model. This work can be further extended by training on more than four sources, so as to get better accuracies. Furthermore, apart from the marketing offer domain, the proposed Hybrid Model can be extended to other domains of interest as well.

Bibliography17

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Ujwal, B. Gaind, A. Kundu, A. Holla, and M. Rungta, “Classification-based adaptive web scraper,” in Machine Learning and Applications (ICMLA), 2017 16th IEEE International Conference on , pp. 125–132, IEEE, 2017.
2[2] M. Honnibal, “spacy (version 1.8), available from https://spacy.io/,” 2016.
3[3] “Stanford ner.” https://nlp.stanford.edu/software/CRF-NER.html .
4[4] F. Chollet, “Keras.” Keras (Version 2.0.2) https://keras.io .
5[5] D. Cournapeau, “scikit-learn.” scikit-learn (Version 0.18.1) https://scikit-learn.org .
6[6] “Named entity recognition.” https://en.wikipedia.org/wiki/Named-entity_recognition . Accessed: 2018-02-28.
7[7] J. D. Lafferty, A. Mc Callum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proceedings of the Eighteenth International Conference on Machine Learning , ICML ’01, (San Francisco, CA, USA), pp. 282–289, Morgan Kaufmann Publishers Inc., 2001.
8[8] Zhiheng, H. Wei, and X. K. Yu, “Bidirectional lstm-crf models for sequence tagging,” 2015.