TL;DR
This paper presents a method to extract and cluster Vietnamese Facebook conversations using PhoBERT embeddings to efficiently generate training data for chatbots, demonstrating improved clustering performance and dataset creation.
Contribution
The study introduces a novel approach combining Facebook data extraction, PhoBERT-based feature extraction, and clustering algorithms with parameter optimization for Vietnamese chatbot training data.
Findings
PhoBERT outperforms other models in feature extraction.
Clustering algorithms achieve high V-measure and Silhouette scores.
The method significantly reduces time and effort in dataset creation.
Abstract
The biggest challenge of building chatbots is training data. The required data must be realistic and large enough to train chatbots. We create a tool to get actual training data from Facebook messenger of a Facebook page. After text preprocessing steps, the newly obtained dataset generates FVnC and Sample dataset. We use the Retraining of BERT for Vietnamese (PhoBERT) to extract features of our text data. K-Means and DBSCAN clustering algorithms are used for clustering tasks based on output embeddings from PhoBERT. We apply V-measure score and Silhouette score to evaluate the performance of clustering algorithms. We also demonstrate the efficiency of PhoBERT compared to other models in feature extraction on the Sample dataset and wiki dataset. A GridSearch algorithm that combines both clustering evaluations is also proposed to find optimal parameters. Thanks to clustering such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Softmax · WordPiece · Adam · Linear Warmup With Linear Decay
