Advancing Analytics
Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Using Machine Learning to Perform Text Clustering

Introduction

An interesting branch of machine learning is Natural Language Processing (NLP). As the name suggests, it involves training machines to detect patterns in language using algorithms. It is quite often the case that NLP is referred to as text analytics. It is actually more impressive than that. It examines vectorised patterns which not only looks at the positioning of elements but what it means in context to neighbouring elements within the vector. In a nutshell, this technique can be extended beyond text to patterns of linguistics in general and even contextual patterns. Nevertheless, its primary use in the machine learning world is to analyse text.

This article will focus on an interesting application of NLP which involves the clustering of text. Clustering is a popular unsupervised machine learning technique used for segmentation or grouping of data. It is a very powerful tool that is used across a variety of industries. However, it is rare you hear of applying clustering to text. This can be achieved using NLP functions, combined with clustering algorithms that can handle non-Euclidian distances.

Word Embeddings

Word embeddings are commonly used in NLP, due to their ability to account for context within a corpus (corpus — a fancy name for a large structured set of text). Routines such as TF-IDF (Term Frequency — Inverse Document Frequency) have the ability to weight ‘valuable’ words based on occurrence but lack the ability to take surrounding words into account. This is where Word2Vec routines shine. The two most common variations for these are CBOW (Continuous Bag of Words) and Skip-gram.

CBOW predicts the probability of a word, given a context and Skip-gram predicts context, given a single word [1]. This article is going to focus on the Word2Vec routine from the library Gensim (though there are many other libraries). Gensim uses either CBOW or Skip-gram to generate text embeddings.

Scenario

Consider these three sentences:

· I am testing this word2vec routine by seeing if sentences are related

· I am attempting to see if this sentence is similar to the previous one using the word2vec routine

· My morning routine consists of eating crayons and gluing carrots to my face

It is clear that the first two sentences are quite similar in what they’re saying. The third sentence is very different to the first two. How can we group these sentences together? The first step is to vectorise the sentences into a computer-readable form.

Below you can see the sentences assigned to variables as strings which are then passed to a Python list.

This is inputted into the Word2Vec object as an argument (see below). Note that the two additional arguments, ‘vector_size’ and ‘min_count’ are used to control the output:

vector_size — Controls the output vector size. Please note that the effectiveness of the model is very sensitive to the vector size and should be tested with multiple sizes prior to using the model. Determining the right vector size is not well defined. Some useful techniques include statistical significance through forward feature augmentation [2]. For this, we have chosen a size 10.

min_count — the minimum row count that can be considered as a group. A value of 1 has been chosen since we’re only considering three sentences in this example.

The code below initialises the model with the input parameters. Here the model can be passed the corpus word arrays to obtain the outputted word vectors.

This is illustrated below where the sentences are converted into vectors of size 10.

Printing sentence 1 will output a vector of size 10 — [ 0.002, -0.002, 0.001, 0.0003, 0.016, 0.008, 0.031, 0.0175, -0.0230, 0.006]. These vectors can be used to compare against one another, using cosine similarity. This is demonstrated below where sentences one and two are very similar. In contrast, sentence 3 is significantly different to the first two sentences.

Notice that the cosine values have been subtracted from 1. This is due to a redefined scale where 1 is identical sentences and 0 is no resemblance altogether. Now looking at the outputs for the comparison between sentences 1 & 2, this gives a value of 0.86. Comparing this to sentences 1 & 3 gives a value of 0.29 and a value of 0.25 for sentences 2 & 3.

Now let’s use this relationship to do some clustering. Using DBSCAN, which is a density-based clustering method with the ability to allow customisable metric spaces to be used within the calculation. There are many clustering algorithms that can be used. DBSCAN was chosen for two reasons: Firstly, it doesn’t require the user to pick the number of clusters outputted, as an input. Secondly, it is more adaptable to shapes that aren’t normally distributed (e.g spiral shapes). An alternative to DBSCAN could be OPTICS (Ordering Points To Identify the Clustering Structure), which will achieve similar results.

Passing the cosine spatial function to the DBSCAN model can be used with parameters ‘eps’ and ‘min_samples’. The parameter ‘eps’ governs the threshold unit of distance for a point to be considered a separate cluster. The parameter ‘min_samples’ controls how many samples can be considered to be in a single group.

Below shows the code that outputs the grouping vector to the variable ‘out_arr’.

This is now used to create an output array displaying the results

Here you will see the following output:

[[‘I am testing this word2vec routine by seeing if sentences are related’, ‘0’],
[‘I am attempting to see if this sentence is similar to the previous one using the word2vec routine’, ‘0’],
[‘My morning routine consists of eating crayons and gluing carrots to my face’, ‘1’]]

Clearly, the text has been grouped correctly. The grouping is dependent on the value of eps you use for DBSCAN where the clustering works correctly for values ~0.2–0.7.

Conclusion

Clustering text is an extremely useful technique that is often overlooked. It can be applied to many industries looking at things like email and text alert categorising. The technique proposed in this article is sensitive to two main variables: firstly, the size of the output vector and secondly, the eps parameter in DBSCAN. It would be advised to experiment with these parameters prior to inferring anything using this method. Enjoy!

The Github repo link to this work is here.

[1] — NLP 101: Word2Vec — Skip-gram and CBOW

[2] — Towards Lower Bounds on Number of Dimensions for Word Embeddings

Luke MenziesComment