Start building your own chatbot now >

Some months ago, we talked about text clustering. From supervised to unsupervised clustering, we drew a global picture of what can be done in order to make a structure emerge out of your data. 

Today, we will work together to cluster a set of tweets from scratch. To do this, we will be using the R language.  

With Python, R is the second main language used for regular data science. Widely utilized by statisticians, this language is very popular for punctual analysis and reporting in academic or industrial research. While Python tutorials about text clustering are spreading more and more, it may be interesting to discover the other face of hands-on data science.   

To practice R, we highly recommend you install and code in RStudio, a complete R development environment far better from the simple CLI. Thanks to the IDE, you will be able to easily see simultaneously your variables, your script, the terminal output, your plots or even the documentation manual.  

Thanks to the IDE, you will easily be able to see everything you need at the same time: your variables, your script, the terminal output, your plots or even the documentation manual. 

Goal of this tutorial  

We are going to cluster a dataset of News Health Tweets. The idea is to learn about the basic functionalities of R to satisfy our needs of data acquisition, data processing and data science. The goal is not to run a state-of-the-art technique to cluster short text datasets. We will make programming and running shortcuts so as not to complicate this tutorial 

Regardless of the results, this tutorial will give you a good idea of how to make basic data science in R and hopefully give you the will to go further!

A complete, comprehensive and detailed version of the following tutorial is available as a R notebook on this Github Gist. 

Starting : data acquisition 

As we said, the clustering target is a dataset of Health News Tweets. To gather it in one go, we will download the zip directly from the source, unzip, create our data frame and delete the temporary files. 

# Creating the empty dataset with the formatted columns 
dataframe <- data.frame(ID=character(), 
                      datetime=character(), 
                      content=character(), 
                      label=factor()) 
source.url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/00438/Health-News-Tweets.zip' 
target.directory <- '/tmp/clustering-r' 
temporary.file <- tempfile() 
download.file(source.url, temporary.file) 
unzip(temporary.file, exdir = target.directory) 

# Reading the files 
target.directory <- paste(target.directory, 'Health-Tweets', sep = '/') 
files <- list.files(path = target.directory, pattern='.txt$') 

# Filling the dataframe by reading the text content 
for (f in files) { 
  news.filename = paste(target.directory , f, sep ='/') 
  news.label <- substr(f, 0, nchar(f) - 4) # Removing the 4 last characters => '.txt' 
  news.data <- read.csv(news.filename, 
                        encoding = 'UTF-8', 
                        header = FALSE, 
                        quote = "", 
                        sep = '|', 
                        col.names = c('ID', 'datetime', 'content')) 

# Trick to handle native split problem (cf. notebook for detail) 
  news.data <- news.data[news.data$content != "", ] 
  news.data['label'] = news.label # We add the label of the tweet  

# Massive data loading memory problem : only loading a few (cf. notebook for detail) 
  news.data <- head(news.data, floor(nrow(news.data) * 0.05)) 
  dataframe <- rbind(dataframe, news.data) # Row appending 
} 
unlink(target.directory, recursive =  TRUE) # Deleting the temporary directory 

The data frame is the main native class that people use to handle regular datasets. It is made of rows which are examples and columns representing features in their own way (strings (character()), labels (factor()), integers, …) 

Data processing : the manual part  

Now comes the time to clean our sentences. After a quick observation time, we can notice that our tweets are all terminated by a shortened URL linking to the news article.  

This kind of information is actually garbage from an NLP point of view, despite the fact that it could make the classes easily distinguishable if a specific shortener is used by one of the Twitter account. 

To get rid of these URLs, let’s use regex substitutions: 

sentences <- sub("http://([[:alnum:]|[:punct:]])+", '', dataframe$content) 

Data processing : the automatic part 

For the remaining part of the regular preprocessing tasks, we will actually use a dedicated package. The main reason is that R was not built with NLP at the center of its architecture. Text manipulation is costly in terms of either coding or running or both. When data is other than numerical entities, R can become a pain for beginners. 

The package that will save our life is tm (stands for text mining). From our resulting sentences, we will create a Corpus object, allowing us to call methods on it to perform stop words cleaning, stemming, whitespaces trimming , … 

corpus = tm::Corpus(tm::VectorSource(sentences)) 
 
# Cleaning up 
# Handling UTF-8 encoding problem from the dataset 
corpus.cleaned <- tm::tm_map(corpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))  
corpus.cleaned <- tm::tm_map(corpus.cleaned, tm::removeWords, tm::stopwords('english')) # Removing stop-words 
corpus.cleaned <- tm::tm_map(corpus, tm::stemDocument, language = "english") # Stemming the words  
corpus.cleaned <- tm::tm_map(corpus.cleaned, tm::stripWhitespace) # Trimming excessive whitespaces 

Sentence representation: TF-IDF and pairwise-distances 

Now that we have a whole set of cleaned sentences, we need to represent them numerically. The most efficient way among the easy ones is the weighted TF-IDF. This way, each sentence is a vector whose length is the size of the remaining vocabulary; and its components are valued according to the frequency of the word in the sentence and its meaningfulness in the corpus. 

tdm <- tm::DocumentTermMatrix(corpus.cleaned) 
tdm.tfidf <- tm::weightTfIdf(tdm)

Now that our sentences are numerically represented, we are able to compute a distance between them. In our situation, we have to apply an additional cut in the features for computational purposes: sparsity being not well handled overall in R, we have to remove columns which are sparse among the documents. This way, we reduce the number of features and then the size of the data matrix. 

This problem is explained in more details in the notebook given at the beginning.  

tdm.tfidf <- tm::removeSparseTerms(tdm.tfidf, 0.999) 
tfidf.matrix <- as.matrix(tdm.tfidf) 

# Cosine distance matrix (useful for specific clustering algorithms) 
dist.matrix = proxy::dist(tfidf.matrix, method = "cosine") 

The core part: the dataset clustering  

We made it! The surviving part of the dataset is interpretable by our clustering algorithms! But how will we cluster all of this? 

We will run most of the various techniques introduced in our article at the beginning: a partitioning one (K-Means), a hierarchical one (Bottom-up merging) and a density-based one (HDBScan). Except for the last one, native objects can be used to run the respective algorithms!  

clustering.kmeans <- kmeans(tfidf.matrix, truth.K) 
clustering.hierarchical <- hclust(dist.matrix, method = "ward.D2") 
clustering.dbscan <- dbscan::hdbscan(dist.matrix, minPts = 10) 

With these 3 clustering methods, we can even try a stacking method: merging the results with a simple hard-vote technique. Considering the K-Means as a master clustering, each of its clusters will be assigned to the major cluster represented among their points in the slave clustering. We repeat this operation from the resulting major slave-cluster with the other clustering result (the second slave).  

This is not the most efficient way to do stacking clustering as it leads to, at best,  keeping clusters and at worse, merging clusters in the master clustering. But it is easy to implement:  

master.cluster <- clustering.kmeans$cluster 
slave.hierarchical <- cutree(clustering.hierarchical, k = truth.K) 
slave.dbscan <- clustering.dbscan$cluster 
stacked.clustering <- rep(NA, length(master.cluster))  
names(stacked.clustering) <- 1:length(master.cluster) 

for (cluster in unique(master.cluster)) { 
  indexes = which(master.cluster == cluster, arr.ind = TRUE) 
  slave1.votes <- table(slave.hierarchical[indexes]) 
  slave1.maxcount <- names(slave1.votes)[which.max(slave1.votes)]   
  slave1.indexes = which(slave.hierarchical == slave1.maxcount, arr.ind = TRUE) 
  slave2.votes <- table(slave.dbscan[indexes]) 
  slave2.maxcount <- names(slave2.votes)[which.max(slave2.votes)]   
  stacked.clustering[indexes] <- slave2.maxcount 
} 

Plotting the results 

Our four clusters are done! We’ve made it! Whoa!  But how can we see the results?  

We have to define a way to see our high-dimensional sentences. Since we have a distance matrix (used for the density-based clustering), we can perform the multidimensional scaling technique to map our data in a two-dimensional space. 

After that, R comes with easy native functions to plot these results:  

points <- cmdscale(dist.matrix, k = 2) 
palette <- colorspace::diverge_hcl(truth.K) # Creating a color palette 
previous.par <- par(mfrow=c(2,2), mar = rep(1.5, 4)) 
 
plot(points, main = 'K-Means clustering', col = as.factor(master.cluster), 
     mai = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), 
     xaxt = 'n', yaxt = 'n', xlab = '', ylab = '') 

plot(points, main = 'Hierarchical clustering', col = as.factor(slave.hierarchical), 
     mai = c(0, 0, 0, 0), mar = c(0, 0, 0, 0),  
     xaxt = 'n', yaxt = 'n', xlab = '', ylab = '') 

plot(points, main = 'Density-based clustering', col = as.factor(slave.dbscan), 
     mai = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), 
     xaxt = 'n', yaxt = 'n', xlab = '', ylab = '') 

plot(points, main = 'Stacked clustering', col = as.factor(stacked.clustering), 
     mai = c(0, 0, 0, 0), mar = c(0, 0, 0, 0), 
     xaxt = 'n', yaxt = 'n', xlab = '', ylab = '') 

par(previous.par) # recovering the original plot space parameters 

And … here are our results! 

This hands-on tutorial is now finished. As I said, a bit of cleaning (deleting sparse features, selecting only a few part of the data, …) leads to correct results. We see here one of the limits of R: it needs adapted packages to handle bigger datasets, but packages incompatibilities may occur, making the process quite heavy.

Python remains the most convenient language for industrial needs in NLP in terms of code efficiency, easiness of coding and software engineering.
But I hope this tutorial gave you a more accurate view of R’s potential and an interesting introduction to applied text clustering on real data. 

Happy coding!

Want to build your own conversational bot? Get started with Recast.AI !

  • Anonymous

    Hi,thanks for this

    I can’t see where you defined “truth.k”. and it doesn’t seem to be a function.

    please help

This site uses Akismet to reduce spam. Learn how your comment data is processed.

This website uses cookies

Please confirm that you accept cookies to monitor the performance of our website