Cluster Analysis & Unsupervised Machine Learning in R
English | Size: 1.71 GB
Cluster analysis is one of the most used techniques to segment data in a multivariate analysis. It is an example of unsupervised machine learning and has widespread application in business analytics. Cluster analysis is a method of grouping a set of objects similar to each other. Precisely, it tries to identify homogeneous groups of cases such as observations, participants, and respondents. In this post, I will take you through the two most important clustering techniques using R. These are:
Hierarchical Clustering: The method identifies a cluster within a cluster. It groups data over a variety of scales by creating a cluster tree or dendrogram. For instance, wine can be subcategorized as Fortified Wine, Sparkling Wine or Still Wine depending on their similar composition. Hierarchical clustering is further categorized as Agglomerative clustering and Divisive clustering, based on bottom-up or top-down approach.
Partitional Clustering: This method constructs a partition of n objects into a set of K clusters. The most popular partitional clustering is K-means. In this, each cluster is associated with a centroid while each point is assigned to the cluster with the closest centroid. The clustering requires specifying the number of clusters to be extracted in advance.
The difference between the two clustering methods is that the K-means clustering handles larger datasets compared to hierarchical clustering. So, let’s go ahead and use both of them one by one. For cluster analysis, I will use “iris” dataset available in the list of R Datasets Package. There are also other datasets available in the package. But this one is a famous dataset used in many statistical classification techniques in machine learning. It consists of the measurements of 50 flowers based on three species in centimeters. These three species are setosa, versicolor, and virginica.