8. Hierarchical Clustering

arrow_back Back to Experiments

8. Hierarchical Clustering

Aim

    To implement hierarchical clustering on the Iris dataset using the Agglomerative Clustering algorithm, visualize the resulting clusters through a dendrogram, and analyze the hierarchical structure to understand how different clusters are formed based on sample features.

Understand the Hierarchical Clsutering Before You Begin

Overview: Hierarchical Clustering is an unsupervised machine learning algorithm used to group similar data points into clusters based on their distance or similarity. Unlike algorithms such as K-Means, hierarchical clustering does not require specifying the number of clusters beforehand. Instead, it builds a tree-like hierarchy of clusters that shows how data points are grouped step by step.

There are two main approaches: Agglomerative (bottom-up) and Divisive (top-down). In agglomerative clustering, each data point initially starts as its own cluster, and clusters are repeatedly merged based on similarity. The merging process depends on linkage criteria such as single linkage, complete linkage, and average linkage, which define how distances between clusters are measured.

The result of hierarchical clustering is typically visualized using a dendrogram, a tree diagram that shows the order in which clusters are merged. By cutting the dendrogram at a chosen height, different numbers of clusters can be obtained. Hierarchical clustering is widely used in bioinformatics, document clustering, gene expression analysis, and customer segmentation.

Further Understanding: Hierarchical Clustering

Algorithm

  1. Import Required Libraries: Import the necessary libraries for hierarchical clustering and data visualization, such as NumPy for data handling, Matplotlib for plotting, and clustering methods from Scikit-learn and SciPy.
  2. Load the Iris Dataset: Load and prepare the dataset for clustering, ensuring it is formatted for input into the clustering algorithm.
  3. Model Initialization: Define the hierarchical clustering model, setting parameters for the linkage criteria, distance threshold, and cluster initialization. The number of clusters may remain undefined for agglomerative clustering.
  4. Model Fitting: Train the model using the input features to build a hierarchical structure based on distance metrics between data points.
  5. Calculate Linkage Matrix: For each merging of clusters, compute the linkage matrix, which tracks the hierarchy and distances between clusters.
  6. Plot Dendrogram: Use the dendrogram plotting function to visualize the hierarchical structure of the data, with adjustable truncation levels for better readability.
  7. Analysis and Interpretation: Analyze the dendrogram to identify meaningful clusters and draw insights from the structure and relationships between clusters.
  8. Visualize Decision Boundaries: Plot the decision boundary

About Iris Dataset

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher’s paper. Note that it’s the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other:

Dataset Information

Number of Classes 3
Number of Instances 150 (50 in each of three classes)
Number of Attributes 4 numeric, predictive attributes and the class
Attribute Information
  • sepal length in cm
  • sepal width in cm
  • petal length in cm
  • petal width in cm
  • Class Label
  • Iris-Setosa
  • Iris-Versicolour
  • Iris-Virginica
  • Samples per class 50
    Samples total 150
    Dimensionality 4
    Features Real, positive

    Source: Dataset Link

    Visualization

    Interactive Visualization of Hierarchical Clustering.

    Pre-Lab Questions

    1. What is Hierarchical Clustering?
    2. Discuss the different types of linkage methods (single, complete, average). How do they affect the clustering process?
    3. Describe the role of distance metrics (Euclidean, Manhattan) in clustering algorithms. How do they impact the formation of clusters?

    Post-Lab Questions

    1. Test the model with different linkage methods (e.g., single, complete, average). How do these changes affect the hierarchical structure and cluster formation?
    2. Try using different distance metrics in the hierarchical clustering model. How does the choice of distance metric impact the clusters and dendrogram?

    Result

    The Hierarchical Clustering model using the Agglomerative Clustering algorithm was successfully implemented on the Iris dataset. The dendrogram effectively visualized the hierarchical relationships among the samples. It revealed a clear structure with natural divisions aligning well with the known species (Setosa, Versicolour, Virginica). By analyzing the dendrogram, the number of optimal clusters and their relationships were identified, validating the effectiveness of hierarchical clustering in unsupervised learning tasks.