4. K Nearest Neighbors Classifier

arrow_back Back to Experiments

4. K Nearest Neighbors Classifier

Aim

    To understand and implement the K-Nearest Neighbors (KNN) algorithm for classification and regression tasks, and to analyze how the choice of ‘k’ value and distance metrics (such as Euclidean or Manhattan distance) affect the model’s accuracy and performance..

Understand the K-Nearest Neighbors (KNN) Algorithm Before You Begin

Overview: The K-Nearest Neighbors (KNN) algorithm is a simple yet powerful supervised learning technique used for both classification and regression tasks. It works by finding the k closest data points (neighbors) to a new input and predicting its label based on the majority class (for classification) or average value (for regression) of those neighbors.

KNN relies on a distance metric—such as Euclidean, Manhattan, or Minkowski distance—to measure how close data points are to each other. It’s a non-parametric algorithm, meaning it makes no assumptions about the underlying data distribution, which makes it easy to implement and interpret.

Further Understanding: K-Nearest Neighbors (KNN)

Algorithm

  1. Load the Dataset: Load the Iris dataset using load_iris.
  2. Binarize the Target: Extract "sepal length" and "sepal width" as the feature set X, and the target labels as y.
  3. Select Features and Labels: Split the dataset into training and testing sets..
  4. Split the Dataset: Split the dataset into training and testing sets using train_test_split, ensuring class distribution is maintained with stratify=y.
  5. Create a Pipeline Create a pipeline with StandardScaler for feature scaling and KNeighborsClassifier for the KNN model.
  6. Initialize Plot: Create a figure with two subplots to visualize the KNN decision boundaries with different weight strategies.
  7. Iterate Over Weight Strategies: For each subplot, set the KNN weight strategy (uniform or distance) and fit the model to the training data.
  8. Visualize Decision Boundaries: Plot the decision boundary

About Iris Dataset

The data sets consists of 3 different types of irises’ (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray. The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width.

Iris Dataset Information

Number of Instances 150 (50 in each of three classes)
Number of Attributes 4 numeric, predictive attributes and the class
Attribute Information
  • Sepal length in cm
  • Sepal width in cm
  • Petal length in cm
  • Petal width in cm
Classes 3 (Iris-Setosa, Iris-Versicolour, Iris-Virginica)

Simulation

Interactive Visualization of KNN Classification on Iris Dataset.

Pre-Lab Questions

  1. How does the choice of 'k' (the number of neighbors) affect the performance of the KNN classifier?
  2. Why is it important to standardize features before applying KNN?

Post-Lab Questions

  1. How does varying the value of 'k' influence the accuracy of your KNN model? If you were to repeat the experiment, would you opt for a different 'k' value?
  2. Develop a program that compares the performance of a KNN model with and without feature scaling. Present the accuracy results side by side and discuss the impact of feature scaling on the model's performance.

Result

The K-Nearest Neighbors classifier was successfully implemented and visualized on the Iris dataset using both uniform and distance-based weighting. The decision boundary plots clearly illustrated how different weight strategies influence classification regions and accuracy.