K-Nearest Neighbors
The K Nearest Neighbor algorithm is a type of supervised learning algorithm and is used for classification and regression. As the name implies, it seems to use K Nearest Neighbors (Data points) to predict the class or continuous value for the new unseen datapoint.
Properties:
- Non -Parametric: There seem to be no hypotheses about the distribution of the underlying data. To put it another way, the model concept is based on the dataset. This will be quite practical, as most real-world datasets do not match to mathematical theoretical assumptions.
- Instance-Based and Lazy Learning: Model development does not require any training data points. In the testing phase, all training data is being used. In the worst-case circumstance, KNN might require more time to examine all data points, and also more memory for storing training data.
How Does KNN Works?
KNN represents the number of the nearest neighbor. The frequency of neighbors is the most important element to consider.
- Assume PT1 is the point for which the label should predict.
- To categorize points, first, find the k nearest points to P1 and then classify them by the majority vote of their k neighbors.
- Each object puts a vote for their preferred class, and the class with the most votes is declared the winner.
- Using distance metrics such as Euclidean distance, Hamming distance, Manhattan distance, and Minkowski distance, you may determine the distance between two points.
Let's summarize how KNN works using the 4 steps below.
- Take the initial data
- Calculate distance between data points
- Find closest neighbors of the data points
- Vote for labels based on neighbors
How to decide 'K' (number of neighbors) in KNN?
KNN stands for the number of nearest neighbors. The frequency of neighbors is the most important element to evaluate.
- If there are even categories, K is usually an odd number but not necessary.
- Domain knowledge is also very helpful in determining the K value.
- Elbow Curve: We plot a graph between possible values of K neighbors and their corresponding error on the data set. The K value when test error stabilizes and is low is considered as optimal value for K. From the below error curve we can choose K=7 for our KNN algorithm implementation.
Things to consider when using the KNN algorithm :
- Scaling Of Data: We must ensure that all of the variables are on the same scale because KNN is based on distance metrics methods. Standardization and Normalization approaches can help with this.
- Curse of Dimensionality: Overfitting is a problem that arises when the number of dimensions in a KNN increases. To avoid overfitting, the required data will expand exponentially as the number of dimensions increases. The Curse of Dimensionality relates to this higher-dimensional difficulty. To deal with the problem of the curse of dimensionality, we can perform feature selection techniques or dimensionality reduction using principal component analysis before applying any machine learning technique.
- Treating/Imputing Missing Values: If out of N features, any feature data is missing in the training set we will be unable to locate or calculate distance from that point. As a result, either deletion or imputation is required.
In the next post, we'll look at how to use Python's scikit-learn module to implement the K Nearest Neighbor algorithm on a sample data set.