K-means Clustering in Python
The K-Means clustering algorithm uses the concept of the centroid to create K clusters. A centroid is nothing but an arithmetic mean position of all points. Here K is defined as the number of clusters.
Table of Contents
Clustering Steps
To start K-means clustering, the user needs to define how many clusters it requires. This follows mainly two iterative steps.
Step1: Assignment step
Step2: Optimization step
Let’s use the below dataset to understand K-means clustering. Here we want to divide our data points into two clusters.
Step1: Assignment step
In the first step, start with the random placement of K initial centroid. Once the centroid is placed, now assign each data point to its nearest centroid. The distance of each point from the centroid is calculated by the Euclidian distance formula. Follow the below diagram where each point is assigned to the nearest centroid.
Step2: Optimization step
In the Optimization step, calculate the centroid again and reposition the centroid to its new position.
Step3: Assignment step
In this step, re-assign all data points to their nearest cluster based on Euclidian distance.
Step4: Optimization step
In this optimization step, update the centroids again.
Follow these processes until the centroid no longer change. At this point, we can say that we have found the optimal centroids.
Practical considerations involved in K-Mean clustering:
- The initial choice of the cluster will impact the final cluster formation.
- Users should aware of the number of clusters it requires.
- Cluster is sensitive in the case of outliers as outliers distance from the centroid is more.
- Clustering does not work with categorical data.
- Sometimes, the process may not converge to find optimal centroids.
- As the Euclidian distance metric used to calculate the distance, the standardization of data is required for better performance.
K-means clustering implementation:
Let’s begin the analysis and implementation of k-means clustering in python of an Online retail dataset where you can find information about user’s spending over a period from a particular retail store.
You can download the dataset from here.
The objective of our clustering analysis is to cluster customers into different categories. If you give a closer look at the dataset, you will notice the dataset is invoice centric. To analyze customers, we have to transform the current dataset into customer-wise data.
Let’s use the RFM analysis concept to create a dataset.
RFM analysis
RFM stands for recency, frequency, and monetary score for all customers. RFM analysis can help the user understand the most important customer group or least important customer groups.
- Recency: It measures customer recent activity. How recently a customer had made a purchase.
- Frequency: It measures the frequency of the orders or the number of orders placed by a customer.
- Monetary: It measures the total spending of a customer.
Step-1: Import libraries and load dataset.
%matplotlib inline import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import scale from sklearn.cluster import KMeans import seaborn as sns retail = pd.read_csv('OnlineRetail.csv', sep = ',',encoding = 'ISO-8859-1', header= 0) retail.head(2)
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 01/12/10 8:26 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 01/12/10 8:26 | 3.39 | 17850.0 | United Kingdom |
Step-2: Convert the ‘InvoiceDate’ column format.
retail['InvoiceDate'] = pd.to_datetime(retail['InvoiceDate'], format = "%d/%m/%y %H:%M")
Step-3: Drop all empty records.
order_wise = retail.dropna() order_wise.shape
Step-4: Create a new column called ‘Amount’ which is the total amount spent on that particular order.
amount = pd.DataFrame(order_wise.Quantity * order_wise.UnitPrice, columns = ["Amount"]) order_wise = pd.concat(objs = [order_wise, amount], axis = 1, ignore_index = False)
Step-5: Generate Recency data from the customer dataset.
recency = order_wise[['CustomerID','InvoiceDate']] maxdate = max(recency.InvoiceDate) maxdate = maxdate + pd.DateOffset(days=1) recency['diff'] = maxdate - recency.InvoiceDate #Dataframe merging by recency df = pd.DataFrame(recency.groupby('CustomerID').diff.min()) df = df.reset_index() df.columns = ["CustomerID", "Recency"]
Output:
CustomerID | Recency | |
---|---|---|
0 | 12346.0 | 326 days 02:33:00 |
1 | 12347.0 | 2 days 20:58:00 |
2 | 12348.0 | 75 days 23:37:00 |
3 | 12349.0 | 19 days 02:59:00 |
4 | 12350.0 | 310 days 20:49:00 |
Step-6: Generate Frequency data from the customer dataset.
frequency = order_wise[['CustomerID', 'InvoiceNo']] invoice_count = frequency.groupby("CustomerID").InvoiceNo.count() invoice_count = pd.DataFrame(invoice_count) invoice_count = invoice_count.reset_index() invoice_count.columns = ["CustomerID", "Frequency"]
Output:
CustomerID | Frequency | |
---|---|---|
0 | 12346.0 | 2 |
1 | 12347.0 | 182 |
2 | 12348.0 | 31 |
3 | 12349.0 | 73 |
4 | 12350.0 | 17 |
Step-7: Generate Monetary data from the customer dataset.
monetary = order_wise.groupby("CustomerID").Amount.sum() monetary = monetary.reset_index() monetary.head()
Output:
CustomerID | Amount | |
---|---|---|
0 | 12346.0 | 0.00 |
1 | 12347.0 | 4310.00 |
2 | 12348.0 | 1797.24 |
3 | 12349.0 | 1757.55 |
4 | 12350.0 | 334.40 |
Step-8: Combine Recency, Frequency, and Monetary dataframe.
RFM = invoice_count.merge(monetary, on = "CustomerID") RFM = RFM.merge(df, on = "CustomerID") RFM.head()
Output:
CustomerID | Frequency | Amount | Recency | |
---|---|---|---|---|
0 | 12346.0 | 2 | 0.00 | 326 days 02:33:00 |
1 | 12347.0 | 182 | 4310.00 | 2 days 20:58:00 |
2 | 12348.0 | 31 | 1797.24 | 75 days 23:37:00 |
3 | 12349.0 | 73 | 1757.55 | 19 days 02:59:00 |
4 | 12350.0 | 17 | 334.40 | 310 days 20:49:00 |
Step-9: Convert the ‘Recency’ column to number type.
RFM_df = RFM.drop("CustomerID", axis=1) RFM_df.Recency = RFM_df.Recency.dt.days
Step-10: Standardise all attributes.
from sklearn.preprocessing import StandardScaler standard_scaler = StandardScaler() standard_scaler.fit_transform(RFM_df)
K-means Clustering implementation
Step-11: Now we have standardized data. Let’s implement K-means clustering algorithm.
Consider the number of clusters (K) as 5, which means divide customers into 5 different groups.
n_clusters: The number of clusters to be formed
max_iter: Maximum number of iterations of the k-means algorithm for a single run.
n_init: The Number of times the k-means algorithm will be run with different centroid seeds. The default value is 10.
from sklearn.cluster import KMeans model = KMeans(n_clusters = 5, max_iter=50) model.fit(RFM_df)
Step-12: Now analyze the clusters formed.
RFM.index = pd.RangeIndex(len(RFM.index)) RFM_cluster_df = pd.concat([RFM, pd.Series(model.labels_)], axis=1) RFM_cluster_df.columns = ['CustomerID', 'Frequency', 'Amount', 'Recency', 'ClusterID']
Step-13: Let’s understand the nature of clusters by calculating the mean of each of the RFM data.
RFM_cluster_df.Recency = RFM_cluster_df.Recency.dt.days RFM_clusters_amount = pd.DataFrame(RFM_cluster_df.groupby(["ClusterID"]).Amount.mean()) RFM_clusters_frequency = pd.DataFrame(RFM_cluster_df.groupby(["ClusterID"]).Frequency.mean()) RFM_clusters_recency = pd.DataFrame(RFM_cluster_df.groupby(["ClusterID"]).Recency.mean()) #concat all means df = pd.concat([pd.Series([0,1,2,3,4]), RFM_clusters_amount, RFM_clusters_frequency, RFM_clusters_recency], axis=1) df.columns = ["ClusterID", "Amount_mean", "Frequency_mean", "Recency_mean"] df.head()
Output:
ClusterID | Amount_mean | Frequency_mean | Recency_mean | |
---|---|---|---|---|
0 | 0 | 389.050111 | 30.798467 | 107.575027 |
1 | 1 | 1150.137627 | 57.661017 | 61.675545 |
2 | 2 | 713.117873 | 45.030207 | 83.976153 |
3 | 3 | 154.465893 | 15.982231 | 169.190523 |
4 | 4 | 1677.971774 | 75.020161 | 58.270161 |
Clustering Result Analysis
Based on this table you can analyze customer behavior. For example, customers belonging to the 5th cluster (ClusterID = 4) are good from the store’s point of view because
-They spend more money on shopping,
-Visit frequently to the store.
-These customers also visited the store recently.
However, customers belonging to the 4th cluster (ClusterID = 4) are not profitable for the store based on their Recency, Frequency, and Monetary value.
In this blog, you learned about the concept of K-means clustering and the implementation of K-means Clustering in Python. Please comment in case of any suggestions….
#K-means Clustering