K-means Clustering in Python

The K-Means clustering algorithm uses the concept of the centroid to create K clusters. A centroid is nothing but an arithmetic mean position of all points. Here K is defined as the number of clusters.

Clustering Steps

To start K-means clustering, the user needs to define how many clusters it requires. This follows mainly two iterative steps.
Step1: Assignment step
Step2: Optimization step

Let’s use the below dataset to understand K-means clustering. Here we want to divide our data points into two clusters.

Kmean clustering 1

Step1: Assignment step

In the first step, start with the random placement of K initial centroid. Once the centroid is placed, now assign each data point to its nearest centroid. The distance of each point from the centroid is calculated by the Euclidian distance formula. Follow the below diagram where each point is assigned to the nearest centroid.

Kmean images

Step2: Optimization step

In the Optimization step, calculate the centroid again and reposition the centroid to its new position.

Kmean assignment step

Step3: Assignment step

In this step, re-assign all data points to their nearest cluster based on Euclidian distance.

Kmean4

Step4: Optimization step

In this optimization step, update the centroids again.

kmean optimization step

Follow these processes until the centroid no longer change. At this point, we can say that we have found the optimal centroids.

Practical considerations involved in K-Mean clustering:

  • The initial choice of the cluster will impact the final cluster formation.
  • Users should aware of the number of clusters it requires.
  • Cluster is sensitive in the case of outliers as outliers distance from the centroid is more.
  • Clustering does not work with categorical data.
  • Sometimes, the process may not converge to find optimal centroids.
  • As the Euclidian distance metric used to calculate the distance, the standardization of data is required for better performance.

K-means clustering implementation:

Let’s begin the analysis and implementation of k-means clustering in python of an Online retail dataset where you can find information about user’s spending over a period from a particular retail store. 

You can download the dataset from here

The objective of our clustering analysis is to cluster customers into different categories. If you give a closer look at the dataset, you will notice the dataset is invoice centric. To analyze customers, we have to transform the current dataset into customer-wise data.

Let’s use the RFM analysis concept to create a dataset.

RFM analysis

RFM stands for recency, frequency, and monetary score for all customers. RFM analysis can help the user understand the most important customer group or least important customer groups.

  • Recency: It measures customer recent activity. How recently a customer had made a purchase.
  • Frequency: It measures the frequency of the orders or the number of orders placed by a customer.
  • Monetary: It measures the total spending of a customer.

Step-1: Import libraries and load dataset.

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
from sklearn.cluster import KMeans
import seaborn as sns

retail = pd.read_csv('OnlineRetail.csv', sep = ',',encoding = 'ISO-8859-1', header= 0)
retail.head(2) 
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 01/12/10 8:26 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 01/12/10 8:26 3.39 17850.0 United Kingdom

 

Step-2: Convert the ‘InvoiceDate’ column format.

retail['InvoiceDate'] = pd.to_datetime(retail['InvoiceDate'], format = "%d/%m/%y %H:%M")

Step-3: Drop all empty records.

order_wise = retail.dropna()
order_wise.shape 

Step-4: Create a new column called ‘Amount’ which is the total amount spent on that particular order.

 
amount = pd.DataFrame(order_wise.Quantity * order_wise.UnitPrice, columns = ["Amount"]) 
order_wise = pd.concat(objs = [order_wise, amount], axis = 1, ignore_index = False) 


Step-5: Generate Recency data from the customer dataset
.

recency = order_wise[['CustomerID','InvoiceDate']]
maxdate = max(recency.InvoiceDate)
maxdate = maxdate + pd.DateOffset(days=1)
recency['diff'] = maxdate - recency.InvoiceDate

#Dataframe merging by recency
df = pd.DataFrame(recency.groupby('CustomerID').diff.min())
df = df.reset_index()
df.columns = ["CustomerID", "Recency"]

Output:

CustomerID Recency
0 12346.0 326 days 02:33:00
1 12347.0 2 days 20:58:00
2 12348.0 75 days 23:37:00
3 12349.0 19 days 02:59:00
4 12350.0 310 days 20:49:00

 

Step-6: Generate Frequency data from the customer dataset.

frequency = order_wise[['CustomerID', 'InvoiceNo']]
invoice_count = frequency.groupby("CustomerID").InvoiceNo.count()
invoice_count = pd.DataFrame(invoice_count)
invoice_count = invoice_count.reset_index()
invoice_count.columns = ["CustomerID", "Frequency"]

Output:

CustomerID Frequency
0 12346.0 2
1 12347.0 182
2 12348.0 31
3 12349.0 73
4 12350.0 17

Step-7: Generate Monetary data from the customer dataset.

monetary = order_wise.groupby("CustomerID").Amount.sum()
monetary = monetary.reset_index()
monetary.head()

Output:

CustomerID Amount
0 12346.0 0.00
1 12347.0 4310.00
2 12348.0 1797.24
3 12349.0 1757.55
4 12350.0 334.40

Step-8: Combine Recency, Frequency, and Monetary dataframe.

RFM = invoice_count.merge(monetary, on = "CustomerID")
RFM = RFM.merge(df, on = "CustomerID")
RFM.head()

Output:

CustomerID Frequency Amount Recency
0 12346.0 2 0.00 326 days 02:33:00
1 12347.0 182 4310.00 2 days 20:58:00
2 12348.0 31 1797.24 75 days 23:37:00
3 12349.0 73 1757.55 19 days 02:59:00
4 12350.0 17 334.40 310 days 20:49:00

Step-9: Convert the ‘Recency’ column to number type.


RFM_df = RFM.drop("CustomerID", axis=1)
RFM_df.Recency = RFM_df.Recency.dt.days

Step-10: Standardise all attributes.


from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
standard_scaler.fit_transform(RFM_df)

K-means Clustering implementation

Step-11: Now we have standardized data. Let’s implement K-means clustering algorithm.

Consider the number of clusters (K) as 5, which means divide customers into 5 different groups.

n_clusters: The number of clusters to be formed
max_iter: Maximum number of iterations of the k-means algorithm for a single run.
n_init: The Number of times the k-means algorithm will be run with different centroid seeds. The default value is 10.

from sklearn.cluster import KMeans
model = KMeans(n_clusters = 5, max_iter=50) 
model.fit(RFM_df) 

Step-12: Now analyze the clusters formed.

RFM.index = pd.RangeIndex(len(RFM.index))
RFM_cluster_df = pd.concat([RFM, pd.Series(model.labels_)], axis=1)
RFM_cluster_df.columns = ['CustomerID', 'Frequency', 'Amount', 'Recency', 'ClusterID']

Step-13:  Let’s understand the nature of clusters by calculating the mean of each of the RFM data.

RFM_cluster_df.Recency = RFM_cluster_df.Recency.dt.days
RFM_clusters_amount = pd.DataFrame(RFM_cluster_df.groupby(["ClusterID"]).Amount.mean())
RFM_clusters_frequency = pd.DataFrame(RFM_cluster_df.groupby(["ClusterID"]).Frequency.mean())
RFM_clusters_recency = pd.DataFrame(RFM_cluster_df.groupby(["ClusterID"]).Recency.mean())
#concat all means
df = pd.concat([pd.Series([0,1,2,3,4]), RFM_clusters_amount, RFM_clusters_frequency, RFM_clusters_recency], axis=1)
df.columns = ["ClusterID", "Amount_mean", "Frequency_mean", "Recency_mean"]
df.head()

Output:

ClusterID Amount_mean Frequency_mean Recency_mean
0 0 389.050111 30.798467 107.575027
1 1 1150.137627 57.661017 61.675545
2 2 713.117873 45.030207 83.976153
3 3 154.465893 15.982231 169.190523
4 4 1677.971774 75.020161 58.270161

Clustering Result Analysis

Based on this table you can analyze customer behavior. For example, customers belonging to the 5th cluster (ClusterID = 4) are good from the store’s point of view because

-They spend more money on shopping,

-Visit frequently to the store.

-These customers also visited the store recently.

However, customers belonging to the 4th cluster (ClusterID = 4)  are not profitable for the store based on their Recency, Frequency, and Monetary value.

In this blog, you learned about the concept of  K-means clustering and the implementation of K-means Clustering in Python. Please comment in case of any suggestions….

#K-means Clustering

Leave a Reply