Case Study – Telecom Churn
Table of Contents
Telecom Churn Case Study
In this section, you will find details about Logistic Regression in Python and a discuss telecom churn case study where the company wants to predict an existing customer will leave the network or not. The telecom churn case study problem is solved using Logistic regression.
You have three files that contain the required information.
1-Customer data( customer_data.csv)
customerID | gender | SeniorCitizen | Partner | Dependents |
---|---|---|---|---|
7590-VHVEG | Female | 0 | Yes | No |
5575-GNVDE | Male | 0 | No | No |
3668-QPYBK | Male | 0 | No | No |
7795-CFOCW | Male | 0 | No | No |
9237-HQITU | Female | 0 | No | No |
2-Churn data( churn_data.csv)
customerID | tenure | PhoneService | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn |
---|---|---|---|---|---|---|---|---|
7590-VHVEG | 1 | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
5575-GNVDE | 34 | Yes | One year | No | Mailed check | 56.95 | 1889.5 | No |
3668-QPYBK | 2 | Yes | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
7795-CFOCW | 45 | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
9237-HQITU | 2 | Yes | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
3-Internet data( internet_data.csv)
customerID | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies |
---|---|---|---|---|---|---|---|---|
7590-VHVEG | No phone service | DSL | No | Yes | No | No | No | No |
5575-GNVDE | No | DSL | Yes | No | Yes | No | No | No |
3668-QPYBK | No | DSL | Yes | Yes | No | No | No | No |
7795-CFOCW | No phone service | DSL | Yes | No | Yes | Yes | No | No |
9237-HQITU | No | Fiber optic | No | No | No | No | No | No |
‘customerID’ is the unique identifier for each customer and can be used to map data across data sets.
Data Loading and Statistical Analysis:
The first step of any analysis is to understand and analyze the data. In the case of the telecom churn case study also, we will first understand data.
Step-1: Import data:
Load all required files to analyze customer churn and understand Logistic Regression in Python.
# Importing Pandas and NumPy import pandas as pd import numpy as np # Importing all datasets churn_data = pd.read_csv("churn_data.csv") customer_data = pd.read_csv("customer_data.csv") internet_data = pd.read_csv("internet_data.csv")
Step-2: Verify dataframe:
After loading data, verify and analyze each dataframe.
internet_data.head()
Output:
customerID | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | |
---|---|---|---|---|---|---|---|---|---|
0 | 7590-VHVEG | No phone service | DSL | No | Yes | No | No | No | No |
1 | 5575-GNVDE | No | DSL | Yes | No | Yes | No | No | No |
2 | 3668-QPYBK | No | DSL | Yes | Yes | No | No | No | No |
3 | 7795-CFOCW | No phone service | DSL | Yes | No | Yes | Yes | No | No |
4 | 9237-HQITU | No | Fiber optic | No | No | No | No | No | No |
Similarly, you can verify other data frames.
Step-3: Merge all dataframe into one:
Once data is loaded into data frames, try merging all differents data frames into one. Analyzing and the model building will be efficient in the case of a single data frame. In this example, ‘customerID’ is the unique key and we can use this to merge all the three dataframes into one.
#Merging on 'customerID' df_1 = pd.merge(churn_data, customer_data, how='inner', on='customerID') telecom = pd.merge(df_1, internet_data, how='inner', on='customerID')
Step-4: Analyze the structure of final dataframe:
Verify statistical information of the dataframe.
telecom.describe()
Output:
tenure MonthlyCharges SeniorCitizen
count 7043.000000 7043.000000 7043.000000
mean 32.371149 64.761692 0.162147
std 24.559481 30.090047 0.368612
min 0.000000 18.250000 0.000000
25% 9.000000 35.500000 0.000000
50% 29.000000 70.350000 0.000000
75% 55.000000 89.850000 0.000000
max 72.000000 118.750000 1.000000
telecom.info()
Output:
class 'pandas.core.frame.DataFrame';
Int64Index: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID 7043 non-null object
tenure 7043 non-null int64
PhoneService 7043 non-null object
Contract 7043 non-null object
PaperlessBilling 7043 non-null object
PaymentMethod 7043 non-null object
MonthlyCharges 7043 non-null float64
TotalCharges 7043 non-null object
Churn 7043 non-null object
gender 7043 non-null object
SeniorCitizen 7043 non-null int64
Partner 7043 non-null object
Dependents 7043 non-null object
MultipleLines 7043 non-null object
InternetService 7043 non-null object
OnlineSecurity 7043 non-null object
OnlineBackup 7043 non-null object
DeviceProtection 7043 non-null object
TechSupport 7043 non-null object
StreamingTV 7043 non-null object
StreamingMovies 7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.2+ MB
Step-5: Check the data size of the master dataframe.
print("Shape of telecom data: ",telecom.shape)
Output:
Shape of telecom data: (7043, 21)
Verify the number of unique records in each dataframe.
telecom.nunique()
Output:
customerID 7043
tenure 73
PhoneService 2
Contract 3
PaperlessBilling 2
PaymentMethod 4
MonthlyCharges 1585
TotalCharges 6531
Churn 2
gender 2
SeniorCitizen 2
Partner 2
Dependents 2
MultipleLines 3
InternetService 3
OnlineSecurity 3
OnlineBackup 3
DeviceProtection 3
TechSupport 3
StreamingTV 3
StreamingMovies 3
Data Preparation
Step-6: Converting Yes to 1 and No to 0
As few attributes contain categorical variables as Yes or No, Let’s map then to 1 or 0.
telecom['PhoneService'] = telecom['PhoneService'].map({'Yes': 1, 'No': 0}) telecom['PaperlessBilling'] = telecom['PaperlessBilling'].map({'Yes': 1, 'No': 0}) telecom['Churn'] = telecom['Churn'].map({'Yes': 1, 'No': 0}) telecom['Partner'] = telecom['Partner'].map({'Yes': 1, 'No': 0}) telecom['Dependents'] = telecom['Dependents'].map({'Yes': 1, 'No': 0})
Step-7: Dummy Variable Creation
Convert non-integer attributes to the dummy variables.
# Creating a dummy variable for the variable 'Contract' and dropping the first one. cont = pd.get_dummies(telecom['Contract'],prefix='Contract',drop_first=True) #Adding the results to the master dataframe telecom = pd.concat([telecom,cont],axis=1) # Creating a dummy variable for the variable 'PaymentMethod' and dropping the first one. pm = pd.get_dummies(telecom['PaymentMethod'],prefix='PaymentMethod',drop_first=True) #Adding the results to the master dataframe telecom = pd.concat([telecom,pm],axis=1) # Creating a dummy variable for the variable 'gender' and dropping the first one. gen = pd.get_dummies(telecom['gender'],prefix='gender',drop_first=True) #Adding the results to the master dataframe telecom = pd.concat([telecom,gen],axis=1) # Creating a dummy variable for the variable 'MultipleLines' and dropping the first one. ml = pd.get_dummies(telecom['MultipleLines'],prefix='MultipleLines') # dropping MultipleLines_No phone service column ml1 = ml.drop(['MultipleLines_No phone service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,ml1],axis=1) # Creating a dummy variable for the variable 'InternetService' and dropping the first one. iser = pd.get_dummies(telecom['InternetService'],prefix='InternetService',drop_first=True) #Adding the results to the master dataframe telecom = pd.concat([telecom,iser],axis=1) # Creating a dummy variable for the variable 'OnlineSecurity'. os = pd.get_dummies(telecom['OnlineSecurity'],prefix='OnlineSecurity') os1= os.drop(['OnlineSecurity_No internet service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,os1],axis=1) # Creating a dummy variable for the variable 'OnlineBackup'. ob =pd.get_dummies(telecom['OnlineBackup'],prefix='OnlineBackup') ob1 =ob.drop(['OnlineBackup_No internet service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,ob1],axis=1) # Creating a dummy variable for the variable 'DeviceProtection'. dp =pd.get_dummies(telecom['DeviceProtection'],prefix='DeviceProtection') dp1 = dp.drop(['DeviceProtection_No internet service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,dp1],axis=1) # Creating a dummy variable for the variable 'TechSupport'. ts =pd.get_dummies(telecom['TechSupport'],prefix='TechSupport') ts1 = ts.drop(['TechSupport_No internet service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,ts1],axis=1) # Creating a dummy variable for the variable 'StreamingTV'. st =pd.get_dummies(telecom['StreamingTV'],prefix='StreamingTV') st1 = st.drop(['StreamingTV_No internet service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,st1],axis=1) # Creating a dummy variable for the variable 'StreamingMovies'. sm =pd.get_dummies(telecom['StreamingMovies'],prefix='StreamingMovies') sm1 = sm.drop(['StreamingMovies_No internet service'],1) #Adding the results to the master dataframe telecom = pd.concat([telecom,sm1],axis=1)
Step-8: Dropping the repeated variables
As we have created dummies for the below variables, so we can drop them.
telecom = telecom.drop([ 'Contract', 'PaymentMethod', 'gender', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies'], 1)
Step-9: Convert ‘TotalCharges’ from String to Double.
telecom['TotalCharges']=telecom['TotalCharges'].convert_objects(convert_numeric=True)
Verify data.
telecom.info()
Output:
Int64Index: 7043 entries, 0 to 7042
Data columns (total 32 columns):
customerID 7043 non-null object
tenure 7043 non-null int64
PhoneService 7043 non-null int64
PaperlessBilling 7043 non-null int64
MonthlyCharges 7043 non-null float64
TotalCharges 7032 non-null float64
Churn 7043 non-null int64
SeniorCitizen 7043 non-null int64
Partner 7043 non-null int64
Dependents 7043 non-null int64
Contract_One year 7043 non-null uint8
Contract_Two year 7043 non-null uint8
PaymentMethod_Credit card (automatic) 7043 non-null uint8
PaymentMethod_Electronic check 7043 non-null uint8
PaymentMethod_Mailed check 7043 non-null uint8
gender_Male 7043 non-null uint8
MultipleLines_No 7043 non-null uint8
MultipleLines_Yes 7043 non-null uint8
InternetService_Fiber optic 7043 non-null uint8
InternetService_No 7043 non-null uint8
OnlineSecurity_No 7043 non-null uint8
OnlineSecurity_Yes 7043 non-null uint8
OnlineBackup_No 7043 non-null uint8
OnlineBackup_Yes 7043 non-null uint8
DeviceProtection_No 7043 non-null uint8
DeviceProtection_Yes 7043 non-null uint8
TechSupport_No 7043 non-null uint8
TechSupport_Yes 7043 non-null uint8
StreamingTV_No 7043 non-null uint8
StreamingTV_Yes 7043 non-null uint8
StreamingMovies_No 7043 non-null uint8
StreamingMovies_Yes 7043 non-null uint8
dtypes: float64(2), int64(7), object(1), uint8(22)
memory usage: 756.6+ KB
Step-10: Check outliers
Checking for outliers in the continuous variables.
num_telecom =telecom[['tenure','MonthlyCharges','SeniorCitizen','TotalCharges']] num_telecom.describe(percentiles=[.25,.5,.75,.90,.95,.99])
Output:
tenure MonthlyCharges SeniorCitizen TotalCharges
count 7043.000000 7043.000000 7043.000000 7032.000000
mean 32.371149 64.761692 0.162147 2283.300441
std 24.559481 30.090047 0.368612 2266.771362
min 0.000000 18.250000 0.000000 18.800000
25% 9.000000 35.500000 0.000000 401.450000
50% 29.000000 70.350000 0.000000 1397.475000
75% 55.000000 89.850000 0.000000 3794.737500
90% 69.000000 102.600000 1.000000 5976.640000
95% 72.000000 107.400000 1.000000 6923.590000
99% 72.000000 114.729000 1.000000 8039.883000
max 72.000000 118.750000 1.000000 8684.800000
Step-11: Check missing values and Impute them.
telecom.isnull().sum()
Output:
customerID 0
tenure 0
PhoneService 0
PaperlessBilling 0
MonthlyCharges 0
TotalCharges 11
Churn 0
SeniorCitizen 0
Partner 0
Dependents 0
Contract_One year 0
Contract_Two year 0
PaymentMethod_Credit card (automatic) 0
PaymentMethod_Electronic check 0
PaymentMethod_Mailed check 0
gender_Male 0
MultipleLines_No 0
MultipleLines_Yes 0
InternetService_Fiber optic 0
InternetService_No 0
OnlineSecurity_No 0
OnlineSecurity_Yes 0
OnlineBackup_No 0
OnlineBackup_Yes 0
DeviceProtection_No 0
DeviceProtection_Yes 0
TechSupport_No 0
TechSupport_Yes 0
StreamingTV_No 0
StreamingTV_Yes 0
StreamingMovies_No 0
StreamingMovies_Yes 0
dtype: int64
Remove missing TotalCharges rows.
telecom = telecom[~np.isnan(telecom['TotalCharges'])]
Step-12: Feature Standardisation
As continuous variables have a wide range of variables, let’s standardize the values.
df = telecom[['tenure','MonthlyCharges','TotalCharges']] normalized_df=(df-df.mean())/df.std() telecom = telecom.drop(['tenure','MonthlyCharges','TotalCharges'], 1) telecom = pd.concat([telecom,normalized_df],axis=1)
Model Building
Step-13: Split data into Train and Test.
In the process of model building for the Telecom Churn Case Study, the first step is to split the data into training and testing data.
Drop ‘customerID’ from the dataframe as this is not having any predictive power.
from sklearn.model_selection import train_test_split # Putting feature variable to X X = telecom.drop(['Churn','customerID'],axis=1) # Putting response variable to y y = telecom['Churn'] # Splitting the data into train and test X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.7,test_size=0.3,random_state=10)
Step-14: Run the Logistic regression model.
we have used the ‘stasmodel‘ library to run the logistic model in Telecom Churn Case Study. You can observe the details of coefficients and P-values for each attribute.
import statsmodels.api as sm logm1 =sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial()) logm1.fit().summary()
Output:
Generalized Linear Model Regression Results Dep. Variable: Churn No. Observations: 4922
Model: GLM Df Residuals: 4898
Model Family: Binomial Df Model: 23
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2004.7
Date: Sat, 18 Apr 2020 Deviance: 4009.4
Time: 23:03:45 Pearson chi2: 6.07e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P. |z| [0.025 0.975]
const -3.2783 1.187 -2.762 0.006 -5.605 -0.952
PhoneService 0.8213 0.588 1.396 0.163 -0.332 1.974
PaperlessBilling 0.3254 0.090 3.614 0.000 0.149 0.502
SeniorCitizen 0.3984 0.102 3.924 0.000 0.199 0.597
Partner 0.0374 0.094 0.399 0.690 -0.146 0.221
Dependents -0.1430 0.107 -1.332 0.183 -0.353 0.067
Contract_One year -0.6578 0.129 -5.106 0.000 -0.910 -0.405
Contract_Two year -1.2455 0.212 -5.874 0.000 -1.661 -0.830
PaymentMethod_Credit card (automatic) -0.2577 0.137 -1.883 0.060 -0.526 0.011
PaymentMethod_Electronic check 0.1615 0.113 1.434 0.152 -0.059 0.382
PaymentMethod_Mailed check -0.2536 0.137 -1.845 0.065 -0.523 0.016
gender_Male -0.0346 0.078 -0.442 0.658 -0.188 0.119
MultipleLines_No 0.1295 0.205 0.632 0.527 -0.272 0.531
MultipleLines_Yes 0.6918 0.392 1.763 0.078 -0.077 1.461
InternetService_Fiber optic 2.5124 0.967 2.599 0.009 0.618 4.407
InternetService_No -3.4348 1.324 -2.594 0.009 -6.030 -0.839
OnlineSecurity_No 0.0905 0.058 1.558 0.119 -0.023 0.204
OnlineSecurity_Yes 0.0660 0.174 0.380 0.704 -0.275 0.407
OnlineBackup_No -0.0088 0.055 -0.161 0.872 -0.116 0.098
OnlineBackup_Yes 0.1653 0.172 0.960 0.337 -0.172 0.503
DeviceProtection_No -0.0832 0.056 -1.487 0.137 -0.193 0.026
DeviceProtection_Yes 0.2397 0.174 1.379 0.168 -0.101 0.580
TechSupport_No 0.0935 0.058 1.604 0.109 -0.021 0.208
TechSupport_Yes 0.0630 0.174 0.362 0.717 -0.278 0.404
StreamingTV_No -0.4016 0.133 -3.027 0.002 -0.662 -0.142
StreamingTV_Yes 0.5581 0.267 2.094 0.036 0.036 1.081
StreamingMovies_No -0.3459 0.133 -2.609 0.009 -0.606 -0.086
StreamingMovies_Yes 0.5024 0.266 1.886 0.059 -0.020 1.025
tenure -1.5198 0.190 -8.015 0.000 -1.891 -1.148
MonthlyCharges -2.1817 1.160 -1.880 0.060 -4.456 0.092
TotalCharges 0.7329 0.198 3.705 0.000 0.345 1.121
Step-15: Feature Selection Using RFE
We can remove unnecessary variables from the list based on the P-value. However, here we will use the RFE method to reduce our variables.
from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import RFE logreg =LogisticRegression() rfe =RFE(logreg, 13)# running RFE with 13 variables as output rfe =rfe.fit(X,y) print(rfe.support_)# Printing the boolean results print(rfe.ranking_)# Printing the ranking
Output:
[ True True False False False True True False True False False True
False True True False True False False False False False True False
False True False True False True]
[ 1 1 2 18 6 1 1 11 1 12 14 1 8 1 1 4 1 15 5 13 10 7 1 3
16 1 17 1 9 1]
Step-16: Use variables selected by RFE.
Now use the selected variables as input to the model for the telecom churn case study.
col =['PhoneService', 'PaperlessBilling', 'Contract_One year', 'Contract_Two year', 'PaymentMethod_Electronic check', 'MultipleLines_No','InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_Yes', 'TechSupport_Yes', 'StreamingMovies_No', 'tenure', 'TotalCharges']
Step-17: Run the model using the selected variables.
from sklearn.linear_model import LogisticRegression from sklearn import metrics logsk =LogisticRegression() logsk.fit(X_train[col], y_train)
Step-18: Use StatsModels to compare the model.
logm4 =sm.GLM(y_train,(sm.add_constant(X_train[col])), family = sm.families.Binomial()) logm4.fit().summary()
Output:
Generalized Linear Model Regression Results Dep. Variable: Churn No. Observations: 4922
Model: GLM Df Residuals: 4908
Model Family: Binomial Df Model: 13
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2024.2
Date: Sat, 18 Apr 2020 Deviance: 4048.4
Time: 23:39:24 Pearson chi2: 6.19e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P. |z| [0.025 0.975]
const -1.0162 0.169 -6.017 0.000 -1.347 -0.685
PhoneService -0.3090 0.173 -1.784 0.074 -0.648 0.030
PaperlessBilling 0.3595 0.089 4.029 0.000 0.185 0.534
Contract_One year -0.7012 0.127 -5.516 0.000 -0.950 -0.452
Contract_Two year -1.3187 0.210 -6.271 0.000 -1.731 -0.907
PaymentMethod_Electronic check 0.3668 0.083 4.446 0.000 0.205 0.529
MultipleLines_No -0.2311 0.095 -2.435 0.015 -0.417 -0.045
InternetService_Fiber optic 0.7937 0.116 6.836 0.000 0.566 1.021
InternetService_No -1.1832 0.182 -6.484 0.000 -1.541 -0.826
OnlineSecurity_Yes -0.4107 0.102 -4.031 0.000 -0.610 -0.211
TechSupport_Yes -0.4181 0.101 -4.135 0.000 -0.616 -0.220
StreamingMovies_No -0.2024 0.094 -2.160 0.031 -0.386 -0.019
tenure -1.4974 0.181 -8.251 0.000 -1.853 -1.142
TotalCharges 0.7373 0.186 3.965 0.000 0.373 1.102
Step-19: Verify VIF value.
def vif_cal(input_data, dependent_col): vif_df = pd.DataFrame( columns = ['Var', 'Vif']) x_vars=input_data.drop([dependent_col], axis=1) xvar_names=x_vars.columns for i in range(0,xvar_names.shape[0]): y=x_vars[xvar_names[i]] x=x_vars[xvar_names.drop(xvar_names[i])] rsq=sm.OLS(y,x).fit().rsquared vif=round(1/(1-rsq),2) vif_df.loc[i] = [xvar_names[i], vif] return vif_df.sort_values(by = 'Vif', axis=0, ascending=False, inplace=False)
Step-20: Calculate VIF values.
Keep attributes which are selected from RFE step and drop other attribute for Telecom churn case study.
vif_cal(input_data=telecom.drop(['customerID', 'SeniorCitizen', 'Partner', 'Dependents', 'PaymentMethod_Credit card (automatic)', 'PaymentMethod_Mailed check', 'gender_Male', 'MultipleLines_Yes', 'OnlineSecurity_No', 'OnlineBackup_No', 'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_Yes', 'TechSupport_No', 'StreamingTV_No', 'StreamingTV_Yes', 'StreamingMovies_Yes', 'MonthlyCharges'], axis=1), dependent_col='Churn')
Output:
Var Vif
0 PhoneService 10.87
12 TotalCharges 8.58
11 tenure 6.80
1 PaperlessBilling 2.61
7 InternetService_No 0.65
3 Contract_Two year 0.28
2 Contract_One year 0.24
9 TechSupport_Yes 0.24
8 OnlineSecurity_Yes 0.21
10 StreamingMovies_No 0.19
4 PaymentMethod_Electronic check 0.05
5 MultipleLines_No 0.05
6 InternetService_Fiber optic 0.03
As you can see, ‘PhoneService’ has a huge VIF value. Let’s drop ‘PhoneService’ in the next iteration and run the logistic model again.
col =['PaperlessBilling', 'Contract_One year', 'Contract_Two year','PaymentMethod_Electronic check', 'MultipleLines_No','InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_Yes','TechSupport_Yes','StreamingMovies_No','tenure','TotalCharges'] logm5 =sm.GLM(y_train,(sm.add_constant(X_train[col])), family = sm.families.Binomial()) logm5.fit().summary()
Output:
Dep. Variable: Churn No. Observations: 4922
Model: GLM Df Residuals: 4909
Model Family: Binomial Df Model: 12
Link Function: logit Scale: 1.0000
Method: IRLS Log-Likelihood: -2025.8
Date: Sat, 18 Apr 2020 Deviance: 4051.5
Time: 23:57:56 Pearson chi2: 6.00e+03
No. Iterations: 7
Covariance Type: nonrobust
coef std err z P. |z| [0.025 0.975]
const -1.1915 0.138 -8.607 0.000 -1.463 -0.920
PaperlessBilling 0.3563 0.089 3.998 0.000 0.182 0.531
Contract_One year -0.6965 0.127 -5.483 0.000 -0.945 -0.448
Contract_Two year -1.3078 0.210 -6.230 0.000 -1.719 -0.896
PaymentMethod_Electronic check 0.3700 0.082 4.487 0.000 0.208 0.532
MultipleLines_No -0.2990 0.087 -3.442 0.001 -0.469 -0.129
InternetService_Fiber optic 0.7227 0.108 6.666 0.000 0.510 0.935
InternetService_No -1.2732 0.175 -7.276 0.000 -1.616 -0.930
OnlineSecurity_Yes -0.4100 0.102 -4.025 0.000 -0.610 -0.210
TechSupport_Yes -0.4202 0.101 -4.157 0.000 -0.618 -0.222
StreamingMovies_No -0.2205 0.093 -2.366 0.018 -0.403 -0.038
tenure -1.4276 0.177 -8.066 0.000 -1.774 -1.081
TotalCharges 0.6495 0.179 3.622 0.000 0.298 1.001
Step-21: Calculate VIF again.
vif_cal(input_data=telecom.drop(['customerID', 'PhoneService', 'SeniorCitizen', 'Partner', 'Dependents', 'PaymentMethod_Credit card (automatic)', 'PaymentMethod_Mailed check', 'gender_Male','MultipleLines_Yes','OnlineSecurity_No','OnlineBackup_No', 'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_Yes', 'TechSupport_No', 'StreamingTV_No', 'StreamingTV_Yes', 'StreamingMovies_Yes', 'MonthlyCharges'], axis=1), dependent_col='Churn')
Output:
Var Vif
11 TotalCharges 8.24
10 tenure 6.56
0 PaperlessBilling 2.44
6 InternetService_No 0.45
2 Contract_Two year 0.26
8 TechSupport_Yes 0.24
1 Contract_One year 0.23
7 OnlineSecurity_Yes 0.21
9 StreamingMovies_No 0.17
3 PaymentMethod_Electronic check 0.05
4 MultipleLines_No 0.04
5 InternetService_Fiber optic 0.02
At this point, we can say VIF and P-values are in a good range for the telecom churn case study. We can treat these variables in our final model of the telecom churn case study.
from sklearn.linear_model import LogisticRegression from sklearn import metrics logsk = LogisticRegression() logsk.fit(X_train[col], y_train)
Model Predictions
Predict probabilities.
y_pred =logsk.predict_proba(X_test[col]) y_pred_df = pd.DataFrame(y_pred) y_pred_1 = y_pred_df.iloc[:,[1]] y_pred_1.head()
Output:
0 0.499083
1 0.372696
2 0.006738
3 0.635453
4 0.007533
Consider the Probability cutoff point as 0.5.
y_test_df = pd.DataFrame(y_test) y_test_df['CustID'] = y_test_df.index y_pred_1.reset_index(drop=True, inplace=True) y_test_df.reset_index(drop=True, inplace=True) y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1) y_pred_final= y_pred_final.rename(columns={ 1 : 'Churn_Prob'}) y_pred_final = y_pred_final.reindex_axis(['CustID','Churn','Churn_Prob'], axis=1) y_pred_final['predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.5 else 0) y_pred_final.head()
Output:
CustID Churn Churn_Prob predicted
0 942 0 0.499083 0
1 3730 1 0.372696 0
2 1761 0 0.006738 0
3 2283 1 0.635453 1
4 1872 0 0.007533 0
Model Evaluation
Verify the Confusion matrix.
from sklearn import metrics # Confusion matrix confusion =metrics.confusion_matrix( y_pred_final.Churn, y_pred_final.predicted ) confusion
Output:
array([[1362, 166],
[ 249, 333]])
Check the overall accuracy.
metrics.accuracy_score( y_pred_final.Churn, y_pred_final.predicted)
Output:
0.8033175355450237
Verify sensitivity and specificity for the Telecom Churn Case Study Model.
TP = confusion[0,0] # true positive TN = confusion[1,1] # true negatives FP = confusion[0,1] # false positives FN = confusion[1,0] # false negatives print("sensitivity:",TP / float(TP+FN)) print("specificity:",TN / float(TN+FP))
Output:
sensitivity:0.6673346693386774
specificity:0.33266533066132264
Find the optimal cutoff point. Let’s create columns with different probability cutoffs.
numbers = [float(x) / 10 for x in range(10)] for i in numbers: y_pred_final[i]= y_pred_final.Churn_Prob.map( lambda x: 1 if x > i else 0) y_pred_final.head()
Output:
CustID | Churn | Churn_Prob | predicted | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 942 | 0 | 0.499083 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 3730 | 1 | 0.372696 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1761 | 0 | 0.006738 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2283 | 1 | 0.635453 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
4 | 1872 | 0 | 0.007533 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Now let’s calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci']) from sklearn.metrics import confusion_matrix num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] for i in num: cm1 = metrics.confusion_matrix( y_pred_final.Churn, y_pred_final[i] ) total1=sum(sum(cm1)) accuracy = (cm1[0,0]+cm1[1,1])/total1 sensi = cm1[0,0]/(cm1[0,0]+cm1[0,1]) speci = cm1[1,1]/(cm1[1,0]+cm1[1,1]) cutoff_df.loc[i] =[ i ,accuracy,sensi,speci] print(cutoff_df)
Output:
prob accuracy sensi speci
0.0 0.0 0.275829 0.000000 1.000000
0.1 0.1 0.605687 0.477094 0.943299
0.2 0.2 0.695261 0.643325 0.831615
0.3 0.3 0.750237 0.752618 0.743986
0.4 0.4 0.783886 0.828534 0.666667
0.5 0.5 0.803318 0.891361 0.572165
0.6 0.6 0.795735 0.941754 0.412371
0.7 0.7 0.757820 0.978403 0.178694
0.8 0.8 0.727962 1.000000 0.013746
0.9 0.9 0.724171 1.000000 0.000000
From the curve above, 0.3 is the optimum point to take it as a cutoff probability.
y_pred_final['final_predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.3 else 0) metrics.accuracy_score( y_pred_final.Churn, y_pred_final.final_predicted)
Output:
0.7502369668246446
Verify the confusion matrix.
metrics.confusion_matrix( y_pred_final.Churn, y_pred_final.final_predicted )
Output:
array([[1150, 378],
[ 149, 433]])
Conclusion:
In this module, You learned about building a model for Telecom Churn Case Study. Please follow this link to find more about Logistic Regression case studies. Wish you a happy learning.
Great,.Awsome