Case Study – Telecom Churn

Telecom Churn Case Study

In this section, you will find details about Logistic Regression in Python and a discuss telecom churn case study where the company wants to predict an existing customer will leave the network or not. The telecom churn case study problem is solved using Logistic regression.

You have three files that contain the required information.

1-Customer data( customer_data.csv)

customerID gender SeniorCitizen Partner Dependents
7590-VHVEG Female 0 Yes No
5575-GNVDE Male 0 No No
3668-QPYBK Male 0 No No
7795-CFOCW Male 0 No No
9237-HQITU Female 0 No No

 

2-Churn data( churn_data.csv)

customerID tenure PhoneService Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
7590-VHVEG 1 No Month-to-month Yes Electronic check 29.85 29.85 No
5575-GNVDE 34 Yes One year No Mailed check 56.95 1889.5 No
3668-QPYBK 2 Yes Month-to-month Yes Mailed check 53.85 108.15 Yes
7795-CFOCW 45 No One year No Bank transfer (automatic) 42.30 1840.75 No
9237-HQITU 2 Yes Month-to-month Yes Electronic check 70.70 151.65 Yes

 

3-Internet data( internet_data.csv)

customerID MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies
7590-VHVEG No phone service DSL No Yes No No No No
5575-GNVDE No DSL Yes No Yes No No No
3668-QPYBK No DSL Yes Yes No No No No
7795-CFOCW No phone service DSL Yes No Yes Yes No No
9237-HQITU No Fiber optic No No No No No No

 

‘customerID’ is the unique identifier for each customer and can be used to map data across data sets.

Data Loading and Statistical Analysis:

The first step of any analysis is to understand and analyze the data. In the case of the telecom churn case study also, we will first understand data.

Step-1: Import data:

Load all required files to analyze customer churn and understand Logistic Regression in Python.

 
# Importing Pandas and NumPy
import pandas as pd
import numpy as np

# Importing all datasets
churn_data = pd.read_csv("churn_data.csv")
customer_data = pd.read_csv("customer_data.csv")
internet_data = pd.read_csv("internet_data.csv")

Step-2: Verify dataframe:

After loading data, verify and analyze each dataframe.

 
internet_data.head()

Output:

customerID MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies
0 7590-VHVEG No phone service DSL No Yes No No No No
1 5575-GNVDE No DSL Yes No Yes No No No
2 3668-QPYBK No DSL Yes Yes No No No No
3 7795-CFOCW No phone service DSL Yes No Yes Yes No No
4 9237-HQITU No Fiber optic No No No No No No

Similarly, you can verify other data frames.

Step-3: Merge all dataframe into one:

Once data is loaded into data frames, try merging all differents data frames into one. Analyzing and the model building will be efficient in the case of a single data frame. In this example, ‘customerID’ is the unique key and we can use this to merge all the three dataframes into one.

 
#Merging on 'customerID'
df_1 = pd.merge(churn_data, customer_data, how='inner', on='customerID')
telecom = pd.merge(df_1, internet_data, how='inner', on='customerID')

Step-4: Analyze the structure of final dataframe:

Verify statistical information of the dataframe.

 telecom.describe()

Output:

        tenure 	MonthlyCharges 	SeniorCitizen
count 	7043.000000 	7043.000000 	7043.000000
mean 	32.371149 	64.761692 	0.162147
std 	24.559481 	30.090047 	0.368612
min 	0.000000 	18.250000 	0.000000
25% 	9.000000 	35.500000 	0.000000
50% 	29.000000 	70.350000 	0.000000
75% 	55.000000 	89.850000 	0.000000
max 	72.000000 	118.750000 	1.000000
telecom.info()

Output:

class 'pandas.core.frame.DataFrame';
Int64Index: 7043 entries, 0 to 7042
Data columns (total 21 columns):
customerID          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
dtypes: float64(1), int64(2), object(18)
memory usage: 1.2+ MB

Step-5: Check the data size of the master dataframe.

print("Shape of telecom data: ",telecom.shape)

Output:

Shape of telecom data:  (7043, 21)

Verify the number of unique records in each dataframe.

 telecom.nunique()

Output:

customerID          7043
tenure                73
PhoneService           2
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1585
TotalCharges        6531
Churn                  2
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3

Data Preparation

Step-6: Converting Yes to 1 and No to 0

As few attributes contain categorical variables as Yes or No, Let’s map then to 1 or 0.

 
telecom['PhoneService'] = telecom['PhoneService'].map({'Yes': 1, 'No': 0})
telecom['PaperlessBilling'] = telecom['PaperlessBilling'].map({'Yes': 1, 'No': 0})
telecom['Churn'] = telecom['Churn'].map({'Yes': 1, 'No': 0})
telecom['Partner'] = telecom['Partner'].map({'Yes': 1, 'No': 0})
telecom['Dependents'] = telecom['Dependents'].map({'Yes': 1, 'No': 0})

Step-7: Dummy Variable Creation

Convert non-integer attributes to the dummy variables.

 
# Creating a dummy variable for the variable 'Contract' and dropping the first one.
cont = pd.get_dummies(telecom['Contract'],prefix='Contract',drop_first=True)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,cont],axis=1)

# Creating a dummy variable for the variable 'PaymentMethod' and dropping the first one.
pm = pd.get_dummies(telecom['PaymentMethod'],prefix='PaymentMethod',drop_first=True)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,pm],axis=1)

# Creating a dummy variable for the variable 'gender' and dropping the first one.
gen = pd.get_dummies(telecom['gender'],prefix='gender',drop_first=True)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,gen],axis=1)

# Creating a dummy variable for the variable 'MultipleLines' and dropping the first one.
ml = pd.get_dummies(telecom['MultipleLines'],prefix='MultipleLines')
#  dropping MultipleLines_No phone service column
ml1 = ml.drop(['MultipleLines_No phone service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,ml1],axis=1)

# Creating a dummy variable for the variable 'InternetService' and dropping the first one.
iser = pd.get_dummies(telecom['InternetService'],prefix='InternetService',drop_first=True)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,iser],axis=1)

# Creating a dummy variable for the variable 'OnlineSecurity'.
os = pd.get_dummies(telecom['OnlineSecurity'],prefix='OnlineSecurity')
os1= os.drop(['OnlineSecurity_No internet service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,os1],axis=1)

# Creating a dummy variable for the variable 'OnlineBackup'.
ob =pd.get_dummies(telecom['OnlineBackup'],prefix='OnlineBackup')
ob1 =ob.drop(['OnlineBackup_No internet service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,ob1],axis=1)

# Creating a dummy variable for the variable 'DeviceProtection'. 
dp =pd.get_dummies(telecom['DeviceProtection'],prefix='DeviceProtection')
dp1 = dp.drop(['DeviceProtection_No internet service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,dp1],axis=1)

# Creating a dummy variable for the variable 'TechSupport'. 
ts =pd.get_dummies(telecom['TechSupport'],prefix='TechSupport')
ts1 = ts.drop(['TechSupport_No internet service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,ts1],axis=1)

# Creating a dummy variable for the variable 'StreamingTV'.
st =pd.get_dummies(telecom['StreamingTV'],prefix='StreamingTV')
st1 = st.drop(['StreamingTV_No internet service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,st1],axis=1)

# Creating a dummy variable for the variable 'StreamingMovies'. 
sm =pd.get_dummies(telecom['StreamingMovies'],prefix='StreamingMovies')
sm1 = sm.drop(['StreamingMovies_No internet service'],1)
#Adding the results to the master dataframe
telecom = pd.concat([telecom,sm1],axis=1)

Step-8: Dropping the repeated variables

As we have created dummies for the below variables, so we can drop them.

 telecom = telecom.drop([
'Contract',
'PaymentMethod',
'gender',
'MultipleLines',
'InternetService', 
'OnlineSecurity', 
'OnlineBackup', 
'DeviceProtection',
'TechSupport', 
'StreamingTV', 
'StreamingMovies'], 1)

Step-9: Convert ‘TotalCharges’ from String to Double.

telecom['TotalCharges']=telecom['TotalCharges'].convert_objects(convert_numeric=True)

Verify data.

telecom.info()

Output:

Int64Index: 7043 entries, 0 to 7042
Data columns (total 32 columns):
customerID                               7043 non-null object
tenure                                   7043 non-null int64
PhoneService                             7043 non-null int64
PaperlessBilling                         7043 non-null int64
MonthlyCharges                           7043 non-null float64
TotalCharges                             7032 non-null float64
Churn                                    7043 non-null int64
SeniorCitizen                            7043 non-null int64
Partner                                  7043 non-null int64
Dependents                               7043 non-null int64
Contract_One year                        7043 non-null uint8
Contract_Two year                        7043 non-null uint8
PaymentMethod_Credit card (automatic)    7043 non-null uint8
PaymentMethod_Electronic check           7043 non-null uint8
PaymentMethod_Mailed check               7043 non-null uint8
gender_Male                              7043 non-null uint8
MultipleLines_No                         7043 non-null uint8
MultipleLines_Yes                        7043 non-null uint8
InternetService_Fiber optic              7043 non-null uint8
InternetService_No                       7043 non-null uint8
OnlineSecurity_No                        7043 non-null uint8
OnlineSecurity_Yes                       7043 non-null uint8
OnlineBackup_No                          7043 non-null uint8
OnlineBackup_Yes                         7043 non-null uint8
DeviceProtection_No                      7043 non-null uint8
DeviceProtection_Yes                     7043 non-null uint8
TechSupport_No                           7043 non-null uint8
TechSupport_Yes                          7043 non-null uint8
StreamingTV_No                           7043 non-null uint8
StreamingTV_Yes                          7043 non-null uint8
StreamingMovies_No                       7043 non-null uint8
StreamingMovies_Yes                      7043 non-null uint8
dtypes: float64(2), int64(7), object(1), uint8(22)
memory usage: 756.6+ KB

Step-10: Check outliers

Checking for outliers in the continuous variables.

 
num_telecom =telecom[['tenure','MonthlyCharges','SeniorCitizen','TotalCharges']]
num_telecom.describe(percentiles=[.25,.5,.75,.90,.95,.99])

Output:

    tenure 	MonthlyCharges 	SeniorCitizen 	TotalCharges
count 	7043.000000 	7043.000000 	7043.000000 	7032.000000
mean 	32.371149 	64.761692 	0.162147 	2283.300441
std 	24.559481 	30.090047 	0.368612 	2266.771362
min 	0.000000 	18.250000 	0.000000 	18.800000
25% 	9.000000 	35.500000 	0.000000 	401.450000
50% 	29.000000 	70.350000 	0.000000 	1397.475000
75% 	55.000000 	89.850000 	0.000000 	3794.737500
90% 	69.000000 	102.600000 	1.000000 	5976.640000
95% 	72.000000 	107.400000 	1.000000 	6923.590000
99% 	72.000000 	114.729000 	1.000000 	8039.883000
max 	72.000000 	118.750000 	1.000000 	8684.800000

Step-11: Check missing values and Impute them.

 telecom.isnull().sum()

Output:

customerID               0
tenure                                    0
PhoneService                              0
PaperlessBilling                          0
MonthlyCharges                            0
TotalCharges                             11
Churn                                     0
SeniorCitizen                             0
Partner                                   0
Dependents                                0
Contract_One year                         0
Contract_Two year                         0
PaymentMethod_Credit card (automatic)     0
PaymentMethod_Electronic check            0
PaymentMethod_Mailed check                0
gender_Male                               0
MultipleLines_No                          0
MultipleLines_Yes                         0
InternetService_Fiber optic               0
InternetService_No                        0
OnlineSecurity_No                         0
OnlineSecurity_Yes                        0
OnlineBackup_No                           0
OnlineBackup_Yes                          0
DeviceProtection_No                       0
DeviceProtection_Yes                      0
TechSupport_No                            0
TechSupport_Yes                           0
StreamingTV_No                            0
StreamingTV_Yes                           0
StreamingMovies_No                        0
StreamingMovies_Yes                       0
dtype: int64

Remove missing TotalCharges rows.

 telecom = telecom[~np.isnan(telecom['TotalCharges'])]

Step-12: Feature Standardisation

As continuous variables have a wide range of variables, let’s standardize the values.

 
df = telecom[['tenure','MonthlyCharges','TotalCharges']]
normalized_df=(df-df.mean())/df.std()
telecom = telecom.drop(['tenure','MonthlyCharges','TotalCharges'], 1)
telecom = pd.concat([telecom,normalized_df],axis=1)

Model Building

Step-13: Split data into Train and Test.

In the process of model building for the Telecom Churn Case Study, the first step is to split the data into training and testing data.
Drop ‘customerID’ from the dataframe as this is not having any predictive power.

 from sklearn.model_selection import train_test_split
# Putting feature variable to X
X = telecom.drop(['Churn','customerID'],axis=1)

# Putting response variable to y
y = telecom['Churn']

# Splitting the data into train and test
X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.7,test_size=0.3,random_state=10) 

Step-14: Run the Logistic regression model.

we have used the ‘stasmodel‘ library to run the logistic model in Telecom Churn Case Study. You can observe the details of coefficients and P-values for each attribute.

 
import statsmodels.api as sm
logm1 =sm.GLM(y_train,(sm.add_constant(X_train)), family = sm.families.Binomial())
logm1.fit().summary()

Output:

Generalized Linear Model Regression Results Dep. Variable: 	Churn 	No. Observations: 	4922
Model: 	GLM 	Df Residuals: 	4898
Model Family: 	Binomial 	Df Model: 	23
Link Function: 	logit 	Scale: 	1.0000
Method: 	IRLS 	Log-Likelihood: 	-2004.7
Date: 	Sat, 18 Apr 2020 	Deviance: 	4009.4
Time: 	23:03:45 	Pearson chi2: 	6.07e+03
No. Iterations: 	7 		
Covariance Type: 	nonrobust 		
	coef 	std err 	z 	P.      |z| 	[0.025 	0.975]
const 	-3.2783 	1.187 	-2.762 	0.006 	-5.605 	-0.952
PhoneService 	0.8213 	0.588 	1.396 	0.163 	-0.332 	1.974
PaperlessBilling 	0.3254 	0.090 	3.614 	0.000 	0.149 	0.502
SeniorCitizen 	0.3984 	0.102 	3.924 	0.000 	0.199 	0.597
Partner 	0.0374 	0.094 	0.399 	0.690 	-0.146 	0.221
Dependents 	-0.1430 	0.107 	-1.332 	0.183 	-0.353 	0.067
Contract_One year 	-0.6578 	0.129 	-5.106 	0.000 	-0.910 	-0.405
Contract_Two year 	-1.2455 	0.212 	-5.874 	0.000 	-1.661 	-0.830
PaymentMethod_Credit card (automatic) 	-0.2577 	0.137 	-1.883 	0.060 	-0.526 	0.011
PaymentMethod_Electronic check 	0.1615 	0.113 	1.434 	0.152 	-0.059 	0.382
PaymentMethod_Mailed check 	-0.2536 	0.137 	-1.845 	0.065 	-0.523 	0.016
gender_Male 	-0.0346 	0.078 	-0.442 	0.658 	-0.188 	0.119
MultipleLines_No 	0.1295 	0.205 	0.632 	0.527 	-0.272 	0.531
MultipleLines_Yes 	0.6918 	0.392 	1.763 	0.078 	-0.077 	1.461
InternetService_Fiber optic 	2.5124 	0.967 	2.599 	0.009 	0.618 	4.407
InternetService_No 	-3.4348 	1.324 	-2.594 	0.009 	-6.030 	-0.839
OnlineSecurity_No 	0.0905 	0.058 	1.558 	0.119 	-0.023 	0.204
OnlineSecurity_Yes 	0.0660 	0.174 	0.380 	0.704 	-0.275 	0.407
OnlineBackup_No 	-0.0088 	0.055 	-0.161 	0.872 	-0.116 	0.098
OnlineBackup_Yes 	0.1653 	0.172 	0.960 	0.337 	-0.172 	0.503
DeviceProtection_No 	-0.0832 	0.056 	-1.487 	0.137 	-0.193 	0.026
DeviceProtection_Yes 	0.2397 	0.174 	1.379 	0.168 	-0.101 	0.580
TechSupport_No 	0.0935 	0.058 	1.604 	0.109 	-0.021 	0.208
TechSupport_Yes 	0.0630 	0.174 	0.362 	0.717 	-0.278 	0.404
StreamingTV_No 	-0.4016 	0.133 	-3.027 	0.002 	-0.662 	-0.142
StreamingTV_Yes 	0.5581 	0.267 	2.094 	0.036 	0.036 	1.081
StreamingMovies_No 	-0.3459 	0.133 	-2.609 	0.009 	-0.606 	-0.086
StreamingMovies_Yes 	0.5024 	0.266 	1.886 	0.059 	-0.020 	1.025
tenure 	-1.5198 	0.190 	-8.015 	0.000 	-1.891 	-1.148
MonthlyCharges 	-2.1817 	1.160 	-1.880 	0.060 	-4.456 	0.092
TotalCharges 	0.7329 	0.198 	3.705 	0.000 	0.345 	1.121

Step-15: Feature Selection Using RFE

We can remove unnecessary variables from the list based on the P-value. However, here we will use the RFE method to reduce our variables.

 
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
logreg =LogisticRegression()
rfe =RFE(logreg, 13)# running RFE with 13 variables as output
rfe =rfe.fit(X,y)
print(rfe.support_)# Printing the boolean results
print(rfe.ranking_)# Printing the ranking 

Output:

[ True  True False False False  True  True False  True False False  True
 False  True  True False  True False False False False False  True False
 False  True False  True False  True]
[ 1  1  2 18  6  1  1 11  1 12 14  1  8  1  1  4  1 15  5 13 10  7  1  3
 16  1 17  1  9  1] 

Step-16: Use variables selected by RFE.

Now use the selected variables as input to the model for the telecom churn case study.
 
col =['PhoneService', 'PaperlessBilling', 'Contract_One year', 'Contract_Two year', 'PaymentMethod_Electronic check', 'MultipleLines_No','InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_Yes', 'TechSupport_Yes', 'StreamingMovies_No', 'tenure', 'TotalCharges'] 

Step-17: Run the model using the selected variables.

 
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logsk =LogisticRegression()
logsk.fit(X_train[col], y_train) 

Step-18: Use StatsModels to compare the model.

 
logm4 =sm.GLM(y_train,(sm.add_constant(X_train[col])), family = sm.families.Binomial())
logm4.fit().summary()

Output:

 Generalized Linear Model Regression Results Dep. Variable: 	Churn 	No. Observations: 	4922
Model: 	GLM 	Df Residuals: 	4908
Model Family: 	Binomial 	Df Model: 	13
Link Function: 	logit 	Scale: 	1.0000
Method: 	IRLS 	Log-Likelihood: 	-2024.2
Date: 	Sat, 18 Apr 2020 	Deviance: 	4048.4
Time: 	23:39:24 	Pearson chi2: 	6.19e+03
No. Iterations: 	7 		
Covariance Type: 	nonrobust 		
	coef 	std err 	z 	P.      |z| 	[0.025 	0.975]
const 	-1.0162 	0.169 	-6.017 	0.000 	-1.347 	-0.685
PhoneService 	-0.3090 	0.173 	-1.784 	0.074 	-0.648 	0.030
PaperlessBilling 	0.3595 	0.089 	4.029 	0.000 	0.185 	0.534
Contract_One year 	-0.7012 	0.127 	-5.516 	0.000 	-0.950 	-0.452
Contract_Two year 	-1.3187 	0.210 	-6.271 	0.000 	-1.731 	-0.907
PaymentMethod_Electronic check 	0.3668 	0.083 	4.446 	0.000 	0.205 	0.529
MultipleLines_No 	-0.2311 	0.095 	-2.435 	0.015 	-0.417 	-0.045
InternetService_Fiber optic 	0.7937 	0.116 	6.836 	0.000 	0.566 	1.021
InternetService_No 	-1.1832 	0.182 	-6.484 	0.000 	-1.541 	-0.826
OnlineSecurity_Yes 	-0.4107 	0.102 	-4.031 	0.000 	-0.610 	-0.211
TechSupport_Yes 	-0.4181 	0.101 	-4.135 	0.000 	-0.616 	-0.220
StreamingMovies_No 	-0.2024 	0.094 	-2.160 	0.031 	-0.386 	-0.019
tenure 	-1.4974 	0.181 	-8.251 	0.000 	-1.853 	-1.142
TotalCharges 	0.7373 	0.186 	3.965 	0.000 	0.373 	1.102

Step-19: Verify VIF value.

 def vif_cal(input_data, dependent_col):
    vif_df = pd.DataFrame( columns = ['Var', 'Vif'])
    x_vars=input_data.drop([dependent_col], axis=1)
    xvar_names=x_vars.columns
    for i in range(0,xvar_names.shape[0]):
        y=x_vars[xvar_names[i]] 
        x=x_vars[xvar_names.drop(xvar_names[i])]
        rsq=sm.OLS(y,x).fit().rsquared  
        vif=round(1/(1-rsq),2)
        vif_df.loc[i] = [xvar_names[i], vif]
    return vif_df.sort_values(by = 'Vif', axis=0, ascending=False, inplace=False) 

Step-20: Calculate VIF values.

Keep attributes which are selected from RFE step and drop other attribute for Telecom churn case study.
 
vif_cal(input_data=telecom.drop(['customerID',
'SeniorCitizen', 
'Partner', 
'Dependents',
'PaymentMethod_Credit card (automatic)',
'PaymentMethod_Mailed check',
'gender_Male',
'MultipleLines_Yes',
'OnlineSecurity_No',
'OnlineBackup_No',
'OnlineBackup_Yes', 
'DeviceProtection_No', 
'DeviceProtection_Yes',
'TechSupport_No', 
'StreamingTV_No', 
'StreamingTV_Yes', 
'StreamingMovies_Yes',
'MonthlyCharges'], axis=1), dependent_col='Churn') 

Output:

	Var 	Vif
0 	PhoneService 	10.87
12 	TotalCharges 	8.58
11 	tenure 	6.80
1 	PaperlessBilling 	2.61
7 	InternetService_No 	0.65
3 	Contract_Two year 	0.28
2 	Contract_One year 	0.24
9 	TechSupport_Yes 	0.24
8 	OnlineSecurity_Yes 	0.21
10 	StreamingMovies_No 	0.19
4 	PaymentMethod_Electronic check 	0.05
5 	MultipleLines_No 	0.05
6 	InternetService_Fiber optic 	0.03

As you can see, ‘PhoneService’ has a huge VIF value. Let’s drop ‘PhoneService’ in the next iteration and run the logistic model again.

 
col =['PaperlessBilling', 'Contract_One year', 'Contract_Two year','PaymentMethod_Electronic check', 'MultipleLines_No','InternetService_Fiber optic', 'InternetService_No', 'OnlineSecurity_Yes','TechSupport_Yes','StreamingMovies_No','tenure','TotalCharges']
logm5 =sm.GLM(y_train,(sm.add_constant(X_train[col])), family = sm.families.Binomial())
logm5.fit().summary() 

Output:

Dep. Variable: 	Churn 	No. Observations: 	4922
Model: 	GLM 	Df Residuals: 	4909
Model Family: 	Binomial 	Df Model: 	12
Link Function: 	logit 	Scale: 	1.0000
Method: 	IRLS 	Log-Likelihood: 	-2025.8
Date: 	Sat, 18 Apr 2020 	Deviance: 	4051.5
Time: 	23:57:56 	Pearson chi2: 	6.00e+03
No. Iterations: 	7 		
Covariance Type: 	nonrobust 		
	coef 	std err 	z 	P.       |z| 	[0.025 	0.975]
const 	-1.1915 	0.138 	-8.607 	0.000 	-1.463 	-0.920
PaperlessBilling 	0.3563 	0.089 	3.998 	0.000 	0.182 	0.531
Contract_One year 	-0.6965 	0.127 	-5.483 	0.000 	-0.945 	-0.448
Contract_Two year 	-1.3078 	0.210 	-6.230 	0.000 	-1.719 	-0.896
PaymentMethod_Electronic check 	0.3700 	0.082 	4.487 	0.000 	0.208 	0.532
MultipleLines_No 	-0.2990 	0.087 	-3.442 	0.001 	-0.469 	-0.129
InternetService_Fiber optic 	0.7227 	0.108 	6.666 	0.000 	0.510 	0.935
InternetService_No 	-1.2732 	0.175 	-7.276 	0.000 	-1.616 	-0.930
OnlineSecurity_Yes 	-0.4100 	0.102 	-4.025 	0.000 	-0.610 	-0.210
TechSupport_Yes 	-0.4202 	0.101 	-4.157 	0.000 	-0.618 	-0.222
StreamingMovies_No 	-0.2205 	0.093 	-2.366 	0.018 	-0.403 	-0.038
tenure 	-1.4276 	0.177 	-8.066 	0.000 	-1.774 	-1.081
TotalCharges 	0.6495 	0.179 	3.622 	0.000 	0.298 	1.001

Step-21: Calculate VIF again.

 
vif_cal(input_data=telecom.drop(['customerID', 'PhoneService', 'SeniorCitizen', 'Partner', 'Dependents', 'PaymentMethod_Credit card (automatic)', 'PaymentMethod_Mailed check', 'gender_Male','MultipleLines_Yes','OnlineSecurity_No','OnlineBackup_No', 'OnlineBackup_Yes', 'DeviceProtection_No', 'DeviceProtection_Yes', 'TechSupport_No', 'StreamingTV_No', 'StreamingTV_Yes', 'StreamingMovies_Yes', 'MonthlyCharges'], axis=1), dependent_col='Churn')

Output:

	Var 	Vif
11 	TotalCharges 	8.24
10 	tenure 	6.56
0 	PaperlessBilling 	2.44
6 	InternetService_No 	0.45
2 	Contract_Two year 	0.26
8 	TechSupport_Yes 	0.24
1 	Contract_One year 	0.23
7 	OnlineSecurity_Yes 	0.21
9 	StreamingMovies_No 	0.17
3 	PaymentMethod_Electronic check 	0.05
4 	MultipleLines_No 	0.04
5 	InternetService_Fiber optic 	0.02

 

At this point, we can say VIF and P-values are in a good range for the telecom churn case study. We can treat these variables in our final model of the telecom churn case study.

 
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logsk = LogisticRegression()
logsk.fit(X_train[col], y_train)

Model Predictions

Predict probabilities.

 
y_pred =logsk.predict_proba(X_test[col])
y_pred_df = pd.DataFrame(y_pred)
y_pred_1 = y_pred_df.iloc[:,[1]]
y_pred_1.head() 

Output:

0 	0.499083
1 	0.372696
2 	0.006738
3 	0.635453
4 	0.007533

Consider the Probability cutoff point as 0.5.

 
y_test_df = pd.DataFrame(y_test)
y_test_df['CustID'] = y_test_df.index
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)
y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)
y_pred_final= y_pred_final.rename(columns={ 1 : 'Churn_Prob'})
y_pred_final = y_pred_final.reindex_axis(['CustID','Churn','Churn_Prob'], axis=1)
y_pred_final['predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.5 else 0)
y_pred_final.head() 

Output:

      CustID 	Churn 	Churn_Prob 	predicted
0 	942 	0 	0.499083 	0
1 	3730 	1 	0.372696 	0
2 	1761 	0 	0.006738 	0
3 	2283 	1 	0.635453 	1
4 	1872 	0 	0.007533 	0

Model Evaluation

Verify the Confusion matrix.

 
from sklearn import metrics
# Confusion matrix
confusion =metrics.confusion_matrix( y_pred_final.Churn, y_pred_final.predicted )
confusion 

Output:

array([[1362,  166],
[ 249,  333]])

Check the overall accuracy.

 
metrics.accuracy_score( y_pred_final.Churn, y_pred_final.predicted) 

Output:

0.8033175355450237

Verify sensitivity and specificity for the Telecom Churn Case Study Model.

 
TP = confusion[0,0] # true positive 
TN = confusion[1,1] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives
print("sensitivity:",TP / float(TP+FN))
print("specificity:",TN / float(TN+FP)) 

Output:

sensitivity:0.6673346693386774
specificity:0.33266533066132264

Find the optimal cutoff point. Let’s create columns with different probability cutoffs.

 
numbers = [float(x) / 10 for x in range(10)]
for i in numbers:
    y_pred_final[i]= y_pred_final.Churn_Prob.map( lambda x: 1 if x > i else 0)
y_pred_final.head() 

Output:

CustID Churn Churn_Prob predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 942 0 0.499083 0 1 1 1 1 1 0 0 0 0 0
1 3730 1 0.372696 0 1 1 1 1 0 0 0 0 0 0
2 1761 0 0.006738 0 1 0 0 0 0 0 0 0 0 0
3 2283 1 0.635453 1 1 1 1 1 1 1 1 0 0 0
4 1872 0 0.007533 0 1 0 0 0 0 0 0 0 0 0

Now let’s calculate accuracy sensitivity and specificity for various probability cutoffs.

 
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix( y_pred_final.Churn, y_pred_final[i] )
    total1=sum(sum(cm1))
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    sensi = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    speci = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

Output:

     prob  accuracy     sensi     speci
0.0   0.0  0.275829  0.000000  1.000000
0.1   0.1  0.605687  0.477094  0.943299
0.2   0.2  0.695261  0.643325  0.831615
0.3   0.3  0.750237  0.752618  0.743986
0.4   0.4  0.783886  0.828534  0.666667
0.5   0.5  0.803318  0.891361  0.572165
0.6   0.6  0.795735  0.941754  0.412371
0.7   0.7  0.757820  0.978403  0.178694
0.8   0.8  0.727962  1.000000  0.013746
0.9   0.9  0.724171  1.000000  0.000000

accuracy

From the curve above, 0.3 is the optimum point to take it as a cutoff probability.

 
y_pred_final['final_predicted'] = y_pred_final.Churn_Prob.map( lambda x: 1 if x > 0.3 else 0)
metrics.accuracy_score( y_pred_final.Churn, y_pred_final.final_predicted) 

Output:

0.7502369668246446

Verify the confusion matrix.

 
metrics.confusion_matrix( y_pred_final.Churn, y_pred_final.final_predicted ) 

Output:

array([[1150,  378],
       [ 149,  433]]) 

Conclusion:

In this module, You learned about building a model for Telecom Churn Case Study. Please follow this link to find more about Logistic Regression case studies. Wish you a happy learning.

One thought on “Case Study – Telecom Churn

Leave a Reply