Evaluation-Sensitivity vs Specificity

Sensitivity vs Specificity

This section explains the difference between two evaluation criteria, Sensitivity vs Specificity.

The below table shows the comparison between the actual value and predicted value.

Predicted
Actual No(Non-Diabetic) Yes(Diabetic)
No(Non-Diabetic) 68 12
Yes(Diabetic) 16 22

This table explains that the actual count of non-diabetic is 80 (68 + 12), whereas our model predicted 68 correctly. Similarly, the model predicted 22 correctly out of 38(16 + 22) for diabetic patients. So the accuracy of the model will be as below.

UMpfs9AeIzUAAAAASUVORK5CYII=

In the given example accuracy = 76%.However, accuracy is not the only metric to evaluate the model. Now Let’s understand the difference between sensitivity vs specificity.

Sensitivity

Sensitivity is the accuracy of correctly predicting diabetes. In other terms, you can say the accuracy of Yes.

B9sztgKVrhABAAAAAElFTkSuQmCC

In the given example Sensitivity = 57%. In other words, the accuracy of being diabetic is 57%.

Specificity

Specificity is the accuracy of correctly predicting non-diabetic or Nos.

In the given example Specificity = 85%.

The above expressions can also be defined as below.

Predicted
Actual No(Non-Diabetic) Yes(Diabetic)
No(Non-Diabetic) True Positive False Positive
Yes(Diabetic) False Negative True Negative

Now let’s find out sensitivity and specificity for the diabetes dataset.

 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

pima = pd.read_csv('pima_indian_diabetes.csv')
df = pima[['No_Times_Pregnant', 'Plasma_Glucose', 'Diastolic_BP', 'Triceps','Insulin', 'BMI', 'Age']]
normalized_df=(df-df.mean())/df.std()
pima = pima.drop(['No_Times_Pregnant', 'Plasma_Glucose', 'Diastolic_BP', 'Triceps','Insulin', 'BMI', 'Age'], 1)
pima = pd.concat([pima,normalized_df],axis=1)

X = pima.drop(['Diabetes'],axis=1)
y = pima['Diabetes']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100)
#Now fit the model.
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logsk = LogisticRegression()
logsk.fit(X_train, y_train)

Predict probabilities.

 y_pred = logsk.predict_proba(X_test)
y_pred_df = pd.DataFrame(y_pred)
y_pred_1 = y_pred_df.iloc[:,[1]]
y_pred_1.head()

Output:

	1
0 	0.849983
1 	0.156658
2 	0.384572
3 	0.350269
4 	0.044309

Mark predicted value as 0 or 1 based on probability value.

 
y_test_df = pd.DataFrame(y_test)
y_test_df['CustID'] = y_test_df.index
y_pred_1.reset_index(drop=True, inplace=True)
y_test_df.reset_index(drop=True, inplace=True)
y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1)
y_pred_final= y_pred_final.rename(columns={ 1 : 'diabetes_Prob'})
y_pred_final = y_pred_final.reindex_axis(['CustID','diabetes_Prob'], axis=1)
y_pred_final['predicted'] = y_pred_final.diabetes_Prob.map( lambda x: 1 if x > 0.5 else 0)
y_pred_final.head()

Evaluate the model.

 
from sklearn import metrics
confusion = metrics.confusion_matrix( y_test, y_pred_final.predicted )
confusion

Output: Confusion matrix

array([[68, 12],
       [16, 22]])

Find overall accuracy.

 metrics.accuracy_score( y_test, y_pred_final.predicted)

Output:

0.7627118644067796

Calculate both the metric and verify the values of sensitivity vs specificity.

 
TP = confusion[0,0] # true positive 
TN = confusion[1,1] # true negatives
FP = confusion[0,1] # false positives
FN = confusion[1,0] # false negatives

# Let us calculate sensitivity and specificity
print("sensitivity ",TP / float(TP+FN))
print("specificity ",TN / float(TN+FP))

Output:

sensitivity 0.8095238095238095
specificity 0.6470588235294118

The optimal cutoff between sensitivity and specificity

In the above calculation of Sensitivity vs Specificity, You can notice, though accuracy is high, sensitivity is very low, which is around 57%. If I want to focus more on the people who are suffering from diabetes, then in such case, the model will perform very badly. Based on business requirements, you may have to decide the cutoff point, whether to have a high sensitivity or high specificity. Though most of the time, we need to find an optimal point where sensitivity and specificity perform fairly well.

Let’s create columns with different probability cutoffs.

 
numbers = [float(x)/10 for x in range(10)]
for i in numbers:
    y_pred_final[i]= y_pred_final.diabetes_Prob.map( lambda x: 1 if x > i else 0)
y_pred_final.head()

Output:

CustID diabetes_Prob predicted 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0 124 0.849983 1 1 1 1 1 1 1 1 1 1 0
1 140 0.156658 0 1 1 0 0 0 0 0 0 0 0
2 276 0.384572 0 1 1 1 1 0 0 0 0 0 0
3 252 0.350269 0 1 1 1 1 0 0 0 0 0 0
4 326 0.044309 0 1 0 0 0 0 0 0 0 0 0

Now let’s calculate accuracy sensitivity and specificity for various probability cutoffs.

cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci'])
from sklearn.metrics import confusion_matrix
num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
for i in num:
    cm1 = metrics.confusion_matrix( y_test, y_pred_final[i]  )
    #print(cm1)
    total1=sum(sum(cm1))
    #print(total1)
    accuracy = (cm1[0,0]+cm1[1,1])/total1
    sensi = cm1[0,0]/(cm1[0,0]+cm1[0,1])
    speci = cm1[1,1]/(cm1[1,0]+cm1[1,1])
    cutoff_df.loc[i] =[ i ,accuracy,sensi,speci]
print(cutoff_df)

Output:

prob  accuracy   sensi     speci
0.0   0.0  0.322034  0.0000  1.000000
0.1   0.1  0.559322  0.3500  1.000000
0.2   0.2  0.669492  0.5500  0.921053
0.3   0.3  0.728814  0.7125  0.763158
0.4   0.4  0.745763  0.7875  0.657895
0.5   0.5  0.762712  0.8500  0.578947
0.6   0.6  0.796610  0.9125  0.552632
0.7   0.7  0.779661  0.9250  0.473684
0.8   0.8  0.737288  0.9500  0.289474
0.9   0.9  0.711864  1.0000  0.105263

Let’s plot accuracy sensitivity and specificity for various probabilities.

 cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci']) 

wcb7gfqL3JwVQAAAABJRU5ErkJggg==

As you can observe, the optimal cut off point will be around 0.35, where trade-off exists between sensitivity and specificity. Based on your requirement, you can choose to optimize sensitivity or specificity. You can find more about this topic here. Please provide your thoughts in the comments below and follow this link to learn more about Logistic regression.

#Sensitivity vs Specificity

Leave a Reply