Evaluation-Sensitivity vs Specificity
Table of Contents
Sensitivity vs Specificity
This section explains the difference between two evaluation criteria, Sensitivity vs Specificity.
The below table shows the comparison between the actual value and predicted value.
Predicted | ||
---|---|---|
Actual | No(Non-Diabetic) | Yes(Diabetic) |
No(Non-Diabetic) | 68 | 12 |
Yes(Diabetic) | 16 | 22 |
This table explains that the actual count of non-diabetic is 80 (68 + 12), whereas our model predicted 68 correctly. Similarly, the model predicted 22 correctly out of 38(16 + 22) for diabetic patients. So the accuracy of the model will be as below.
In the given example accuracy = 76%.However, accuracy is not the only metric to evaluate the model. Now Let’s understand the difference between sensitivity vs specificity.
Sensitivity
Sensitivity is the accuracy of correctly predicting diabetes. In other terms, you can say the accuracy of Yes.
In the given example Sensitivity = 57%. In other words, the accuracy of being diabetic is 57%.
Specificity
Specificity is the accuracy of correctly predicting non-diabetic or Nos.
In the given example Specificity = 85%.
The above expressions can also be defined as below.
Predicted | ||
---|---|---|
Actual | No(Non-Diabetic) | Yes(Diabetic) |
No(Non-Diabetic) | True Positive | False Positive |
Yes(Diabetic) | False Negative | True Negative |
Now let’s find out sensitivity and specificity for the diabetes dataset.
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split pima = pd.read_csv('pima_indian_diabetes.csv') df = pima[['No_Times_Pregnant', 'Plasma_Glucose', 'Diastolic_BP', 'Triceps','Insulin', 'BMI', 'Age']] normalized_df=(df-df.mean())/df.std() pima = pima.drop(['No_Times_Pregnant', 'Plasma_Glucose', 'Diastolic_BP', 'Triceps','Insulin', 'BMI', 'Age'], 1) pima = pd.concat([pima,normalized_df],axis=1) X = pima.drop(['Diabetes'],axis=1) y = pima['Diabetes'] X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, test_size=0.3, random_state=100) #Now fit the model. from sklearn.linear_model import LogisticRegression from sklearn import metrics logsk = LogisticRegression() logsk.fit(X_train, y_train)
Predict probabilities.
y_pred = logsk.predict_proba(X_test) y_pred_df = pd.DataFrame(y_pred) y_pred_1 = y_pred_df.iloc[:,[1]] y_pred_1.head()
Output:
1
0 0.849983
1 0.156658
2 0.384572
3 0.350269
4 0.044309
Mark predicted value as 0 or 1 based on probability value.
y_test_df = pd.DataFrame(y_test) y_test_df['CustID'] = y_test_df.index y_pred_1.reset_index(drop=True, inplace=True) y_test_df.reset_index(drop=True, inplace=True) y_pred_final = pd.concat([y_test_df,y_pred_1],axis=1) y_pred_final= y_pred_final.rename(columns={ 1 : 'diabetes_Prob'}) y_pred_final = y_pred_final.reindex_axis(['CustID','diabetes_Prob'], axis=1) y_pred_final['predicted'] = y_pred_final.diabetes_Prob.map( lambda x: 1 if x > 0.5 else 0) y_pred_final.head()
Evaluate the model.
from sklearn import metrics confusion = metrics.confusion_matrix( y_test, y_pred_final.predicted ) confusion
Output: Confusion matrix
array([[68, 12],
[16, 22]])
Find overall accuracy.
metrics.accuracy_score( y_test, y_pred_final.predicted)
Output:
0.7627118644067796
Calculate both the metric and verify the values of sensitivity vs specificity.
TP = confusion[0,0] # true positive TN = confusion[1,1] # true negatives FP = confusion[0,1] # false positives FN = confusion[1,0] # false negatives # Let us calculate sensitivity and specificity print("sensitivity ",TP / float(TP+FN)) print("specificity ",TN / float(TN+FP))
Output:
sensitivity 0.8095238095238095
specificity 0.6470588235294118
The optimal cutoff between sensitivity and specificity
In the above calculation of Sensitivity vs Specificity, You can notice, though accuracy is high, sensitivity is very low, which is around 57%. If I want to focus more on the people who are suffering from diabetes, then in such case, the model will perform very badly. Based on business requirements, you may have to decide the cutoff point, whether to have a high sensitivity or high specificity. Though most of the time, we need to find an optimal point where sensitivity and specificity perform fairly well.
Let’s create columns with different probability cutoffs.
numbers = [float(x)/10 for x in range(10)] for i in numbers: y_pred_final[i]= y_pred_final.diabetes_Prob.map( lambda x: 1 if x > i else 0) y_pred_final.head()
Output:
CustID | diabetes_Prob | predicted | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 124 | 0.849983 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
1 | 140 | 0.156658 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 276 | 0.384572 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 252 | 0.350269 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 326 | 0.044309 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Now let’s calculate accuracy sensitivity and specificity for various probability cutoffs.
cutoff_df = pd.DataFrame( columns = ['prob','accuracy','sensi','speci']) from sklearn.metrics import confusion_matrix num = [0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9] for i in num: cm1 = metrics.confusion_matrix( y_test, y_pred_final[i] ) #print(cm1) total1=sum(sum(cm1)) #print(total1) accuracy = (cm1[0,0]+cm1[1,1])/total1 sensi = cm1[0,0]/(cm1[0,0]+cm1[0,1]) speci = cm1[1,1]/(cm1[1,0]+cm1[1,1]) cutoff_df.loc[i] =[ i ,accuracy,sensi,speci] print(cutoff_df)
Output:
prob accuracy sensi speci
0.0 0.0 0.322034 0.0000 1.000000
0.1 0.1 0.559322 0.3500 1.000000
0.2 0.2 0.669492 0.5500 0.921053
0.3 0.3 0.728814 0.7125 0.763158
0.4 0.4 0.745763 0.7875 0.657895
0.5 0.5 0.762712 0.8500 0.578947
0.6 0.6 0.796610 0.9125 0.552632
0.7 0.7 0.779661 0.9250 0.473684
0.8 0.8 0.737288 0.9500 0.289474
0.9 0.9 0.711864 1.0000 0.105263
Let’s plot accuracy sensitivity and specificity for various probabilities.
cutoff_df.plot.line(x='prob', y=['accuracy','sensi','speci'])
As you can observe, the optimal cut off point will be around 0.35, where trade-off exists between sensitivity and specificity. Based on your requirement, you can choose to optimize sensitivity or specificity. You can find more about this topic here. Please provide your thoughts in the comments below and follow this link to learn more about Logistic regression.
#Sensitivity vs Specificity