Diabetes Prediction

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## corrplot 0.92 loaded

Introduction

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

Diabetes is a chronic disease that occurs either when the pancrease does not produce enough insulin or when the body cannot use the insulin it produces.

Types of diabetes:

-Type 1 DM - caused by an autoimmune reaction. Can be diagnosed at any age but more common in the younger age. Symptoms develop rapidly.
Type II DM - The body fails to produce insulin or cannot utilize the insulin produced. It develops gradually over time. Can be prevented by lifestyle modification -
Gestational DM - Develops during pregnancy. Mostly after 20 weeks gestation. Most clear after delivery, though there is an increased risk of developing DM later in life.
Impaired glucose tolerance - Intermediate transitions between normal and diabetes. People with impaired glucose have a high risk of progressing to type 2 DM.

Type 2 DM leads to multiple complications. Microvascular and macrovascular complications. They cause :
Reduced life expectancy.
Premature mortality and increased morbidity.
Increased financial burden

Epidemiology

According to WHO, NCDs accounted for 74% of deaths globally of which 1.6 million deaths were diabetes related making it the 9th global mortality cause.

More than 37 Million US adults have diabetes , and 1 in 5 do not know they have it. It is the 8th leading cause of death in the US. In the last twenty years, the number of adults diagnosed with diabetes has doubled.

The estimated DM population in India was 77 Million with an expected rise to over 134 million in 2045.

More than 50 % of people are unaware of their diabetes status (WHO, 2019).

The risk of diabetes is mainly influenced by ethnicity, age, obesity, unhealthy diet and family history.

Data Dictionary

Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function. Indicates the function which scores likelihood of diabetes based on family history
Age: Age (years)
Outcome: Class variable (0 or 1). If patient had diabetes 1 = Yes, 0 = No.

Problem Statement

What is the distribution of the number of pregnancies among the female Pima Indian patients in the dataset?
How does the distribution of plasma glucose concentration vary among patients with and without diabetes?
Is there a correlation between diastolic blood pressure and the likelihood of diabetes in these patients?
How does the 2-hour serum insulin level differ between patients with and without diabetes?
What is the distribution of BMI (Body Mass Index) among these patients, and does it correlate with the presence of diabetes?
What is the distribution of the diabetes pedigree function scores in the dataset?
How does age vary among patients with diabetes and those without diabetes?
What is the overall prevalence of diabetes (Outcome = 1) among these Pima Indian female patients?
Are there any noticeable trends or patterns in the data that suggest certain factors are more strongly associated with diabetes among this population?
Can a predictive model be developed to estimate the likelihood of diabetes based on these features?
Are there any relationships or interactions between the variables that are worth exploring, such as age and BMI, or glucose levels and insulin levels?
Can insights be gained about the hereditary factors (DiabetesPedigreeFunction) and their impact on diabetes in this population?

Understand the data

The data was imported into R and run head to visualize the first 6 rows

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Also looked at the dimensions of the data. The dataset had 768 rows and 9 columns

dim(diabetes)

## [1] 768   9

colnames(diabetes)

## [1] "Pregnancies"              "Glucose"                 
## [3] "BloodPressure"            "SkinThickness"           
## [5] "Insulin"                  "BMI"                     
## [7] "DiabetesPedigreeFunction" "Age"                     
## [9] "Outcome"

Looking if we had any null values in the data or any duplicated values. No null values were present.

## $Pregnancies
## [1] 0
## 
## $Glucose
## [1] 0
## 
## $BloodPressure
## [1] 0
## 
## $SkinThickness
## [1] 0
## 
## $Insulin
## [1] 0
## 
## $BMI
## [1] 0
## 
## $DiabetesPedigreeFunction
## [1] 0
## 
## $Age
## [1] 0
## 
## $Outcome
## [1] 0

## [1] 0

Since We cannot have Bps, BMI, Skin Thickness and Glucose being zero(0) all zero values were replaced with the mean/median since dropping them would change the data.

##   Pregnancies        Glucose       BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:20.54  
##  Median : 3.000   Median :117.00   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :121.68   Mean   : 72.25   Mean   :26.61  
##  3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :99.00  
##     Insulin            BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.00   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.: 30.50   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 31.25   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 94.65   Mean   :32.45   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.25   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.00   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome        age_cat         
##  Min.   :0.000   Length:768        
##  1st Qu.:0.000   Class :character  
##  Median :0.000   Mode  :character  
##  Mean   :0.349                     
##  3rd Qu.:1.000                     
##  Max.   :1.000

Insights

How was the population spread?

age_cat	Percent
Below_30	51.56
30s	21.48
40s	15.36
50s	7.42
60s	3.78
Above_70	0.39

Number of Pregnancies

The mean number of pregnancies was around 3 with the data being skewed to the left. A few extremes having above 11 pregnancies accounting for 4.43%

Comparison of pregnancies and outcome

Those who had higher number of pregnancies had a higher chance of being diabetic.

BMI numbers

About 61.98 % of the population had a BMI of above 30 which is considered as being obese. 1.04% had a BMI of above 50 which even though possible could be an outlier. 0.52% percent of the population were underweight with a BMI < 18.5

## # A tibble: 9 × 2
##   BMI_cat           Percent
##   <chr>               <int>
## 1 Moderate Obese        235
## 2 Obese                 179
## 3 Severe Obese          150
## 4 Normal                102
## 5 Very severe Obese      62
## 6 Morbid Obese           27
## 7 Super Obese             8
## 8 Underweight             4
## 9 Hyper obese             1

Diabetes Pedigree Function and BMI

The super Obese, Severe Obese and very severe obese had the highest diabetes pedigree function. This in itself is not a good measure since even the underweight still have a high DPF

Glucose ranges

Those with Glucose above 125 classified as hypergylcemia accounted for 40.49 %, Impaired glucose had 34.5% while the hypoglycemia accounted for 1%

Glucose Ranges and Diabetes Pedigree Function

Both the hyperglycemia and impaired glucose had the highest diabetes pedigree function.

The patient with hyperglycemia or high glucose 2 hours after Oral glucose test had a higher chance of having diabetes as their outcome. Compared to the hypoglycemic who none had diabetes as the outcome.

How do the various paramaters relate to each other

Comparison of Age and Outcome

Blood Pressure Comparison

The study utilized Diastolic blood pressures that ranges from 60 - 80 mmHg. The BPs had a normal distribution though 11.2 % had low diastolic BP(less than 60mmHg). 62 % had normal BPs and 0.13% or 1 person had a Hypertensive crises BP > 120 .

Diastolic blood pressure did not have such high impact on the outcome for diabetes.

Prediction Analysis

We aim to investigate the relationship between several medical predictor variables and a binary outcome variable, “Outcome.” The dataset contains information related to pregnancies, BMI, insulin levels, age, blood pressure, glucose levels, and the binary outcome variable, which takes on values 1 (indicating “Yes”) and 2 (indicating “No”). The aim of this analysis is to understand how these predictor variables influence the outcome.

Summary Statistics

The dataSet has 768 observations with the following variables: “Outcome” (Response Variable), Predictor Variables:(“Pregnancies”, “BMI” (Body Mass Index), “Insulin”, “Age”, “BloodPressure”, “Glucose”, “DiabetesPedigreeFunction”).

A regression model was used to predict the outcome variable using the others as predictor variables.

diabetes$Outcome <- as.numeric(diabetes$Outcome)
model <- lm(formula = Outcome ~ Pregnancies + BMI + Insulin + Age + BloodPressure + Glucose + DiabetesPedigreeFunction, data = diabetes)

# Summary of the linear regression model
summary(model)

## 
## Call:
## lm(formula = Outcome ~ Pregnancies + BMI + Insulin + Age + BloodPressure + 
##     Glucose + DiabetesPedigreeFunction, data = diabetes)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.99115 -0.29540  0.07944  0.28584  1.06831 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.0253763  0.1035845  29.207  < 2e-16 ***
## Pregnancies              -0.0204037  0.0050543  -4.037 5.96e-05 ***
## BMI                      -0.0149618  0.0022473  -6.658 5.33e-11 ***
## Insulin                   0.0002412  0.0001486   1.623  0.10497    
## Age                      -0.0020047  0.0015349  -1.306  0.19192    
## BloodPressure             0.0015239  0.0013138   1.160  0.24647    
## Glucose                  -0.0066740  0.0005365 -12.441  < 2e-16 ***
## DiabetesPedigreeFunction -0.1368006  0.0442019  -3.095  0.00204 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3941 on 760 degrees of freedom
## Multiple R-squared:  0.3233, Adjusted R-squared:  0.3171 
## F-statistic: 51.88 on 7 and 760 DF,  p-value: < 2.2e-16

Coefficients

The coefficients obtained from the linear regression analysis are as follows:

Intercept (Intercept): The intercept of the model is approximately 3.0253. This represents the expected value of the "Outcome" variable when all predictor variables are zero.

Pregnancies: For each one-unit increase in the number of pregnancies, the "Outcome" variable is expected to decrease by approximately -0.0204, holding all other predictors constant.

BMI: For each one-unit increase in BMI, the "Outcome" variable is expected to decrease by approximately -0.01496, holding all other predictors constant.

Insulin: The coefficient for Insulin is approximately 0.0002, but it is not statistically significant (p-value > 0.05). This suggests that Insulin may not have a significant impact on the "Outcome" variable in this model.

Age: Similarly, Age has a coefficient of approximately -0.002, but it is not statistically significant.

BloodPressure: BloodPressure has a coefficient of approximately 0.00152, but it is also not statistically significant.

Glucose: For each one-unit increase in Glucose, the "Outcome" variable is expected to decrease by approximately -0.0067, holding all other predictors constant.
DiabetesPedigreeFunction: The coefficient is - 0.1368

Statistical Significance

The predictor variables ‘Pregnancies,’ ‘BMI,’ ‘DiabetesPedigreeFunction’ and ‘Glucose’ are statistically significant (p < 0.05), indicating that they have a significant impact on the ‘Outcome’ variable.

On the other hand, ‘Insulin,’ ‘Age,’ and ‘BloodPressure’ are not statistically significant (p > 0.05) in this model.

The F-statistic is 58.27 with a p-value close to zero, indicating that the model as a whole is statistically significant.

Model Fit

The R-squared value of approximately 0.3171 suggests that about 31.71% of the variance in the ‘Outcome’ variable is explained by the predictor variables.

Conclusion

In summary, this linear regression analysis suggests that several predictor variables, including “Pregnancies,” “BMI,” “DiabetesPedigreeFunction” and “Glucose,” are statistically significant in predicting the “Outcome” variable. These variables have been found to have a significant impact on the likelihood of a positive outcome. However, variables such as “Insulin,” “Age,” and “BloodPressure” do not appear to play a significant role in this predictive model.

These findings provide valuable insights into the factors that may influence the outcome under investigation. Further research and analysis may help refine the model and provide additional insights.

Recommendation

Since more than 60% of the population were obese education and lifestyle modification measures should introduced to the community.

More screening measures should be introduced. Gym services and also healthy eating should be adopted.

More exploration of other factors that would have been left out in this research and comparison also with other regions.

Create more awareness of diabetes and its complication and need for screening.