##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## corrplot 0.92 loaded
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes.
Diabetes is a chronic disease that occurs either when the pancrease does not produce enough insulin or when the body cannot use the insulin it produces.
-Type 1 DM - caused by an autoimmune reaction. Can be diagnosed at any age but more common in the younger age. Symptoms develop rapidly.
Type II DM - The body fails to produce insulin or cannot utilize the insulin produced. It develops gradually over time. Can be prevented by lifestyle modification -
Gestational DM - Develops during pregnancy. Mostly after 20 weeks gestation. Most clear after delivery, though there is an increased risk of developing DM later in life.
Impaired glucose tolerance - Intermediate transitions between normal and diabetes. People with impaired glucose have a high risk of progressing to type 2 DM.
Type 2 DM leads to multiple complications. Microvascular and macrovascular complications. They cause :
Reduced life expectancy.
Premature mortality and increased morbidity.
Increased financial burden
According to WHO, NCDs accounted for 74% of deaths globally of which 1.6 million deaths were diabetes related making it the 9th global mortality cause.
More than 37 Million US adults have diabetes , and 1 in 5 do not know they have it. It is the 8th leading cause of death in the US. In the last twenty years, the number of adults diagnosed with diabetes has doubled.
The estimated DM population in India was 77 Million with an expected rise to over 134 million in 2045.
More than 50 % of people are unaware of their diabetes status (WHO, 2019).
The risk of diabetes is mainly influenced by ethnicity, age, obesity, unhealthy diet and family history.
Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function. Indicates the function which scores likelihood of diabetes based on family history
Age: Age (years)
Outcome: Class variable (0 or 1). If patient had diabetes 1 = Yes, 0 = No.
What is the distribution of the number of pregnancies among the female Pima Indian patients in the dataset?
How does the distribution of plasma glucose concentration vary among patients with and without diabetes?
Is there a correlation between diastolic blood pressure and the likelihood of diabetes in these patients?
How does the 2-hour serum insulin level differ between patients with and without diabetes?
What is the distribution of BMI (Body Mass Index) among these patients, and does it correlate with the presence of diabetes?
What is the distribution of the diabetes pedigree function scores in the dataset?
How does age vary among patients with diabetes and those without diabetes?
What is the overall prevalence of diabetes (Outcome = 1) among these Pima Indian female patients?
Are there any noticeable trends or patterns in the data that suggest certain factors are more strongly associated with diabetes among this population?
Can a predictive model be developed to estimate the likelihood of diabetes based on these features?
Are there any relationships or interactions between the variables that are worth exploring, such as age and BMI, or glucose levels and insulin levels?
Can insights be gained about the hereditary factors (DiabetesPedigreeFunction) and their impact on diabetes in this population?
The data was imported into R and run head to visualize the first 6 rows
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Also looked at the dimensions of the data. The dataset had 768 rows and 9 columns
dim(diabetes)
## [1] 768 9
colnames(diabetes)
## [1] "Pregnancies" "Glucose"
## [3] "BloodPressure" "SkinThickness"
## [5] "Insulin" "BMI"
## [7] "DiabetesPedigreeFunction" "Age"
## [9] "Outcome"
Looking if we had any null values in the data or any duplicated values. No null values were present.
## $Pregnancies
## [1] 0
##
## $Glucose
## [1] 0
##
## $BloodPressure
## [1] 0
##
## $SkinThickness
## [1] 0
##
## $Insulin
## [1] 0
##
## $BMI
## [1] 0
##
## $DiabetesPedigreeFunction
## [1] 0
##
## $Age
## [1] 0
##
## $Outcome
## [1] 0
## [1] 0
Since We cannot have Bps, BMI, Skin Thickness and Glucose being zero(0) all zero values were replaced with the mean/median since dropping them would change the data.
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:20.54
## Median : 3.000 Median :117.00 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :121.68 Mean : 72.25 Mean :26.61
## 3rd Qu.: 6.000 3rd Qu.:140.25 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.00 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.00 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 30.50 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median : 31.25 Median :32.00 Median :0.3725 Median :29.00
## Mean : 94.65 Mean :32.45 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.25 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome age_cat
## Min. :0.000 Length:768
## 1st Qu.:0.000 Class :character
## Median :0.000 Mode :character
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
age_cat | Percent | |||
---|---|---|---|---|
Below_30 | 51.56 | |||
30s | 21.48 | |||
40s | 15.36 | |||
50s | 7.42 | |||
60s | 3.78 | |||
Above_70 | 0.39 |
The mean number of pregnancies was around 3 with the data being skewed to the left. A few extremes having above 11 pregnancies accounting for 4.43%
Those who had higher number of pregnancies had a higher chance of being diabetic.
About 61.98 % of the population had a BMI of above 30 which is considered as being obese. 1.04% had a BMI of above 50 which even though possible could be an outlier. 0.52% percent of the population were underweight with a BMI < 18.5
## # A tibble: 9 × 2
## BMI_cat Percent
## <chr> <int>
## 1 Moderate Obese 235
## 2 Obese 179
## 3 Severe Obese 150
## 4 Normal 102
## 5 Very severe Obese 62
## 6 Morbid Obese 27
## 7 Super Obese 8
## 8 Underweight 4
## 9 Hyper obese 1
The super Obese, Severe Obese and very severe obese had the highest diabetes pedigree function. This in itself is not a good measure since even the underweight still have a high DPF
Those with Glucose above 125 classified as hypergylcemia accounted for 40.49 %, Impaired glucose had 34.5% while the hypoglycemia accounted for 1%
Both the hyperglycemia and impaired glucose had the highest diabetes pedigree function.
The patient with hyperglycemia or high glucose 2 hours after Oral glucose test had a higher chance of having diabetes as their outcome. Compared to the hypoglycemic who none had diabetes as the outcome.
The study utilized Diastolic blood pressures that ranges from 60 - 80 mmHg. The BPs had a normal distribution though 11.2 % had low diastolic BP(less than 60mmHg). 62 % had normal BPs and 0.13% or 1 person had a Hypertensive crises BP > 120 .
Diastolic blood pressure did not have such high impact on the outcome for diabetes.
We aim to investigate the relationship between several medical predictor variables and a binary outcome variable, “Outcome.” The dataset contains information related to pregnancies, BMI, insulin levels, age, blood pressure, glucose levels, and the binary outcome variable, which takes on values 1 (indicating “Yes”) and 2 (indicating “No”). The aim of this analysis is to understand how these predictor variables influence the outcome.
The dataSet has 768 observations with the following variables: “Outcome” (Response Variable), Predictor Variables:(“Pregnancies”, “BMI” (Body Mass Index), “Insulin”, “Age”, “BloodPressure”, “Glucose”, “DiabetesPedigreeFunction”).
A regression model was used to predict the outcome variable using the others as predictor variables.
diabetes$Outcome <- as.numeric(diabetes$Outcome)
model <- lm(formula = Outcome ~ Pregnancies + BMI + Insulin + Age + BloodPressure + Glucose + DiabetesPedigreeFunction, data = diabetes)
# Summary of the linear regression model
summary(model)
##
## Call:
## lm(formula = Outcome ~ Pregnancies + BMI + Insulin + Age + BloodPressure +
## Glucose + DiabetesPedigreeFunction, data = diabetes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.99115 -0.29540 0.07944 0.28584 1.06831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0253763 0.1035845 29.207 < 2e-16 ***
## Pregnancies -0.0204037 0.0050543 -4.037 5.96e-05 ***
## BMI -0.0149618 0.0022473 -6.658 5.33e-11 ***
## Insulin 0.0002412 0.0001486 1.623 0.10497
## Age -0.0020047 0.0015349 -1.306 0.19192
## BloodPressure 0.0015239 0.0013138 1.160 0.24647
## Glucose -0.0066740 0.0005365 -12.441 < 2e-16 ***
## DiabetesPedigreeFunction -0.1368006 0.0442019 -3.095 0.00204 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3941 on 760 degrees of freedom
## Multiple R-squared: 0.3233, Adjusted R-squared: 0.3171
## F-statistic: 51.88 on 7 and 760 DF, p-value: < 2.2e-16
The coefficients obtained from the linear regression analysis are as follows:
Intercept (Intercept): The intercept of the model is approximately 3.0253. This represents the expected value of the "Outcome" variable when all predictor variables are zero.
Pregnancies: For each one-unit increase in the number of pregnancies, the "Outcome" variable is expected to decrease by approximately -0.0204, holding all other predictors constant.
BMI: For each one-unit increase in BMI, the "Outcome" variable is expected to decrease by approximately -0.01496, holding all other predictors constant.
Insulin: The coefficient for Insulin is approximately 0.0002, but it is not statistically significant (p-value > 0.05). This suggests that Insulin may not have a significant impact on the "Outcome" variable in this model.
Age: Similarly, Age has a coefficient of approximately -0.002, but it is not statistically significant.
BloodPressure: BloodPressure has a coefficient of approximately 0.00152, but it is also not statistically significant.
Glucose: For each one-unit increase in Glucose, the "Outcome" variable is expected to decrease by approximately -0.0067, holding all other predictors constant.
DiabetesPedigreeFunction: The coefficient is - 0.1368
The predictor variables ‘Pregnancies,’ ‘BMI,’ ‘DiabetesPedigreeFunction’ and ‘Glucose’ are statistically significant (p < 0.05), indicating that they have a significant impact on the ‘Outcome’ variable.
On the other hand, ‘Insulin,’ ‘Age,’ and ‘BloodPressure’ are not statistically significant (p > 0.05) in this model.
The F-statistic is 58.27 with a p-value close to zero, indicating that the model as a whole is statistically significant.
The R-squared value of approximately 0.3171 suggests that about 31.71% of the variance in the ‘Outcome’ variable is explained by the predictor variables.
In summary, this linear regression analysis suggests that several predictor variables, including “Pregnancies,” “BMI,” “DiabetesPedigreeFunction” and “Glucose,” are statistically significant in predicting the “Outcome” variable. These variables have been found to have a significant impact on the likelihood of a positive outcome. However, variables such as “Insulin,” “Age,” and “BloodPressure” do not appear to play a significant role in this predictive model.
These findings provide valuable insights into the factors that may influence the outcome under investigation. Further research and analysis may help refine the model and provide additional insights.
Since more than 60% of the population were obese education and lifestyle modification measures should introduced to the community.
More screening measures should be introduced. Gym services and also healthy eating should be adopted.
More exploration of other factors that would have been left out in this research and comparison also with other regions.
Create more awareness of diabetes and its complication and need for screening.