UCI Heart Disease Logistic Regression Predictive Data Science Project

We will take the data set of heart disease for logistic regression. You can download the data set from the following link. The "Heart Disease UCI" dataset, often used on Kaggle and other data science platforms, typically contains several features (independent variables) and a response variable (dependent variable). Here are the commonly included features and the response variable:

Features (Independent Variables):

  1. Age: The age of the patient.
  2. Sex: The gender of the patient (0 for female, 1 for male).
  3. Chest Pain Type: The type of chest pain experienced by the patient (e.g., typical angina, atypical angina, non-anginal pain).
  4. Resting Blood Pressure: The patient's resting blood pressure.
  5. Cholesterol: The patient's serum cholesterol level in mg/dl.
  6. Fasting Blood Sugar: Whether the fasting blood sugar is greater than 120 mg/dl (1 for true, 0 for false).
  7. Resting Electrocardiographic Results: The results of the resting electrocardiogram (ECG) (e.g., normal, abnormal ST-T wave, hypertrophy).
  8. Maximum Heart Rate: The patient's maximum heart rate achieved during exercise.
  9. Exercise-Induced Angina: Whether angina was induced by exercise (1 for yes, 0 for no).
  10. ST Depression: ST depression induced by exercise relative to rest.
  11. Slope: The slope of the peak exercise ST segment (e.g., upsloping, flat, downsloping).
  12. Number of Major Vessels (Fluoroscopy): The number of major vessels colored by fluoroscopy (0-3).
  13. Thalassemia: A blood disorder (e.g., normal, fixed defect, reversible defect).

Response Variable (Dependent Variable):

  • The response variable is typically a binary classification label representing the presence or absence of heart disease. It is often represented as:
  • 0: No heart disease
  • 1: Heart disease is present

The goal of using this dataset is to build a predictive model that can accurately classify patients into these two categories (presence or absence of heart disease) based on the provided features. Data scientists and researchers use this dataset to train machine learning models to predict heart disease and to analyze the relationship between these features and the likelihood of heart disease.

We will use Logistic Regression for this purpose as a demonstration by coding in R and Python.


R

Loading the Libraries

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(caret)

Loading the Data Set

# Load the heart disease dataset (adjust the file path accordingly)
heart_data <- read.csv("heart.csv")

Training and Test Data Set

# Split the dataset into training and testing sets
set.seed(123)
train_indices <- createDataPartition(heart_data$target, p = 0.7, list = FALSE)
train_data <- heart_data[train_indices, ]
test_data <- heart_data[-train_indices, ]

Logistic Regression Model

# Build a logistic regression model
logistic_model <- glm(target ~ ., data = train_data, family = "binomial")

Predictions

# Make predictions on the test data
predictions <- predict(logistic_model, newdata = test_data, type = "response")

# Convert probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(predictions > 0.5, 1, 0)

Model Evaluation

# Evaluate the model
confusion_matrix <- table(predicted_classes, test_data$target)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)

# Print the confusion matrix and accuracy
print(confusion_matrix)
cat("Accuracy:", accuracy, "\n")

Model Performance

Actual Classes \ Predicted Classes01
011016
135146

Accuracy: 0.8338762

In this code:

  1. We load the necessary libraries, including dplyr for data manipulation, ggplot2 for data visualization, and caret for data splitting.
  2. We load the heart disease dataset into R. Please adjust the file path accordingly to load your dataset.
  3. We split the dataset into a training set (70%) and a testing set (30%) using the createDataPartition function from the caret package.
  4. We build a logistic regression model using the glm function, specifying target ~ . to predict the target variable based on all other variables in the dataset.
  5. We make predictions on the test data using the predict function.
  6. We convert the predicted probabilities into binary predictions using a threshold of 0.5.
  7. We evaluate the model's performance by calculating the confusion matrix and accuracy.

You can run this code in your R environment / Kaggle environment / Google Colaboratory with the appropriate dataset file, and it will build a logistic regression model for heart disease prediction and provide an accuracy score.


Python

Loading the Libraries

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

Loading the Data Set

# Load the heart disease dataset (adjust the file path accordingly)
heart_data = pd.read_csv("heart.csv")

Training and Test Data Set

# Split the dataset into training and testing sets
X = heart_data.drop("target", axis=1)
y = heart_data["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

Logistic Regression Model

# Fit the model on the training data
logistic_model.fit(X_train, y_train)

Predictions

# Make predictions on the test data
predictions = logistic_model.predict(X_test)

Model Evaluation

# Evaluate the model
confusion_matrix_result = confusion_matrix(y_test, predictions)
confusion_df = pd.crosstab(index=y_test, columns=predictions, rownames=['Actual'], colnames=['Predicted'])
accuracy = accuracy_score(y_test, predictions)

# Print the confusion matrix and accuracy
print("Confusion Matrix:")
print(confusion_df)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Model Performance

Actual Classes \ Predicted Classes01
013024
116138

Accuracy: 0.870129

In this Python code:

  1. We import the necessary libraries, including Pandas for data handling, NumPy for numerical operations, Scikit-Learn for machine learning, and related modules.
  2. We load the heart disease dataset into Python. Make sure to adjust the file path according to the location of your dataset file.
  3. We split the dataset into a training set (70%) and a testing set (30%) using train_test_split from Scikit-Learn.
  4. We create a logistic regression model using LogisticRegression() from Scikit-Learn.
  5. We fit the model on the training data using fit().
  6. We make predictions on the test data using predict().
  7. We evaluate the model's performance by calculating the confusion matrix and accuracy using Scikit-Learn's functions.

You can run this Python code in your environment, and it will build a logistic regression model for heart disease prediction and provide an accuracy score, just like the R code provided earlier.


Note

  • This is a basic predictive data science project. A project is not just about finding the accuracy on the test data. The goal of a project is to fundamentally understand the data set, and which model may work well on future data.
  • We have used accuracy as a good measure of the model performance, since the classes are balanced. There are other classification performance measures.
  • To understand a model deeply, one also needs to do statistical testing of the parameters of the logistic regression model. Also, one needs to do data visualization, and exploratory data analysis to constantly understand the data, as well as the model behaviour.
  • One should take this project as a basic step to start coding and get his/her hands dirty with the codes, along with understanding each model in a detailed manner.

Learn. Code. Apply.
Statistics. Machine Learning.
cloud-syncearthbullhorn linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram