We will take the data set of heart disease for logistic regression. You can download the data set from the following link. The "Heart Disease UCI" dataset, often used on Kaggle and other data science platforms, typically contains several features (independent variables) and a response variable (dependent variable). Here are the commonly included features and the response variable:
Features (Independent Variables):
Response Variable (Dependent Variable):
The goal of using this dataset is to build a predictive model that can accurately classify patients into these two categories (presence or absence of heart disease) based on the provided features. Data scientists and researchers use this dataset to train machine learning models to predict heart disease and to analyze the relationship between these features and the likelihood of heart disease.
We will use Logistic Regression for this purpose as a demonstration by coding in R and Python.
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(caret)
# Load the heart disease dataset (adjust the file path accordingly)
heart_data <- read.csv("heart.csv")
# Split the dataset into training and testing sets
set.seed(123)
train_indices <- createDataPartition(heart_data$target, p = 0.7, list = FALSE)
train_data <- heart_data[train_indices, ]
test_data <- heart_data[-train_indices, ]
# Build a logistic regression model
logistic_model <- glm(target ~ ., data = train_data, family = "binomial")
# Make predictions on the test data
predictions <- predict(logistic_model, newdata = test_data, type = "response")
# Convert probabilities to binary predictions (0 or 1)
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
# Evaluate the model
confusion_matrix <- table(predicted_classes, test_data$target)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
# Print the confusion matrix and accuracy
print(confusion_matrix)
cat("Accuracy:", accuracy, "\n")
| Actual Classes \ Predicted Classes | 0 | 1 |
| 0 | 110 | 16 |
| 1 | 35 | 146 |
Accuracy: 0.8338762
In this code:
dplyr for data manipulation, ggplot2 for data visualization, and caret for data splitting.createDataPartition function from the caret package.glm function, specifying target ~ . to predict the target variable based on all other variables in the dataset.predict function.You can run this code in your R environment / Kaggle environment / Google Colaboratory with the appropriate dataset file, and it will build a logistic regression model for heart disease prediction and provide an accuracy score.
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
# Load the heart disease dataset (adjust the file path accordingly)
heart_data = pd.read_csv("heart.csv")
# Split the dataset into training and testing sets
X = heart_data.drop("target", axis=1)
y = heart_data["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
# Fit the model on the training data
logistic_model.fit(X_train, y_train)
# Make predictions on the test data
predictions = logistic_model.predict(X_test)
# Evaluate the model
confusion_matrix_result = confusion_matrix(y_test, predictions)
confusion_df = pd.crosstab(index=y_test, columns=predictions, rownames=['Actual'], colnames=['Predicted'])
accuracy = accuracy_score(y_test, predictions)
# Print the confusion matrix and accuracy
print("Confusion Matrix:")
print(confusion_df)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
| Actual Classes \ Predicted Classes | 0 | 1 |
| 0 | 130 | 24 |
| 1 | 16 | 138 |
Accuracy: 0.870129
In this Python code:
train_test_split from Scikit-Learn.LogisticRegression() from Scikit-Learn.fit().predict().You can run this Python code in your environment, and it will build a logistic regression model for heart disease prediction and provide an accuracy score, just like the R code provided earlier.