Diabetes Prediction Using Data Mining
Objective
To predict diabetes in healthcare industry using data mining.
Project Overview
Diabetes is one of the major international health problems. World Health Organization reports says that around 422 million people have diabetes worldwide. Data mining plays a huge role in predicting diabetes in the healthcare industry. There are many algorithms developed for prediction of diabetes. But most of the algorithms failed in case of the accuracy estimation. Also, there is a need to automate the overall process of diabetes prediction. This automation of diabetic database helps in identification of impact of diabetes on various human organs.More the accuracy of prediction, more the chances of accurate severity estimation. Therefore this project concentrated on providing different prediction methods of diabetes.
Proposed System
Dataset
Here PIMA Indian diabetes data set is considered. The data set is taken from UCI machine learning repository. The data set consists of 9 attributes: number of times pregnant, plasma glucose concentration, diastolic blood pressure, triceps skin folds thickness, serum insulin, body mass index, pedigree type, age,and class. Here, the class label is binary classification. It has two values
- Tested positive (1) which means diabetic
- Tested negative (0) which saysnondiabetic
Diabetes Prediction Using Data Mining Methodology
Data pre processing and data mining algorithms are used for the further process in the project. Data pre processing technique data transformation is applied to the data set before applying data mining algorithms. The decision tree and regression models are built. Decision trees and Regression models are used to predict the final binary target variable. After running different types of models, model comparison needed to select the best algorithm. The best algorithm and best model is selected based on the high accuracy rate.
Performance Metrics
The following performance metrics are used to evaluate the performance of various algorithms.
- True positive (TP) – people have the disease,and the prediction also has a positive
- True negative (TN) – people not having the disease and the prediction also has a negative
- False positive (FP) – people not having the disease but the prediction has a positive
- False negative (FN) – people having the disease and the prediction also has a positive
- TP and TN can be used to calculate accuracy rate and the error rates can be computed using FP and FN values.
- True positive rate can be calculated as TP by a total number of people having the disease in reality.
- False positive rate can be calculated as FP by a total number of people not having the disease in reality.
- Precision is the TP/ total number of people having prediction result as yes.
- Accuracy is the total number of correctly classified records.
Diabetes Prediction Using Data Mining Results
Finally,decision tree is built using c4.5 decision tree algorithm. All the results are displayed to the end user using weka data visualization. Regression provides the predicted outcome to end user.
Software Requirements
- Windows OS
- Weka
Hardware Requirements
- Hard Disk – 1 TB or Above
- RAM required – 8 GB or Above
- Processor – Core i3 or Above
Technology Used
- Data Mining
- Data Visualization