Research Article, J Womens Health Issues Care Vol: 8 Issue: 1
Analysis of Risk Factors of Gestational Diabetes Mellitus (GDM) Using Data Mining
*Corresponding Author : Prema NS
Department of Information Science and Engineering, Vidyavardhaka College of Engineering, Mysuru, India
E-mail: [email protected]
Received: October 11, 2018 Accepted: June 28, 2019 Published: June 30, 2019
Citation: Prema NS, Pushpalatha MP (2019) Analysis of Risk Factors of Gestational Diabetes Mellitus (GDM) Using Data Mining. J Womens Health, Issues Care 8:1. doi: 10.4172/2325-9795.1000327
Diabetes is the common chronic disease and a major health challenge in all population. Gestational diabetes mellitus (GDM) is a type of diabetes developed in women at the time of pregnancy. We present a Data mining (DM) approach to identify the risk factors of Gestational diabetes mellitus (GDM) using different data mining techniques. Dataset used for analysis contains the details of the pregnant women admitted the local hospital of Mysuru, India. The data mining techniques used are k-means clustering, J48 Decision Tree, Random-Forest and Naive-Bayes classifier. Classification accuracy is enhanced by using feature subset selection wrapper approach. Data imbalanced problem is handled by using Synthetic Minority Over-sampling Technique (SMOTE). The performances of the algorithms have been measured and compared in terms of Accuracy.
Keywords: J48Decision tree; Random forest; Naive-Bayes; Gestational Diabetes mellitus; SMOTE; K-means
It is estimated that 1 out of every 200 pregnancies is complicated by diabetes mellitus and additionally that 5 in every 200 pregnant women will develop gestational diabetes mellitus. Globally prevalence of diabetes is increasing and India is no exception. Four million women have GDM in India alone so, it is important to identify the GDM at the earliest; else it threatens lives of mother and baby .
The factors for increasing prevalence of gestational diabetes in India are;
• The age of the women
• Lack of physical activity
• Modern lifestyles, smoking, alcohol consumption etc.
Diagnosing a pregnant woman with Gestational Diabetes Mellitus (GDM) is very important because diabetes mellitus is associated with significant metabolic alterations, increased perinatal mortality and morbidity, maternal morbidity and exaggerated long span illness among the mothers and their off springs.
Since there is a scarce of healthcare resources and very low doctor to population ratio – the medical specialist remain overcome with the workload; therefore, it becomes a challenging task for them to provide value care to the patients. To meet the trial of saving lives we provide a combined e-health solution using Clinical Decision Support System (CDSS) which is based on data mining which analyzes the patient’s data and classifies as either normal or high risk.
Data mining is method of extracting useful knowledge from large repository of data. Medical data mining is an application of data mining, where data mining methods are used for the analysis of medical data. In Medical data mining approaches are applied for the following tasks: diagnosis, treatment, prognosis, monitoring and management. The aim of medical data mining is to help and assist physicians, improve public health and support patients.
The main two approaches of data mining are prediction and description, prediction includes classification, regression and description includes clustering and association analysis applications of both in the field of healthcare can be found in literature. In most of the work referred in the paper, the data set used is the Pima Indian diabetes data set from UCI machine learning repository which contains data about female patients [UCI]. Many classification algorithms are applied on Pima diabetes data set and there objective is to classify the data into either diabetic or non-diabetic and they have considered only Type-I and Type-II diabetes they have not taken gestational diabetes into consideration.
To mention few, Alexis et al. , proposed diagnosis of type II diabetes by applying artificial metaplasticity on multilayer perceptron, the data set used is Pima Indians diabetes. Patil et al.  have prepared hybrid prediction model proposed for the prediction of Type II diabetes which uses k-means clustering and C4.5 classification algorithm. Similarly Nahla H Barakat and his team have used support vector machines (SVMs) for the diagnosis of diabetes .
In literature we could find many works related to maternal healthcare data, to mention few; M. Jamal Afridi and Muddassar Farooq presented a combination of data mining techniques for effective classification of high risk pregnant women . The model classifies four major risk factors of mortality – hypertension, hemorrhage, septicemia and obstructed labour in a reliable, autonomous and accurate fashion. Aparna Gorthi et al.  proposed a machine learning approach for early determination of the risk category of pregnancy based on patterns from profiles of known clinical parameter . Here classification techniques are applied just to identify the severity of risk like low, medium and high. Compared two Bayesian classifiers to classify hypertensive disorders in pregnancy care .
The first study on the application of machine learning techniques with EHRs to predict GDM was proposed by Hang Qui et al. In their work they developed prediction models capturing the future risk for the electronic medical records of women in West China Second Hospital; the average accuracy obtained is 62% .
The purpose of this work is to apply data mining techniques in exploring the major risk factors of GDM which can be used for early detection of GDM. The remainder of this paper is organized as follows: In Section II, the data mining algorithms are described. The results of the algorithm are discussed in section III. The conclusions are given in Section IV.
Materials and Methods
There is no dataset of pregnant women having gestational diabetes exists; therefore, we have made an attempt to create a new dataset which contains information about diabetes in pregnancy. The data used in this experiment are collected from local hospitals of Mysuru, Karnataka state, India. The medical records are taken after masking the identity of patients in order to ensure confidentiality. The data set has been developed by keeping obstetrics and gynaecology consultants in a feedback loop. We have collected about 1352 pregnant women details. GDM dataset is developed by removing less relevant and irrelevant features with the help of doctors, then data cleaning and transformation is done, Figure 1 shows the steps followed in the proposed model.
Removal of irrelevant features
The data taken from hospital contains more than 20 attributes. The reduction of features is done manually by taking the help of gynaecologists. As a result, only 10 relevant features are retained that consultants use to detect gestational diabetes.
Data cleaning and transformation
This step is very important in developing a complete data set which can be used further in any machine learning techniques. It was very challenging task to extracts useful information from a manually entered medical record, as the entry was made manually there was lots ambiguity for in entering the values of some of the attributes for example for the attribute number of time pregnant (Gravida) some have entered in numbers and some have specified as multigravida etc. In order to have a meaningful dataset, we applied data cleansing and transformation cycle. Once we have the meaningful attributes, the datasets is finalized on the basis of short listed 10 risk factors. Most of the attributes the values will be of type nominal and values will be either yes or no.
GDM data set has totally 10 risk factors, they are:
2. Past history of fetal loss(abortion or IUD)
3. Congenital anomalies in previous pregnancy
4. Macrosomia in previous pregnancy
5. Family history of Diabetes mellitus
7. Past history of Pre-elampsia
8. Number of times pregnant
9. Unexplained neonatal loss
10. Previous history of GDM
Age is the major risk factor of GDM, older the age more chances of developing gestational diabetes. The Figure 2 shows association between age and GDM. It is found that women with age more than 25 are having more chances of development of GDM.
Many attribute are about previous pregnancy, they might be the cause for GDM, the selected attributes for this study are history of fetal loss by abortion or IUD, Congenital anomalies, GDM, unexplained neonatal loss, Macrosomia and Pre-elampsia.
Macrosomia is the situation where the birth-weight is over 4,000 g and is not depending on gestational age. Macrosomia affects about 3-15% pregnancies .
Family history of diabetes can also be the reason for developing GDM, in the used dataset there about 138 cases where family history of diabetes is positive in that more than 90% cases we can see the development of GDM.
Obesity is the common risk factor for many deceases, for GDM also it can be considered as the major risk factors. The dataset contains about 95 obese women details where 95% of them have gestational diabetes.
It is an unsupervised technique of grouping similar objects into disjoint groups using distance measures. k-means is a partition clustering techniques where it aims in partitioning the observation into k cluster with nearest centroid.
The distance measure used find out the distance between the clusters is Euclidian distance.
The aim of study here is to find the application of the classification techniques for better classification. In our study, we use Decision Tree (J48), Random-Forest, and Naive-Bayes classifiers.
Decision Tree (J48)
It is a decision tree algorithm implemented for C4.5 from a set training data very similar to ID3, using Information entropy.
It is an ensemble form of decision tree constructed using training data from a random subset. Random-Forest is of a collection of treestructured classifiers. The main principle behind ensemble methods is that a group of “weak learners” can come together to form a “strong learner”.
For the GDM data set Random-Forest classifier is applied, 10 trees will be generated, each constructed while considering 4 random features with Out of bag error 0.1806.
Naive-Bayes: This is classifier based on Bayes theorem, which uses maximum likelihood method.
Class imbalance problem
Data used is imbalanced; imbalanced classification is a supervised learning problem where one classes out numbers other class by a large proportion. In medical datasets high risk patients tend to be the minority class, so the cost of miss predicting the minority class will be more. Therefore, there is a need of a good sampling technique for medical datasets .
The technique used for oversampling is SMOTE, in which the minority class is over-sampled by creating “synthetic” examples rather than by over-sampling with duplicated real data entries. SMOTE blindly generates synthetic minority class samples without considering majority class samples and may thus cause over generalization .
Results and Discussion
The k means clustering technique is applied on the data set by taking k value as two, the two
clusters formed are;
Cluster-1 with 90% of the data set.
Cluster-2 with 10% of the data set.
The cluster-2 contains the instances with GDM cases and the attributes values yes is considered for Family history of diabetes, number of times pregnant, obesity and the average value for attribute Age, considered is 28.8 whereas in cluster-1 all the attributes values are no and they all belong to non-GDM cases.
Feature subset selection
The final 10 risk factors are given as input for the subset selection. Wrapper model approach is used for the feature subset selection which uses the method of classification itself to measure the importance of features set; hence feature selected depends on the classifier model used. Evaluates attribute sets by using a learning scheme. Cross validation is used to estimate the accuracy of the learning scheme for a set of attributes. The selection of attributes is done using best fit approach which is a hybrid of DFS and BFS search
A Bayesian classifier Naive-Bayes and tree based classifier namely J48, Random-Forest are used for the feature selection purpose. Standard 10 fold cross validation process is done in all the experiments. This process allows the classifiers divides the data into 9 and 1 fold in which, 9 folds of data is used for training and 1 unused fold for testing.
The prediction of GDM is done by using same classifiers by using all the risk factors and selected risk factors. Table 1 shows the accuracy of the classifiers for the selected features and the accuracy of classifiers with all features.
|Number of features selected||5||7||6|
|Accuracy for with selected features in %||86.7||86||85.7|
|Accuracy with all features in %||86||87||84|
Table 1: Distribution of underreporting bullying experiences among students’ demographical variables.
The features selected by the classifiers are shown in Table 2.
|Number of features selected||6||9||9|
|Accuracy for with selected features in %||93.8||95||90.7|
|Accuracy with all features in %||93.5||95||89|
Table 2: Classification accuracy using wrapper feature selection approach for balanced dataset.
In summary, we have applied different DM technique for identification of risk factors for predicting diabetes in pregnancy using 10 important attributes. If we consider the complete data set there is 24% GDM cases but K-means algorithm can group only 10% by considering limited attributes.
Here the studies conclude that the classifiers achieve higher accuracy of 86% for the imbalanced data set and 93% for balanced data. The classification accuracy has been increased by 1 to 2% after selecting best attributes by applying wrapper approach of feature subset selection and except for the attributes Age and Obesity no attributes are selected in common by the applied algorithms as major risk factors for GDM for the taken data set. Hence all 10 risk factors will be helpful in the prediction of GDM further the application can be developed which will help the pregnant women in primary diagnosis of gestational diabetes mellitus.
Further we plan to consider more and different types of risk factors for gestational diabetes prediction and develop a model for large data set.
- International Diabetes Federation (2009) International Diabetes federation, diabetes atlas. (4th edn), IDF, Brussels, Belgium.
- Marcano-Cedeño A, Andina D (2012) Data mining for the diagnosis of type 2 diabetes. World Automation Congress.
- Patil BM, Joshi RC, Toshniwal D (2010) Hybrid prediction model for type-2 diabetic patients. Expert System with Applications 12: 8102-8108.
- Nahla N, Barakat AP, BradleY, Barakat MNH (2010) Intelligible support vector machines for diagnosis of Diabetes Mellitus. IEEE Transactions on Information Technology in Biomedicine 14: 1089-7771.
- Afridi MJ, Farooq M (2011) OG-Miner: an intelligent health tool for achieving millennium development goals (MDGs) in m-health environments. 44th Hawaii International Conference on System Sciences.
- Gorthi A, Firtion C, Vepa J (2009) Automated risk assessment tool for pregnancy care. International Conference of the IEEE Engineering in Medicine and Biology Society.
- Moreira MWL, Rodrigues JJPC, Oliveira AMB, Saleem K, Neto AV (2016) An inference mechanism using bayes-based classifiers in pregnancy care. IEEE 18th International Conference on e-Health Networking, Applications and Services, Munich, Germany.
- Qiu H, Hai-Yan Y, Wang LY, Yao Q, Wu SN, et al. (2017) Electronic health record driven prediction for gestational diabetes mellitus in early pregnancy. Scientific Reports 7: 16417.
- Mohammadbeigi A, Farhadifar F, Soufi zadeh N, Mohammadsalehi N, Rezaiee M, et al. (2013) Fetal macrosomia: risk factors, maternal, and perinatal outcome. Annals of Medical & Health Sciences Research 4: 546-550.
- Laza R, Pavan R, Reboiro-Jato M, Fdez-Riverola F (2011) Evaluating the effect of unbalanced data in biomedical document classification. Journal of Integrative Bioinformatics 3: 177.
- Yen SJ, Lee YS (2009) Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications 3: 5718-5727.