Predicting HIV Status among Women in South Africa using Machine Learning: Comparing Decision Tree Model and Logistic Regression
Researcher: Oluwabukola Oluwapelumi Oladokun, University of the Witwatersrand, Johannesburg
Supervisors: Prof. Rod Alence and Dr Sasha Frade, University of the Witwatersrand, Johannesburg
The HIV epidemic has grown immensely to become a serious public health problem globally. 940,000 people died from HIV in 2017, and approximately 1.8 million new infections were reported worldwide in the same year. Almost half of all new HIV infections are in women aged 15-24 years old in sub-Saharan Africa. In addition, South Africa has the highest HIV rate worldwide with an estimated 7.2 million people living with the virus in the country. To effectively manage this epidemic, better understanding of the sociodemographic factors that influence the risk of seroconversion is needed. This can be obtained by creating a model of the HIV epidemic especially among at-risk populations. More specifically, the aim of this study is to predict the HIV status of an individual, given readily available demographic data using decision tree and comparing the results with traditional logistic regression.
Individual recode data was gotten from DHS 2016 for women in South Africa. The study sample was 7808 women aged 15-49 years living in South Africa. Data was split into training (75%) and testing (25%) datasets. The logistic regression model had the highest accuracy for both training (62.90%) and testing dataset (68.039%). Accuracy for the decision tree model was 63.93%. The AUCs from the ROC curve reported 0.652 and 0.682 for the DT and LG respectively. This means that on average, a woman will be predicted as HIV negative 65.2% of the time as compared to being HIV positive using the DT model and 68.2% using the LG model.
The accuracy of both models was not high enough with the logistic regression unexpectedly having a higher accuracy, the accuracy of the decision tree model could have been impacted due to overfitting. In addition, demographic data might not be enough to accurately predict HIV status especially at the medical classification level, or more variables are needed to build the model. It is also recommended that different input features be tested, as well as automatic relevance detection to assess which inputs contribute to the output of the model.