Research Article - (2021)
Objective: To improve the timeliness for the early COVID-19 infection diagnosis, it is essential to develop a decision- making tool to assist early diagnosis of COVID-19 patients in fever clinics.
Materials and methods: This paper aims at extracting risk factors from clinical data of 912 early COVID-19 infected patients and utilizing four types of traditional machine learning approaches including Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF) and a deep learning-based method for diagnosis of early COVID-19.
Results: The results show that the LR predictive model presents a higher specificity rate of 0.95, an Area Under the receiver operating Curve (AUC) of 0.971 and an improved sensitivity rate of 0.82, which makes it optimal for the screening of early COVID-19 infection. We also perform the verification for generality of the best model (LR predictive model) among Zhejiang population, and analyse the contribution of the factors to the predictive models.
Discussions: Under the background of COVID-19 pandemic, the early diagnosis of COVID-19 still face severe challenges, a decision-making tool assisting early diagnosis of COVID-19 patients is vital for fever clinics.
Conclusions: Our manuscript describes and highlights the ability of machine learning methods for improving the accuracy and timeliness of early COVID-19 infection diagnosis. The higher AUC of our LR-base predictive model makes it a more conducive method for assisting COVID-19 diagnosis. The optimal model has been encapsulated as a mobile application (APP) and implemented in some hospitals in Zhejiang Province.
Infectious diseases; Machine learning; Artificial intelligence; COVID-19; SARS-CoV-2
The coronavirus disease 2019 (COVID-19) cases were first reported in Outbreak area in December 2019. Soon after, Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV-2), this new emerging virus has spread rapidly in over 200 countries and areas [1,2]. On March 11, 2020, the World Health Organization (WHO) declared that COVID-19 outbreaks a global pandemic. COVID-19 is a novel pathogen with characteristics of fast transmission and strong infectivity [3,4]. The early symptoms of COVID-19 are similar to other respiratory infectious diseases, which makes it difficult for early differential diagnosis [5-7]. So far, accurate RT-PCR test has been regarded as the gold standard for the diagnosis of COVID-19. However, RT-PCR tests are complicated in operation and it usually takes 5-6 hours or even longer to get the results . Additionally, due to the low virus loads in early infected COVID-19 patients, RT-PCR tests show false negative results in a number of cases [9,10]. It has greatly hindered the prevention and control of the global pandemic. Thus, it is dramatically essential to establish a rapid diagnostic model to screen high-risk patients with COVID-19 infection.
In recent years, machine learning solutions are widely used to predict diagnosis and individual risk factors for diseases, and support clinical decisions . Some machine learning methods have achieved remarkable results in medical filed, and a method with superior classification precision would provide better robustness for predicting unknown data [11-16]. Chhatwal et al. utilized LR to create a breast cancer risk estimation model based on the descriptors of National Mammography Database (NMD) format that can aid in decision-making for early detection of breast cancer . Pinto et al. used RF method to identify the patients who suffer from Alzheimer’s disease based on ADNI datasets, which shows better accuracy and can be used as clinical assistant diagnosis . Recently, the combination of machine learning approaches and epidemic infectious diseases has been emerged extensively. Hong et al. used SVM with double class analysis for MERS-COV epidemiological study and discovered the relevance between two sequences of MERS-COV . Jia et al. constructed predictive model with higher accuracy for antigen mutation of influenza virus subtype H1, which used CART DT algorithm combined with amino acid variation sites of viral proteins . The combination of machine learning and medical data has become the main development direction to meet the needs of early diagnosis and prognosis assessment.
In this study, we attempted to identify the best appropriative algorithm for early COVID-19 detection based on clinical big data. We analysed clinical data of 912 patients who were confirmed as COVID-19 or other respiratory infectious diseases from 18 hospitals in Zhejiang Province, focusing on extraction of risk factors and construction of five types of classification models: SVM, LR, DT, RF as well as Deep Neural Network (DNN). Four epidemiological factors and six clinical manifestations were selected by feature engineering approach as diagnostic models input, and they were much fewer than candidate features of medical records. Essentially, the diagnostic model constructed with fewer meaningful clinical factors is practical for outpatient service. Clinical symptoms, laboratory tests and imaging findings play significant roles in identification of COVID-19 infection . To evaluate the contributions of clinical symptoms, laboratory tests and imaging information for diagnostic models, we established predictive models based on the data excluding epidemiological information. It was found that the diagnostic models established with clinical symptoms, laboratory tests and imaging information only presented poorer performance. In other words, epidemiological information tremendously affects the performance of COVID-19 predictive models. Briefly, making full use of clinical manifestations and epidemiological characteristics integrated is essential for constructing the early diagnosis model of COVID-19.
The COVID-19 dataset contains clinical information of 914 suspected patients who were from 18 hospitals in Zhejiang between Jan 17 and Feb 19, 2020. Suspected cases were diagnosed according to the 5th edition of Chinese Clinical Guidance for COVID-19 Pneumonia Diagnosis and Treatment. COVID-19 should be suspected if subjects conform to any one of the criteria in the epidemiological history and any two of the standards in clinical presentations. If there is no epidemiological history, suspected cases should meet three of the criteria in clinical presentations .
Considering about the completeness of the clinical information, we firstly screened out the patients with complete clinical records, which results in total number of 912 eligible patients. We then split processed patients into training (80%) and validation (20%) partitions randomly to train our models. Subsequently, we collected 115 clinical dataset from other hospitals in Zhejiang as test partition to verify the universality of implemented models in Zhejiang population.
To obtain the datasets for early stage COVID-19 rapid diagnostic models, all selected suspected patients were categorized into positive or negative cases. The patients who met any one of the following criteria’s were considered to be positive cases.
Positive RT-PCR test results in throat swab, sputum, blood samples. The genetic sequences detected in the samples are highly homologous to the known SARS-CoV-2.
Positive cases are considered to be the patients confirmed as COVID-19 infection by RT-PCR. Conversely, negative cases are patients excluded as COVID-19 infection by RT-PCR for at least two times. The 912 eligible participants enrolled in this study, include 361 COVID-19 infected patients (positive cases) and 551 COVID-19 non-infected patients (negative cases). Each patient’s clinical record contains 31 factors including gender, age, coexisting diseases, epidemiological information, laboratory tests, clinical symptoms and imaging findings. Details of these 31 factors and their distribution characteristics on training and validation dataset are shown in Table 1.
|Factors||Training data set||Validation data set|
|Age(year)||47.39 ± 14.38||38.53 ± 18.14||46.18 ± 14.73||35.33 ± 16.85|
|Travel or residence history over the past 14 days|
|Neighboring areas of outbreak area||7||53||2||15|
|Other areas with persistent local transmission or community with definite cases||136||212||26||54|
|Exposure to patients with fever or respiratory symptoms over the past 14 days who had a travel or residence history|
|Neighboring areas of outbreak area||7||18||1||6|
|Other areas with persistent local transmission or community with definite cases||79||22||20||10|
|Relationship with a cluster outbreak||104||13||29||3|
|Exposure to wildlife||0||1||1||0|
|Contact with patients of influenza A||1||12||2||3|
|Contact with patients of influenza B||2||11||3||4|
|Body temperature||37.54 ± 0.852||37.48 ± 0.848||37.39 ± 0.69||37.46 ± 0.73|
|Factors||Training data set||Validation data set|
|Diarrhea or bellyache||31||17||6||4|
|Dizziness or headache||24||38||4||9|
|Nausea or vomiting||10||5||3||2|
|White blood cell count (× 109)||5.47 ± 2.63||7.36 ± 3.04||5.10 ± 1.88||7.08 ± 2.89|
|Lymphocyte count (× 109)||1.22 ± 0.88||1.68 ± 0.86||1.16 ± 0.54||1.81 ± 0.78|
|Neutrophil cell count (× 109)||4.01 ± 4.67||5.12 ± 3.47||3.47 ± 1.68||4.58 ± 2.40|
|C-reactive protein level (mg/L )||21.77 ± 27.79||18.72 ± 35.26||17.33 ± 20.97||17.15 ± 34.93|
|Imaging changes of Chest X-ray or CT|
|Unilateral local patchy shadowing||94||104||13||28|
|Bilateral multiple ground glass opacity||94||77||24||20|
|Bilateral with pulmonary consolidation||84||24||25||6|
|Other imaging alterations||11||37||0||13|
Table 1: The characteristics of positive and negative samples on the training set and validation set.
Feature selection is used to select effective factors from numerous features to reduce the feature space dimension and classification error rate. We leveraged embedded feature engineering approach based on logistic regression algorithm with L2 penalty to select COVID-19 risk factors from the 31 factors mentioned above. Finally, 10 factors were chosen for the early COVID-19 prediction task by setting the threshold as 0.85. The final selected factors include four epidemiological features (relationship with a cluster outbreak, travel or residence history over the past 14 days in Outbreak area, exposure to patients with fever or respiratory symptoms over the past 14 days who had a travel or residence history in outbreak area, exposure to patients with fever or respiratory symptoms over the past 14 days who had a travel or residence history in other areas with persistent local transmission, or community with definite cases) and six clinical manifestations(muscle soreness, dyspnea, fatigue, lymphocyte count (× 109/L), white blood cell count (× 109/L) and imaging changes of Chest X-ray or CT). In practice, the diagnostic model constructed with fewer incoherent factors is beneficial and practical for outpatient service. Details of selected risk factors and their related coefficients are shown in Table 2. The importance of the factors relies on absolute value of the coefficients. Table 2 suggests that imaging changes of Chest X-ray or CT is more vital than others. Table 2 also shows that the tolerance of these 10 factors is more than 0.1 and variance inflation factor of them is less than 10, which indicated that there was no collinearity among selected factors.
|Imaging changes of Chest X-ray or CT||3.15||0.42||2.38|
|Relationship with a cluster outbreak||2.44||0.42||2.38|
|Travel or residence history over the past 14 days in outbreak area||1.58||0.75||1.33|
|Exposure to patients with fever or respiratory symptoms over the past 14 days who had a travel or residence history in other areas with persistent local transmission, or community with definite cases||1.07||0.75||1.34|
|Exposure to patients with fever or respiratory symptoms over the past 14 days who had a travel or residence history in outbreak area||0.87||0.75||1.34|
|Lymphocyte count(× 109/L)||-1.36||0.25||4.02|
|White blood cell count(× 109/L)||-3.05||0.23||4.41|
Table 2: The coefficients, tolerance and variance inflation factor of factors selected by feature engineering.
In this study, we conduct four conventional types of machine learning algorithms and a deep learning solution to establish the early stage COVID-19 rapid diagnostic models. We implement LR model with L2 regularization penalty, and train other three models including SVM with kernel of rbf, ID3 DT and RF. The RF model is constructed by 50 decision trees with information gain algorithm. This study used deep learning-base method, namely DNN, which is a four-layer network with the hidden dimension of 64, 32, 16 and 20 respectively. A Softmax layer is added at the top of the network to output the probability of a patient infected with COVID-19.
We evaluate the performance of the early stage COVID-19 diagnostic models at the 20% validation using familiar assessment strategies, which include measuring accuracy and the AUC generated by plotting sensitivity vs. 1-specificity. Classification accuracy is obtained via an optimum cut-off point. AUC measures the overall performance of the recall concerning different false positive rate, which exhibits robustness for performance assessment of predictive models . Models with higher AUC will show more powerful identified and diagnostic capacities to assist health care workers. High-sensitivity (or recall rate of positive cases) and high-specificity (or recall rate of negative cases) play a vital role in screening the infectious patients . Essentially, a model with high sensitivity can correctly identify patients infected with COVID-19 for timely treatment, while a model with high specificity can excellently screen non-infective patients, thereby effectively avoid cross infection.
The experiments we conduct to evaluate the performance of the five types of predictive models are illustrated in this section. We evaluate the predictive models on validation set and compare the results of validation to obtain the best solution for identifying early COVID-19 infection. Ultimately, we test the best model based on test dataset to obtain general diagnostic model for Zhejiang population.
We implement multiple model structures as our constructed models and deploy different combinations of feature inputs (Table 3). Table 4 summarizes the performance of conventional solutions and deep learning-based methods. Table 3 part (a) reveals the performances of predictive models constructed based on the raw dataset including 31 factors (Table 1), and part (b) exhibits the performances of models established with ten factors selected by using feature selection approach. Feature selection is intended for data dimensionality reduction . In practice, the diagnostic models constructed with less meaningful clinical factors are more practical for outpatient services. The results in Table 3 demonstrate that the predictive models of part (b) perform slightly better than that of part (a) in terms of AUC. The sensitivity, specificity, as well as accuracy of these predictive models of part (b) are relatively approximate to those of part (a). Thus, feature selection partly improves the performances of COVID-19 diagnostic models, and the ROC curve of some selected high performing machine learning models are shown in Figure 1. Table 4 part (b) shows that LR combined with feature selection outperforms other four methods by reaching an AUC of 0.971, high-specificity of 0.95 and accuracy of 0.90 respectively. These results suggest that the combination of LR and feature selection approach presents the best AUC and specificity among five categories of classification methods. Higher specificity of model will facilitate the elimination of infected diseases such as COVID-19 infection. In addition, according to the clinical experience of experts, the AUC (0.86) calculated by the diagnostic scale is compared, and the LR diagnostic model shows better performance. Therefore, LR can be selected as the optimum classification model for the early-stage COVID-19 rapid screening.
|(a) Models constructed on raw dataset|
|(b) Models constructed combination with feature selection|
|(c) Models constructed on the dataset excluding epidemiological information|
|(d) Models combination with feature selection on the dataset excluding epidemiological information|
Note: FS: Feature Selections exclude epidemiology: models established on the data excluding epidemiology information.
Table 3: The performance comparison of various machine learning models on validation set with different sets of features.
Note: LR: Logistic Regression; FS: Feature Selection
Table 4: The sensitivity, specificity, accuracy and AUC for logistic regression on test dataset.
Figure 1: ROC curve of chosen high performing machine learning models. FS: Feature Selection.
Under the background of COVID-19 pandemic, clinical symptoms, laboratory tests and imaging findings are vital clinical criterion for the diagnosis of COVID-19 infection. In order to verify the contribution of above-mentioned three indicators to the COVID-19 diagnostic models, we establish predictive models based on the dataset excluding epidemiological information. The performances of various predictive models are shown in Table 3 (c) and (d). Results in part (a) and part (b) illustrates that epidemiological information is beneficial for early COVID-19 rapid diagnostic models construction. In the absence of epidemiological information, the sensitivity, specificity and accuracy of the predictive models (part (c)) exhibits sharp reduction compared with the models shown in part (a). In addition, the five types of machine learning approaches combining with feature selection are constructed based on the dataset excluding epidemiological information, as is shown in Table 4 part (d). Compared with part (c), AUC of part (d) is slightly improved. While due to the absence of epidemiological information, part (c) and part (d) show poorer performances compared with part (a) and part (b). In brief, it indicates laterally that epidemiological information is essential for constructing the early COVID-19 diagnostic models in Zhejiang population.
The above results clearly illustrate that the combination of traditional logistic regression method and feature selection has a great probability to predict early COVID-19 infection. And construction of highly precious diagnostic model relies on integrating and taking the most advantages of clinical symptoms, laboratory tests, imaging findings as well as epidemiological information.
Moreover, LR algorithm is proved as the most ideal method among the five classification solutions for the early COVID-19 rapid screening. The experiment performed in this study uses test dataset for verifying generality of the optimum diagnosis model. As is shown in Table 4, the sensitivity, specificity, accuracy and AUC of the LR+FS model on test dataset are 0.87, 0.95, 0.91 and 0.95, respectively. These results show that the predictive model constructed by combination of logistic regression and feature selection as early COVID-19 rapid diagnostic tool is universally applicable in Zhejiang Province.
Under the background of COVID-19 pandemic, the early prevention and control of COVID-19 still face severe challenges. According to the reports, the most common early symptoms of COVID-19 are fever, cough, fatigue, and myalgia, followed by diarrhoea, nausea, headache and sore throat [25,26]. As the disease goes on, some infected patients, especially those with low immune functions, gradually become dyspnea [21,27]. Additionally, complications such as acute arrhythmia and shock, Acute Respiratory Distress Syndrome (ARDS), are probably related to a poor prognosis [28,29]. Thus, early prediction of suspected patients and early aggressive treatment of confirmed patients is the key to reduce cross infection and mortality. CT scan has become the main auxiliary tool for screening of COVID-19 cases. However, CT scan cannot be used to identify specific viral infections . Moreover, some COVID-19 patients can also present with normal pulmonary imaging in early stage . Clinical symptoms and laboratory tests are sometimes non-specific for early COVID-19 infection [21,32]. At present, RT-PCR is still the accepted detection method for the diagnosis of COVID-19 infection. While the time consuming and instability of test results are still the most struggling problems . Therefore, to improve the timeliness for the early COVID-19 infection diagnosis, it is essential to develop a decision-making tool to assist early diagnosis of COVID-19 patients in fever clinics.
Current studies which analyse symptoms and laboratory examination results of COVID-19 patients mainly focus on predicting mortality risk and progression of the disease . Only few studies aim at COVID-19 early diagnosis. At present, Meng et al. selected nine representative variables (including age, activated partial thromboplastin time, red blood cell distribution width-sd, uric acid, triglyceride, serum potassium, albumin/globulin, 3-hydroxybutyrate, serum calcium) and constructed an optimized diagnostic model through Lasso regression screening and Multivariate logistic regression based on 431 samples . The AUC of their early COVID-19 screening model in the testing set and independent validation cohort were 0.890 and 0.872. Feng et al. used logistic regression with Lasso regression for features selection and screening model development based on clinical data of 132 recruited patients . The final chosen features include 1 demographic variable (age); 4 variables of vital signs (e.g., Temperature (TEM), Heart rate (HR), etc.); 5 variables of blood routine values (e.g., Platelet count (PLT), Monocyte ratio (MONO%), Eosinophil count (EO#), etc.); 7 variables of clinical signs and symptoms (e.g., Fever, Fever classification, Shiver, etc.); and 1 infection-related biomarker (Interleukin-6 (IL-6)). The performance of their model constructed based on the final selected features in held-out testing set and validation cohort resulted in AUCs of 0.841 and 0.938, and specificity of 0.727 and 0.778. In our study, we selected four epidemiological features and six clinical manifestations from the raw dataset including 31 factors, further developed multiple models with various machine learning algorithms and screened an optimum early COVID-19 diagnostic model with an AUC of 0.971. We tested the best model based on LR on the external test data set, and its AUC and specificity were 0.950 and 0.95, respectively. Compared to previous studies, we screened out fewer risk factors based on a larger clinical data set, and the early COVID-19 diagnostic model we established has better performance and is more suitable for clinical assisted diagnosis. Moreover, our study is based on a large clinical data set, including a total of 912 patients who were confirmed to have early COVID-19 infection or other respiratory infectious diseases, which may contribute to mining more potential clinical information and improve generalization ability of diagnostic models. Considering the indisputable role epidemiological features play in the diagnosis of infectious diseases in clinic Zhang et al. we specifically studied the role of epidemiological information in diagnostic models. We found that the lack of epidemiological information greatly affected the accuracy, specificity and sensitivity of the model. It means that epidemiological information is vital for building an accurate COVID-19 diagnostic tool, and makes the utility and reliability of the previously reported diagnostic models questioned .
Nevertheless, this study still has several limitations. First of all, the recruited participants are limited to Zhejiang Province, which causes certain regional restrictions in the application of the predictive models. Further extremely concerning about the epidemiological characteristics and nationwide studies are needed to access the generality of the suggested model. Secondly, there is a lack of information on the progression and prognosis of COVID-19 as well as asymptomatic infection cases. Finally, more information of infections should be recruited to improve the accurate of screening model.
In our study, ten representative factors with significant identification value were selected and constructed diagnostic models. The model established an algorithm based on logistic regression can be used as a simple, fast, and effective tool for diagnosing the early COVID-19 infection with significant clinical value.
This work was supported by the grant from the emergency project of key research and development plan in National Science and Technology Major Project (2017ZX10204401) and Zhejiang Province (2020C03123-1; 2020C03123-2).
The authors declare that they have no competing interests.
Bin Ju conceived and designed the experiments, analysed the data, authored or reviewed drafts of the paper, and approved the final draft.
Nannan Sun performed the experiments, analysed the data, prepared figures and or tables, and approved the final draft.
Ya Yang authored or reviewed drafts of the paper, and approved the final draft.
Lingling Tang revised drafts of the manuscript, and approved the final draft.
Zhen Li revised drafts of the manuscript, and approved the final draft.
Hainv Gao revised drafts of the manuscript, and approved the final draft.
Yining Dai provided samples dates, and approved the final draft.
Nannan Sun and Ya Yang contributed equally to this work.
This study was approved by the Ethics Committee of Zhejiang Provincial People’s Hospital (2020KY006). Exempt informed consent was approved because the subjects would not be exposed to any risk in this observational study, and the information of subjects was anonymized at collection and prior to analysis.
Citation: Sun N, Yang Y, Tang L, Li Z, Dai Y, Xu W, et al. (2021) A Prediction Model Based on Machine Learning for Diagnosing the Early COVID-19 patients. J Antivir Antiretrovir. S18:002.
Received: 01-Apr-2021 Accepted: 15-Apr-2021 Published: 22-Apr-2021 , DOI: 10.35248/1948-5964.21.s18.002
Copyright: © 2021 Sun N, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.