Reporting of coronavirus disease 2019 prognostic models: the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis statement
Introduction
The novel coronavirus disease 2019 (COVID-19) poses an urgent threat to global health. As of August 28, 2020; 24,299,923 confirmed cases of COVID-19, including 827,730 deaths, were reported to the World Health Organization (WHO) (1). The huge number of infected cases brought tremendous pressure on the medical facilities. In addition to the high risk of infection to the medical staff, effectively allocating resources, such as the number of intensive care unit (ICU) beds or other medical equipment, is also a challenge. According to existing reports, many infected patients show mild flu-like symptoms and can recover quickly (2). However, some rapidly develop acute respiratory distress syndrome, multiple organ failure, and death (3-6). Therefore, a current concern is to determine the patients’ prognosis at an early stage, to reduce mortality. To provide the patients with the most reasonable level of treatment and care, many studies have combined multiple predictors to establish models, to predict the patients’ prognosis in clinical practice, but the quality of these reports has not been evaluated (7-9). Complete reporting is benefit to study replication and assess the applicability to other individuals. Therefore, high-quality reporting about prediction model is essential. In 2015, multiple journals simultaneously published a study on how to improve the quality of reports on prediction model studies, namely transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement (10). TRIPOD is a list of 22 items involving title and abstract (items 1 and 2), background and objectives (item 3), methods (items 4 through 12), results (items 13 through 17), discussion (items 18 through 20), and other information (items 21 and 22). The TRIPOD statement covers the development and external validation of prediction models as well as studies with only external validation (updates with or without predictors).
A previous systematic review showed unsatisfactory level of quality of prediction models in various clinical fields (11). Wynants et al. also conducted a systematic review of the prediction models in COVID-19 (12). However, the results were qualitative, and no unified indicator to measure and compare the reporting integrity between different studies was reported. Our study provides a new evaluation method for model reporting, and summarizes the omissions commonly existing in current reporting, so that future research can focus on avoiding these problems to improve the quality of model reporting.
Our research aimed to use the TRIPOD tool to systematically review and critically evaluate the published models for predicting the prognosis or course of COVID-19 in patients. The results could provide the key for further improvement of the quality of COVID-19-related prognostic model reporting. We present the following article in accordance with the PRISMA reporting checklist (available at http://dx.doi.org/10.21037/atm-20-6933).
Methods
Search strategy
A search was conducted in PubMed and Web of Science databases until August 11, 2020, with no language restrictions. The terms related to COVID-19 (COVID-19, SARS-COV-2, novel corona, 2019-ncov) and prognostic model (prognostic, prediction model, regression) were searched in the databases. We also searched for reviews in this field and references of the original articles, to identify whether there were any missed studies. Only peer-reviewed studies on the prognostic model of COVID-19 were included in our research, and the preprint form was not considered.
Inclusion and exclusion criteria
We included articles on multivariate models or risk scores for predicting any prognostic outcomes of COVID-19. The exclusion criteria were as follows: (I) non-human research; (II) studies on the prediction model of disease transmission; (III) diagnostic model of COVID-19; (IV) studies on predictive factors but with no established prognostic models; (V) studies on prediction models using non-regression techniques; since TRIPOD does not support the evaluation of such methods (e.g., machine learning, neural networks) (13). Studies based on the above criteria were screened by two investigators (LQY and QW), and differences were resolved after discussion.
Data extraction
Two investigators (LQY and TTC) independently reviewed the titles and abstracts of all extracted articles. Any discrepancies were agreed upon through discussion and, if necessary, resolved by a consultant (HJ). Investigators used TRIPOD standard data extraction forms to determine the completeness of articles (www.tripod-statement.org). Additionally, the publications were grouped into four types of prediction models: development, external validation of existing models, incremental values, and development and validation of the same model. Publications could be classified into more than one type of prediction model.
In other words, for the development model, if different models were developed using the same data in one study, we extracted information from the primary model. For external validation of different existing models, information was extracted separately. Studies that reported both development and external validation of different models were classified into both development and external validation models. The basic information of each study (study region, study design, sample size, and predicted outcomes) were extracted. In addition, information about predictors were addressed in the articles. Predictors refer to variables that are included in the model at the time of model construction and that build statistical relationships with predicted outcomes. Previous researchers encourage that age, sex, C-reactive protein, lactic dehydrogenase, lymphocyte count, and potentially features derived from CT-scoring should be included in the COVID-19 prognostic model (12). Similarly, we extracted the prediction performance, including discrimination and calibration and their standard error (SE) or 95% confidence interval (CI), if provided. Discrimination was usually measured by the area under the receiver operator characteristic curve (AUROC) or c-index, while calibration was usually quantified by calibration intercept and calibration slope. The closer the AUROC or c-index and calibration slope is to 1, the better the performance of the model. The performance data were extracted in the following order: external validation, internal validation, and original performance (if the two above were not included).
Analysis
To evaluate the completeness of included models, the number of TRIPOD items that were completely reported was divided by the total number of TRIPOD items in the article. Furthermore, to assess the overall reporting completeness of each item in the TRIPOD statement, we divided the number of models with complete reports for a specific TRIPOD item by the total number of models applicable to this item. To evaluate for completeness, if an item was not considered applicable to a study, the five items declared by TRIPOD included “if completed” or “if applicable” statements (items 5c, 10e, 11, 14b, and 17). Then, such items were excluded from both the numerator and denominator.
In validation, the random effect model was used to pool the presented prediction performance with their 95% CI in the meta-analysis. The I2 statistic was used to assess the heterogeneity among the studies. When I2 statistic was >50% (moderate heterogeneity), the random effect model was used for the analysis.
Results
After screening, a total of 52 publications were included in our study (Figure 1). From the 52 publications, we scored 67 models using the TRIPOD tool as follows: 37 (55%) development, 14 (21%) external validation of existing models, 3 (5%) incremental values, and 13 (19%) development and validation of the same model.
Primary information
Thirty-six studies used COVID-19 patients’ data from China, four from Italy, and two from the United States. Britain, France, Norway, Turkey, Spain, and Mexico had one each. Four studies did not specify the country or region of the data. Regarding the study design, most (88%) were retrospective studies, while two were prospective studies. One study used retrospective data in model development, but prospective methods in a validation cohort to recruit patients. One study identified the race of the participants as Caucasian (8). In a total of 23 studies, the follow-up date was mentioned. All the studies reported the sample sizes (median sample size, 220.5 [interquartile range (IQR): 109.25, 459.25]. Detailed information is shown in Table 1 and Appendix 1.
Full table
Prognostic predictors
In the final model, six studies used computed tomography (CT) or chest X-ray results to establish the scoring rules. The median number of prognostic predictors was five (IQR: 3, 6.25). The most frequently used predictors in the model (>10 times) were as follows: age, disease history, lymphocyte count, history of hypertension and cardiovascular disease, C reactive protein, lactate dehydrogenase, white blood cell count, and platelet count, reported 26 (50%), 17 (33%), 14 (27%), 12 (23%), 12 (23%), 11 (21%), 10 (19%), and 10 (19%) times, respectively. The commonly used predictors (>5 times) were as follows: lymphocyte ratio, procalcitonin, aspartate aminotransferase, and dyspnea reported 8 (15%), 5 (10%), 5 (10%), and 5 (10%) times, respectively (Appendix 2).
Prediction outcomes and performances
The prediction outcomes in 23, 17, 8, 2, and 2 studies were death, severe or critical state disease development, ICU admission/mechanical ventilation/death, survival time, and length-of-hospital stay, respectively (Table 1). For death, the reported discrimination performance ranged from 0.584 to 0.994. Another study reported the weighted kappa (kw) and 95% CI (14). The calibration of the prediction models on mortality by Luo et al. showed good consistency between the prediction in the training cohort and actual observations (15). In two other studies, the model also fitted well (16,17). When the outcome was severe or critical progression of the disease, the discrimination ranged from 0.636 to 0.971. For ICU admission/mechanical ventilation/death, the discrimination varied between 0.712 and 0.900. Discrimination reported for the length-of-hospital stay outcome ranged from 0.361 to 0.848. For survival time, the discrimination was between 0.672 and 0.892.
Reporting completeness per model in TRIPOD
Figure 2 and the file (https://cdn.amegroups.cn/static/application/df0da0ff07a31a06aa1b1e1cf3b15d66/atm-20-6933-1.pdf) present the completeness of the model in TRIPOD. Overall, the reporting completeness was between 31% and 83%, with a median of 67% (IQR: 62%, 73%). The best completeness reporting was incremental value, with a median of 83%. This was followed by validation (70%, IQR: 64%, 74%). The development (66%, IQR: 62%, 70%) and the development and validation of the same model (62%, IQR: 56%, 71%) had similar reporting completeness.
Reporting completeness per TRIPOD items
We found that TRIPOD items in the discussion section were well completed (items 18–20); up to 100%. Supplementary information for item 21 and research funding for item 22 were well reported at 100%. The remaining 14 items were reported at ≥75% completeness, for all types of models (e.g., development, validation, development and validation of the same model, and incremental value). Four items reported <25%.
Information in the other parts of the TRIPOD items were described carefully below. Since there were three models in the incremental value that qualified and the sample size was small (hence not representative), we did not include this type of model in the following elaboration. All details are shown in Figure 3 and Appendix 3.
Items 1–3 (title/abstract/introduction)
In all types of models, the reporting completeness on the title and abstract section items was low, ranging from 5% to 36%. However, the completion of the introduction section (item 3) was high, both specifying the objectives, presenting the background, and including references to existing models.
In development, 5 (11%) of the 37 models explicitly identified the study as development and/or validation multivariable prediction models; then, they reported the target population and predicted the outcomes in the title. These completeness were 36% and 31% for the validation, and development and validation of the same model, respectively. Four models in the validation satisfied all the 12 elements in item 2. That is, the research objectives, study design, setting, participants, sample sizes, predictors, prediction outcomes, and statistical analyses were all provided in the abstract as well as brief results and conclusions. The completeness of item 2 was 5% and 23% in the development, and development and validation of the same model, respectively.
Items 4–12 (methods)
Items 4–5, 6a, 8, 10c, and 11 were highly reported among all the models; with all the values >80%. This meant that the sources of data, key study dates, and eligibility criteria for the participants were well reported. However, the reported completeness of how the missing data were handled (item 9) and the model-building procedures (item 10b) were low, at <15%.
In the development (57%) and development and validation of the same model (46%), the completeness of any blinding of the outcome to be predicted was not high. Assessment of the model performance (item 10d) had general completeness reporting of 24% in development, 43% in validation, and 54% in development and validation of the same model. These results were mainly due to the inadvertent reporting of the calibration element. In validation, very few (7%) noted the need to compare validation with data from development (item 12). However, item 12 was well reported in the development and validation of the same model; up to 77%.
Items 4–17 (results)
All types of models were highly completed in the reporting of the number of participants and outcome events in the analysis and the unadjusted association between candidate predictors and outcomes (items 14a and 14b); reaching more than 90%. However, only few models could consider all the four elements in item 13b, and the reporting completeness was <5%. This was due to the fact that researchers tended to ignore the number of participants with missing data in predictors and prediction outcomes when reporting information.
In the development, and development and validation of the same model, few studies reported adequate information in the final model (item15a), with the completeness of 32% and 8%, respectively. Although most models presented regression coefficients for each predictor, the intercept, or the cumulative baseline hazard (or baseline survival) for at least one time point was poorly reported.
In development, 46% of all models were fully reported for item 15b, and many researchers did not explain how to use the newly established prediction model. Whether in development, validation, or development and validation of the same model, the reporting of the prediction model performance measures (item 16) was not ideal at 24%, 43%, and 62%, respectively. These were due to the inability of many models to adhere to one of these elements that reported model calibration, which also corresponded to the low reporting of item 10d in the methods section.
Meta-analysis
In the meta-analysis, we screened five studies for the included validation from which the discrimination of CURB-65 could be extracted. The CURB-65 score is a prediction model used to divide patients with community-acquired pneumonia into different treatment patient groups (18). The pooled performance of CURB-65 in COVID-19 infectious patients was 0.768 (95% CI, 0.694, 0.841). The forest plot is shown in Appendix 4.
Discussion
In this systematic review of prognostic models related to COVID-19, we included a total of 67 models from 52 studies. The main prediction outcomes were as follows: death, development of severe/critical state, ICU admission/mechanical ventilation/death, survival time, and length-of-hospital stay. There was a mix between outcomes. The predicted outcome of some studies were the indicators of the outcomes predicted in some other studies. Zeng et al. focused on identifying patients with a high risk of progression and who would require transfer to the ICU (19). On the other hand, many other studies listed ICU admission as one of the indicators of their prediction outcomes (i.e. severe or critical progression and mortality) (20-22). Additionally, the same outcome was defined differently in different studies; the definition of severe and critical cases was not uniform. Liu et al. assessed the status of patients according to the American Thoracic Society guidelines (23). Liang et al. also defined the severity based on the American Thoracic Society guidelines for community-acquired pneumonia, given the extensive acceptance of this guideline (24). However, Xiao et al. used the Diagnosis and Treatment Protocol for Novel Coronavirus Pneumonia (Trial Version 7) as the guideline for the spectrum of severity (25). The blind evaluation of the prediction outcome and prediction factors were ignored in the models. For the all-cause mortality, it was well-defined and not affected by subjective factors, while in other instances such as in severe state progression, an explicit mention about the judgement of outcome was expected.
Potential for popularizing clinical practice
Optimistic discrimination performance was reported for all the models. However, the existing models had the risk of over-fitting, because the number of available samples and events which were used for developing the new prediction model were limited by the sample sizes. In addition to the above reasons, most studies directly excluded the missing data from the original data, which reduced the sample sizes greatly. Multiple imputation may be used to address this challenge. The overfitting can also be alleviated by calibration, which has rarely been evaluated in models. In future prediction model research, attention should be paid to the disposal of missing values, and multiple interpolation should be carried out for missing values when appropriate. In addition, emphasis should be placed on calibration results in reporting model performance. Similarly, there were few (only 13) external validations of the newly established models, so these were insufficient to promote the existing models directly in clinical practice. In addition, there were few internal validations of the newly established models. Random splitting was the most frequently used method instead of bootstrap or k-fold cross-validation, which enhanced the limitation of the small sample size in the model prediction. Based on our findings, we encourage researchers to count age, disease history, lymphocyte count, history of hypertension and cardiovascular disease, C reactive protein, lactate dehydrogenase, white blood cell count and platelet count into the prediction model, rather than simply selecting the predictors in a data-driven manner, which may put the model at risk of overfitting.
Research participants should be adequately described in the development data, which is beneficial to popularize newly established models in the real world. Borghesi et al. identified Caucasians as participants in a study (8). Osborne clarified that their model was aimed at veterans in the United States (26). Pascual determined that the setting of their study was the hospital emergency room (27). However, the applicability of the model among most of the studies was not of great importance. Although we realized that due to the particularity of COVID-19, the time and space for the completion of these studies were limited.
Moreover, the reporting completeness of the final model presentation was poor. Although the regression coefficient (or a derivative such as hazard ratio, odds ratio, and risk ratio) for each predictor in the model was reported in a large number of models. The intercept or the cumulative baseline hazard for at least one time point was ignored, which will make future research to re-validate the developed model and recalibrate it difficult. All of the above hindered the improvement of the prediction model and its promotion in clinical practice.
In our study, moderate or even excellent degree of discrimination ability was found when the existing CURB-65 model was used to predict the prognosis of COVID-19 patients. In future research, we may consider adding the prediction variables or recalibrating the model to achieve better prediction results. What’s more, with the development of vaccine trials worldwide, whether vaccination will have an impact on the prediction model, that is, whether vaccination can also become a new predictor is also the direction that researchers need to focus on.
Limitations
The number of studies was relatively small. However, these evaluation results may be improved with the promotion of COVID-19 prognosis model research. In particular, the number of incremental value studies was few, so it may not be appropriate to use the quantitative method converted by the TRIPOD statement for the evaluation. Secondly, due to the limitation of the applicability of TRIPOD, we were unable to evaluate models that were established by artificial intelligence. Thirdly, some hospitals provided data for different studies at the same time, which made it unclear to us how much overlap we included from the studies. Moreover, most of the articles we included were from China, especially Wuhan; and there was no description of demographic variables such as race, economic status, and educational level that might affect patient outcomes. All of these factors may have potential impacts on our results.
Conclusions
In the present study, the prognostic prediction models for COVID-19 were evaluated according to the TRIPOD statement; we found the reporting completeness to be poor. The potential for the clinical promotion of the model is low due to over-fitting and the lack of calibration and external validation. Overall, we need to focus our research in the future on the validation and improvement of existing models. The premise for this was a high-quality research, following the TRIPOD reporting guidelines.
Acknowledgments
Funding: The work was supported by grants from the National Natural Science Foundation of China (81573258); and the Jiangsu Provincial Major Science & Technology Demonstration Project (BE2017749); and the Southeast University COVID-19 Fund (3225002001C1)
Footnote
Reporting Checklist: The authors have completed the PRISMA reporting checklist (available at http://dx.doi.org/10.21037/atm-20-6933).
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at http://dx.doi.org/10.21037/atm-20-6933). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. No any human experiments or animals’ experiments were involved in studies.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- COVID19.WHO.int. WHO Coronavirus Disease (COVID-19) Dashboard; c2020. Available online: https://covid19.who.int/
- Chen N, Zhou M, Dong X, et al. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. Lancet 2020;395:507-13. [Crossref] [PubMed]
- Wang D, Hu B, Hu C, et al. Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in Wuhan, China. JAMA 2020;323:1061-9. [Crossref] [PubMed]
- Huang C, Wang Y, Li X, et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 2020;395:497-506. [Crossref] [PubMed]
- Grasselli G, Zangrillo A, Zanella A, et al. Baseline Characteristics and Outcomes of 1591 Patients Infected With SARS-CoV-2 Admitted to ICUs of the Lombardy Region, Italy. JAMA 2020;323:1574-81. [Crossref] [PubMed]
- CDC COVID-19 Response Team. Severe Outcomes Among Patients with Coronavirus Disease 2019 (COVID-19) - United States, February 12-March 16, 2020. MMWR Morb Mortal Wkly Rep 2020;69:343-6. [Crossref] [PubMed]
- Bello-Chavolla OY, Bahena-López JP, Antonio-Villa NE, et al. Predicting Mortality Due to SARS-CoV-2: A Mecha-nistic Score Relating Obesity and Diabetes to COVID-19 Outcomes in Mexico. J Clin Endocrinol Metab 2020;105:dgaa346.
- Borghesi A, Zigliani A, Golemi S, et al. Chest X-ray severity index as a predictor of in-hospital mortality in corona-virus disease 2019: A study of 302 patients from Italy. Int J Infect Dis 2020;96:291-3. [Crossref] [PubMed]
- Dong YM, Sun J, Li YX, et al. Development and Validation of a Nomogram for Assessing Survival in Patients with COVID-19 Pneumonia. Clin Infect Dis 2021;72:652-60. [Crossref] [PubMed]
- Collins GS, Reitsma JB, Altman DG, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): The TRIPOD Statement. BMC Med 2015;13. [Crossref] [PubMed]
- Heus P, Damen JAAG, Pajouheshnia R, et al. Poor reporting of multivariable prediction model studies: towards a targeted implementation strategy of the TRIPOD statement. BMC Med 2018;16:120. [Crossref] [PubMed]
- Wynants L, Van Calster B, Bonten MMJ, et al. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. BMJ 2020;369:m1328. [Crossref] [PubMed]
- Moons KGM, Altman DG, Reitsma JB, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): Explanation and Elaboration. Ann Intern Med 2015;162:W1-73. [Crossref] [PubMed]
- Borghesi A, Maroldi R. COVID-19 outbreak in Italy: experimental chest X-ray scoring system for quantifying and monitoring disease progression. Radiol Med 2020;125:509-13. [Crossref] [PubMed]
- Luo M, Liu J, Jiang W, et al. IL-6 and CD8(+) T cell counts combined are an early predictor of in-hospital mortality of patients with COVID-19. JCI Insight 2020;5:e139024. [Crossref] [PubMed]
- Chen R, Liang W, Jiang M, et al. Risk Factors of Fatal Outcome in Hospitalized Subjects with Coronavirus Disease 2019 From a Nationwide Analysis in China. Chest 2020;158:97-105. [Crossref] [PubMed]
- Shang Y, Liu T, Wei Y, et al. Scoring systems for predicting mortality for severe patients with COVID-19. EClini-calMedicine 2020;24:100426. [Crossref] [PubMed]
- Lim WS, van der Eerden MM, Laing R, et al. Defining community acquired pneumonia severity on presentation to hospital: an international derivation and validation study. Thorax 2003;58:377-82. [Crossref] [PubMed]
- Zeng Z, Ma Y, Zeng H, et al. Simple nomogram based on initial laboratory data for predicting the probability of ICU transfer of COVID-19 patients: Multicenter retrospective study. J Med Virol 2021;93:434-40. [Crossref] [PubMed]
- Myrstad M, Ihle-Hansen H, Tveita AA, et al. National Early Warning Score 2 (NEWS2) on admission predicts severe disease and in-hospital mortality from Covid-19-a prospective cohort study. Scand J Trauma Resusc Emerg Med 2020;28:66. [Crossref] [PubMed]
- Liu J, Liu Y, Xiang P, et al. Neutrophil-to-lymphocyte ratio predicts critical illness patients with 2019 coronavirus disease in the early stage. J Transl Med 2020;18:206. [Crossref] [PubMed]
- Wang B, Zhong F, Zhang H, et al. Risk factors analysis and nomogram construction of non-survivors in critical patients with COVID-19. Jpn J Infect Dis 2020;73:452-8. [Crossref] [PubMed]
- Liu YP, Li GM, He J, et al. Combined use of the neutrophil-to-lymphocyte ratio and CRP to predict 7-day disease severity in 84 hospitalized patients with COVID-19 pneumonia: a retrospective cohort study. Ann Transl Med 2020;8:635. [Crossref] [PubMed]
- Liang W, Liang H, Ou L, et al. Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients With COVID-19. JAMA Intern Med 2020;180:1081-9. [Crossref] [PubMed]
- Xiao LS, Zhang WF, Gong MC, et al. Development and validation of the HNC-LL score for predicting the severity of coronavirus disease 2019. EBioMedicine 2020;57:102880. [Crossref] [PubMed]
- Osborne TF, Veigulis ZP, Arreola DM, et al. Automated EHR score to predict COVID-19 outcomes at US Department of Veterans Affairs. PLoS One 2020;15:e0236554. [Crossref] [PubMed]
- Pascual Gómez NF, Monge Lobo I, Granero Cremades I, et al. Potential biomarkers predictors of mortality in COVID-19 patients in the Emergency Department. Rev Esp Quimioter 2020;33:267-73. [Crossref] [PubMed]