Combined electronic medical records and gene polymorphism characteristics to establish an anti-tuberculosis drug-induced hepatic injury (ATDH) prediction model and evaluate the prediction value
Introduction
The Global Tuberculosis Annual Report 2020 placed the international and domestic situation of tuberculosis infection as critical, with China accounting for approximately 8.4% of the global tuberculosis population (1). While the 2021 report found the incidence of tuberculosis patients had fallen by 18% compared to the previous year (2), the coronavirus disease 2019 (COVID-19) pandemic has caused disruptions to tuberculosis services and an increase in tuberculosis deaths, with close to half of those who fall ill undiagnosed and untreated (2). One of the side effects of the World Health Organization (WHO)-recommended first-line anti-tuberculosis regimen is anti-tuberculosis drug-induced hepatic injury (ATDH), which occurs at an incidence of 5.0–28.0% and can lead to discontinuation of first-line regimens, treatment failure, and the development and spread of multi-drug resistant nodules (1,2). The mechanism of ATDH has not been clarified yet. The mechanism of ATDH involves many complex links such as drug metabolism and transport, oxidative stress, mitochondrial dysfunction, immune regulation and inflammatory response. However, it has not been clarified yet (1,3). While there is a lack of specific clinical symptoms and markers for the diagnosis of ATDH, single nucleotide polymorphisms (SNPs) have been shown to have potential clinical applications as molecular markers of the disease (4). However, the results of single nucleotide SNPs do not provide a complete and systematic picture of the relevance of the signaling pathway in which they are located to ATDH (5,6).
PXR/ALAS axis activation leading to abnormal metabolism of the hepatic heme pathway is one of the possible mechanisms by which combined rifampicin and isoniazid treatment leads to ATDH (7). PXR is an important factor in activating ALAS1 transcription (8), and the transcriptional balance of the ALAS1 gene is coordinated by a complex combination of signaling pathways. The insulin-sensitive FOXO1 pathway can synergize the transcriptional regulation of ALAS1 by PXR (9,10), and our previous studies have identified genetic polymorphisms in PXR and FOXO1 that correlate with ATDH susceptibility (6,11).
A clinical prediction model may assist in the diagnosis of ATDH by identifying predictors that have a significant impact on the outcome through a multifactorial analysis approach. This provides different combinations of candidate predictors to further assess the probability of the outcome to assist in clinical practice (12). Current ATDH prediction models with single sociodemographic factors or partial clinical indications as predictor variables have not been validated for the goodness of fit, and the robustness and accuracy of their predictive efficacy are yet to be verified. With the development of pharmacogenomics, SNPs as important genetic variable features as predictors, are believed to improve model predictive power, and combining multi-gene feature variables can enhance predictive model efficacy (13-15). Machine learning algorithms can perform big data mining effectively and at high speed to discover potential clinically relevant factors. By integrating multidimensional information simultaneously from which valid information can be extracted for model fitting, these algorithms improve model stability and predictive efficacy in unknown datasets and increases the applicability of models to assist clinical diagnosis and treatment (16). Therefore, building predictive models based on general clinical features, laboratory indications, and multigene genetic features with machine learning algorithms can lay the foundation for building clinical models that are both stable and individualized. At the same time, the probability of occurrence of disease outcomes predicted by the prediction model can be visualized by nomograms, which are more intuitive and easy to use than conventional prediction formulas (17).
In summary, in this study, for the first time, machine learning algorithms were used simultaneously to establish a visual prediction model of ATDH risk by combining the general clinical characteristics, laboratory indications, and genetic characteristics of multiple genes in the PXR/ALAS/FOXO1 axis in 746 patients with confirmed tuberculosis. The model was evaluated with respect to its predictive performance and clinical applicability. We present the following article in accordance with the STARD reporting checklist (available at https://atm.amegroups.com/article/view/10.21037/atm-22-4551/rc).
Methods
Study population
A total of 1,060 patients with suspected tuberculosis who attended the West China Hospital of Sichuan University from December 2016 to April 2018 consecutively were included retrospectively (18). The inclusion criteria were a clear tuberculosis diagnosis according to our tuberculosis diagnostic criteria and the use of an anti-tuberculosis first-line treatment regimen. The exclusion criteria were an unclear tuberculosis diagnosis; the use of other hepatotoxic drugs; concomitant human immunodeficiency virus (HIV), hepatitis B virus (HBV), hepatitis C virus (HCV), or other immune disease; and failure of follow-up or discontinuation of first-line treatment (19). The study conformed to the provisions of the Declaration of Helsinki (as revised in 2013) and it was approved by the Ethics Committee of West China Hospital, Sichuan University (registration No. 2014198). All experimental subjects were unrelated Han Chinese populations in western China who voluntarily participated and signed the informed consent form.
All methods were performed in accordance with the relevant guidelines and regulations. Inclusion criteria for the ATDH group were: (I) receiving a first-line anti-tuberculosis drug regimen; (II) diagnosis of liver injury in accordance with CTCAEv.5.0 criteria (20); and (III) no other hepatotoxic drugs used 14 days before the diagnosis of ATDH (11,18). Inclusion criteria for the non-ATDH group was that no liver injury had occurred during anti-tuberculosis treatment. The severity of hepatotoxicity was classified into three major categories according to the WHO Toxicity Classification Standards as Grade 1 (mild), alanine aminotransferase (ALT) <5× upper limits of normal (ULN) (200 IU/L); Grade 2 (moderate), ALT level higher than 5 ULN but less than 10× ULN; and Grade 3 (severe), ALT level ≥10× ULN (400 IU/L) (21).
Selection of target SNPs loci and typing assay
The dbSNP database and the 1000 Genomes database were accessed, and haploView software was used to screen the candidate genes PXR, FOXO1, and ALAS1 on the target SNPs by genetic polymorphism testing, with the screening principles as follows (11,19): minimum allele frequency ≥20% in the southern Chinese Han population; linkage disequilibrium value of tag SNPs r2 ≥0.8; located in the region where the candidate genes are located 200 bp upstream and 300 bp downstream; combined with domestic and international literature, SNPs loci that may be associated with the risk of ATDH development or have potential functional significance. Based on the above principles and pre-experimental results, SNPs loci that could be successfully designed with PCR primers and single-base amplification primers and were successfully typed were screened and typed using 48-Plex SNPscan® high-throughput SNP typing technology (18,19). Thirty samples were randomly selected for double-blind experiments to ensure the repeatability and stability of the genotyping results, and all the genotype calling success rates were greater than 99.0% (6).
Data collection, preprocessing, and feature variable screening
The definitive diagnosis of ATDH and basic medical history of subjects were exported from the HIS system by data collectors, and all relevant laboratory indications were exported in the LIS system. The missing data of <10% were filled with the median for continuous variables and plural for categorical variables, while missing data of >10% were excluded. The exact diagnosis of ATDH was obtained from the medical records by data collectors and provided to the data analysts. If there was no clear diagnosis of ATDH in the medical records, it was excluded after confirmation by a consulting clinician. Genetic polymorphism testing staff and clinical data collectors worked independently and were unaware of each other’s data, while data analysts jointly used all data to build predictive models and perform performance validation. All study subjects were randomly divided into test and validation datasets according to a 3:2 ratio. Lasso regression was used to initially screen candidate variables, and the smallest penalty coefficient lambada (λ) was selected to construct a subset of candidate variables at P<0.2 (22-24).
Construction of prediction models and evaluation of predictive efficacy using test set data
The modeling of candidate variables using the test set was performed using STATA software v15.0, and the goodness of fit of the model was evaluated using Akaike’s Information Criterion (AIC) (12,22,25). The selection criteria were (I) AIC minimization and (II) candidate variables minimization without affecting predictive efficacy (23,26). Covariance and interaction analyses were performed on the included predictors (26,27), while performance parameters such as sensitivity, specificity, positive predictive value, and negative predictive value were evaluated with a discrimination test (22). We used receiver operating characteristic (ROC) curves and C-index for model differentiation assessment (12,23), and calibration curve plots for consistency assessment (23,28).
Validation of prediction model efficacy using validation set data
Using the predictors obtained from the test set model and the corresponding coefficients, model reconstruction and validation of model fit were performed using the validation set data (12).
Model visualization and clinical application value assessment
A nomogram was developed to visualize the model. The clinical application value of the model was demonstrated using decision curve analysis, and the strategy with the highest net benefit for a specific threshold probability was considered the best strategy (26,29).
Statistical analysis
SPSS software (version 23.0) was used for data on clinical data and laboratory indications, with t-test or ANOVA for quantitative data obeying normal distribution, expressed as mean ± standard deviation (SD). Mann-Whitney or Kruskal-Wallis nonparametric tests were used for quantitative data with non-normal distribution, expressed as median or interquartile range, and the chi-square test or logistic regression was used to count the count data (19). R version 3.6.1 software was used to screen potential predictors through Lasso regression and SPSS version 23.0 software using one-way logistic regression analysis with P<0.20 was used as the judgment threshold for inclusion of predictors. Multi-factor analysis was performed by STATA version 14 software using a generalized linear model logistic regression stepwise selection method, and the model was constructed using the minimum value of AIC and the minimum number of predictors as criteria. ROC curve analysis was used to evaluate the predictive model discrimination using C-index as a criterion. The Hosmer-Lemeshow test was used to evaluate the consistency of the prediction model in the validation set data, with P>0.05 as the reference criterion, and the unreliability test (U test) corrected curve analysis was used to evaluate the consistency of the model fit, with P>0.05 as the reference criterion (30). The prediction model was visualized using a nomogram, and its clinical application value was analyzed using decision curves. The incidence of ATDH in the West China population was approximately 15% (31).
To examine the significant difference between these two groups, the bilateral significance level was established at 5%, and the power of the test was 80%. Considering a 10% loss to follow-up, the sample size of each group was estimated at approximately 100 cases (32).
Results
Basic information about the study population
Along with treatment, biochemical and haematological analyses were performed twice each month during the first two months and monthly in the subsequent four months. At the same time, clinicians observed and recorded clinical symptoms and signs in accordance with the diagnosis and treatment standards. A total of 746 study subjects (118 in the ATDH group and 628 in the non-ATDH group) were included in this study, and the process of enrolment is shown in Figure S1. All patients in the ATDH group presented alterations in hepatic enzymes, and 32 individuals developed symptomatic hepatitis, which was characterized by jaundice, nausea, vomiting, and abdominal pain. While there was no statistical difference between the two groups in the proportion of gender, age, and living habits, the proportion of patients presenting with febrile symptoms was significantly lower in the ATDH group than in the non-ATDH group, as shown in Table 1.
Table 1
Characteristics | ATDH (n=118) | Non-ATDH (n=628) | P |
---|---|---|---|
Age, mean ± SD (years) | 40.92±15.72 | 42.85±18.44 | 0.285 |
Gender (male/female), n (%) | 69 (58.47)/49 (41.53) | 375 (59.71)/253 (40.29) | 0.801 |
Smoking (yes/no), n (%) | 35 (29.66)/83 (70.34) | 192 (30.57)/436 (69.43) | 0.843 |
Drinking alcohol (yes/no), n (%) | 32 (27.12)/86 (72.88) | 141 (22.45)/487 (77.55) | 0.270 |
General symptoms, n (%) | |||
Fever | 63 (20.32) | 247 (79.68) | 0.004 |
Weight loss | 34 (13.08) | 226 (86.92) | 0.133 |
Nocturnal night sweats | 29 (14.80) | 167 (85.20) | 0.648 |
Loss of appetite | 45 (17.11) | 218 (82.89) | 0.475 |
Fatigue | 31 (17.92) | 142 (82.08) | 0.387 |
Local infection symptoms, n (%) | |||
Appearances | 88 (16.92) | 432 (83.08) | 0.209 |
Disappearances | 30 (13.27) | 196 (86.73) |
ATDH, anti-tuberculosis drug-induced hepatic injury; SD, standard deviation.
Indications of clinical laboratory tests in the study population
Patients in the ATDH group had increased total bilirubin (TBIL), indirect bilirubin, aspartate aminotransferase (AST), ALT, alkaline phosphatase, and glutamyl transferase levels and lower uric acid levels relative to those in the non-ATDH group (all P<0.05), as shown in Table 2. ATDH cases were graded as mild (83/118, 70.34%), moderate (21/118, 17.80%), and severe (14/118, 11.86%). Age and gender were similar in the three groups (P=0.888 and P=0.117, respectively) (data available if necessary). Once the patient developed ATDH, the clinician used the treatment to protect liver function according to the severity, temporarily discontinued the drug, or switched to the second-line treatment plan (2).
Table 2
Baseline values for laboratory test results | ATDH group (n=118) | Non-ATDH group (n=628) | P |
---|---|---|---|
Red blood cell count (×1012/L) | 4.31±0.74 | 4.28±0.68 | 0.481 |
Haemoglobin (g/L) | 122.87±22.11 | 122.06±20.58 | 0.717 |
Erythrocyte pressure (%) | 0.38±0.06 | 0.36±0.06 | 0.069 |
Platelet count (×109/L) | 236.50 (184.00–321.75) | 232.50 (172.75–297.25) | 0.134 |
White blood cell count (×109/L) | 6.57 (4.99–7.96) | 6.51 (5.17–8.44) | 0.761 |
Absolute value of neutrophils (×109/L) | 5.23±2.89 | 5.10±2.73 | 0.631 |
Absolute value of lymphocytes (×109/L) | 1.29±0.79 | 1.26±0.62 | 0.625 |
Absolute value of monocytes (×109/L) | 0.55±0.29 | 0.50±0.25 | 0.099 |
Neutrophils (%) | 70.49±11.50 | 70.13±11.54 | 0.760 |
Lymphocytes (%) | 16.25 (12.58–25.58) | 17.5 (12.18–25.68) | 0.527 |
Monocytes (%) | 7.74±2.62 | 7.30±2.37 | 0.077 |
Total bilirubin (μmol/L) | 10.05 (7.50–14.13) | 8.70 (6.30–12.10) | 0.002 |
Direct bilirubin (μmol/L) | 3.55 (2.38–5.60) | 3.45 (2.50–5.40) | 0.126 |
Indirect bilirubin (μmol/L) | 5.70 (3.98–7.95) | 4.80 (3.40–7.03) | 0.049 |
ALT (IU/L) | 28.00 (15.75–38.00) | 15.00 (10.00–21.00) | <0.001 |
AST (IU/L) | 27.00 (20.00–34.00) | 19.50 (16.00–25.00) | <0.001 |
Total protein (g/L) | 69.42±8.42 | 68.82±9.15 | 0.508 |
Albumin (g/L) | 38.64±7.35 | 37.89±6.90 | 0.248 |
Globulin (g/L) | 30.78± 6.65 | 30.93±7.02 | 0.829 |
Glucose (mmol/L) | 5.15 (4.64–5.95) | 5.14 (4.71–5.89) | 0.410 |
Urea (mmol/L) | 3.92 (2.90–5.24) | 4.05 (3.15–5.30) | 0.299 |
Creatinine (μmol/L) | 57.50 (47.78–67.00) | 60.45 (49.00–73.20) | 0.601 |
Serum cystatin C (mg/L) | 0.91 (0.81–1.04) | 0.92 (0.79–1.06) | 0.975 |
Uric acid (μmol/L) | 291.29±125.98 | 331.51±155.30 | 0.008 |
Triglycerides (mmol/L) | 0.99 (0.81–1.31) | 1.06 (0.80–1.43) | 0.469 |
Cholesterol (mmol/L) | 3.96±1.206 | 3.96±1.058 | 0.966 |
High-density lipoprotein (mmol/L) | 1.12 (0.85–1.48) | 1.08 (0.82–1.41) | 0.811 |
Low-density lipoprotein (mmol/L) | 2.20 (1.79–2.72) | 2.21 (1.69–2.77) | 0.575 |
Alkaline phosphatase (IU/L) | 85.50 (68.50–106.00) | 79.00 (64.00–98.00) | 0.021 |
Glutamyl transferase (IU/L) | 42.50 (26.00–78.00) | 29.00 (19.00–48.00) | <0.001 |
C-reactive protein (mg/L) | 9.74 (2.30–39.23) | 12.25 (2.67–37.43) | 0.961 |
Blood sedimentation (mm/h) | 38.50 (20.50–63.00) | 33.50 (14.75–64.00) | 0.173 |
Data are presented as mean ± standard deviation or median (interquartile range). ATDH, anti-tuberculosis drug-induced hepatic injury; ALT, alanine aminotransferase; AST, aspartate aminotransferase.
Loci typing results for the target SNPs in the study population
T allele carriers at rs3814055 of the PXR gene had a reduced relative risk of ATDH compared to C allele carriers (11). Carriers of the rs2755237 locus C allele of the FOXO1 gene had a reduced relative risk of ATDH relative to carriers of the A allele and carriers of the T allele of the rs4435111 locus relative to carriers of the C allele. The gene frequencies of candidate SNPs for the ALAS1 gene did not differ between the two groups.
Modeling of ATDH risk prediction
Model predictor screening
Lasso regression in the machine learning algorithm was used to screen the pre-treated 98 characteristic variables, showing the optimal subset of non-zero coefficient variables for inclusion in the model was 36 at the minimum value of 10-fold cross-validation error λ=0.0074528, and the coefficients of the remaining variables were reduced to zero, as shown in Figures 1,2.
Identification of candidate predictors using one-way logistic regression
As shown in Table 3, fourteen candidate feature variables were statistically different in the test set, respectively, while there were twelve corresponding candidate feature variables in the validation set. The characteristic variables statistically different in both groups were fever, rs3814055, total bile acid, glutamic aminotransferase, glutamic oxalacetic aminotransferase and uric acid.
Table 3
Candidate feature variables | Test set (n=490) | Validation set (n=256) | |||||
---|---|---|---|---|---|---|---|
Non-ATDH group (n=409) | ATDH group (n=81) | P | Non-ATDH group (n=219) | ATDH group (n=37) | P | ||
Gender (M/F), n (%) | 246 (60.1)/163 (39.9) | 45 (55.6)/36 (44.4) | 0.459 | 90 (41.1)/129 (58.9) | 13 (35.1)/24 (64.9) | 0.588 | |
Alcohol consumption (yes/no), n (%) | 127 (31.1)/ 282 (68.9) | 25 (30.9)/56 (69.1) | 0.543 | 14 (6.4)/205 (93.6) | 7 (8.2)/30 (81.1) | 0.019 | |
Fever (yes/no), n (%) | 177 (43.3)/ 232 (56.7) | 44 (54.3)/37 (45.7) | 0.087 | 70 (32.0)/149 (68.0) | 19 (51.4)/18 (48.6) | 0.026 | |
Weight loss (yes/no), n (%) | 176 (43.0)/ 233 (57.0) | 25 (30.9)/56 (69.1) | 0.048 | 50 (22.8)/169 (77.2) | 9 (24.3)/28 (75.7) | 0.834 | |
Decreased appetite (yes/no), n (%) | 166 (40.6)/ 243 (59.4) | 32 (39.5)/49 (60.5) | 0.902 | 52 (23.7)/167 (76.3) | 13 (35.1)/24 (64.9) | 0.155 | |
Fatigue (yes/no), n (%) | 122 (29.8)/ 287 (70.2) | 23 (28.4)/58 (71.6) | 0.894 | 19 (9.1)/199 (90.9) | 8 (21.6)/29 (78.4) | 0.041 | |
Genotype | |||||||
rs353556, n (%) | 0.886 | 0.544 | |||||
11 | 116 (28.4) | 22 (27.2) | 57 (26.0) | 9 (24.3) | |||
22 | 207 (50.6) | 40 (49.4) | 120 (54.8) | 18 (48.6) | |||
33 | 86 (21.0) | 19 (23.5) | 42 (19.2) | 10 (27.0) | |||
rs3852071, n (%) | 0.267 | 0.180 | |||||
11 | 13 (3.2) | 0 (0.0) | 3 (1.4) | 1 (2.7) | |||
22 | 111 (27.1) | 23 (28.4) | 67 (30.6) | 6 (16.2) | |||
33 | 285 (69.7) | 58 (71.6) | 149 (68.0) | 30 (81.1) | |||
rs352169, n (%) | 0.058 | 0.334 | |||||
11 | 174 (42.6) | 27 (33.3) | 80 (36.5) | 17 (45.9) | |||
22 | 194 (47.4) | 39 (48.1) | 103 (47.0) | 17 (45.9) | |||
33 | 41 (10.0) | 15 (18.5) | 36 (16.4) | 3 (8.1) | |||
rs2755237, n (%) | 0.008 | 0.385 | |||||
11 | 191 (46.7) | 53 (65.4) | 101 (46.1) | 15 (40.5) | |||
22 | 192 (46.9) | 24 (29.6) | 102 (46.6) | 21 (56.8) | |||
33 | 26 (6.4) | 4 (4.9) | 16 (7.3) | 1 (2.7) | |||
rs2701891, n (%) | 0.009 | 0.202 | |||||
11 | 224 (54.8) | 39 (48.1) | 126 (57.5) | 22 (59.5) | |||
22 | 160 (39.1) | 29 (35.8) | 76 (34.7) | 15 (40.5) | |||
33 | 25 (6.1) | 13 (16.0) | 17 (7.8) | 0 (0.0) | |||
rs3751436, n (%) | 0.629 | 0.481 | |||||
11 | 149 (36.4) | 28 (34.5) | 80 (36.5) | 10 (27.0) | |||
22 | 205 (50.1) | 39 (48.1) | 108 (49.3) | 22 (59.5) | |||
33 | 55 (13.5) | 14 (17.3) | 31 (14.2) | 5 (13.5) | |||
rs4435111, n (%) | 0.014 | 0.991 | |||||
11 | 249 (60.9) | 63 (77.8) | 117 (53.4) | 20 (4.1) | |||
22 | 144 (35.2) | 17 (21.0) | 89 (40.6) | 15 (40.5) | |||
33 | 16 (3.9) | 1 (1.2) | 13 (5.9) | 2 (5.4) | |||
rs7325594, n (%) | 0.492 | 0.329 | |||||
11 | 51 (12.5) | 13 (16.0) | 31 (14.2) | 5 (13.5) | |||
22 | 218 (53.3) | 45 (55.6) | 109 (49.8) | 23 (62.2) | |||
33 | 140 (34.2) | 23 (28.4) | 79 (36.1) | 9 (24.3) | |||
rs3814055, n (%) | 0.195 | 0.114 | |||||
11 | 242 (59.2) | 56 (69.1) | 131 (59.8) | 28 (75.7) | |||
22 | 137 (33.5) | 22 (27.2) | 76 (34.7) | 9 (24.3) | |||
33 | 30 (7.3) | 3 (3.7) | 12 (5.5) | 0 (0.0) | |||
rs56967099, n (%) | 0.880 | 0.897 | |||||
11 | 111 (27.1) | 24 (29.6) | 60 (27.4) | 9 (24.3) | |||
22 | 189 (46.2) | 37 (45.7) | 104 (47.5) | 19 (51.4) | |||
33 | 109 (26.7) | 20 (24.7) | 55 (25.1) | 9 (24.3) | |||
rs13059232, n (%) | 0.740 | 0.939 | |||||
11 | 157 (38.4) | 34 (42.0) | 80 (36.5) | 14 (37.8) | |||
22 | 196 (47.9) | 35 (43.2) | 107 (48.9) | 17 (45.9) | |||
33 | 56 (13.7) | 12 (14.8) | 32 (14.6) | 6 (16.2) | |||
rs4688040, n (%) | 0.477 | 0.964 | |||||
11 | 149 (36.4) | 34 (42.0) | 82 (37.4) | 13 (35.1) | |||
22 | 207 (50.6) | 35 (43.2) | 103 (47.0) | 18 (48.6) | |||
33 | 53 (13.0) | 12 (14.8) | 34 (15.5) | 6 (16.2) | |||
rs6785049, n (%) | 0.361 | 0.802 | |||||
11 | 163 (39.9) | 39 (48.1) | 94 (42.9) | 15 (40.5) | |||
22 | 187 (45.7) | 33 (40.7) | 95 (43.4) | 18 (48.6) | |||
33 | 59 (14.4) | 9 (11.1) | 30 (13.7) | 4 (10.8) | |||
rs3732360, n (%) | 0.675 | 0.220 | |||||
11 | 129 (31.5) | 25 (30.9) | 76 (34.7) | 18 (48.6) | |||
22 | 206 (50.4) | 38 (46.9) | 108 (49.3) | 13 (35.1) | |||
33 | 74 (18.1) | 18 (22.2) | 35 (16.0) | 6 (16.2) | |||
Platelets (×109/L) | 231.0 (171.0–296.5) | 244.0 (185.5–322.5) | 0.0549 | 235.0 (173.0–293.0) | 221.0 (181.5–276.0) | 0.9589 | |
Percentage of neutrophils (%) | 71.70 (62.00–78.30) | 70.60 (61.28–76.53) | 0.6908 | 70.00 (64.00–79.00) | 74.50 (65.35–82.65) | 0.2106 | |
Percentage of monocytes (%) | 7.10 (5.75–8.90) | 8.30 (5.68–9.45) | 0.1136 | 6.90 (5.80–8.80) | 7.40 (5.05–8.84) | 0.5629 | |
Absolute monocyte values (×109/L) | 0.47 (0.35–0.64) | 0.49 (0.36–0.67) | 0.341 | 0.45 (0.32–0.60) | 0.46 (0.36–0.76) | 0.2338 | |
Total bile acids (μmol/L) | 8.70 (6.40–12.23) | 9.80 (7.60–14.55) | 0.0087 | 8.80 (6.40–12.20) | 10.40 (6.60–14.70) | 0.1414 | |
Direct bilirubin (μmol/L) | 3.50 (2.50–5.40) | 3.60 (2.30–6.60) | 0.3726 | 3.50 (2.50–5.45) | 3.90 (2.90–7.55) | 0.1374 | |
Glutamic oxalacetic transaminase (IU/L) | 14.0 (10.0–20.0) | 27.0 (13.5–38.0) | <0.0001 | 16.0 (10.0–23.0) | 27.0 (17.0–38.0) | <0.0001 | |
Glutathione transaminase (IU/L) | 19.00 (15.00–25.00) | 27.00 (19.0–34.00) | <0.0001 | 21.00 (17.00–26.25) | 25.00 (21.00–33.00) | 0.0012 | |
Albumin (g/L) | 38.50 (33.20–43.10) | 38.70 (34.40–43.60) | 0.2744 | 38.95 (33.10–43.33) | 37.80 (30.40–46.85) | 0.866 | |
Glucose (mmol/L) | 4.72 (5.16–5.97) | 4.61 (4.93–5.64) | 0.0735 | 5.05 (4.65–5.67) | 5.37 (4.57–6.24) | 0.2974 | |
Creatinine (μmol/L) | 61.1 (49.0–74.0) | 57.4 (47.5–69.0) | 0.2144 | 59.0 (50.0–71.0) | 61.0 (53.0–73.0) | 0.3446 | |
Uric acid (μmol/L) | 306.70 (225.50–403.00) | 273.00 (203.00–393.00) | 0.1429 | 292.00 (224.00–417.00) | 267.35 (185.25–350.25) | 0.0392 | |
Triglycerides (mmol/L) | 1.04 (0.78–1.41) | 1.02 (0.81–1.47) | 0.9415 | 1.01 (0.78–1.39) | 0.94 (0.83–1.13) | 0.1614 | |
Alkaline phosphatase (IU/L) | 78.00 (64.00–96.25) | 84.00 (72.50–104.00) | 0.0254 | 80.00 (63.75–99.00) | 80.00 (65.50–108.50) | 0.3916 |
LASSO, least absolute shrinkage and selection operator; ATDH, anti-tuberculosis drug-induced hepatic injury.
Adjustment for model confounders
There was moderate strength covariance P=0.616 for ALT and AST and no multicollinearity between the remaining 15 candidate variables two by two, with a maximum P value of 0.26. Rs3814055 and rs4435111 had an interaction effect on the outcome variable ATDH occurrence (P=0.001), while no interactions were detected between the other 15 variables, all P>0.05.
Test set model building and optimization
The 17 candidate predictors were modeled in different ways, and the screening P values and AIC and BIC are shown in Table 4. Model 6 incorporated five variables with an AIC of 320.50, model 8 incorporated nine variables with an AIC of 312.44, and model 9 incorporated eight variables with an AIC of 312.68. A comparison of model 6, model 8, and model 9 revealed model 6 and model 8 were different, and although model 6 incorporated fewer variables, its predictive efficacy was reduced (using STATA software’s lrtest test command, P<0.05). In contrast, there was no difference in predictive efficacy between model 8 and model 9, and as model 9 incorporated fewer variables (using the lrtest test command of STATA software, P>0.05), it was considered the best model with the characteristics of incorporated variables as shown in Table 5.
Table 4
Models | Construction method | Inclusion of variables | Screening P value |
Number of variables | AIC | BIC |
---|---|---|---|---|---|---|
Model 1 | Entry into law | All variables | – | 17 | 325.71 | 406.32 |
Model 2 | Entry into law (dummy variable) | All variables | – | 17 | 327.26 | 422.75 |
Model 3 | Entry into law (dummy variable) | rs4435111, rs3814055, monocyte%, PLT, ALT, AST, UA, ALP, TBIL, DBIL, Alb, Glu, TG | – | 13 | 350.91 | 408.60 |
Model 4 | Entry into law (dummy variable) | rs4435111, rs3814055, monocyte%, ALT, AST, UA, TBIL | – | 7 | 352.18 | 389.40 |
Model 5 | Stepwise method | All variables | 0.2 | 9 | 312.44 | 352.74 |
Model 6 | Stepwise method | All variables | 0.05 | 5 | 320.50 | 344.68 |
Model 7 | Stepwise method | All variables | 0.3 | 11 | 313.42 | 361.78 |
Model 8 | Stepwise method | All variables | 0.2 | 9 | 312.44 | 352.74 |
Model 9 | Stepwise method | All variables | 0.05 | 8 | 312.68 | 348.95 |
Model 10 | Stepwise method | All variables | 0.3 | 13 | 315.68 | 372.11 |
Model 11 | Entry into law (dummy variable/interaction) | Fever, rs4435111, rs3814055, monocyte%, ALT, AST, UA, TBIL | – | 10 | 315.01 | 359.16 |
AIC, Akaike’s information criterion; BIC, Baysian information criterion; PLT, platelet; ALT, alanine transaminase; AST, aspartate aminotransferase; ALP, alkaline phosphatase; TBIL, total bilirubin; DBIL, direct bilirubin; Glu, glucose; TG, triglyceride; UA, uric acid; Alb, albumin.
Table 5
Characteristic variable | β | OR | 95% CI | P | |
---|---|---|---|---|---|
Lower limit | Upper limit | ||||
Fever | 0.7491207 | 2.115 | 0.148 | 1.349 | 0.015 |
rs4435111* | −1.373078 | 0.253 | −2.090 | −0.65 | <0.001 |
rs3814055* | −0.5692482 | 0.565 | −1.09 | −0.04 | 0.033 |
Albumin | 0.0676586 | 1.070 | 0.017 | 0.117 | 0.008 |
Glutamic-pyruvic transaminase | 0.0662291 | 1.068 | 0.036 | 0.096 | <0.001 |
Glutathione transaminase | 0.0503438 | 1.051 | 0.004 | 0.096 | 0.032 |
Uric acid | −0.0023242 | 0.997 | −0.005 | 0.00015 | 0.036 |
Percentage of monocytes | 0.1458457 | 1.157 | 0.028 | 0.262 | 0.014 |
*, both were set up according to dummy variables and modeled in layers according to interactions. OR, odds ratio; CI, confidence interval.
Test set model predictive efficacy analysis
The model had a discriminant C-index of 0.816, sensitivity of 34.25%, specificity of 97.99%, positive predictive value of 78.13%, and negative predictive value of 87.69%, as shown in Figure 3. The model consistency test had S:P =0.896, maximum absolute difference Emax =0.147, and average absolute difference Eave =0.017, as shown in Figure 4.
Validation set model building and effectiveness analysis
Logistic regression models were recreated in the validation set data summary using the regression coefficients from the test set model: odds (ATDH) = 1/{1+exp[−(−3.661122 + 0.7491207 × fever + 0.0676586 × Albumin − 0.0023242 × uric acid + 0.1458457 × monocyte% + 0.050343 × AST + 0.0662291 × ALT − 1.373078 × rs4435111 − 0.5698482 × rs3814055)]}.
The fit of this model was consistent with that constructed from the test set data (Hosmer-Lemeshow test P=0.4636). In the validation set the model ROC curve analysis discrimination C-index was 0.7189, the specificity 97.77%, negative predictive value 86.21%, sensitivity 15.15%, and positive predictive value 55.56%, as shown in Figure 5, and the calibration curve validation maximum absolute difference Emax =0.101 and average absolute difference Eave =0.009, with the Spiegelhalter Z-test for calibration accuracy S:P =0.929, as shown in Figure 6.
Building the nomogram
The nomogram was established according to this prediction model, and the genotypes of rs3814055 were stratified because of the interaction between rs3814055 and rs4435111. As the different genotypes of rs3814055 and rs4435111 had a non-equal predicted risk of ATDH set according to dummy variables. The nomogram is shown in Figure 7, with predicted probabilities between 0.1–0.7 for total integrals between 170–210.
Decision curve analysis effects analysis of the prediction model
The clinical decision curve for the ATDH prediction model is shown in Figure 8. The model has value for clinical use when the risk threshold ranges between 0.1 and 0.8.
Discussion
In this study, the ATDH prediction model was constructed using machine learning algorithms to screen eight predictors in terms of general clinical characteristics, laboratory indications, and genetic characteristics variables. The construction process was conducted strictly with reference to the statement of clinical prediction models using three steps: developing the prediction model, validating the prediction model, and studying the clinical significance of the model (30). Although the model had moderate specificity, discrimination, consistency, and clinical application, the sensitivity needs to improve.
In this study, Lasso regression, a machine learning algorithm, was used for the pre-screening of model feature variables. Lasso regression is beneficial in constructing prediction models to satisfy variance trade-offs while integrating a large amount of data in different dimensions. It has the characteristics of fast analysis, stability, and easy interpretation of results compared to conventional logistic regression step-by-step processing (33). Therefore, this study first used Lasso regression for data pre-screening then used one-way logistic regression to filter out 17 candidates variables (24).
The principle of multivariate logistic regression was used in the test set data, with the lowest AIC value and the least number of predictors as the selection criteria for the optimal model (34). The included predictors were fever, rs3814055, rs4435111, albumin, ghrelin, glutamic aminotransferase, uric acid, and monocyte percentage.
Visualization of the optimized model revealed ALT, AST, albumin, monocyte percentage, and fever as independent predictors of ATDH, suggesting the basal liver function status, immune status, and ATDH susceptibility in TB patients were associated. Meanwhile, the nomogram visualized the interaction between rs3814055 and rs4435111 and showed when rs3814055 was the CC genotype, the risk of ATDH was significantly increased, while the TT genotype of rs4435111 could reduce the risk of ATDH. When rs3814055 was the TT genotype, the overall risk of ATDH decreased, while the TT genotype of rs4435111 increased the risk of ATDH. The rs4435111 TT genotype in the genetics section was a factor in the reduced risk of ATDH occurrence, although the nomogram showed its score value for predicting the probability of ATDH risk was influenced by the genotype at the rs3814055 locus. Analysis of possible reasons for this is are as follows: (I) both rs3814055 and rs4435111 have relatively few TT genotypes, leading to bias in data from small samples; and (II) there is a complex higher-order and second-order multiplicative interaction of the PXR gene with the FOXO1 gene (6). The rs3814055 genotype interfered with the efficacy of the latter assessment due to the greater weighting of the rs3814055 genotype on the effect of ATDH susceptibility.
The model had a C-index =0.8164 for the test set’s discriminant test, with the consistency test S:P =0.896, Emax =0.147, and Eave =0.017, suggesting both the model’s discriminant and consistency were good. To avoid overfitting of the model due to random and systematic errors in the cross-validation data in different training data sets, the model fit needs to be validated in the validation set data to prevent the increase in variance caused by overfitting. It was shown that the fit of the model constructed from the validation set data was consistent with that of the model constructed from the test set data, and the discrimination had moderate strength discrimination, indicating that the use of Lasso regression was effective in preventing model overfitting from causing fit contraction in the new sample set. Further clinical decision curve analysis of the model revealed that when the high-risk threshold was between 0.1 and 0.8, the model was of good value for clinical use.
However, the model’s sensitivity was 34.25% and 15.15% in the test and validation sets, respectively, and its specificity was 97.99% and 97.77%, respectively, with positive predictive values of 78.13% and 55.56% and negative predictive values of 87.69% and 86.21%, also suggesting its predictive sensitivity needs to be improved.
The possible reasons for the good predictive specificity and poor sensitivity of the model are as follows: (I) low incidence of ATDH. In this study, there were 118 cases in the ATDH group and 628 cases in the non-ATDH group, and the incidence of ATDH was 15.81%. The group randomly divided all TB patients into the test set (81 cases in the ATDH group) and the validation set (37 cases in the ATDH group) in a 3:2 ratio, and their ATDH incidence was 18.57% and 14.45%, corresponding to a sensitivity of 34.25% and 15.15%, respectively. The low incidence of ATDH in the constructed model data may be one of the important reasons for the poor sensitivity of the model. (II) Lack of strong predictors. The predictors selected by Lasso regression and one-way logistic regression for model inclusion factors were general clinical features (fever), routine laboratory indications (ALT, AST, albumin, monocyte percentage), and genetic indications (genotype of rs3814055 and rs4435111), respectively. Although these predictors are objective tests, they are all relevant markers derived from the mechanisms of ATDH occurrence and not specific markers. Metabolomics and microbiomes indicate ATDH characterized by metabolic and microbial profiles also differed from non-ATDH (35). However, this study found that gene polymorphisms were correlated with the occurrence of ATDH, and different genes had interactions. Given the mechanism of ATDH has not yet been elucidated, exploring the target molecules [such as N-acetyltransferase (NAT), glutathione S-transferase (GST), and CYP450] in its occurrence and development as predictors will help to improve the predictive power.
Therefore, based on our established prediction model for ATDH, it can be concluded that (I) the machine learning algorithm Lasso regression helps to simultaneously perform a large number of candidate variables screening and meets the requirements of variance trade-off by bootstrap self-sampling, cross-validation, and avoiding overfitting; (II) SNPs are promising predictors, and combining multi-gene SNPs features over single-gene SNPs to build prediction models can improve predictive efficacy and clinical applicability; and (III) simultaneous modeling of multi-gene SNPs requires consideration of the impact of interactions on model predictive efficacy. Further research directions should also be validated in a larger and different population while adding as many key genes or clinical data as possible to increase the sensitivity of the model.
Acknowledgments
Funding: This study was supported by the Sichuan Medical Research Project (No. S21058), The Science and Technology Project of the Health Planning Committee of Sichuan (No. 19PJ163), and the Chengdu Medical Research Project (No. 2019067 and No. 2018059).
Footnote
Reporting Checklist: The authors have completed the STARD reporting checklist. Available at https://atm.amegroups.com/article/view/10.21037/atm-22-4551/rc
Data Sharing Statement: Available at https://atm.amegroups.com/article/view/10.21037/atm-22-4551/dss
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://atm.amegroups.com/article/view/10.21037/atm-22-4551/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). The study was approved by the Ethics Committee of West China Hospital, Sichuan University (Registration No. 2014198). All experimental subjects have signed the informed consent form.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Chakaya J, Khan M, Ntoumi F, et al. Global Tuberculosis Report 2020 - Reflections on the Global TB burden, treatment and prevention efforts. Int J Infect Dis 2021;113:S7-S12. [Crossref] [PubMed]
- Jeremiah C, Petersen E, Nantanda R, et al. The WHO Global Tuberculosis 2021 Report - not so good news and turning the tide back to End TB. Int J Infect Dis 2022; Epub ahead of print. [Crossref] [PubMed]
- Bao Y, Ma X, Rasmussen TP, et al. Genetic Variations Associated with Anti-Tuberculosis Drug-Induced Liver Injury. Curr Pharmacol Rep 2018;4:171-81. [Crossref] [PubMed]
- Zhang J, Liu X, He H, et al. Influence of HNF4α and HNF4α-AS1 gene variants on the risk of anti-tuberculosis drugs-induced hepatotoxicity. Ann Palliat Med 2021;10:11733-44. [Crossref] [PubMed]
- Huang YS. Recent progress in genetic variation and risk of antituberculosis drug-induced liver injury. J Chin Med Assoc 2014;77:169-73. [Crossref] [PubMed]
- Zhang J, Jiao L, Song J, et al. Genetic and Functional Evaluation of the Role of FOXO1 in Antituberculosis Drug-Induced Hepatotoxicity. Evid Based Complement Alternat Med 2021;2021:3185874. [Crossref] [PubMed]
- Lyoumi S, Lefebvre T, Karim Z, et al. PXR-ALAS1: a key regulatory pathway in liver toxicity induced by isoniazid-rifampicin antituberculosis treatment. Clin Res Hepatol Gastroenterol 2013;37:439-41. [Crossref] [PubMed]
- Podvinec M, Handschin C, Looser R, et al. Identification of the xenosensors regulating human 5-aminolevulinate synthase. Proc Natl Acad Sci U S A 2004;101:9127-32. [Crossref] [PubMed]
- Fraser DJ, Zumsteg A, Meyer UA. Nuclear receptors constitutive androstane receptor and pregnane X receptor activate a drug-responsive enhancer of the murine 5-aminolevulinic acid synthase gene. J Biol Chem 2003;278:39392-401. [Crossref] [PubMed]
- Thunell S. (Far) Outside the box: genomic approach to acute porphyria. Physiol Res 2006;55:S43-66. [Crossref] [PubMed]
- Zhang J, Zhao Z, Bai H, et al. Genetic polymorphisms in PXR and NF-κB1 influence susceptibility to anti-tuberculosis drug-induced liver injury. PLoS One 2019;14:e0222033. [Crossref] [PubMed]
- Steyerberg EW, Vergouwe Y. Towards better clinical prediction models: seven steps for development and an ABCD for validation. Eur Heart J 2014;35:1925-31. [Crossref] [PubMed]
- Pontual Y, Pacheco VSS, Monteiro SP, et al. ABCB1 gene polymorphism associated with clinical factors can predict drug-resistant tuberculosis. Clin Sci (Lond) 2017;131:1831-40. [Crossref] [PubMed]
- Mushiroda T, Yanai H, Yoshiyama T, et al. Development of a prediction system for anti-tuberculosis drug-induced liver injury in Japanese patients. Hum Genome Var 2016;3:16014. [Crossref] [PubMed]
- Chamorro JG, Castagnino JP, Aidar O, et al. Effect of gene-gene and gene-environment interactions associated with antituberculosis drug-induced hepatotoxicity. Pharmacogenet Genomics 2017;27:363-71. [Crossref] [PubMed]
- Mahomed S, Padayatchi N, Singh J, et al. Precision medicine in resistant Tuberculosis: Treat the correct patient, at the correct time, with the correct drug. J Infect 2019;78:261-8. [Crossref] [PubMed]
- Guo BL, Ouyang FS, Yang SM, et al. Development of a preprocedure nomogram for predicting contrast-induced acute kidney injury after coronary angiography or percutaneous coronary intervention. Oncotarget 2017;8:75087-93. [Crossref] [PubMed]
- Zhang J, Zhao Z, Bai H, et al. The Variant at TGFBRAP1 but Not TGFBR2 Is Associated with Antituberculosis Drug-Induced Liver Injury. Evid Based Complement Alternat Med 2019;2019:1685128. [Crossref] [PubMed]
- Yang M, Qiu Y, Jin Y, et al. NR1I2 genetic polymorphisms and the risk of anti-tuberculosis drug-induced hepatotoxicity: A systematic review and meta-analysis. Pharmacol Res Perspect 2020;8:e00696. [Crossref] [PubMed]
- WHO. Common Terminology Criteria for Adverse Events (CTCAE) Version 5.0. U.S. Department of Health and Human Services, 2017:1-155.
- Tostmann A, Boeree MJ, Aarnoutse RE, et al. Antituberculosis drug-induced hepatotoxicity: concise up-to-date review. J Gastroenterol Hepatol 2008;23:192-202. [Crossref] [PubMed]
- Huang YQ, Liang CH, He L, et al. Development and Validation of a Radiomics Nomogram for Preoperative Prediction of Lymph Node Metastasis in Colorectal Cancer. J Clin Oncol 2016;34:2157-64. [Crossref] [PubMed]
- Stone GW, Maehara A, Lansky AJ, et al. A prospective natural-history study of coronary atherosclerosis. N Engl J Med 2011;364:226-35. [Crossref] [PubMed]
- Kang SJ, Cho YR, Park GM, et al. Predictors for functionally significant in-stent restenosis: an integrated analysis using coronary angiography, IVUS, and myocardial perfusion imaging. JACC Cardiovasc Imaging 2013;6:1183-90. [Crossref] [PubMed]
- Akaike H. Data analysis by statistical models. No To Hattatsu 1992;24:127-33. [PubMed]
- Jaddoe VW, de Jonge LL, Hofman A, et al. First trimester fetal growth restriction and cardiovascular risk factors in school age children: population based cohort study. BMJ 2014;348:g14. [Crossref] [PubMed]
- Qiao F, Fu K, Zhang Q, et al. The association between missing teeth and non-alcoholic fatty liver disease in adults. J Clin Periodontol 2018;45:941-51. [Crossref] [PubMed]
- Coutant C, Olivier C, Lambaudie E, et al. Comparison of models to predict nonsentinel lymph node status in breast cancer patients with metastatic sentinel lymph nodes: a prospective multicenter study. J Clin Oncol 2009;27:2800-8. [Crossref] [PubMed]
- Vickers AJ, Cronin AM, Elkin EB, et al. Extensions to decision curve analysis, a novel method for evaluating diagnostic tests, prediction models and molecular markers. BMC Med Inform Decis Mak 2008;8:53. [Crossref] [PubMed]
- Moons KG, Altman DG, Reitsma JB, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med 2015;162:W1-73. [Crossref] [PubMed]
- Hu X, Zhang M, Bai H, et al. Antituberculosis Drug-Induced Adverse Events in the Liver, Kidneys, and Blood: Clinical Profiles and Pharmacogenetic Predictors. Clin Pharmacol Ther 2018;104:326-34. [Crossref] [PubMed]
- Lu T, He L, Zhang B, et al. Percutaneous mastoid electrical stimulator improves Poststroke depression and cognitive function in patients with Ischaemic stroke: a prospective, randomized, double-blind, and sham-controlled study. BMC Neurol 2020;20:217. [Crossref] [PubMed]
- Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B (Methodology) 1996;58:267-88. [Crossref]
- Vrieze SI. Model selection and psychological theory: a discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychol Methods 2012;17:228-43. [Crossref] [PubMed]
- Wu S, Wang M, Zhang M, et al. Metabolomics and microbiomes for discovering biomarkers of antituberculosis drugs-induced hepatotoxicity. Arch Biochem Biophys 2022;716:109118. [Crossref] [PubMed]
(English Language Editor: B. Draper)