Harmonizing across datasets to improve the transferability of drug combination prediction


A framework of intra- and inter-study machine learning prediction

Our goal is to test the capability of machine learning models in predicting combination treatment response, not only within a single study but also between different studies and on unseen drug combinations. To achieve this goal, we first explore the publicly available high-throughput screening datasets for anti-cancer combination treatments, to build a gold standard for our experiment. We explore the current latest version of the DrugComb portal (, which contains the most comprehensive publicly available drug combination high-throughput screening datasets, including 24 independent studies7,8. Among them, we select four major datasets: ALMANAC, O’Neil, FORCINA, and Mathews, as they are of the biggest sizes and therefore are commonly used in machine learning prediction of combination responses13,18,19,24,27,28,29. These four studies contain a total of 406,479 drug combination experiments, 9,163 drugs, and 92 cell lines, while the size, drug, and cell line composition, as well as experimental settings, vary significantly among them (Supplementary Table 1). Of the four datasets, ALMANAC is the largest dataset with the most drug-cell line combinations, and FORCINA has the largest number of drugs screened. O’Neil has the best quality, where all the combinations are tested with four replicates, whereas ALMANAC tested at most three replicates for each combination and Mathews tested two replicates for each combination. In contrast, the FORCINA dataset contains no replicates.

We carried out a two-step cross-study validation (Fig. 1a). First, we trained dataset-specific models and carried out intra-study cross-validation. The cross-validation was set up so that the training and testing sets in this step do not share the same treatment-cell line combinations. Therefore, we aimed to test the performance of machine learning models in predicting unseen combination treatments within the same study. Next, during the inter-study cross-validation step, we tested these dataset-specific models on new individual datasets, which are denoted as “1 vs 1” in Fig. 1a. Furthermore, to explore more versatile cross-study scenarios, we designed a “3 vs 1” cross-validation strategy by combining three of the four datasets as the training set and the remaining one as the test set.

Fig. 1: Overview of the framework on intra- and inter-study drug combination predictions.
figure 1

a The cross-validation strategy. We carry out the cross-validation in two steps: intra-study, which is five-fold cross-validation carried out within a single dataset, where the training and test sets are split by drug combination and cell lines, and inter-study, which is carried out between different datasets. The models used in the 1 vs. 1 inter-study cross-validations are the models generated from the inter-study training step. For the 3 vs. 1 inter-study cross-validation, three of the four datasets are combined and used as the training set to generate five models by five-fold cross-validation and then tested on the remaining dataset. b The overlapped information (drug, cell line, and treatment-cell line combination) between the four datasets used in this study. c The schematic of model construction in this study. We use four different data sources to generate the machine learning model used in this study. For drug-related features, we used chemical structure, monotherapy efficacy score, and their corresponding dose–response relationship. For the treated cancer cell lines, we used the transcription levels of 293 cancer-related genes. The constructed features are input into a lightGBM learner to generate models predicting the six different response metrics of the combination treatment: CSS, which is the sensitivity score representing the efficacy of the combination, and five synergy scores (S, Bliss, HSA, Loewe, and ZIP) representing the degree of interaction between the two drugs.

To analyze the potential of transferability, we determine the overlap of the drugs, cancer cell lines, and treatment-cell line combinations between the four studies (Fig. 1b). While drugs are overlapped between all the studies, no overlap of cell lines exists between FORCINA and Mathews with the other datasets, since both FORCINA and Mathews include only one unique cancer cell line. Overall, only 612 treatment-cell line combinations exist between ALMANAC and O’Neil, providing reference data for evaluating the performance of cross-dataset prediction.

Using the replicates within each dataset and the overlapping treatment-cell line combinations between the datasets, we analyzed the reproducibility of a drug combination sensitivity score called CSS23, as well as multiple drug combination synergy scores, including S, Bliss, HSA, Loewe, and ZIP23,30. The intra- and inter-study reproducibility can be used as a benchmark for the drug combination prediction model we built in the next step (Supplementary Fig. 1). While no replicates existed in the FORCINA dataset, the O’Neil dataset showed the best intra-study replicability (0.93 Pearson’s r for CSS, 0.929 Pearson’s r for S, 0.778 for Bliss, 0.777 for HSA, 0.938 for Loewe, and 0.752 for ZIP), possibly due to the relatively more abundant replicates in this study. When testing the overlapping treatment-cell line combinations between ALMANAC and O’Neil, as expected, all the drug combination synergy scores showed significant drops of replicability (0.2 Pearson’s r for S, 0.12 for Bliss, 0.18 for HSA, 0.25 for Loewe, and 0.09 for ZIP), while the CSS score still maintained a higher correlation (0.342 Pearson’s r). The higher reproducibility of the CSS score, both within and across the studies, suggested that drug combination sensitivity is more reproducible than synergy, which may justify why most of the clinically approved drug combinations rely on their combinatorial efficacy rather than synergy31,32

The above result highlights the challenges of predicting cross-dataset drug combinations including (1) the scarcity of overlapped compounds and cell lines between studies, and (2) the variability in the assay and experimental settings, such as the total number and ranges of doses. To combat these challenges, we propose a machine learning model using the following features (Fig. 1c): (1) for both drugs, we use chemical structure-derived fingerprints, which can be transferred to chemicals that may not be present in the training set; (2) we use pharmacodynamic properties, such as monotherapy efficacy scores and dose–response curves of the drugs. The dose–response curves will be normalized; 3) we use the expression of 273 essential cancer genes33 to represent the molecular states of the cell lines. The above features will be fed into a lightGBM boosting model, as it has shown higher efficiency than other tree-based algorithms such as XGboost and Random Forest when training on large datasets34. We will evaluate the accuracy of predicting the six types of drug combination response scores (i.e. CSS, S, Bliss, HSA, Loewe, and ZIP).

Combating inter-study variability by integrating monotherapy efficacy and imputation of dose–response curves

We observe that experimental settings differ not only between different studies, but also within the same study (Supplementary Table 1 and Supplementary Fig. 2). For example, the dose–response matrix ranges from 2 × 2 (FORCINA) to 10 × 10 (Mathews), and within the O’Neil dataset, both 4 × 4 and 4 × 6 dose–response matrices are used. Meanwhile, the dose ranges differ significantly within and between studies (Supplementary Table 1). For example, within the ALMANAC study, more than 40 different doses have been used (Supplementary Fig. 2), and the maximum doses tested for each drug are different due to their distinctive pharmacodynamic properties4. Therefore, we precalculate the replicability of monotherapy efficacy scores, in terms of IC50, RI, and the distribution statistics (maximum, minimum, mean, and median of all inhibitions in the dose–response curves) within and between different datasets (Supplementary Fig. 3). We notice that RI and IC50 show comparable reproducibility within datasets, with Pearson’s r of RI ranging from 0.363 (within Mathews) to 1 (within O’Neil), while Pearson’s r of IC50 ranges from 0.537 (within ALMANAC) to 1 (within O’Neil). However, the replicability of IC50 is much lower than that of RI in the cross-dataset analysis, (Pearson’s r = 0.084 for IC50 versus r = 0.451 for RI between ALMANAC and O’Neil). Most dose–response curve shape statistics show Pearson’s r better than or comparable with IC50 and RI, either within or between studies, suggesting potential in cross-study prediction (Supplementary Fig. 3).

We start exploring the drug combination response prediction based on the monotherapy responses such as efficacy and dose–response curves (M1–M12, Fig. 2, Supplementary Figs. 4–10, and Supplementary Table 2–5). Three types of features based on monotherapy responses are constructed, denoted as “monotherapy_efficacy”, “drc_baseline”, and “drc_imputation”. “monotherapy_efficacy” is summarized score of the curve (IC50 or RI) using each of the two drugs independently on the treated cell lines, and has often been used in previous benchmark models in drug combination prediction challenges11,35,36. The other two features, “drc_baseline”, and “drc_imputation”, are based on the exact dose–response relationships (Fig. 2a). The dose–response curve baseline model (M1) using the “drc_baseline” feature method is equivalent to the method previously described by Zheng et. al. 8, where the doses and corresponding responses are vectorized and concatenated directly together. Since the total number of doses varies significantly between different datasets, the “drc_baseline” features were padded to the same length. For the “drc_imputation” feature, we interpolate all the dose–response curves to the same length for all the datasets (Fig. 2b). We test linear, Lagrange, 4-parameter log-logistic regression (LL4) interpolation (M2–M4, Supplementary Figs. 5 and 6). Among the three interpolation methods, linear interpolation performs the best in the intra-study cross-validation while LL4 performs the best in the inter-study cross-validation. Furthermore, combining all three methods shows better performances in both scenarios, and thus is used in the final “imputation” model (M5, Supplementary Fig. 5b, d). Also, since using IC50 and RI together are generally better than them alone in the intra- and inter-study cross-validations, the final monotherapy efficacy feature contains both measurements (M7–M9, Supplementary Figs. 7, 8). Five models using different combinations of the monotherapy response-based features mentioned above, are shown in Fig. 2. We notice that M12, which is a combination of all three types of monotherapy features, performs slightly better in the intra-study cross-validation (101–102% performance ratio compared to the other models), while M5, which is the pure imputation model, performs the best in the inter-study cross-validations (107–15% performance ratio compared to the other models), and this advantage is especially significant in the prediction of Bliss (112–113% performance ratio compared to the other models) and Loewe scores (119–138% performance ratio compared to the other models) (Supplementary Fig. 4). It is expected that M12 performs the best in the intra-study validation since the un-imputed dose–response baseline features contain the doses for dose–response evaluation. These doses chosen for monotherapy response evaluation can be significantly different (Supplementary Table 2), thus causing biases in the cross-study prediction. However, the monotherapy doses can still be effective for within-study prediction since it contains unique experimental information for each drug. The imputation method, on the other hand, indeed alleviates the biases in the experimental settings and is more universally transferable between different experimental settings, thus M5, which only imputes dose–response information, outperforms all other monotherapy-based models.

Fig. 2: Strategy to normalize the differences in inter-study experimental settings.
figure 2

a Demonstration of different dose–response curves (drc) feature construction schemes. The drc baseline feature is defined as the direct concatenation of doses and corresponding responses, where the total number of doses and responses will be padded by “-1” to the same length for different experimental settings. The drc imputation feature is the concatenation of imputed responses by different interpolation methods (see Methods for details). The monotherapy efficacy feature is the IC50 and RI of both drugs on the same cell line. b Schematics of inter-study interpolation normalization in experimental settings. For experimental settings A, B, and C, which are tested using a different total number of doses, N1, N2, and N3, we pull out the largest number of doses across all the studies, denoted as Nmax. Then, the dose–response information of each setting is interpolated to the same size as Nmax. c Performances are evaluated by Pearson’s r for all models, which are models with different combinations of the three features. The top performance in each training set (top) and testing set (right) is denoted by “*”. d Heatmap shows the results from paired t test between the performances of five models in intra- and inter-study cross-validation. The color in the heatmap shows the performance ratios (PRs) of the average performances between each model pair.

When comparing the monotherapy efficacy directly with dose–response curve-based models, interestingly, the efficacy model shows the best performance in inter-study prediction while the worst in intra-study prediction (Supplementary Figs. 9, 10). We notice that the efficacy model performs especially well when trained or tested on the FORCINA dataset, which adopts a 2 × 2 dose–response matrix design (Supplementary Fig. 9a). We reckon that the coarse dose–response relationship may not be as good as the total efficacy in this case, as the imputation becomes unreliable with only two doses.

The imputation methods improve the benchmarks model’s performance in the cross-study prediction

Previously, the DrugComb study provided a state-of-the-art model using the O’Neil dataset, by integrating one-hot encoding of drugs and cell lines as well as drug chemical fingerprints, drug doses, and cell line gene expressions in the model construction8. In this study, we construct a benchmark model based on their schemes, by encoding the chemical structure properties and molecular profiles of drugs and cell lines in the feature set, and explore if the imputation method of the dose–response curve can further improve the prediction accuracy across different individual datasets (Fig. 3 and Supplementary Fig. 11).

Fig. 3: Normalized dose–response information improves the intra- and inter-study prediction performances of benchmark models.
figure 3

a Schematic of the step-by-step feature construction strategy from the benchmark models (M13–M15) to the dose–response-curve-incorporated models (M16 and M20). b Performances in all the training and testing scenarios for M1–M5. The best-performing models were denoted by “*” c Comparisons of performances of M1–M5 from paired t test. The performance ratios (PRs) between model pairs are shown in different colors.

We construct five models step-by-step, from the label information (categorical encoding of both drugs and cell lines), to adding the chemical structure of both drugs encoded by molecular fingerprints and cell line cancer gene expression, to adding monotherapy efficacy, and adding the dose–response curve baseline feature and imputation feature, respectively. The performances of all models are listed as M13–M20 (Supplementary Tables 2–5). And five models, including M13-16, and M20, are listed for the main comparison (Fig. 3a).

We notice that, while the benchmark models with only information directly from drugs and cell lines (M13 and M14) still achieve decent performances around the experimental reproducibility levels in intra-study cross-validation (Supplementary Table 2), neither of these models achieve better-than-random performances in the cross-study predictions, due to a lack of shared drugs and cell lines across different studies (Fig. 3b and Supplementary Table 3). Incorporating pharmacological properties such as monotherapy activity on the same cell lines (M15) improves both the intra-study and inter-study prediction performances to 178% and 1299% compared to the reference model (M13), showing the robustness of monotherapy efficacy information between studies (Fig. 3c). Adding the monotherapy baseline information (M6) further improves the inter-study performance but not the intra-study, possibly due to the same reason we mention in the previous section, that the baseline information contains the dose settings, which is a dataset-exclusive artifact. Furthermore, adding the imputed information (M20) further improves the performances in both intra- and inter-study cross-validation, to 184% in the intra-study cross-validation and 1367% in the inter-study validation (Fig. 3c). This improvement is consistent in terms of all the drug combination sensitivity and synergy scores, with 1187% in CSS, 2141% in Bliss, 949% in HSA, 2257% in Loewe, 723% in ZIP, and 2019% in S score, respectively (Supplementary Fig. 11b). Notably the models achieve better performances than experimental replicates within and between studies (Supplementary Tables 2–5). We conclude that the imputed dose–response curve contains orthogonal information to the monotherapy efficacy, which can be effectively used to improve the prediction of combination treatment response by overcoming the variability between different experimental settings.

To understand which information plays the most important role in the inter-study prediction, we carry out SHAP (SHapley Additive exPlanations) analysis to visualize the contribution of all the features in the best-performing model (M20, Fig. 3). As expected, the dose–response curve derived feature shows significant SHAP importance and remains the top feature for all the drug combination response score predictions, while the monotherapy efficacy score also shows significant importance in the S score prediction (Supplementary Fig. 12). We then analyze the contributions of the dose–response imputation features specifically and noticed that the imputed responses at the beginning and end of the curve show significant importance in the prediction, suggesting that the minimum and the maximum response of the monotherapies are informative for predicting the drug combination response (Supplementary Fig. 13).

To demonstrate the robustness of our models in broader inter-dataset validation settings, we carry out 3 vs. 1 cross-validation experiments based on the four datasets we use in this study (Fig. 4 and Supplementary Figs. 14–17). For each training and test setting, we combine three datasets and use the combination as the training set, then test the model on the remaining datasets. We expect that using a multi-sourced training set can lead to improved model performances, by including more types of drugs and cell lines in the training instances. Thus, the training datasets can potentially contain more transferable information to new datasets. As expected, the optimal model in 1 vs. 1 inter-study cross-validation settings, M20, which is the baseline model plus dose–response curve imputation feature, shows the same advantages compared to the other models, with 910% performance compared to M1 and 1544% performance compared to M2 (Fig. 4b).

Fig. 4: Comparison of performances before and after incorporating dose–response curve into the baseline model in inter-study predictions.
figure 4

Models were trained using three datasets and then tested on the remaining dataset. The models refer to the same model definition in Fig. 3. a Performances of machine learning models in the 3 vs 1 training-testing settings. For each comparison, the training set includes three studies shown on the top, while the test set contains one study shown on the right. Top performances are marked by “*”. b The Pairwise comparison of performances of five models, showing the performance ratios (PRs) and their p-values (paired t test).


Source link

What do you think?

Leave a Reply

GIPHY App Key not set. Please check settings


    Pitch Deck Examples That Helped Entertainment Startups Raise Millions

    elizabeth holmes lost her bid to stay out of pris 3 570 1681224861 0 dblbig

    Elizabeth Holmes loses bid to stay out of prison