Background: The coronavirus disease-2019 (COVID-19) pandemic, caused by severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2), has urgently necessitated effective therapeutic solutions, with a focus on rapidly identifying and classifying potential small-molecule drugs. Given traditional methods’ labor-intensive and time-consuming nature, deep learning has emerged as an essential tool for efficiently processing and extracting insights from complex biological data.
Aims: To utilize deep learning techniques, particularly deep neural networks (DNN) enhanced with the synthetic minority oversampling technique (SMOTE), to enhance the classification of binding activities in anti-SARS-CoV-2 molecules across various bioassays.
Methods: We used 11 bioassay datasets covering various SARS-CoV-2 interactions and inhibitory mechanisms. These assays ranged from spike-ACE2 protein-protein interaction to ACE2 enzymatic activity and 3CL enzymatic activity. To address the prevalent class imbalance in these datasets, the SMOTE technique was employed to generate new samples for the minority class. In our model-building approach, we divided the dataset into 80% training and 20% test sets, reserving 10% of the training set for validation. Our approach involved employing a DNN that integrates ReLU and sigmoid activation functions, incorporates batch normalization, and uses Adam optimization. The hyperparameters and architecture of the DNN were optimized through various tests on layers, minibatch sizes, epoch sizes, and learning rates. A 40% dropout rate was incorporated to mitigate overfitting. For model evaluation, we computed performance metrics, such as balanced accuracy (BACC), precision, recall, F1 score, Matthews’ correlation coefficient (MCC), and area under the curve (AUC).
Results: The performance of the DNN across 11 bioassay test sets revealed varying outcomes, significantly influenced by the ratios of active-to-inactive compounds. Assays, such as AlphaLISA and CoV-PPE, demonstrated robust performance across various metrics, including BACC, precision, recall, and AUC, when configured with more balanced ratios (1:3 and 1:1, respectively). This suggests the effective identification of active compounds in both cases. In contrast, assays with higher imbalance ratios, such as 3CL (1:38) and cytopathic effect (1:15), demonstrated higher recall but lower precision, highlighting challenges in accurately identifying active compounds among numerous inactive compounds. However, even in these challenging settings, the model achieved favorable BACC and recall scores. Overall, the DNN model generally performed well, as indicated by the BACC, MCC, and AUC values, especially when considering the degree of dataset imbalance in each assay.
Conclusion: This study demonstrates the significant impact of deep learning, particularly DNN models enhanced with SMOTE, in improving the identification of active compounds in bioassay datasets for COVID-19 drug discovery, outperforming traditional machine learning models. Furthermore, this study highlights the efficacy of advanced computational techniques in addressing high-throughput screening data imbalances.
The outbreak of the novel coronavirus, severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2), causing coronavirus disease-2019 (COVID-19), has posed unprecedented challenges to global health, triggering an urgent need for effective therapeutic interventions.1 This scenario underscores the importance of rapidly identifying and developing potential therapeutic agents, particularly small molecules with promising binding activity against various viral targets. These molecules are important in the fight against the pandemic. Accurate classification of these molecules is pivotal for accelerating drug discovery and development processes. In this critical quest, the role of computational tools, especially machine learning (ML) models, has become increasingly valuable. Offering the potential to significantly accelerate the identification and classification of promising drug candidates, these tools have emerged as key contributors in the battle against COVID-19, enabling researchers to efficiently analyze complex biological data and optimize experimental efforts through the predictive power of advanced algorithms.2,3
In antiviral research, traditional methods for classifying the binding activities of small molecules, while effective, often involve labor-intensive and time-consuming experimental procedures.4 The advent of quantitative high-throughput screening (qHTS) techniques has revolutionized this landscape by generating extensive datasets. However, this abundance of data presents a dual challenge: its complexity and volume often exceed the processing capabilities of conventional analytical methods.5 ML methods, and more specifically deep learning, have emerged as a potent solution to these challenges.6,7 Deep learning offers a powerful approach to extracting meaningful insights from complex data by providing robust models that can efficiently handle large datasets and navigate vast chemical spaces and intricate biological interactions.8 This advancement in the computational analysis is both a response to the immediate challenges posed by the pandemic and a significant leap forward in the field of drug discovery.
Deep learning, also known as deep neural networks (DNN), has demonstrated significant success in various fields, including drug discovery, by offering potent tools to model complex biological interactions.6 Its application in classifying the binding activities of molecules, especially in the context of antiviral research against SARS-CoV-2, represents a promising avenue for enhancing the screening and analysis of potential antiviral agents.9 However, the effectiveness of deep learning models is contingent upon their ability to generalize across a range of diverse experimental conditions and assays-a challenge frequently encountered in biological datasets. Our study tackles this challenge by concentrating on the classification of binding activity for anti-SARS-CoV-2 molecules, utilizing deep learning techniques across multiple assays. This methodology is influenced by recent strides in computational drug discovery, as showcased in projects like REDIAL-2020. In this initiative, multiple ML models were applied to predict various activities, encompassing live viral infectivity, viral entry, viral replication, and host cell toxicity.9
We used 11 SARS-CoV-2 bioassays from the National Center for Advancing Translational Sciences (NCATS) Open Data Portal for COVID-19 (https://opendata.ncats.nih.gov/covid19). These bioassays are derived from qHTS and frequently exhibit an imbalance, typically characterized by a notably larger number of inactive compounds than active ones. This imbalance poses a significant challenge for ML and deep learning algorithms because the limited data for the less represented (minority) class makes learning more difficult.10 Balancing the datasets through the application of oversampling techniques is one approach to address this challenge.6 In this study, we utilized the synthetic minority oversampling technique (SMOTE)11, a widely used method for oversampling, to establish a balance between active and inactive compounds in each bioassay. Furthermore, we generated molecular descriptors for every compound in each bioassay. Subsequently, we applied the data balancing method to each dataset and proceeded to train the DNN models on these adjusted datasets. The predictive performance of these models was assessed using a range of metrics.
Datasets
In our study, we used 11 experimental bioassay datasets from the NCATS OpenData Portal, including AlphaLISA and TruHit counter screens for spike-ACE2 protein-protein interaction12, ACE2 enzymatic activity13, and 3CL enzymatic activity14 assays targeting SARS-CoV-2 entry and replication. We also employed SARS-CoV-2 cytopathic effect (CPE) and host tox counter screen (cytotox) assays in Vero E6 cells15, along with SARS-CoV pseudotyped particle entry (CoV-PPE), its counter screen (CoV-PPE-cs), and similar assays for middle east respiratory syndrome (MERS-PPE and MERS-PPE-cs).16 Additionally, a human fibroblast toxicity (hCYTOX) assay was included to evaluate compound cytotoxicity.
KC et al.9 previously employed these bioassays. We used PaDEL software to calculate 1878 molecular descriptors for each compound within each bioassay. Refer to Yap17 for more comprehensive details on these descriptors. We applied a preprocessing step. First, empty columns, missing values, and variables with zero variance were eliminated. Subsequently, the datasets underwent centering and scaling through z-score transformation. Table 1 summarizes the number of compounds (both before and after preprocessing, categorized as active and inactive compounds), the active-to-inactive ratio, and the number of variables retained post-preprocessing for each bioassay dataset.
Data balancing using SMOTE
SMOTE, a popular oversampling method, generates new synthetic samples for the minority class instead of replicating existing ones.11 It begins by identifying the k-nearest neighbors (k-NN) of each minority class sample and then synthesizes new samples in the direction of these neighbors. This procedure includes several steps: first, determining the difference between a sample from the minority class and its nearest neighbor; next, multiplying this difference by a randomly chosen number between 0 and 1; then, adding this product to the original sample, thereby generating a new synthetic sample along the line connecting the two samples; and finally, repeating this process for every sample in the minority class. Such a technique expands the decision-making region for the minority class, prompting classifiers to establish broader and more generalized decision regions. This stands in contrast to narrow and overly specific decision regions, ultimately enhancing the overall generalization of the classifier.
Deep neural networks
DNNs, a specialized area of ML, are distinguished by their architecture, which includes several non-linear hidden layers between the input and output layers.18 The ability of DNNs to decode complex relationships between inputs and outputs is enhanced by their multiple layers and adjustable weights.6 The training of a DNN begins with randomly assigned weights. Subsequently, a loss function is used to evaluate the model’s predictions against actual classes, resulting in a loss score. This score is crucial for the backpropagation algorithm, which calculates derivatives of the loss function concerning the weights in each layer using the chain rule.19 Subsequently, an optimizer, such as a gradient descent algorithm, carefully adjusts the weights based on these derivatives to lower the loss. This iterative process of assessment, calculation, and refinement aims to minimize the loss function, leading to a DNN that is finely tuned to make predictions that closely align with true classifications.
Model building
For our model, we divided each dataset into an 80% training set and a 20% test set. Additionally, we randomly selected 10% of each training set to serve as validation sets. We used the training sets to train the DNN models and the test sets to assess their predictive performances. The validation sets served two main purposes: first, to fine-tune the hyperparameters of the network, and second, to identify the optimal cut-off points for the predicted probabilities generated by the DNN models. To select optimal cut-off points, we generated a plot illustrating balanced accuracy (BACC) as a function of the predicted probabilities derived from the validation sets. Then, we identified the point at which the DNN attained the highest BACC, deeming it as the optimal cut-off point. This strategy was offered by Korkmaz6. We applied the SMOTE method for data balancing in each training set. The ReLU function activated the input and hidden layers, and a sigmoid function was used for the output layer. Batch normalization enhances network performance and stability in each layer. We utilized a binary loss function and the Adam optimization method to tune hyperparameters through five-fold cross-validation. The model architecture was selected based on minimizing cross-validation loss, leading to a three-hidden-layered network. After evaluating various minibatch sizes, we determined 64 to be optimal, setting the epoch size at 150. A 40% dropout rate was applied to each layer to prevent overfitting, and the learning rate was set at 0.001 as it most effectively minimized cross-validation loss. We used Python v3.7.1 and R v4.3.2 for all analyses. We utilized the Keras (v3.0) and TensorFlow (v2.13) libraries for DNN model building and the imbalanced-learn library to apply the SMOTE method in Python. Receiver operating characteristic (ROC) curves were created using the pROC (v1.18.5) package in R.
Performance metrics
We computed various performance metrics, including BACC, precision, recall, F1-score, MCC, and AUC, to evaluate the validation and test set performances of the DNN models.
Here, TP represents true positives, TN denotes true negatives, FP signifies false positives, and FN indicates false negatives. The variable i iterates over all n0 data points that are accurately labeled as 1, while j covers all n1 data points that are correctly identified as 0. Y1i refers to the i-th data point with a true label of 1, Y0j to the j-th data point with a true label of 0, and I represents the indicator function.
These measures are defined in Korkmaz6.
Performance metrics were acquired for each bioassay. The results for the validation set are summarized in Table 2, while the results for the test set are presented in Table 3. Additionally, ROC curves are shown in Figures 1, 2 for both validation and test sets to visualize the classification ability of the DNN model.
The performance of the DNN model across 11 bioassay test sets provides valuable insights, especially when considering the imbalance ratio of active-to-inactive compounds. The subsequent section provides a comprehensive analysis of each assay:
3CL assay: With an imbalance ratio of 1:38, this assay presented a substantial challenge. The model achieved a BACC of 0.731, a noteworthy accomplishment given the high level of imbalance. However, the precision was low at 0.075, indicating challenges in accurately identifying active compounds, a likely consequence of the high number of inactive compounds. The recall of 0.700 suggests that the model was sensitive to detecting active compounds. However, this sensitivity came at the cost of precision. The MCC of 0.172 and AUC of 0.783 further reflect these trends.
ACE2 assay: With a more manageable imbalance ratio of 1:15, the model showed improved performance with a BACC of 0.772. The higher precision of 0.175, compared to 3CL, indicates better identification of true positives, whereas the high recall of 0.846 suggests that the model effectively identified the most active compounds. The F1-score of 0.289 and MCC of 0.294 align with these findings, while an AUC of 0.842 indicates good overall model performance.
AlphaLISA assay: This assay had a relatively balanced ratio of 1:3, reflected in strong performance across metrics, including a high BACC of 0.803, precision of 0.603, and recall of 0.813. The F1-score of 0.692 and MCC of 0.561 are particularly noteworthy, suggesting a well-balanced model. The high AUC of 0.868 indicates excellent overall model performance.
CoV-PPE assay: Exhibiting a balanced ratio of 1:1, this assay demonstrated robust performance metrics. It achieved the highest precision of 0.730 among all assays, indicating an effective identification of active compounds. However, the recall is slightly lower at 0.659, suggesting that some active compounds were missed. The F1-score of 0.693 and MCC of 0.417, along with an AUC of 0.755, indicate good overall performance.
CoV-PPE-cs assay: Similar to AlphaLISA in terms of imbalance (1:3), the model exhibited a balanced performance, with a precision of 0.556 and a recall of 0.675. The BACC of 0.750, F1-score of 0.610, MCC of 0.470, and AUC of 0.799 indicate a model that performs well in both identifying active compounds and avoiding false positives.
CPE assay: Despite a challenging imbalance ratio of 1:15, the model achieved a notable recall of 0.764 but struggled with precision (0.121), indicating difficulties in accurately identifying true active compounds among numerous inactive compounds. The BACC of 0.690, F1-score of 0.208, MCC of 0.190, and AUC of 0.753 suggest a model that is more sensitive than specific.
Cytotox assay: With an imbalance ratio of 1:5, the model demonstrated moderate precision (0.379) and high recall (0.699), suggesting an effective balance in identifying active compounds. The BACC of 0.730, F1-score of 0.491, MCC of 0.374, and AUC of 0.792 reflect a well-performing model in a moderately imbalanced setting.
hCYTOX assay: Here, the ratio of 1:9 posed a moderate difficulty, and the model managed a high recall (0.767) but a relatively low precision (0.212). The BACC of 0.734, F1-score of 0.332, MCC of 0.289, and AUC of 0.751 indicate a model that effectively identifies active compounds, although with some limitations in precision.
MERS-PPE assay: With an imbalance ratio of 1:3, the model performed well, achieving a recall of 0.764 and a precision of 0.426, suggesting an effective balance in identifying active compounds. The BACC of 0.723, F1-score of 0.547, MCC of 0.384, and AUC of 0.761 support the model’s good performance.
MERS-PPE-cs assay: An imbalance ratio of 1:11 presented challenges. The model exhibited a high recall (0.822) and relatively low precision (0.236), indicating sensitivity to active compounds. The BACC of 0.763, F1-score of 0.366, MCC of 0.331, and AUC of 0.803 indicate a model that is more sensitive than specific.
TruHit assay: With a favorable imbalance ratio of 1:2, the model delivered a robust performance, especially notable in recall (0.813) and precision (0.515), demonstrating its effectiveness in discriminating active compounds. The BACC of 0.765, F1-score of 0.630, MCC of 0.476, and AUC of 0.821 reflect a well-balanced and effective model.
Overall, the DNN model performance varied notably across different bioassays and was heavily influenced by the imbalance ratio of each dataset. Assays with more balanced ratios typically demonstrated better precision and overall metric performance. In contrast, assays with higher imbalances tended to have higher recall but struggled with precision. The BACC, MCC, and AUC values across all assays suggest that the model generally performed well, particularly when considering the varying degrees of dataset imbalance.
In drug discovery research, a diverse range of ML algorithms has been effectively employed for the activity classification of compounds. This aims at extracting valuable insights from bioassay datasets. Algorithms, such as Naïve Bayes (NB)20,21, Random forests (RF)22,23, k-NN24, support vector machines (SVM)7,25, and DNN6, are commonly employed for these classification tasks. A significant challenge in handling bioassay datasets, especially those derived from qHTS, is their inherent class imbalance. This is typically characterized by a substantially higher number of inactive compounds compared to active compounds. This imbalance presents a challenge for ML models due to the limited data available for the minority class. To address this specific challenge, Korkmaz6 explored various data balancing methods during the training phase of DNNs. The results of this study suggest that the impact of imbalanced data on neural network performance can be alleviated by employing data balancing methods.
In our study, we used a deep learning approach to test the binding activities of 11 anti-SARS-CoV-2 bioassays. We undertook a comprehensive trial-and-error process to optimize the model architecture, experimenting with various numbers of layers, minibatch sizes, epochs, and learning rate levels to identify the most effective hyperparameters. We evaluated the model performance by selecting performance metrics that reflect the imbalanced nature of our datasets, including BACC, F1-score, and MCC.
KC et al.9 analyzed these bioassays in their research by utilizing a variety of 24 ML models. These models included the following methods: NB, extreme gradient boosting, RF, logistic regression, multilayer perceptron, and SVM, alongside 22 types of molecular descriptors.9 In contrast, our study focused on developing a deep learning model, employing PaDEL software to generate a comprehensive set of 1878 molecular descriptors. A notable methodological divergence lies in our approach to class imbalance. KC et al.9 performed random undersampling to balance the active and inactive classes. However, we chose to implement an advanced oversampling technique, SMOTE, which creates synthetic samples for the minority class. Our DNN model exhibited superior performance over the ML classifiers employed by KC et al.,9 particularly in terms of accuracy and AUC, underscoring its more robust overall capability.
It is crucial to emphasize the significance of discovering new and effective drugs to combat COVID-19, a global health challenge that has had profound impacts on society. The development of such drugs is not only a matter of immediate health concern but also of long-term preparedness for potential future outbreaks. In this context, deep learning methods have emerged as invaluable tools in the drug discovery process. Their ability to analyze complex datasets and uncover patterns that might elude traditional analytical methods offers a promising avenue for identifying novel active molecules. The capacity of deep learning to handle vast and varied data - a common scenario in drug discovery - makes it particularly suitable for rapidly evolving scenarios, such as the COVID-19 pandemic. By incorporating deep learning techniques, researchers can accelerate the identification and validation of potential therapeutic candidates, thereby shortening the time from laboratory research to clinical trials and eventual public availability. The use of these advanced computational approaches not only enhances the efficiency of the drug discovery process but also increases the likelihood of identifying effective compounds. This, in turn, could save lives and mitigate the impacts of the pandemic.
In conclusion, this study highlights the pivotal role of deep learning in revolutionizing the drug discovery process, especially within the scope of the COVID-19 outbreak. By employing sophisticated deep learning methods, notably the DNN model augmented with SMOTE to address class imbalances, our research demonstrates a marked improvement in identifying active compounds from bioassay datasets. This approach outperforms conventional ML models, as evidenced by comparisons with the work of KC et al.,9 and also plays a vital role in addressing the challenges arising from the inherently imbalanced characteristics of HTS data. Our results highlight the capability of deep learning to expedite the identification of effective therapeutic agents against COVID-19. Advocating for the broader implementation of advanced computational techniques in drug discovery, this study signals progress toward a more efficient, precise, and swift identification of potential drugs to address present and upcoming health crises.
Data Sharing Statement: The data that support the findings of this study are openly available in Kaggle at https://www.kaggle.com/datasets/selcukorkmaz/anti-sars-cov-2-assays
Authorship Contributions: Concept- B.E.Y., S.K., Design- B.E.Y., S.K., Supervision- S.K., Materials- B.E.Y., S.K., Data Collection or Processing- B.E.Y., S.K., Analysis or Interpretation- B.E.Y., S.K., Literature Search- B.E.Y., S.K., Writing- B.E.Y., S.K., Critical Review- B.E.Y., S.K.
Peer Review: Selçuk Korkmaz is a member of the Editorial Board of the Balkan Medical Journal. However, he was not involved in the editorial decision of the manuscript at any stage.
Conflict of Interest: No conflict of interest was declared by the authors.
Funding: The authors declared that this study received no financial support.