Geometry Optimization Using the Frozen Domain and Partial Dimer Approaches in the Fragment Molecular Orbital Method: Implementation, Benchmark, and Applications to Protein Ligand-Binding Sites
The frozen domain (FD) approximation with the fragment molecular orbital (FMO) method is efficient for partial geometry optimization of large systems. We implemented the FD formulation (FD and frozen domain dimer [FDD] methods) already proposed by Fedorov, D. G. et al. ( , 2, 282-288); proposed a variation of it, namely frozen domain and partial dimer (FDPD) method; and applied it to several protein-ligand complexes. The computational time for geometry optimization at the FDPD/HF/6-31G* level for the active site (six fragments) of the largest β-adrenergic G-protein-coupled receptor (440 residues) was almost half that of the conventional partial geometry optimization method. In the human estrogen receptor, the crystal structure was refined by FDPD geometry optimization of estradiol, surrounding hydrogen-bonded residues and a water molecule. The rather polarized ligand binding site of influenza virus neuraminidase was also optimized by FDPD optimization, which relaxed steric repulsion around the ligand in the crystal structure and optimized hydrogen bonding. For Serine-Threonine Kinase Pim1 and six inhibitors, the structures of the ligand binding site, Lys67, Glu121, Arg122, and benzofuranone ring and indole/azaindole ring of the ligand, were optimized at FDPD/HF/6-31G* and the ligand binding energy was estimated at the FMO-MP2/6-31G* level. As a result of examining three different optimization regions, the correlation coefficient between pIC and ligand binding energy was considerably improved by expanding the optimized region; in other words, better structure-activity relationships was obtained. Thus, this approach is promising as a high-precision structure refinement method for structure-based drug discovery.
Design of Recyclable Plastics with Machine Learning and Genetic Algorithm
We present an artificial intelligence-guided approach to design durable and chemically recyclable ring-opening polymerization (ROP) class polymers. This approach employs a genetic algorithm (GA) that designs new monomers and then utilizes virtual forward synthesis (VFS) to generate almost a million ROP polymers. Machine learning models to predict thermal, thermodynamic, and mechanical properties─crucial for application-specific performance and recyclability─are used to guide the GA toward optimal polymers. We present potential substitute polymers for polystyrene (PS) that achieve all property targets with low estimated synthetic complexity.
Martignac: Computational Workflows for Reproducible, Traceable, and Composable Coarse-Grained Martini Simulations
Despite their wide use and far-reaching implications, molecular dynamics (MD) simulations suffer from a lack of both traceability and reproducibility. We introduce Martignac: computational workflows for the coarse-grained (CG) Martini force field. Martignac describes Martini CG MD simulations as an acyclic directed graph, providing the entire history of a simulation─from system preparation to property calculations. Martignac connects to NOMAD, such that all simulation data generated are automatically normalized and stored according to the FAIR principles. We present several prototypical Martini workflows, including system generation of simple liquids and bilayers, as well as free-energy calculations for solute solvation in homogeneous liquids and drug permeation in lipid bilayers. By connecting to the NOMAD database to automatically pull existing simulations and push any new simulation generated, Martignac contributes to improving the sustainability and reproducibility of molecular simulations.
All-Atom Simulations Reveal the Effect of Membrane Composition on the Signaling of the NKG2A/CD94/HLA-E Immune Receptor Complex
Understanding how membrane composition influences the dynamics and function of transmembrane proteins is crucial for the comprehensive elucidation of cellular signaling mechanisms and the development of targeted therapeutics. In this study, we employed all-atom molecular dynamics simulations to investigate the impact of different membrane compositions on the conformational dynamics of the NKG2A/CD94/HLA-E immune receptor complex, a key negative regulator of natural killer cell cytotoxic activity. Our results reveal significant variations in the behavior of the immune complex structure across five different membrane compositions, which include POPC, POPA, DPPC, and DLPC phospholipids, and a mixed POPC/cholesterol system. These variations are particularly evident in the intracellular domain of NKG2A, manifested as changes in mobility, tyrosine exposure, and interdomain communication. Additionally, we found that a large concentration of negative charge at the surface of the POPA-based membrane greatly increased the number of contacts with lipid molecules and significantly decreased the exposure of intracellular NKG2A ITIM regions to water molecules, thus likely halting the signal transduction process. Furthermore, the DPPC model with a membrane possessing a high transition temperature in a gel-like state became curved, affecting the exposure of one ITIM region. The decreased membrane thickness in the DPLC model caused a significant transmembrane domain tilt, altering the linker protrusion angle and potentially disrupting the hydrogen bonding network in the extracellular domain. Overall, our findings highlight the importance of considering membrane composition in the analysis of transmembrane protein dynamics and in the exploration of novel strategies for the external modulation of their signaling pathways.
Kinase-Bench: Comprehensive Benchmarking Tools and Guidance for Achieving Selectivity in Kinase Drug Discovery
Developing selective kinase inhibitors remains a formidable challenge in drug discovery because of the highly conserved structural information on adenosine triphosphate (ATP) binding sites across the kinase family. Tailoring docking protocols to identify promising kinase inhibitor candidates for optimization has long been a substantial obstacle to drug discovery. Therefore, we introduced "Kinase-Bench," a pioneering benchmark suite designed for an advanced virtual screen, to improve the selectivity and efficacy of kinase inhibitors. Our comprehensive data set includes 6875 selective ligands and 422,799 decoys for 75 kinases, using extensive bioactivity and structural data from the ChEMBL database and decoys generated by the Directory of Useful Decoys-Enhanced version. Our benchmarking sets and retrospective case studies were designed to provide useful guidance in discovering selective kinase inhibitors. We employed a Glide High-Throughput Virtual Screen and Standard Precision complemented by three scoring functions and customized protein-ligand interaction filters that target specific kinase residue interactions. These innovations were successfully implemented in our virtual screen efforts targeting JAK1 inhibitors, achieving selectivity against its family member, TYK2. Consequently, we identified novel potential hits: Compound (JAK1 IC: 980.5 nM, TYK2 IC: 4.5 μM) and the approved pan-AKT inhibitor Capivasertib (JAK1 IC: 275.9 nM, TYK2 IC: 10.9 μM). Using the Kinase-Bench protocol, both compounds demonstrated substantial JAK1 selectivity, making them strong candidates for further investigation. Our pharmaceutical results underscore the utility of tailored virtual screen protocols in identifying selective kinase inhibitors with substantial implications for rational drug design. Kinase-Bench offers a robust toolset for selective kinase drug discovery with the potential to effectively guide future therapeutic strategies effectively.
Streamlining Linear Free Energy Relationships of Proteins through Dimensionality Analysis and Linear Modeling
Linear free energy relationships (LFERs) are pivotal in predicting protein-water partition coefficients, with traditional one-parameter (-LFER) models often based on octanol. However, their limited scope has prompted a shift toward the more comprehensive but parameter-intensive Abraham solvation-based poly-parameter (-LFER) approach. This study introduces a two-parameter (-LFER) model, aiming to balance simplicity and predictive accuracy. We showed that the complex six-dimensional intermolecular interaction space, defined by the six Abraham solute descriptors, can be efficiently simplified into two key dimensions. These dimensions are effectively represented by the octanol-water (log ) and air-water (log ) partition coefficients. Our -LFER model, utilizing linear combinations of log and log , showed promising results. It accurately predicted structural protein-water (log ) and bovine serum albumin-water (log ) partition coefficients, with values of 0.878 and 0.760 and root mean squared errors (RMSEs) of 0.334 and 0.422, respectively. Additionally, the -LFER model favorably compares with -LFER predictions for neutral per- and polyfluoroalkyl substances. In a multiphase partitioning model parametrized with -LFER-derived coefficients, we observed close alignment with experimental and distribution data for diverse mammalian tissues/organs ( = 137, RMSE = 0.44 log unit) and milk-water partitioning data ( = 108, RMSE = 0.29 log units). The performance of the -LFER is comparable to -LFER and significantly surpasses -LFER. Our findings highlight the utility of the -LFER model in estimating chemical partitioning to proteins based on hydrophobicity, volatility, and solubility, offering a viable alternative in scenarios where -LFER descriptors are unavailable.
DeepAIPs-Pred: Predicting Anti-Inflammatory Peptides Using Local Evolutionary Transformation Images and Structural Embedding-Based Optimal Descriptors with Self-Normalized BiTCNs
Inflammation is a biological response to harmful stimuli, playing a crucial role in facilitating tissue repair by eradicating pathogenic microorganisms. However, when inflammation becomes chronic, it leads to numerous serious disorders, particularly in autoimmune diseases. Anti-inflammatory peptides (AIPs) have emerged as promising therapeutic agents due to their high specificity, potency, and low toxicity. However, identifying AIPs using traditional in vivo methods is time-consuming and expensive. Recent advancements in computational-based intelligent models for peptides have offered a cost-effective alternative for identifying various inflammatory diseases, owing to their selectivity toward targeted cells with low side effects. In this paper, we propose a novel computational model, namely, , for the accurate prediction of AIP sequences. The training samples are represented using LBP-PSSM- and LBP-SMR-based evolutionary image transformation methods. Additionally, to capture contextual semantic features, we employed attention-based ProtBERT-BFD embedding and QLC for structural features. Furthermore, differential evolution (DE)-based weighted feature integration is utilized to produce a multiview feature vector. The SMOTE-Tomek Links are introduced to address the class imbalance problem, and a two-layer feature selection technique is proposed to reduce and select the optimal features. Finally, the novel self-normalized bidirectional temporal convolutional networks (SnBiTCN) are trained using optimal features, achieving a significant predictive accuracy of 94.92% and an AUC of 0.97. The generalization of our proposed model is validated using two independent datasets, demonstrating higher performance with the improvement of ∼2 and ∼10% of accuracies than the existing state-of-the-art model using Ind-I and Ind-II, respectively. The efficacy and reliability of highlight its potential as a valuable and promising tool for drug development and research academia.
Improved and Interpretable Prediction of Cytochrome P450-Mediated Metabolism by Molecule-Level Graph Modeling and Subgraph Information Bottlenecks
Accurately identifying sites of metabolism (SoM) mediated by cytochrome P450 (CYP) enzymes, which are responsible for drug metabolism in the body, is critical in the early stage of drug discovery and development. Current computational methods for CYP-mediated SoM prediction face several challenges, including limitations to traditional machine learning models at the atomic level, heavy reliance on complex feature engineering, and the lack of interpretability relevant to medicinal chemistry. Here, we propose GraphCySoM, a novel molecule-level modeling approach based on graph neural networks, utilizing lightweight features and interpretable annotations on substructures, to effectively and interpretably predict CYP-mediated SoM. Unlike computationally expensive atomic descriptors derived from resource-intensive chemistry or even quantum chemistry calculations, we emphasize that graph-based molecular modeling initialized solely with lightweight features enables the adaptive learning of molecular topology through message-passing mechanisms combined with various aggregation kernels. Extensive ablation experiments demonstrate that GraphCySoM significantly outperforms baseline models and achieves superior performance compared with competing methods while exhibiting advantages in computational efficiency. Moreover, the attention mechanism and subgraph information bottlenecks are incorporated to analyze node importance and feature significance, resulting in mining substructures associated with the SoM. To the best of our knowledge, this is the first comprehensive study of CYP-mediated SoM using molecule-level modeling and interpretable technology. Our method achieves new state-of-the-art performance and provides potential insights into the molecular and pharmacological mechanisms underlying drug metabolism catalyzed by CYP enzymes. All source files and trained models are freely available at https://github.com/liyigerry/GraphCySoM.
XDock: A General Docking Method for Modeling Protein-Ligand and Nucleic Acid-Ligand Interactions
Molecular docking is an essential computational tool in structure-based drug discovery and the investigation of the molecular mechanisms underlying biological processes. Despite the development of many molecular docking programs for various systems, a universal tool that can accurately dock ligands across multiple system types remains elusive. Meeting the need, we developed XDock, a versatile docking framework built for both protein-ligand and nucleic acid-ligand interactions. XDock efficiently accounts for ligand flexibility by docking multiple conformations of a ligand and flexibly refining the final binding poses. It utilizes a distance geometric method for ligand sampling and leverages our knowledge-based scoring functions for assessing protein-ligand and nucleic acid-ligand interactions. XDock has undergone extensive validations on diverse benchmarks of protein-ligand and nucleic acid-ligand complexes and was compared with six other docking methods, including DOCK 6, AutoDock Vina, PLANTS, LeDock, rDock, and RLDock. In addition, XDock is also computationally efficient and on average can dock a ligand within 1 min.
Machine Learning-Driven Discovery and Database of Cyanobacteria Bioactive Compounds: A Resource for Therapeutics and Bioremediation
Cyanobacteria strains have the potential to produce bioactive compounds that can be used in therapeutics and bioremediation. Therefore, compiling all information about these compounds to consider their value as bioresources for industrial and research applications is essential. In this study, a searchable, updated, curated, and downloadable database of cyanobacteria bioactive compounds was designed, along with a machine-learning model to predict the compounds' targets of newly discovered molecules. A Python programming protocol obtained 3431 cyanobacteria bioactive compounds, 373 unique protein targets, and 3027 molecular descriptors. PaDEL-descriptor, Mordred, and Drugtax software were used to calculate the chemical descriptors for each bioactive compound database record. The biochemical descriptors were then used to determine the most promising protein targets for human therapeutic approaches and environmental bioremediation using the best machine learning (ML) model. The creation of our database, coupled with the integration of computational docking protocols, represents an innovative approach to understanding the potential of cyanobacteria bioactive compounds. This resource, adhering to the findability, accessibility, interoperability, and reuse of digital assets (FAIR) principles, is an excellent tool for pharmaceutical and bioremediation researchers. Moreover, its capacity to facilitate the exploration of specific compounds' interactions with environmental pollutants is a significant advancement, aligning with the increasing reliance on data science and machine learning to address environmental challenges. This study is a notable step forward in leveraging cyanobacteria for both therapeutic and ecological sustainability.
Potency Prediction of Covalent Inhibitors against SARS-CoV-2 3CL-like Protease and Multiple Mutants by Multiscale Simulations
3-Chymotrypsin-like protease (3CL) is a prominent target against pathogenic coronaviruses. Expert knowledge of the cysteine-targeted covalent reaction mechanism is crucial to predict the inhibitory potency of approved inhibitors against 3CLs of SARS-CoV-2 variants and perform structure-based drug design against newly emerging coronaviruses. We carried out an extensive array of classical and hybrid QM/MM molecular dynamics simulations to explore covalent inhibition mechanisms of five well-characterized inhibitors toward SARS-CoV-2 3CL and its mutants. The calculated binding affinity and reactivity of the inhibitors are highly consistent with experimental data, and the predicted inhibitory potency of the inhibitors against 3CL with L167F, E166V, or T21I/E166V mutant is in full agreement with ICs determined by the accompanying enzymatic assays. The explored mechanisms unveil the impact of residue mutagenesis on structural dynamics that communicates to change not only noncovalent binding strength but also covalent reaction free energy. Such a change is inhibitor dependent, corresponding to varied levels of drug resistance of these 3CL mutants against nirmatrelvir and simnotrelvir and no resistance to the compound. These results together suggest that the present simulations with a suitable protocol can efficiently evaluate the reactivity and potency of covalent inhibitors along with the elucidated molecular mechanisms of covalent inhibition.
Effects of All-Atom and Coarse-Grained Molecular Mechanics Force Fields on Amyloid Peptide Assembly: The Case of a Tau K18 Monomer
To propose new mechanism-based therapeutics for Alzheimer's disease (AD), it is crucial to study the kinetics and oligomerization/aggregation mechanisms of the hallmark tau proteins, which have various isoforms and are intrinsically disordered. In this study, multiple all-atom (AA) and coarse-grained (CG) force fields (FFs) have been benchmarked on molecular dynamics (MD) simulations of K18 tau (M243-E372), which is a truncated form (130 residues) of full-length tau (441 residues). FF19SB is first excluded because the dynamics are too slow, and the conformations are too stable. All other benchmarked AAFFs (Charmm36m, FF14SB, Gromos54A7, and OPLS-AA) and CGFFs (Martini3 and Sirah2.0) exhibit a trend of shrinking K18 tau into compact structures with the radius of gyration (ROG) around 2.0 nm, which is much smaller than the experimental value of 3.8 nm, within 200 ns of AA-MD or 2000 ns of CG-MD. Gromos54A7, OPLS-AA, and Martini3 shrink much faster than the other FFs. To perform meaningful postanalysis of various properties, we propose a strategy of selecting snapshots with 2.5 < ROG < 4.5 nm, instead of using all sampled snapshots. The calculated chemical shifts of all C, CA, and CB atoms have very good and close root-mean-square error (RMSE) values, while Charmm36m and Sirah2.0 exhibit better chemical shifts of N than other FFs. Comparing the calculated distributions of the distance between the CA atoms of CYS291 and CYS322 with the results of the FRET experiment demonstrates that Charmm36m is a perfect match with the experiment while other FFs exhibit limitations. In summary, Charmm36m is recommended as the best AAFF, and Sirah2.0 is recommended as an excellent CGFF for simulating tau K18.
Molecular Design for Cardiac Cell Differentiation Using a Small Data Set and Decorated Shape Features
The discovery of small organic compounds for inducing stem cell differentiation is a time- and resource-intensive process. While data science could, in principle, streamline the discovery of these compounds, novel approaches are required due to the difficulty of acquiring training data from large numbers of example compounds. In this paper, we present the design of a new compound for inducing cardiomyocyte differentiation using simple regression models trained with a data set containing only 80 examples. We introduce decorated shape descriptors, an information-rich molecular feature representation that integrates both molecular shape and hydrophilicity information. These models demonstrate improved performance compared to ones using standard molecular descriptors based on shape alone. Model overtraining is diagnosed using a new type of sensitivity analysis. Our new compound is designed using a conservative molecular design strategy, and its effectiveness is confirmed through expression profiles of cardiomyocyte-related marker genes using real-time polymerase chain reaction experiments on human iPS cell lines. This work demonstrates a viable data-driven strategy for designing new compounds for stem cell differentiation protocols and will be useful in situations where training data is limited.
Modeling Heterogeneous Catalysis Using Quantum Computers: An Academic and Industry Perspective
Heterogeneous catalysis plays a critical role in many industrial processes, including the production of fuels, chemicals, and pharmaceuticals, and research to improve current catalytic processes is important to make the chemical industry more sustainable. Despite its importance, the challenge of identifying optimal catalysts with the required activity and selectivity persists, demanding a detailed understanding of the complex interactions between catalysts and reactants at various length and time scales. Density functional theory (DFT) has been the workhorse in modeling heterogeneous catalysis for more than three decades. While DFT has been instrumental, this review explores the application of quantum computing algorithms in modeling heterogeneous catalysis, which could bring a paradigm shift in our approach to understanding catalytic interfaces. Bridging academic and industrial perspectives by focusing on emerging materials, such as multicomponent alloys, single-atom catalysts, and magnetic catalysts, we delve into the limitations of DFT in capturing strong correlation effects and spin-related phenomena. The review also presents important algorithms and their applications relevant to heterogeneous catalysis modeling to showcase advancements in the field. Additionally, the review explores embedding strategies where quantum computing algorithms handle strongly correlated regions, while traditional quantum chemistry algorithms address the remainder, thereby offering a promising approach for large-scale heterogeneous catalysis modeling. Looking forward, ongoing investments by academia and industry reflect a growing enthusiasm for quantum computing's potential in heterogeneous catalysis research. The review concludes by envisioning a future where quantum computing algorithms seamlessly integrate into research workflows, propelling us into a new era of computational chemistry and thereby reshaping the landscape of modeling heterogeneous catalysis.
Correction to "Semisupervised Learning to Boost hERG, Nav1.5, and Cav1.2 Cardiac Ion Channel Toxicity Prediction by Mining a Large Unlabeled Small Molecule Data Set"
Kinetics-Based State Definitions for Discrete Binding Conformations of T4 L99A in MD via Markov State Modeling
As a model system, the binding pocket of the L99A mutant of T4 lysozyme has been the subject of numerous computational free energy studies. However, previous studies have failed to fully sample and account for the observed changes in the binding pocket of T4 L99A upon binding of a congeneric ligand series, limiting the accuracy of results. In this work, we resolve the closed, intermediate, and open states for T4 L99A previously reported in experiment in MD and establish definitions for these states based on the dynamics of the system. From this analysis, we arrive at two primary conclusions. First, assignment of simulation trajectories into discrete states should not be done simply based on RMSD to crystal structures as this can result in misassignment of states. Second, the different metastable conformations studied here need to be carefully treated, as we estimate the time scales for conformational interconversion to be on the order of 10 to 10 ns─far longer than time scales for typical binding calculations. We conclude with a discussion on the need to develop enhanced sampling methods to generally account for significant changes in protein conformation due to relatively small ligand perturbations.
Advanced AI-Driven Prediction of Pregnancy-Related Adverse Drug Reactions
Ensuring drug safety during pregnancy is critical due to the potential risks to both the mother and fetus. However, the exclusion of pregnant women from clinical trials complicates the assessment of adverse drug reactions (ADRs) in this population. This study aimed to develop and validate risk prediction models for pregnancy-related ADRs of drugs using advanced Machine Learning (ML) and Deep Learning (DL) techniques, leveraging real-world data from the FDA Adverse Event Reporting System. We explored three methods─Information Component, Reporting Odds Ratio, and 95% confidence interval of ROR─for classifying drugs into high-risk and low-risk categories. DL models, including Directed Message Passing Neural Networks (DMPNN), Graph Neural Networks, and Graph Convolutional Networks, were developed and compared to traditional ML models like Random Forest, Support Vector Machines, and XGBoost. Among these, the DMPNN model, which integrated molecular graph information and molecular descriptors, exhibited the highest predictive performance, particularly at the preferred term level. The model was validated against external data sets from SIDER and DailyMed, demonstrating strong generalizability. Additionally, the model was applied to assess the risk of 22 oral hypoglycemic drugs, and potential substructure alerts for pregnancy-related ADRs were identified. These findings suggest that the DMPNN model is a valuable tool for predicting ADRs in pregnant women, offering significant advancement in drug safety assessment and providing crucial insights for safer medication use during pregnancy.
AABBA Graph Kernel: Atom-Atom, Bond-Bond, and Bond-Atom Autocorrelations for Machine Learning
Graphs are one of the most natural and powerful representations available for molecules; natural because they have an intuitive correspondence to skeletal formulas, the language used by chemists worldwide, and powerful, because they are highly expressive both globally (molecular topology) and locally (atom and bond properties). Graph kernels are used to transform molecular graphs into fixed-length vectors, which, based on their capacity of measuring similarity, can be used as fingerprints for machine learning (ML). To date, graph kernels have mostly focused on the atomic nodes of the graph. In this work, we developed a graph kernel based on atom-atom, bond-bond, and bond-atom (AABBA) autocorrelations. The resulting vector representations were tested on regression ML tasks on a data set of transition metal complexes; a benchmark motivated by the higher complexity of these compounds relative to organic molecules. In particular, we tested different flavors of the AABBA kernel in the prediction of the energy barriers and bond distances of the Vaska's complex data set (Friederich et al., , 2020, 4584). For a variety of ML models, including neural networks, gradient boosting machines, and Gaussian processes, we showed that AABBA outperforms the baseline including only atom-atom autocorrelations. Dimensionality reduction studies also showed that the bond-bond and bond-atom autocorrelations yield many of the most relevant features. We believe that the AABBA graph kernel can accelerate the exploration of large chemical spaces and inspire novel molecular representations in which both atomic and bond properties play an important role.
GPTrans: A Biological Language Model-Based Approach for Predicting Disease-Associated Mutations in G Protein-Coupled Receptors
Accurately predicting mutations in G protein-coupled receptors (GPCRs) is critical for advancing disease diagnosis and drug discovery. In response to this imperative, GPTrans has emerged as a highly accurate predictor of disease-related mutations in GPCRs. The core innovation of GPTrans resides in the design of a novel feature extraction network, that is capable of integrating features from both wildtype and mutant protein variant sites, utilizing multifeature connections within a transformer framework to ensure comprehensive feature extraction. A key aspect of GPTrans's effectiveness is our introduction of an innovative deep feature integration strategy, which merges embeddings and class tokens from multiple protein language models, including evolutionary scale modeling and ProtTrans, thus shedding light on the biochemical properties of proteins. Leveraging transformer components and a self-attention mechanism, GPTrans captures higher-level representations of protein features. Employing both wildtype and mutation site information for feature fusion not only enriches the predictive feature set but also avoids the common issue of overestimation associated with sequence-based predictions. This approach distinguishes GPTrans, enabling it to significantly outperform existing methods. Our evaluations across diverse GPCR data sets, including ClinVar and MutHTP, demonstrate GPTrans's superior performance, with average AUC values of 0.874 and 0.590 in 10-fold cross-validation. Notably, compared to the AlphaMissense method, GPTrans exhibited a remarkable 38.03% improvement in accuracy when predicting disease-associated mutations in the MutHTP data set. A thorough analysis of the predicted results further validates the model's effectiveness. The source code, data sets, and prediction results for GPTrans are available for academic use at https://github.com/EduardWang/GPTrans.
Improved Prediction of Ligand-Protein Binding Affinities by Meta-modeling
The accurate screening of candidate drug ligands against target proteins through computational approaches is of prime interest to drug development efforts. Such virtual screening depends in part on methods to predict the binding affinity between ligands and proteins. Many computational models for binding affinity prediction have been developed, but with varying results across targets. Given that ensembling or meta-modeling approaches have shown great promise in reducing model-specific biases, we develop a framework to integrate published force-field-based empirical docking and sequence-based deep learning models. In building this framework, we evaluate many combinations of individual base models, training databases, and several meta-modeling approaches. We show that many of our meta-models significantly improve affinity predictions over base models. Our best meta-models achieve comparable performance to state-of-the-art deep learning tools exclusively based on 3D structures while allowing for improved database scalability and flexibility through the explicit inclusion of features such as physicochemical properties or molecular descriptors. We further demonstrate improved generalization capability by our models using a large-scale benchmark of affinity prediction as well as a virtual screening application benchmark. Overall, we demonstrate that diverse modeling approaches can be ensembled together to gain meaningful improvement in binding affinity prediction.
Ligand Many-Body Expansion as a General Approach for Accelerating Transition Metal Complex Discovery
Methods that accelerate the evaluation of molecular properties are essential for chemical discovery. While some degree of ligand additivity has been established for transition metal complexes, it is underutilized in asymmetric complexes, such as the square pyramidal coordination geometries highly relevant to catalysis. To develop predictive methods beyond simple additivity, we apply a many-body expansion to octahedral and square pyramidal complexes and introduce a correction based on adjacent ligands (i.e., the interaction model). We first test the interaction model on adiabatic spin-splitting energies of octahedral Fe(II) complexes, predicting DFT-calculated values of unseen binary complexes to within an average error of 1.4 kcal/mol. Uncertainty analysis reveals the optimal basis, comprising the homoleptic and symmetric complexes. We next show that the model (i.e., the interaction model solved for the optimal basis) infers both DFT- and CCSD(T)-calculated model catalytic reaction energies to within 1 kcal/mol on average. The model predicts low-symmetry complexes with reaction energies outside the range of binary complex reaction energies. We observe that interactions are unnecessary for most monodentate systems but can be important for some combinations of ligands, such as complexes containing a mixture of bidentate and monodentate ligands. Finally, we demonstrate that the model may be combined with Δ-learning to predict CCSD(T) reaction energies from exhaustively calculated DFT reaction energies and the same fraction of CCSD(T) reaction energies needed for the model, achieving around 30% of the error from using the CCSD(T) reaction energies in the model alone.