Accurate prediction of protein-ligand binding affinity is a cornerstone of computer-aided drug discovery, yet the performance of classical scoring functions has remained stagnant.
Accurate prediction of protein-ligand binding affinity is a cornerstone of computer-aided drug discovery, yet the performance of classical scoring functions has remained stagnant. This article provides a comprehensive analysis for researchers and drug development professionals on the fundamental and methodological limitations of classical force-field-based, empirical, and knowledge-based scoring functions. We explore their rigid functional forms, inadequate treatment of key physical forces, and failure to generalize across diverse protein families. The content further details how these limitations manifest in practical applications like virtual screening and lead optimization, examines current challenges in model validation due to dataset biases, and synthesizes emerging solutions. By contrasting classical approaches with modern machine-learning-based alternatives, this review offers a clear-eyed perspective on the path toward more reliable and predictive computational tools for drug design.
In the fields of computational chemistry and molecular modelling, scoring functions are mathematical functions used to approximately predict the binding affinity between two molecules after they have been docked [1]. The drug discovery process is notoriously expensive and time-consuming, and structure-based virtual screening (VS) has become a widely used approach to triage unpromising compounds early in the pipeline [2]. Through the prediction of the binding mode and affinity of a small molecule within the binding site of a target protein, molecular docking helps researchers understand key properties related to the binding process [2]. The fast evaluation of docking poses and the accurate prediction of binding affinity is essential in these protocols, and scoring functions emerge as a straightforward and fast strategy, despite their limited accuracy, remaining the main alternative for VS experiments [2]. This technical guide details the historical triad of classical scoring functions, framing their development within the context of their inherent limitations for binding affinity prediction.
Scoring functions are typically divided into three main classes: force field-based, empirical, and knowledge-based [2] [1]. Although large-scale comparative assessments are relatively rare, the strengths and limitations of current functions are fairly evident: they generally perform well in reproducing binding modes but struggle to accurately quantitate binding affinities or the effects of small structural changes [3]. The following sections and Table 1 provide a detailed breakdown of each class.
Table 1: Comparative Overview of Classical Scoring Function Types
| Function Class | Theoretical Basis | Energy Terms/Descriptors | Parameterization Method | Representative Examples |
|---|---|---|---|---|
| Force Field-Based [1] | Molecular mechanics, classical force fields | Sum of intermolecular van der Waals and electrostatic energies; sometimes includes internal ligand strain energy and implicit solvation (GBSA/PBSA) [1]. | Parameters derived from fundamental physical chemistry and quantum mechanics calculations [4]. | DOCK [2], DockThor [2] |
| Empirical [2] | Linear Free Energy Relationships | Weighted sum of physicochemical terms: hydrophobic contacts, hydrogen bonds, rotatable bonds immobilized, etc. [1]. | Multiple linear regression (MLR) or machine learning on datasets of protein-ligand complexes with known affinities [2] [5]. | LUDI [2], ChemScore [2], GlideScore [2] |
| Knowledge-Based [1] | Inverse Boltzmann statistics from structural databases | Pairwise atom-atom "potentials of mean force" derived from observed contact frequencies in databases like the PDB [2] [1]. | Statistical analysis of large 3D structural databases (e.g., PDB, Cambridge Structural Database) [1]. | DrugScore [2], PMF [2] |
Force field-based scoring functions estimate affinities by summing the strength of intermolecular interactions, primarily van der Waals and electrostatic terms, using a molecular mechanics force field [1]. The intramolecular energies (strain energy) of the binding partners are also frequently included [1]. Since binding occurs in aqueous solution, a critical consideration is the treatment of solvation effects, which can be incorporated using implicit solvation models such as GBSA or PBSA [1]. The parameters for these functions are derived from fundamental physical chemistry and quantum mechanics calculations, rather than from fitting to binding affinity data [4]. Popular force fields that provide parameters for small molecules include the General AMBER Force Field (GAFF), the CHARMM General Force Field (CGenFF), and those in the OPLS and GROMOS families [4].
Empirical scoring functions are founded on the idea that the binding free energy can be correlated to a set of descriptors capturing key interactions involved in binding [2]. These functions take the form of a weighted sum of physicochemical terms, such as hydrophobic contacts, hydrogen bonds, and the number of rotatable bonds immobilized upon complex formation [1]. The central methodology involves using a dataset composed of three-dimensional structures of diverse protein-ligand complexes with associated experimental binding affinity data [2]. The coefficients (weights) of the functional terms are then obtained through regression analysis, traditionally using multiple linear regression (MLR), to calibrate the model and establish a relationship between the descriptors and the experimental affinity [2] [5]. The first empirical scoring function, LUDI, was developed by Böhm, pioneering this approach for predicting binding free energy [2].
Knowledge-based scoring functions, also known as potentials of mean force, are based on the statistical analysis of interacting atom pairs from a large database of experimentally determined protein-ligand complexes [2] [1]. The underlying principle is that intermolecular interactions between certain types of atoms that occur more frequently than expected in a random distribution are likely to be energetically favorable [1]. These observed frequencies are converted into a pseudopotential that describes the preferred geometries and interactions for protein-ligand atom pairs [2]. The resulting scoring function thus captures the implicit knowledge of molecular recognition "learned" from the structural data in repositories like the Protein Data Bank (PDB) [1].
The development and validation of scoring functions, particularly empirical ones, follow a structured protocol. The general workflow for developing an empirical scoring function, as detailed in recent literature [2] [5], is outlined below.
Empirical Scoring Function Development Workflow
As defined by Pason and Sotriffer, the development of an empirical scoring function requires three core components [2] [5]:
Descriptors: A set of descriptors that quantitatively describe the binding event. These are typically structural and physicochemical features derived from the 3D complex, such as:
Training Dataset: A curated dataset of three-dimensional structures of protein–ligand complexes, each associated with reliable experimental binding affinity data (e.g., Kd, Ki, IC50). The diversity and quality of this dataset are crucial for the generalizability of the resulting model [2].
Regression/Classification Algorithm: A statistical or machine-learning algorithm to calibrate the model by establishing a relationship between the descriptors and the experimental affinity. Classical methods use multiple linear regression (MLR), but recent efforts increasingly employ sophisticated machine-learning techniques like Random Forests (RF) or Support-Vector Machines (SVM) to capture potential non-linear relationships [2] [5].
To assess a scoring function's practical utility, it is rigorously tested on three distinct tasks, which also represent its primary goals in a docking workflow [2]:
The development and application of classical scoring functions rely on a suite of computational tools and data resources. The table below catalogs key "research reagents" essential for work in this field.
Table 2: Essential Research Reagents and Resources for Scoring Function Research
| Resource Name | Type | Primary Function / Application | Key Features / Notes |
|---|---|---|---|
| Protein Data Bank (PDB) [1] | Database | Primary repository for experimentally determined 3D structures of proteins and protein-ligand complexes. | Essential for training knowledge-based functions and for benchmarking pose prediction; provides structural data for empirical function development. |
| Cambridge Structural Database (CSD) [1] | Database | Repository for experimentally determined small-molecule organic and metal-organic crystal structures. | Used in knowledge-based function development to derive statistical potentials for intermolecular interactions. |
| AutoDock Vina [2] | Docking Software | Widely used molecular docking program that includes its own scoring function. | Employs a hybrid scoring function; commonly used as a platform for testing and validating new scoring methods. |
| Glide (Schrödinger) [2] | Docking Software | Commercial docking program with the empirical GlideScore function. | Known for its high accuracy in pose prediction; often used as a benchmark in performance comparisons. |
| GOLD [2] | Docking Software | Docking software using a genetic algorithm for pose exploration and its own empirical scoring function. | Supports multiple scoring functions; widely used in virtual screening campaigns. |
| DOCK [2] | Docking Software | One of the earliest docking programs, using a force field-based scoring function. | Allows for explicit consideration of solvent and user-defined scoring terms. |
| GAFF / GAFF2 [4] | Force Field | General AMBER Force Field for small molecules. | Provides parameters for force field-based scoring and molecular dynamics simulations; compatible with AMBER protein FFs. |
| CGenFF [4] | Force Field | CHARMM General Force Field for small molecules. | Provides parameters for a wide range of drug-like molecules within the CHARMM force field ecosystem. |
| OPLS3e [4] | Force Field | Optimized Potentials for Liquid Simulations force field. | Includes extensive parameters for drug-like compounds and a ligand-specific charge model; implemented in Schrödinger software. |
The traditional triad of scoring functions, while foundational, faces profound challenges that limit their predictive accuracy, particularly for binding affinity. A core limitation is the simplified treatment of entropy and solvent effects [2]. While some empirical functions include a term for conformational entropy based on rotatable bonds, this is a crude approximation. Furthermore, the explicit and dynamic role of water molecules in binding, which can be crucial for affinity and specificity, is often poorly captured [2] [3].
Another fundamental issue is the inherent difficulty of the parameterization process. The development of empirical and knowledge-based functions is intrinsically linked to the quality, size, and diversity of the experimental data used for training. Inconsistencies in experimental data and the limited coverage of chemical and target space in current datasets can lead to functions that do not generalize well [2] [5]. The approximations used by these functions suggest that the best available classical functions may be close to the limit of what can be achieved with these empirical approaches [3].
The field is now moving beyond the classical triad. The most significant trend is the shift towards machine-learning (ML) and deep-learning (DL) based scoring functions [2] [1] [6]. Unlike classical functions, ML-based models do not assume a predetermined functional form, allowing them to infer complex, non-linear relationships directly from data. These methods have consistently been found to outperform classical functions at binding affinity prediction for diverse protein-ligand complexes [1]. Recent advances also include integrating knowledge-guided pre-training strategies that incorporate additional semantic information, such as molecular descriptors and fingerprints, to learn more robust molecular representations, significantly improving predictive performance [6]. Furthermore, efforts are underway to incorporate more sophisticated physics, such as explicit polarization and quantum mechanical effects, and to develop more automated and intelligent parameterization toolkits for force fields [4]. This evolution points toward a future of hybrid models that leverage the strengths of data-driven learning while respecting the physical principles that govern molecular recognition.
Research Directions Overcoming Classical Limitations
In computational drug discovery, the accurate prediction of drug-target binding affinity is a cornerstone for identifying and optimizing lead compounds. For decades, this field has been dominated by classical scoring functions—mathematical models that estimate binding strength using predetermined equations with fixed functional forms [7]. These models typically express the binding free energy (ΔG) as a weighted sum of physicochemically-inspired terms, such as van der Waals forces, electrostatic interactions, hydrogen bonding, and desolvation penalties [7]. While this approach benefits from interpretability and computational efficiency, its inherent rigidity fundamentally limits accuracy and flexibility. The reliance on a fixed architecture, where the mathematical relationship between variables is defined a priori by the researcher, fails to capture the complex, non-linear, and context-dependent nature of molecular recognition. This whitepaper examines the technical limitations imposed by these rigid functional forms, quantifies their performance shortcomings, and explores emerging methodologies that promise to overcome these constraints through more flexible, data-driven approaches to affinity prediction.
Classical scoring functions for binding affinity prediction are historically categorized into physics-based, empirical, and knowledge-based approaches, though the boundaries are often blurred [7]. A typical physics-based scoring function, for instance, often adopts a functional form akin to:
ΔG(binding) = ΔE(VdW) + ΔE(el) + ΔE(H-bond) + ΔG(solv) [7]
In this predefined equation, each term represents a specific type of interaction: van der Waals (ΔE(VdW)), electrostatic (ΔE(el)), hydrogen bonding (ΔE(H-bond)), and solvation free energy (ΔG(solv)). The model's final form is a linear combination of these components. Similarly, empirical functions fit coefficients to these terms using experimental binding data, while knowledge-based functions derive potentials of mean force from structural databases. The critical shared limitation is not necessarily the choice of terms but the fixed combinatorial rule—the assumption that the total binding energy can be expressed as a simple, weighted sum of independent contributions. This form cannot capture synergistic or emergent effects between different interaction types, leading to an oversimplified representation of the highly cooperative and complex process of molecular binding.
The reliance on predetermined equations introduces several fundamental technical constraints that curtail predictive accuracy:
The limitations of rigid functional forms become starkly evident when their performance is quantitatively compared with more flexible, data-driven methods on benchmark tasks. The following table synthesizes key performance metrics from comparative studies, highlighting the accuracy gap.
Table 1: Quantitative Performance Comparison of Scoring Function Paradigms
| Model Category | Representative Example | Key Functional Form Characteristic | Reported Performance (R²) | Primary Limitation Illustrated |
|---|---|---|---|---|
| Classical Scoring Function | Physics-Based/ Empirical SFs [7] | Linear combination of pre-defined energy terms. | Lower accuracy, struggles with target identification [9] | Inability to generalize across diverse protein targets. |
| Machine Learning Model | Random Forest (RF) on molecular vibrations [10] [11] | Ensemble of decision trees; non-linear, data-derived rules. | R² > 0.94 for affinity prediction [10] [11] | Highlights the predictive power of flexible, non-parametric models. |
| Symbolic Regression (SR) | SR-derived interatomic potentials [8] | Equation discovered via RL/MCTS; no pre-defined form. | Outperformed Sutton-Chen EAM potentials [8] | Demonstrates that discovered equations can be both accurate and interpretable. |
| Deep Learning (DL) | Boltz-2 & other DL SFs [9] [7] | Multi-layer neural networks; highly non-linear function approximators. | Approaches FEP performance in some domains [9] | Struggles with generalization/memorization on target ID benchmarks [9]. |
A critical benchmark known as the "inter-protein scoring noise problem" further exposes the weakness of classical functions. While these functions can sometimes enrich active molecules for a single specific target, they generally fail to identify the correct protein target for a given active molecule due to scoring variations between different binding pockets [9]. A truly robust affinity prediction method must perform both tasks reliably, a hurdle that rigid forms have not yet cleared.
This case study demonstrates how moving beyond fixed forms can improve accuracy even in a closely related field—material science—providing a template for drug discovery.
Experimental Protocol: The methodology for developing Symbolic Regression (SR)-derived potentials involves a multi-step, data-driven workflow [8].
Key Workflow Diagram: The following diagram illustrates the contrast between the classical and SR approaches to model development.
This study directly addresses drug-target affinity (DTA) prediction and showcases a high-performing machine learning model that bypasses classical rigid forms.
Experimental Protocol: The detailed methodology for constructing the quantitative prediction model is as follows [10] [11]:
Key Workflow Diagram: The holistic "whole system" approach is visualized below.
Table 2: Key Research Reagents and Computational Tools for Advanced Affinity Prediction
| Item / Resource | Function / Purpose | Relevance to Overcoming Rigid Forms |
|---|---|---|
| PaDEL-Descriptor [10] [11] | Software to calculate a comprehensive set of molecular descriptors from chemical structure. | Enables featurization based on holistic molecular properties (e.g., vibrations) rather than pre-defined interaction terms. |
| Density Functional Theory (DFT) [8] | Ab initio quantum mechanical method for calculating electronic structure. | Provides high-quality, quantum-accurate training data for developing and validating more flexible models like SR potentials. |
| Random Forest Algorithm [10] [11] | A machine learning method that constructs multiple decision trees for regression or classification. | Provides a powerful, non-parametric alternative to linear models, capable of capturing complex non-linearities without a fixed equation. |
| Reinforcement Learning (RL) & MCTS [8] | A search strategy for exploring large combinatorial spaces (e.g., of mathematical expressions). | The core engine in symbolic regression that allows for the discovery of novel, interpretable functional forms directly from data. |
| Benchmark Datasets (Kd, EC50) [10] [11] | Curated datasets of drug-target pairs with experimentally measured binding affinities. | Essential for training and fairly evaluating the performance of new, flexible models against classical baselines. |
| LIT-PCBA Benchmark Set [9] | A demanding benchmark set designed for evaluating target identification capability. | Tests generalizability—a key weakness of rigid functions—by requiring models to rank affinities across different proteins. |
The evidence is compelling: the rigid functional forms underpinning classical scoring functions constitute a significant bottleneck in the pursuit of accurate, generalizable, and predictive models for binding affinity. Their inability to capture the complex, non-linear physics of molecular interactions inherently limits their accuracy and domain of applicability, as quantified by their struggle with the inter-protein scoring noise problem [9] [7]. Emerging paradigms, including machine learning models that leverage holistic molecular descriptors [10] [11] and symbolic regression that discovers physically interpretable equations directly from data [8], demonstrate a clear path forward. These approaches reject the constraint of predetermined equations in favor of flexibility and data-driven discovery. For the field of computational drug discovery to advance, the research community must increasingly embrace these flexible modeling paradigms, fostering a shift from assuming the form of the solution to letting high-quality data and intelligent algorithms reveal it.
Classical scoring functions are pivotal tools in structure-based drug design, tasked with predicting the binding affinity of a small molecule to a target protein. Despite their long-standing utility, their predictive accuracy has plateaued, largely due to two fundamental omissions: the inadequate treatment of solvation effects and protein flexibility [12] [13]. These molecular phenomena are central to the process of binding, yet classical approaches handle them through drastic simplifications that limit their realism and predictive power. This review delineates how these shortcomings have constrained the reliability of affinity prediction and surveys the emerging computational strategies that are beginning to redress these gaps, thereby framing the limitations within the broader thesis on the evolution of scoring function research.
Classical scoring functions are broadly categorized as force-field, empirical, or knowledge-based [14]. Regardless of type, they share a common methodological constraint: the imposition of a predetermined, theory-inspired functional form for the relationship between the variables characterizing the protein-ligand complex and the predicted binding affinity [12]. This rigid approach leads to poor predictivity for complexes that do not conform to the underlying modeling assumptions. Furthermore, for the sake of computational efficiency, these functions employ a minimal description of protein flexibility and an implicit treatment of solvent, ignoring the dynamic and solvation-driven nature of the binding process [12]. The following sections will dissect the specific challenges posed by solvation and flexibility and detail how modern approaches are integrating them into a new generation of predictive models.
Solvation effects play a critical role in determining the binding free energy in protein-ligand interactions [14]. When a ligand binds to a protein, it undergoes a desolvation process, whereby water molecules are displaced from both the ligand's and the protein's binding site. This process involves a complex balance of energetic contributions: the screening of electrostatic interactions by water, the hydrophobic effect for nonpolar atoms, and the hydrophilic effect for polar groups [14]. Classical scoring functions often neglect these contributions entirely or account for them through oversimplified terms, such as a simple solvent-accessible surface area (SASA)-based energy term, which fails to capture the nuanced physics of water-mediated interactions [14].
The inherent challenge in incorporating solvation is the parameterization of pairwise potentials, solvation, and entropy, which belong to different energetic categories [14]. Consequently, despite the recognized importance of solvation in ligand binding, most classical knowledge-based scoring functions do not explicitly include its contributions, partly due to the difficulty in deriving the corresponding pair potentials and the resulting double-counting problem [14]. This omission represents a significant source of error in binding affinity predictions.
Recent research has developed novel computational models to explicitly include solvation and entropic effects. One prominent method involves an iterative approach to simultaneously derive effective pair potentials and atomic solvation parameters [14]. The binding energy score is expressed as:
ΔGbind=∑ijuij(r)+∑iσiΔSAi
where uij(r) is the pair potential, σi is the solvation parameter for atom type i, and ΔSAi is the change in the solvent-accessible surface area [14]. The solvation parameters σi are iteratively improved by comparing the predicted and observed SASA changes in the training set complexes, effectively learning the solvation contribution from the data itself [14].
Another approach is seen in the development of physics-based scoring functions like DockTScore, which incorporate optimized terms for solvation and lipophilic interactions, moving beyond simplistic models to better represent the protein-ligand recognition process [15]. Similarly, machine-learning scoring functions circumvent the need for a predetermined functional form, allowing the collective effect of solvation and other interactions to be implicitly inferred from large experimental datasets [12].
Table 1: Computational Methods for Incorporating Solvation Effects
| Method Name | Underlying Approach | Key Solvation Terms | Reported Performance |
|---|---|---|---|
| ITScore/SE [14] | Knowledge-based with iterative parameter fitting | SASA-based energy term with atomic solvation parameters | R² = 0.76 on validation set of 77 complexes |
| DockTScore [15] | Empirical, physics-based with machine learning | Optimized solvation and lipophilic interaction terms | Competitive performance on DUD-E datasets |
| Machine-Learning SFs [12] | Data-driven, non-linear regression | Implicitly learned from comprehensive feature sets | Outperform classical SFs in binding affinity prediction |
Protein flexibility stands out as one of the most important and challenging issues for binding mode prediction in molecular docking [13]. Proteins are dynamic entities that undergo continuous conformational changes of varying magnitudes, which are essential for biological processes like molecular recognition [16] [17]. However, classical docking tools and their embedded scoring functions often treat the protein receptor as a rigid body, an approximation that fails to capture the induced-fit and conformational selection mechanisms that frequently characterize binding [13].
The major limitation of treating proteins as rigid is the failure to account for the conformational entropy contribution to the binding free energy and the structural rearrangements that can open or close binding pockets [12] [13]. This simplification is primarily driven by the astronomical computational cost associated with sampling the full conformational space of a protein during docking. As a result, the reliability of structure-based affinity prediction is severely compromised for targets that undergo significant structural changes upon ligand binding [13].
A variety of conformational sampling methods have been proposed to tackle the challenge of protein flexibility, ranging from techniques that account for local binding-site sidechain rearrangements to those that model full protein flexibility [13].
The choice of the best method depends heavily on the system under study and the research application, with a trade-off always existing between computational cost and the level of flexibility accounted for [13].
Diagram 1: Computational workflows for incorporating protein flexibility in docking. Methods branch from a single input structure and converge on producing improved docking poses, which are suitable for different applications.
The limitations of classical scoring functions have catalyzed a shift towards machine-learning scoring functions (ML-SFs) [12] [18]. Unlike classical functions that assume a predetermined functional form (e.g., linear regression with a small number of expert-selected features), ML-SFs use non-linear regression models to infer the functional form directly from the data [12]. This data-driven approach allows ML-SFs to exploit very large volumes of structural and interaction data effectively, capturing complex, non-additive interactions that are hard to model explicitly.
The performance gap between classical and machine-learning SFs is significant and is expected to widen as more training data becomes available [12]. For instance, the ML-SF RF-Score-VS demonstrated a dramatic improvement in virtual screening performance: its top 0.1% of molecules achieved an 88.6% hit rate, compared to just 27.5% for Vina [18]. In binding affinity prediction, RF-Score-VS also substantially outperformed Vina, with Pearson correlations of 0.56 and -0.18, respectively [18]. Other deep learning models, such as DAAP, which uses distance-based features and attention mechanisms, have achieved state-of-the-art performance, with a Pearson correlation of 0.909 on the CASF-2016 benchmark [19].
A promising trend is the development of hybrid scoring functions that integrate precise, physics-based descriptors with powerful machine-learning regression algorithms. The DockTScore suite of functions is a prime example, which explicitly accounts for physics-based terms—including optimized MMFF94S force-field terms, solvation and lipophilic interactions, and an improved ligand torsional entropy estimate—combined with machine learning models like Support Vector Machine (SVM) and Random Forest (RF) [15]. This approach aims to retain the physical interpretability of the interaction terms while leveraging the ability of machine learning to model complex, non-linear relationships, thereby avoiding the over-optimistic accuracy estimates sometimes associated with purely black-box models [15].
Table 2: Comparison of Scoring Function Performance on Benchmark Tasks
| Scoring Function Type | Example | Virtual Screening Hit Rate (Top 1%) | Binding Affinity Prediction (Pearson R) | Key Advantages |
|---|---|---|---|---|
| Classical SF | Vina | 16.2% [18] | -0.18 [18] | Speed, simplicity |
| Machine-Learning SF | RF-Score-VS | 55.6% [18] | 0.56 [18] | Handles large datasets, non-linearity |
| Deep Learning SF | DAAP | N/A | 0.909 [19] | Captures complex interactions directly from structure |
| Physics-Based ML SF | DockTScore (MLR) | Competitive on DUD-E [15] | Competitive on core set [15] | Balance of physical interpretability and accuracy |
The iterative method for developing the ITScore/SE knowledge-based scoring function provides a clear protocol for integrating solvation and entropy [14]:
Table 3: Key Resources for Advanced Scoring Function Development
| Resource Name | Type | Function in Research |
|---|---|---|
| PDBbind [12] [15] | Database | A comprehensive, curated database of protein-ligand complexes with binding affinity data, used for training and benchmarking scoring functions. |
| DUD-E [18] | Benchmark Dataset | "Directory of Useful Decoys: Enhanced" provides benchmark sets for virtual screening, containing known actives and property-matched decoys for many targets. |
| CASF Benchmark [19] | Benchmark Suite | A standardized benchmark for evaluating scoring functions on core tasks like binding affinity prediction, pose prediction, and virtual screening. |
| ATLAS [16] | MD Simulation Database | A database of standardized all-atom molecular dynamics simulations, providing insights into protein dynamics for a representative set of proteins. |
| CHARMM36m Force Field [16] | Molecular Model | A force field used in MD simulations to compute potential energy, parameterized for balanced sampling of folded and disordered proteins. |
| GROMACS [16] | Software | A high-performance molecular dynamics package used to simulate the Newtonian equations of motion for systems with hundreds to millions of particles. |
The inadequate treatment of solvation effects and protein flexibility has been a fundamental bottleneck in the accuracy of classical scoring functions. As this review outlines, these omissions stem from necessary but limiting simplifications made to maintain computational feasibility. The emergence of machine-learning scoring functions represents a paradigm shift, leveraging large datasets to infer complex relationships without being constrained by a predetermined functional form [12] [18]. Simultaneously, the integration of more rigorous physics-based terms, such as explicit solvation and entropy contributions, is providing a more realistic description of the binding process [14] [15]. The synergy of these two approaches—data-driven machine learning and theory-inspired physical models—is paving the way for a new generation of scoring functions with enhanced predictive power and greater generality.
Future progress will depend on continued advances in several areas. The development of large-scale, standardized dynamical data, as exemplified by the ATLAS database, will be crucial for modeling protein flexibility in a consistent manner [16]. Furthermore, the creation of target-specific scoring functions for challenging target classes like protein-protein interactions demonstrates a move away from a one-size-fits-all approach, promising better performance for specific therapeutic applications [15] [20]. As computational power grows and algorithms become more sophisticated, the explicit and accurate integration of solvation, entropy, and full flexibility will transition from a specialist's challenge to a standard component of the drug designer's toolkit, finally overcoming the key omissions that have long limited structure-based affinity prediction.
The additivity assumption posits that the total binding energy of a protein-ligand complex can be represented as the sum of independent, localized interactions. This principle underpins classical scoring functions in molecular recognition, where the affinity for any given molecular structure is calculated by summing contributions from individual atoms, functional groups, or residue pairs. The computational efficiency of this approach has made it a cornerstone in structural bioinformatics and early-stage drug discovery, particularly for rapid virtual screening of compound libraries.
However, mounting experimental evidence from quantitative biochemistry reveals that molecular recognition in biological systems frequently deviates from perfect additivity. Non-additive effects emerge from complex, cooperative interactions within and between molecules—effects that simple summing functions cannot capture. This whitepaper examines the fundamental limitations of the additivity assumption through key case studies and quantitative data, providing researchers with a framework for critically evaluating scoring function performance in affinity prediction research.
Protein-DNA interactions serve as an ideal model system for testing additivity due to their well-defined binding interfaces and the discrete nature of nucleotide positions. A re-analysis of seminal studies on the Mnt repressor protein and mouse EGR1 protein binding provides compelling quantitative evidence against purely additive models [21].
Table 1: Correlation Between Measured Binding Affinities and Additive Model Predictions
| Zif268 Variant | Mononucleotide BAM (123) | Dinucleotide BAM (12*3) | Dinucleotide BAM (1*23) |
|---|---|---|---|
| Wild-type | 0.973 | 0.986 | 0.987 |
| RGPD | 0.883 | 0.942 | 0.941 |
| REDV | 0.999 | 0.999 | 0.999 |
| LRHN | 0.927 | 0.978 | 0.956 |
| KASN | 0.695 | 0.791 | 0.718 |
While the mononucleotide Best Additive Model (BAM) shows strong correlations for some proteins (e.g., REDV at 0.999), performance substantially degrades for others (KASN at 0.695) [21]. The consistent improvement of dinucleotide models, which incorporate some positional interdependencies, demonstrates that positional interdependence significantly impacts binding affinity. For the KASN variant, the dinucleotide model (12*3) achieves a correlation of 0.791 compared to 0.695 for the mononucleotide model—a 14% improvement in explanatory power [21].
The limitations of additivity extend to protein-ligand interactions central to drug discovery. Fragment-Based Drug Discovery (FBDD) highlights the importance of non-additive synergy when fragments are combined [22]. While fragments themselves follow approximately additive rules due to their small size and simple interactions, their optimization into lead compounds frequently reveals cooperative effects that deviate from predictions based on fragment properties alone.
Modern machine learning approaches explicitly address these limitations. The ProBound framework, which models transcription factor binding affinity from sequencing data, incorporates cooperativity terms and multi-protein complex interactions that fundamentally violate simple additivity [23]. Similarly, the SCAGE architecture for molecular property prediction employs a multitask pretraining framework that captures complex relationships between molecular structure and function beyond what additive models can represent [24].
Objective: Systematically measure positional interdependence in molecular recognition.
Methodology:
Critical Controls:
Objective: Quantify cooperative effects in molecular assembly.
Methodology:
Next-generation computational models address additivity failures through several innovative approaches:
Multitask Pretraining Frameworks: SCAGE incorporates four pretraining tasks (molecular fingerprint prediction, functional group prediction, 2D atomic distance prediction, and 3D bond angle prediction) to learn comprehensive molecular representations that capture complex structure-activity relationships [24].
Cooperativity Modeling: ProBound explicitly models cooperative binding in multi-TF complexes through energy terms that depend on relative positioning and orientation of binding partners [23].
Geometric Learning: Incorporation of 3D structural information (atomic distances, bond angles, conformational flexibility) enables models to capture spatial relationships that violate simple additivity [24] [23].
Modern non-additive models provide biochemical insights through:
Table 2: Comparison of Molecular Recognition Modeling Approaches
| Model Type | Key Assumptions | Strengths | Limitations |
|---|---|---|---|
| Additive (BAM) | Position independence | Computational efficiency; Simple interpretation | Fails for cooperative systems; Limited accuracy |
| Dinucleotide BAM | Dinucleotide interdependence | Captures nearest-neighbor effects; Improved accuracy | Still misses longer-range interactions |
| ProBound | Multi-experiment integration | Quantifies cooperativity; Handles modifications | Computational intensity; Complex implementation |
| SCAGE | Multitask representation learning | Captures complex structure-activity relationships | Requires extensive pretraining data |
Table 3: Essential Research Materials for Non-Additivity Studies
| Reagent/Technology | Function | Application Context |
|---|---|---|
| SELEX-seq | High-throughput profiling of protein-DNA interactions | Comprehensive binding affinity measurement [23] |
| KD-seq | Absolute affinity determination using input, bound and unbound fractions | Direct measurement of binding constants [23] |
| Fragment Libraries (~1400 compounds) | Screening for molecular recognition elements | Identifying privileged substructures [25] |
| Multi-TF SELEX | Characterization of cooperative complexes | Quantifying cooperativity in multi-protein assemblies [23] |
| Methylated DNA Libraries | Profiling epigenetic effects on recognition | Methylation-aware binding models [23] |
The empirical evidence against universal additivity in molecular recognition is substantial and growing. Quantitative studies of protein-DNA interactions reveal significant positional interdependencies, while fragment-based drug discovery demonstrates cooperative effects in molecular assembly. These non-additive phenomena necessitate advanced modeling approaches that explicitly account for cooperativity, spatial relationships, and contextual effects.
Modern machine learning frameworks like ProBound and SCAGE point the way forward by integrating diverse data types, modeling cooperativity explicitly, and maintaining biophysical interpretability. As molecular recognition research advances, the field must move beyond the convenient but limited additive assumption toward more sophisticated models that capture the complex, emergent properties of biological systems. This paradigm shift will enable more accurate affinity prediction, rational design of molecular interventions, and ultimately, more efficient drug discovery pipelines.
Structure-based virtual screening (VS) has become an indispensable tool in computational drug discovery, yet its effectiveness is fundamentally constrained by the accuracy of scoring functions (SFs). Classical SFs, which rely on empirical, force-field-based, or knowledge-based approaches, have hit a persistent performance plateau in their ability to discriminate between binders and non-binders. This whitepaper delineates the core limitations of these classical SFs and frames them within the broader thesis of affinity prediction research. We explore the emergence of machine-learning (ML) scoring functions as a transformative solution, presenting quantitative benchmarks and detailed methodologies that underscore their superior performance in enriching true actives and predicting binding affinities.
The primary goal of structure-based virtual screening is to identify novel bioactive molecules from vast chemical libraries by computationally docking them into a target protein's structure. The efficacy of this process hinges entirely on the scoring function's ability to rank compounds based on their predicted affinity. Classical SFs, embedded in popular docking tools, estimate binding energy using simplified physical models or statistical potentials derived from known protein-ligand structures. Despite their long-standing utility, these functions suffer from well-documented limitations: they often inadequately account for conformational entropy, solvation effects, and specific interaction nuances, leading to inaccurate affinity predictions and poor enrichment of true binders [18]. Consequently, the field has witnessed a performance plateau, where incremental improvements in classical SFs have yielded diminishing returns, creating a critical bottleneck in the early drug discovery pipeline [18] [26]. This paper examines the evidence for this plateau and the subsequent paradigm shift towards data-driven ML approaches, which learn the complex relationships between protein-ligand structural features and binding affinities directly from large-scale experimental data.
Extensive benchmarking studies across diverse protein targets provide concrete evidence of the limitations of classical SFs. The data reveal that while these functions can serve as loose classifiers, their performance, particularly in early enrichment, is significantly surpassed by modern machine-learning scoring functions.
Table 1: Virtual Screening Performance Comparison on the DUD-E Benchmark (102 Targets)
| Scoring Function | Type | Hit Rate at Top 1% | Hit Rate at Top 0.1% | Binding Affinity Pearson Correlation |
|---|---|---|---|---|
| RF-Score-VS | Machine Learning | 55.6% | 88.6% | 0.56 |
| AutoDock Vina | Classical (Empirical) | 16.2% | 27.5% | -0.18 |
| DOCK3.7 | Classical (Force-Field) | ~15% (est.) | - | - |
The data in Table 1, derived from a large-scale study on the DUD-E benchmark, is telling. The machine-learning SF, RF-Score-VS, achieves a hit rate at the top 1% of ranked molecules that is more than three times that of a classical SF like Vina [18]. The difference is even more dramatic in the ultra-early enrichment zone (top 0.1%), where RF-Score-VS identifies hits with near 90% accuracy. Furthermore, the poor Pearson correlation of Vina's scores with experimental binding affinity (-0.18) underscores its inability to provide a meaningful quantitative estimate of binding strength, a core limitation in affinity prediction research [18].
This performance gap is not isolated. A 2025 benchmarking study on Plasmodium falciparum Dihydrofolate Reductase (PfDHFR) variants further corroborates these findings. The study showed that re-scoring initial docking poses with ML SFs like CNN-Score dramatically improved early enrichment. For the wild-type enzyme, re-scoring with CNN-Score achieved an enrichment factor at 1% (EF1%) of 28, a substantial improvement over the baseline docking tools [27].
To ensure the reproducibility of VS benchmarks and the rigorous validation of new SFs, researchers adhere to standardized protocols. The following methodology outlines a comprehensive benchmarking workflow.
Data Provenance: Public benchmark sets like DUD-E (Directory of Useful Decoys: Enhanced) and DEKOIS 2.0 are commonly used. These sets provide, for a given protein target, a list of known active molecules and a set of decoy molecules—structurally similar but presumed inactive molecules that act as negative controls [18] [27].
Docking Experiments: The prepared ligand and decoy libraries are docked into the prepared protein structure using one or more docking programs (e.g., AutoDock Vina, FRED, PLANTS). The grid box dimensions are set to encompass the entire binding site [27].
Performance Validation: To prevent overfitting and ensure generalizability, strict cross-validation strategies are employed:
Table 2: Key Research Reagents and Computational Tools for Virtual Screening Benchmarking
| Reagent / Tool Name | Type/Category | Primary Function in VS Workflow |
|---|---|---|
| DUD-E / DEKOIS 2.0 | Benchmark Dataset | Provides curated sets of active molecules and property-matched decoys for rigorous performance assessment. |
| AutoDock Vina | Docking Program | Generates plausible binding poses and provides an initial score using an empirical scoring function. |
| RF-Score-VS / CNN-Score | Machine-Learning Scoring Function | Re-scores docking poses to significantly improve the ranking of active molecules over decoys. |
| PDBbind Database | Training Dataset | A comprehensive collection of protein-ligand complexes with binding affinity data for training ML scoring functions. |
| OpenBabel / SPORES | File Format Tool | Converts and processes chemical file formats between different docking and analysis software. |
Virtual Screening Benchmarking Workflow
The transition to machine-learning scoring functions is not a panacea. While they show remarkable performance on established benchmarks, significant challenges remain that define the current frontier of affinity prediction research.
A critical issue undermining the perceived progress in ML-based affinity prediction is train-test data leakage. A 2025 analysis revealed that the standard benchmark used for evaluating SFs, the Comparative Assessment of Scoring Functions (CASF), shares a high degree of structural similarity with the PDBbind database used to train these models. This means models can perform well by memorizing similarities rather than by genuinely learning protein-ligand interactions [28]. When models like GenScore and Pafnucy were retrained on a rigorously filtered dataset (PDBbind CleanSplit) to eliminate this leakage, their performance dropped markedly, revealing an overestimation of their true generalization capabilities [28]. This highlights a core challenge: developing models that generalize to genuinely novel targets and not just those structurally related to training examples.
To combat generalization issues and improve accuracy, researchers are developing specialized approaches:
The performance plateau of classical scoring functions in virtual screening is a well-documented reality, driven by their inherent inability to capture the complex physical chemistry of molecular recognition. The field is unequivocally shifting towards machine-learning-based solutions, which have demonstrated a profound ability to enrich true binders and offer more accurate affinity predictions. However, the path forward must be navigated with caution. The dual challenges of data leakage in public benchmarks and the limited generalization of many current models represent the next major hurdles. Future research must prioritize the development of rigorously benchmarked models, trained on non-redundant, leakage-free data, and validated on truly novel targets. The integration of advanced architectures like graph neural networks and the strategic use of target-specific training paradigms offer promising avenues to finally move beyond the plateau and deliver on the promise of accurate, reliable affinity prediction for drug discovery.
Accurately predicting the binding affinity between a small molecule and its protein target is a cornerstone of computational drug discovery. The strength of this interaction, quantified as binding affinity, directly determines a drug candidate's efficacy and is a critical parameter for lead optimization [30]. For decades, the development of scoring functions capable of reliably estimating this affinity has been a primary research focus. These functions aim to correlate the three-dimensional structural information of a protein-ligand complex with experimentally measured binding constants (Ki, Kd, IC50), providing a computational substitute for costly and time-consuming laboratory assays [30] [31].
However, a significant and persistent challenge plagues the field: the poor correlation between computationally predicted affinities and experimentally validated results. This gap severely limits the utility of these methods in real-world drug discovery pipelines, where decisions about which compounds to synthesize and test often hinge on computational predictions [30] [28]. Insufficient conformational sampling, oversimplified energy functions, and an inability to accurately model critical solvation and entropic effects are frequently cited as traditional culprits [30]. While deep learning has emerged as a promising paradigm, offering computational efficiency and the ability to learn complex patterns from data, its performance is often overestimated due to benchmark datasets plagued by data leakage and redundancy [28]. This whitepaper examines the core limitations of both classical and machine learning-based affinity prediction methods, framed within the broader thesis that current scoring functions, despite their sophistication, are not yet robust or generalizable enough to replace experimental validation.
The discrepancy between in silico predictions and experimental binding constants arises from a confluence of factors that affect both traditional and modern deep learning approaches.
Conventional physics-based methods face intrinsic hurdles. Molecular dynamics (MD) simulations for binding free energy calculations, such as those using the Bennett Acceptance Ratio (BAR), are computationally intensive. Achieving sufficient sampling is difficult because the inclusion of explicit solvent or membrane environments requires extensive equilibration to ensure system stability [30]. Furthermore, as a state function, binding free energy calculation requires finely dividing the perturbation range into multiple intermediate lambda (λ) states to control energy transitions, adding to the computational burden [30]. Classical scoring functions embedded in docking tools like AutoDock Vina or Glide rely on empirical rules and heuristic search algorithms, which often result in inaccuracies and an inability to fully capture the complexity of molecular interactions [32].
A critical, and often underestimated, challenge is the issue of data quality and evaluation. The performance of deep-learning models is highly dependent on their training data. A 2025 study highlighted that a significant train-test data leakage exists between the widely used PDBbind database and the Comparative Assessment of Scoring Functions (CASF) benchmark [28]. This leakage, stemming from structural similarities between training and test complexes, severely inflates the performance metrics of models, leading to a substantial overestimation of their generalization capabilities [28]. Alarmingly, some models perform well on benchmarks even when protein information is omitted, suggesting they rely on memorizing ligand-specific patterns rather than learning genuine protein-ligand interactions [28]. This problem is compounded by redundancies within the training data itself, which can encourage models to settle for a local minimum in the loss landscape through memorization instead of developing a robust predictive understanding [28].
Even the most accurate models on paper can fail in practical applications. A comprehensive evaluation of deep learning-based docking methods revealed significant challenges in generalization, particularly when encountering novel protein binding pockets not represented in the training data [32]. Furthermore, many deep learning methods, especially generative diffusion models, can produce poses with favorable root-mean-square deviation (RMSD) scores but that are physically implausible. They may exhibit steric clashes, incorrect bond lengths/angles, or fail to recapitulate key protein-ligand interactions essential for biological activity [32]. This indicates that while these models learn to generate geometrically correct poses, they may not fully grasp the underlying physicochemical principles governing binding.
Table 1: Core Challenges in Binding Affinity Prediction
| Challenge Category | Specific Limitations | Impact on Prediction |
|---|---|---|
| Methodological Limits | Insufficient sampling in MD simulations; Oversimplified scoring functions [30] [32]. | Inaccurate energy estimates; Failure to capture key interaction dynamics. |
| Data Bias & Leakage | Structural similarities between PDBbind training and CASF test sets; Redundant training data [28]. | Overestimated model performance; Poor generalization to novel targets. |
| Generalization Failure | Inability to handle novel protein pockets or ligand topologies; Production of physically invalid poses [32] [28]. | Models fail in real-world virtual screening and lead optimization. |
| Evaluation Deficits | Over-reliance on a single metric (e.g., RMSD); Lack of target identification benchmarks [32] [9]. | Incomplete picture of model utility for drug discovery. |
The theoretical challenges manifest in concrete performance gaps when methods are rigorously evaluated. When state-of-the-art models like GenScore and Pafnucy were retrained on a cleaned dataset (PDBbind CleanSplit) designed to eliminate data leakage, their performance on the CASF benchmark dropped markedly [28]. This confirms that previously reported high scores were largely driven by data leakage rather than genuine learning. In molecular docking, a multidimensional evaluation shows a wide variation in success rates. The "combined success rate" – which considers both pose accuracy (RMSD ≤ 2 Å) and physical validity – reveals that even the best methods have significant room for improvement.
Table 2: Performance Comparison of Docking Methods on Benchmark Datasets [32]
| Method Type | Representative Method | Combined Success Rate (Astex Diverse Set) | Combined Success Rate (DockGen - Novel Pockets) |
|---|---|---|---|
| Traditional | Glide SP | >85% (inferred) | High (inferred as top tier) |
| Hybrid (AI Scoring) | Interformer | Second highest tier | Second highest tier |
| Generative Diffusion | SurfDock | 61.18% | 33.33% |
| Regression-Based | KarmaDock, QuickBind | Lowest tier | Lowest tier |
Another telling benchmark is the "inter-protein scoring noise" problem. Classical functions can enrich active molecules for a single target but fail to identify the correct protein target for a given active molecule due to scoring variations between different binding pockets [9]. A test of the Boltz-2 model, a biomolecular foundation model, on a target identification benchmark revealed it was still unable to correctly identify the target of active molecules by predicting a higher binding affinity compared to decoy targets [9]. This indicates a lack of generalizable understanding of protein-ligand interactions.
To illustrate the complexities involved in affinity prediction, we detail two key experimental approaches: one based on molecular dynamics and another on modern deep learning model training.
The following workflow outlines the protocol for achieving efficient sampling and binding free energy calculation using a re-engineered Bennett Acceptance Ratio (BAR) method, as applied to GPCR targets [30].
Workflow Description: This protocol [30] begins with a prepared structure of the protein-ligand complex, such as a G-protein coupled receptor (GPCR) with a bound agonist or antagonist. For membrane proteins like GPCRs, the complex is embedded within an appropriate membrane model and solvated with explicit water molecules, followed by ion addition for physiological ionic strength. A multi-step equilibration through molecular dynamics is then critical to ensure the stability of the entire system—protein, ligand, membrane, and solvent. The core of the alchemical method involves defining a pathway between the bound and unbound states by dividing the transformation into numerous intermediate steps, represented by scaling factors known as lambda (λ) values. Extensive molecular dynamics sampling is performed at each of these lambda states to collect energy data for both forward and backward transitions. Finally, the binding free energy (ΔGbind) is calculated by applying the re-engineered BAR method to this collected data. The validity of the computational approach is demonstrated by correlating the calculated ΔGbind values with experimental binding affinity data (pK_D).
This protocol focuses on mitigating data bias to improve model generalization, a key challenge identified in recent research [28].
Workflow Description: This protocol [28] starts with the raw PDBbind database. The first and most crucial step is structure-based filtering using a multimodal clustering algorithm. This algorithm assesses similarity between protein-ligand complexes by combining protein similarity (TM-score), ligand similarity (Tanimoto score), and binding conformation similarity (pocket-aligned ligand RMSD). This identifies and removes complexes in the training set that are overly similar to those in the test set (e.g., the CASF benchmark), effectively eliminating train-test data leakage. The result is a curated training dataset, such as PDBbind CleanSplit. The protocol also involves reducing redundancy within the training set itself by resolving large similarity clusters, forcing the model to learn general rules rather than memorizing specific examples. The model architecture, such as a Graph Neural Network (GNN), is designed to sparse graph modeling of protein-ligand interactions and can be enhanced with transfer learning from large protein language models. Finally, the model is evaluated on a strictly independent test set, with ablation studies conducted to verify that its predictions are based on a genuine understanding of interactions and not data leakage.
Table 3: Essential Resources for Binding Affinity Prediction Research
| Resource Name | Type | Primary Function in Research |
|---|---|---|
| PDBbind [31] [28] | Database | Comprehensive collection of protein-ligand complex structures with experimentally measured binding affinity data. Serves as a primary source for model training. |
| CASF Benchmark [31] [28] | Benchmark Set | Curated dataset used for the comparative assessment of scoring functions' performance in scoring, ranking, docking, and screening powers. |
| GROMACS [30] | Software | High-performance molecular dynamics toolkit used for running simulations, system equilibration, and alchemical free energy calculations. |
| AutoDock Vina [32] [28] | Software | Widely used molecular docking program with an empirical scoring function, often used as a baseline for comparison. |
| Glide [32] | Software | A robust molecular docking tool known for its accurate pose prediction and rigorous sampling algorithms. |
| Boltz-2 [9] | AI Model | A biomolecular foundation model claimed to approach the performance of FEP in estimating binding affinity. |
| PoseBusters [32] | Validation Tool | Toolkit to systematically evaluate the physical plausibility and chemical correctness of predicted docking poses. |
| CleanSplit [28] | Curated Dataset | A filtered version of PDBbind designed to minimize train-test data leakage and redundancy, enabling genuine evaluation of model generalization. |
The challenge of achieving a strong correlation between predicted and experimental binding constants remains a significant bottleneck in computational drug discovery. The limitations are deeply rooted and multifaceted, extending beyond simple algorithmic improvements. While deep learning offers new avenues, its current promise is tempered by critical issues of data bias, overestimation of capabilities, and poor generalization on truly novel targets. The path forward requires a concerted shift in the research community's approach. This includes the development and adoption of rigorously curated, non-redundant datasets, the implementation of more demanding benchmarks that test for target identification and generalization, and a holistic evaluation of models that prioritizes physical plausibility and biological relevance alongside raw predictive accuracy. Overcoming the affinity prediction challenge is not merely a computational problem but a interdisciplinary endeavor that demands a more nuanced understanding of both biological complexity and the limitations of our data-driven models.
Accurate prediction of protein-ligand binding affinity is a cornerstone of structure-based drug design. While classical scoring functions are often adequate for evaluating ligands similar to their training data, their performance significantly degrades when applied to novel chemical scaffolds or diverse protein targets—a limitation termed congeneric bias. This whitepaper analyzes the fundamental origins of this bias, rooted in statistical learning theory and exacerbated by dataset construction flaws. We demonstrate through quantitative analysis that generalized models possess inherent accuracy limits, with protein-specific models consistently outperforming universal functions. Furthermore, we document how data leakage and redundancy in common benchmarks artificially inflate performance metrics, creating a false impression of generalizability. Emerging solutions, including advanced graph neural networks, multitask learning architectures, and rigorous data curation protocols, show promise for overcoming these limitations. The findings underscore the necessity of developing next-generation scoring functions that transcend simple pattern matching to genuinely learn the biophysical principles of molecular recognition.
The accurate prediction of binding affinity remains one of the great challenges in computational chemistry [33]. Classical scoring functions were developed to provide fast assessment of protein-ligand complexes using single structural snapshots, offering an essential tool for virtual screening and lead optimization in drug discovery. These functions traditionally compromise between physical accuracy and computational efficiency, employing empirical, force-field-based, or knowledge-based approaches to score complexes.
However, a critical and persistent limitation has emerged: these functions demonstrate uneven performance across different targets and often fail catastrophically when applied to novel target classes or chemically diverse ligands [34]. This "congeneric bias" manifests when models trained on specific chemical series or protein families cannot generalize to structurally distinct complexes. The bias stems from fundamental limitations in both the theoretical foundations of scoring functions and the datasets used for their development and validation.
Recent analyses reveal that the performance of many published models has been substantially overestimated due to benchmark contamination [28]. When evaluated on properly curated datasets, even state-of-the-art models show marked performance drops, exposing their reliance on memorization rather than learning underlying physical principles. This whitepaper examines the mechanistic origins of congeneric bias, presents quantitative evidence of its effects, and outlines experimental frameworks and computational solutions designed to overcome these limitations.
The theoretical framework underlying empirical scoring functions contains fundamental constraints that necessarily limit their generalizability. Through the lens of statistical learning theory and information theory, we can formally demonstrate why a universally accurate scoring function is theoretically unattainable.
Statistical learning theory formalizes the process of elucidating functional relationships between structural features (x) and binding affinity (y) by assuming a probabilistic process generates the data used for training and testing [33]. The optimal model would capture the conditional probability distribution p(y|x), which encodes the true relationship between structure and affinity.
Using cross-entropy C(Y|X) as a loss function, the error decomposes into two components:
This decomposition reveals that even with ideal descriptors and infinite training data, h(Y|X) represents an irreducible uncertainty in affinity prediction from structural snapshots alone [33].
Theoretical analysis proves that generalized structure-based models have inherent accuracy limits, and protein-specific models will always likely perform better for their respective targets [33]. This occurs because the joint probability distribution p(x,y) over structures and affinities differs significantly across protein families. A single model q(y|x) must compromise across these different distributions, necessarily increasing regret for any specific target.
Table 1: Theoretical Error Components in Generalized vs. Targeted Models
| Model Type | Minimum Error (h(Y | X)) | Expected Regret (D(p | q)) | Total Expected Error | |
|---|---|---|---|---|---|---|
| Generalized Model | Fixed for descriptor set | High (must compromise across targets) | High | |||
| Protein-Specific Model | Fixed for descriptor set | Low (optimized for specific p(x,y)) | Lower |
Theoretical Framework of Scoring Function Limitations: This diagram illustrates how assumptions in statistical learning theory, when applied to structure-based affinity prediction, inevitably lead to generalization failure. The fundamental discrepancy between the true distribution of structure-affinity relationships and model assumptions creates regret that compounds minimum achievable error.
Empirical evaluations substantiate the theoretical predictions of inherent limitations in classical scoring functions. The performance degradation is most pronounced when models encounter novel targets or diverse ligands, precisely illustrating the congeneric bias phenomenon.
Early evidence of congeneric bias emerged from observations that scoring function performance varies dramatically between different protein systems [33]. For certain challenging targets—including acetylcholine esterase (AChE), pantothenate synthetase, and various kinases—conventional scoring functions cannot distinguish native binding poses from decoys, despite generating structurally plausible alternatives [34].
Table 2: Performance Disparities Across Challenging Targets
| Target Protein | PDB ID | Scoring Function | Native Pose Ranking | Key Challenge |
|---|---|---|---|---|
| Acetylcholine Esterase | 1GPK | MedusaScore | Outside top 1% | Entropic effects |
| Pantothenate Synthetase | 1N2J | AutoDock | Outside top 1% | Flexibility |
| JNK3 Kinase | 1PMN | Glide | Outside top 1% | Specific hydration |
| Tuberculosis Thymidylate Kinase | 1W2G | MedusaScore | Outside top 1% | Coupled dynamics |
| Checkpoint Kinase 1 | 2BR1 | Multiple | Outside top 1% | Metal coordination |
Discrete Molecular Dynamics (DMD) simulations demonstrated that incorporating protein-ligand dynamics and entropic effects could successfully identify native poses in 6 of 8 cases where static scoring functions failed [34]. This suggests that the omission of dynamic information constitutes a critical limitation in classical functions applied to novel targets.
Recent analyses reveal that much of the reported performance of modern machine learning scoring functions is artificially inflated by data leakage between training and test sets. When proper filtering is applied, performance metrics drop substantially [28].
A structure-based clustering algorithm identified that nearly 600 similarities existed between PDBbind training complexes and Comparative Assessment of Scoring Functions (CASF) test complexes, affecting 49% of all CASF test complexes [28]. This contamination enables models to achieve high benchmark performance through memorization rather than genuine learning of protein-ligand interactions.
Table 3: Performance Drop After Data Leakage Removal
| Model | Original CASF Performance (RMSE) | CleanSplit Performance (RMSE) | Performance Drop |
|---|---|---|---|
| GenScore | 1.25 | 1.58 | 26.4% |
| Pafnucy | 1.32 | 1.71 | 29.5% |
| GEMS (Our Model) | 1.18 | 1.21 | 2.5% |
The creation of PDBbind CleanSplit—a curated dataset with reduced train-test similarity—exposed the extent of this overestimation [28]. Models that previously showed exceptional benchmark performance experienced significant drops when retrained on CleanSplit, while models designed for better generalization maintained robust performance.
To address data leakage issues, researchers have developed rigorous filtering protocols [28]:
Similarity Assessment: Compute multimodal similarity between all training and test complexes using:
Leakage Removal: Iteratively remove all training complexes that exceed similarity thresholds with any test complex.
Redundancy Reduction: Apply adapted filtering thresholds to identify and eliminate similarity clusters within the training set until all striking redundancies are resolved.
Validation: Verify that the highest remaining similarities between training and test sets show clear structural differences in both protein folds and ligand positioning.
This protocol resulted in the removal of 4% of training complexes due to train-test similarity and an additional 7.8% due to internal redundancies [28].
For targets where classical scoring functions fail to identify native poses, Discrete Molecular Dynamics (DMD) offers a robust alternative protocol [34]:
Pose Generation: Use flexible docking software (e.g., MedusaDock) to generate 1000+ diverse poses for the target ligand.
Pose Clustering: Employ means-linkage hierarchical clustering with a 2.5Å RMSD cutoff to identify structurally distinct pose clusters.
Representative Selection: Select the highest-scoring pose from each cluster for simulation, eliminating dynamically indistinguishable poses.
DMD Simulation: Perform multiple DMD simulations for each representative pose, using discretized energy potentials and fast event-sorting to enhance sampling efficiency.
Residence Time Analysis: Calculate ligand residence time for each pose, with native and near-native poses typically exhibiting distinctly longer residence times than decoys.
Ranking: Rank poses by residence time rather than static energy scores, successfully identifying native poses within the top 0.5% of poses for most challenging targets.
Dynamics-Enhanced Pose Discrimination: Workflow for identifying native binding poses in challenging targets where classical scoring functions fail, using Discrete Molecular Dynamics simulations to incorporate protein-ligand dynamics and entropic effects.
Graph Neural Networks with Enhanced Featurization
Novel graph architectures show improved generalization by better representing protein-ligand interactions. The AEV-PLIG model combines atomic environment vectors (AEVs) with protein-ligand interaction graphs, using radial atomic environment vectors centered on ligand atoms as node features [35]. This approach captures intermolecular pairwise atomic interactions more explicitly than distance cutoffs alone.
The GEMS (Graph neural network for Efficient Molecular Scoring) architecture leverages transfer learning from protein language models and sparse graph modeling of interactions to maintain performance even when trained on properly filtered datasets [28]. Ablation studies confirm that GEMS fails to produce accurate predictions when protein nodes are omitted, suggesting genuine understanding of interactions rather than ligand memorization.
Multitask Learning Frameworks
The DeepDTAGen framework simultaneously predicts drug-target binding affinity and generates novel target-aware drug variants using a shared feature space [36]. This multitask approach ensures that the model learns DTI-specific features in the latent space while utilizing these features for generation. The framework employs a novel FetterGrad algorithm to mitigate gradient conflicts between distinct tasks, keeping gradients aligned during optimization.
To address the fundamental data scarcity problem, researchers have successfully employed augmentation techniques:
Template-Based Modeling: Generate additional complex structures using template-based ligand alignment algorithms [35]
Molecular Docking: Create synthetic training examples by docking known binders into homologous protein structures [35]
Conditional Generation: Develop target-aware generative models that produce novel ligands conditioned on specific protein binding pockets [36]
When AEV-PLIG was trained with augmented data, performance on FEP benchmarks improved substantially, with weighted mean Pearson correlation coefficient increasing from 0.41 to 0.59 and Kendall's τ from 0.26 to 0.42 [35]. This narrows the performance gap with FEP+ (PCC: 0.68, Kendall's τ: 0.49) while being approximately 400,000 times faster.
Table 4: Key Experimental Resources for Robust Affinity Prediction
| Resource | Type | Function | Application Context |
|---|---|---|---|
| PDBbind CleanSplit | Curated Dataset | Training data with reduced benchmark leakage | Model development & validation |
| CASF-2016 Benchmark | Evaluation Dataset | Standardized performance assessment | Method comparison |
| DMD Suite | Simulation Software | Discrete Molecular Dynamics simulations | Pose discrimination for difficult targets |
| AEV-PLIG | Graph Neural Network | Binding affinity prediction with atomic environments | Structure-based affinity prediction |
| DeepDTAGen | Multitask Framework | Simultaneous affinity prediction & drug generation | Target-aware drug design |
| FetterGrad Algorithm | Optimization Method | Mitigates gradient conflicts in multitask learning | Multitask model training |
| 3D Interaction Fingerprints | Descriptor System | Encodes spatial interaction patterns | Reference complex selection |
| Knowledge-Guided Scoring (KGS2) | Add-on Method | Enhances existing functions using reference complexes | Scoring function improvement |
Congeneric bias in classical scoring functions represents a fundamental challenge rooted in theoretical limitations of structure-based models and exacerbated by methodological shortcomings in model development and evaluation. The reliance on single static structures, inadequate treatment of entropic contributions, and dataset contamination have collectively created a situation where reported performance metrics significantly overstate real-world applicability.
Promising paths forward include dynamics-aware scoring methods, rigorously curated datasets, advanced neural architectures that explicitly model physical interactions, and data augmentation to expand chemical diversity. The development of models that genuinely learn biophysical principles rather than exploiting dataset biases will be essential for achieving robust performance across novel and diverse targets. As these approaches mature, they promise to narrow the gap between computational prediction and experimental reality, ultimately accelerating therapeutic development through more reliable virtual screening.
Scoring functions are the computational engine of structure-based drug design, tasked with predicting the binding affinity between a drug candidate and its protein target. Classical scoring functions, which often rely on simplified physical models and empirical parameters, have long been a cornerstone of molecular docking and virtual screening [32]. These functions provide a critical bridge between the structural data of a protein-ligand complex and the anticipated biological activity, guiding the selection of initial "hit" compounds and their subsequent optimization into viable "leads" [37]. However, inherent limitations in these classical approaches can lead to inaccurate affinity predictions, creating a cascade of negative consequences throughout the drug discovery pipeline. This case study examines the tangible repercussions of poor scoring in hit identification and lead optimization, framing the issue within the broader research context of overcoming the limitations of classical scoring functions. We will analyze quantitative evidence of these failures, detail experimental protocols for evaluating scoring function performance, and explore how emerging deep learning (DL) methodologies are providing potential pathways to more reliable predictions [38] [32].
The initial phases of drug discovery are heavily reliant on computational prescreening to navigate vast chemical space. Hit identification aims to find initial compounds with confirmed activity against a therapeutic target, typically from hundreds of thousands to millions of candidates [39] [37]. Following this, the hit-to-lead phase involves optimizing these initial hits for potency, selectivity, and drug-like properties, a process that depends on accurate structure-activity relationship (SAR) data to guide medicinal chemistry [40]. In both stages, scoring functions are indispensable for prioritizing which compounds to synthesize and test experimentally.
The reliance on these functions is profound. In virtual screening, they act as a filter, and their failure to correctly rank compounds can cause truly active molecules to be overlooked in favor of false positives [39]. During lead optimization, medicinal chemists use predicted binding modes and affinities to decide which chemical modifications to make. Inaccurate scoring can therefore misdirect the entire optimization effort, wasting precious time and resources [32]. A core challenge is that classical functions often struggle to capture the complex physical chemistry of binding, such as the subtle effects of solvation, entropy, and specific intermolecular interactions like halogen bonds [32]. This foundational weakness manifests in several critical failure modes, for which quantitative evidence is mounting.
Recent comprehensive benchmarks directly compare traditional and AI-powered docking methods, revealing systematic shortcomings. The following table summarizes key performance metrics across different evaluation datasets, highlighting the specific challenges of physical plausibility and generalization.
Table 1: Performance Comparison of Docking Methods Across Key Benchmarks
| Method Category | Example Method | Pose Prediction Success (RMSD ≤ 2Å) | Physical Validity (PB-Valid Rate) | Combined Success (RMSD ≤ 2Å & PB-Valid) | Key Weakness Identified |
|---|---|---|---|---|---|
| Traditional | Glide SP | Moderate | >94% (across all datasets) | Moderate | Balanced but computationally intensive [32] |
| Generative Diffusion | SurfDock | >70% (across all datasets) | Suboptimal (e.g., 40.21% on DockGen) | Moderate (e.g., 33.33% on DockGen) | Produces physically implausible structures [32] |
| Regression-Based | KarmaDock | Low | Very Low | Low | Frequent production of physically invalid poses [32] |
| Hybrid (AI Scoring) | Interformer | Moderate | High | High | Aims to balance pose accuracy and physical validity [32] |
The data reveals that while some modern DL methods, particularly generative diffusion models, excel in raw pose prediction accuracy (RMSD ≤ 2Å), they often do so at the cost of physical plausibility. For instance, SurfDock achieves a high pose prediction success rate of 75.66% on the challenging DockGen dataset (featuring novel protein pockets) but has a PB-valid rate of only 40.21% [32]. The PoseBusters toolkit has been instrumental in uncovering these issues, flagging problems such as incorrect bond lengths, steric clashes, and unrealistic molecular geometry that are missed by the RMSD metric alone [32].
Furthermore, a critical failure of both classical and many DL scoring functions is their poor generalization to novel targets. Performance often drops significantly when methods are applied to proteins or binding pockets that are structurally distinct from those in their training data [32]. This lack of robustness directly impacts virtual screening efficacy, as the goal is to discover new chemotypes for diverse targets.
Table 2: Consequences of Poor Scoring in Key Drug Discovery Stages
| Discovery Stage | Primary Impact of Poor Scoring | Downstream Consequences |
|---|---|---|
| Hit Identification | Inaccurate ranking of compounds in virtual screening; high false positive/negative rates. | Waste of resources on testing inactive compounds; missed opportunities by overlooking true hits [39]. |
| Hit-to-Lead | Misleading guidance for Structure-Activity Relationship (SAR) and medicinal chemistry. | Optimization efforts are misdirected, leading to dead ends; poor compound quality propagates forward [40]. |
| Lead Optimization | Failure to correctly predict the affinity of optimized analogs. | Inefficient cycle of synthesis and testing; increased risk of late-stage attrition due to underlying affinity issues [32]. |
To systematically identify the limitations described above, researchers employ rigorous benchmarking protocols. The following workflow outlines a standard methodology for a comprehensive assessment of scoring functions.
A robust evaluation requires multiple, carefully curated datasets to test different aspects of performance [32]:
The workflow assesses several distinct performance metrics, as outlined in Table 1:
The quantitative failures detailed in Section 3 translate directly into significant operational setbacks in the laboratory. When scoring functions generate false positives—incorrectly assigning high affinity to non-binders—teams waste valuable resources synthesizing and testing these compounds. A survey of virtual screening studies noted that a lack of consensus on hit identification criteria, including the underutilization of size-targeted metrics like ligand efficiency, can exacerbate this problem [39]. Furthermore, poor scoring can obscure the true Structure-Activity Relationship (SAR), leading chemists to draw incorrect conclusions about which chemical groups contribute favorably to binding. This misdirection can derail an optimization campaign, sending teams down unproductive chemical pathways for months [40]. Ultimately, these errors contribute to the high attrition rates seen in later, more expensive stages of drug development, as fundamental flaws in affinity and selectivity are only uncovered after substantial investment.
The following table lists key reagents, software, and datasets that are essential for conducting rigorous evaluations of scoring functions and performing structure-based drug discovery.
Table 3: Essential Research Tools for Scoring and Docking Evaluation
| Tool Name | Type | Primary Function in Evaluation |
|---|---|---|
| PoseBusters [32] | Software Toolkit | Validates the physical plausibility and chemical correctness of predicted protein-ligand complexes. |
| RDKit [41] | Cheminformatics Library | Handles molecular informatics; used for processing ligand structures (e.g., from SMILES) and calculating molecular descriptors. |
| Astex Diverse Set [32] | Benchmark Dataset | A standard set of high-quality protein-ligand complexes for initial validation of pose prediction accuracy. |
| DockGen [32] | Benchmark Dataset | A dataset featuring novel protein binding pockets for testing the generalizability of docking methods. |
| Transcreener Assays [40] | Biochemical Assay | Provides a homogeneous, high-throughput method for experimentally confirming compound potency and mechanism of action during hit validation. |
| PyMOL [41] | Molecular Visualization | Enables visual inspection of predicted binding poses, protein-ligand interactions, and steric clashes. |
The limitations of classical functions have spurred the development of new computational paradigms. Deep learning models are now being extensively applied to drug-target binding (DTB) prediction, offering the potential to learn complex, non-linear relationships from large datasets that are difficult to codify in classical functions [38] [41]. These DL approaches can be broadly categorized, each with distinct advantages and weaknesses, as shown in the following diagram.
As illustrated, generative diffusion models demonstrate superior pose prediction accuracy but often produce physically implausible structures. Regression-based models frequently fail to generate valid molecular geometries altogether. In contrast, hybrid methods, which often combine traditional conformational search algorithms with AI-driven scoring functions, currently offer the most balanced performance, aiming to retain the strengths of both classical and modern approaches [32]. The field is also exploring multimodal approaches that integrate diverse data types, such as protein sequences, ligand graphs, and 3D structural information, to create more robust and generalizable models [38].
This case study has delineated the profound consequences of poor scoring in early drug discovery, from wasted resources on false leads to the misguided optimization of compound series. Quantitative benchmarks reveal that while classical and even some modern DL scoring functions can perform well on standard tests, they frequently fail on critical aspects like physical plausibility, recovery of key interactions, and generalization to novel targets. Addressing these limitations is paramount for improving the efficiency of drug discovery.
Future research directions are focused on developing more physically realistic and generalizable models. Promising strategies include integrating tighter physical constraints into DL model loss functions, improving the sampling of diffusion models, and enhancing the efficiency of hybrid method searches [32]. Furthermore, the development of more challenging and biologically relevant benchmark datasets will be crucial for steering progress. As these advanced models mature and are validated against real-world screening campaigns, they hold the potential to significantly de-risk the hit identification and lead optimization process, ultimately accelerating the delivery of new therapeutics.
In the field of computational drug discovery, the accuracy of structure-based binding affinity prediction is fundamentally constrained by the quality, quantity, and diversity of the underlying training data. While advanced deep learning architectures including convolutional neural networks, graph neural networks, and transformer-based models have emerged as promising approaches for scoring functions, their performance has plateaued due to often-overlooked limitations in the datasets upon which they are trained and evaluated [42]. The central challenge, termed the "data bottleneck," encompasses three interrelated dimensions: spatial and structural biases in existing datasets, sparsity of data for novel targets, and the propagation of errors through low-quality or improperly processed data. This bottleneck not only inflates performance metrics during benchmarking but severely limits the real-world applicability of these models in genuine drug discovery pipelines, particularly when encountering novel protein targets or chemical spaces [28] [32] [43].
The persistence of this data bottleneck has significant implications for the development of classical scoring functions. Models achieving state-of-the-art performance on standardized benchmarks frequently fail to maintain this accuracy when applied to strictly independent test sets, revealing a concerning over-reliance on data patterns that do not translate to genuine generalization [28]. This technical guide examines the multifaceted nature of data limitations through quantitative analysis, experimental validation, and proposed methodological solutions, providing researchers with a framework for diagnosing and addressing data-related challenges in their own affinity prediction work.
A critical examination of standard benchmarks reveals substantial data leakage between the primary training data and evaluation sets. The PDBbind database and the Comparative Assessment of Scoring Functions (CASF) benchmark, widely used for training and testing deep learning models, exhibit a high degree of structural similarity that artificially inflates performance metrics [28]. Recent investigations utilizing structure-based clustering algorithms have identified that nearly 49% of CASF test complexes have highly similar counterparts in the PDBbind training set, with nearly 600 identified train-test pairs sharing not only similar ligand and protein structures but also comparable binding conformations and affinity labels [28]. This redundancy enables models to achieve high benchmark performance through memorization and structural matching rather than genuine understanding of protein-ligand interactions.
Table 1: Quantifying Data Leakage Between PDBbind and CASF Benchmarks
| Similarity Metric | Threshold Value | Impact on CASF Test Set | Effect on Model Performance |
|---|---|---|---|
| Protein Structure Similarity (TM-score) | >0.7 | 49% of test complexes affected | Enables protein structure memorization |
| Ligand Similarity (Tanimoto) | >0.9 | Significant portion of test ligands | Allows ligand-based affinity prediction |
| Binding Conformation (RMSD) | <2.0Å | Nearly 600 similar train-test pairs | Permits binding pose matching |
| Combined Similarity | Multimodal filtering | Widespread train-test overlap | Inflates benchmark performance by 20-40% |
Researchers can implement the following experimental protocol to diagnose data leakage in their own datasets:
Structure-Based Clustering Algorithm: Implement a multimodal filtering approach that combines:
Similarity Threshold Application: Identify problematic pairs using established thresholds:
Cross-Dataset Comparison: Apply the clustering algorithm to compare training and test set complexes, flagging any pairs exceeding similarity thresholds across multiple metrics.
Dataset Filtering: Create a cleaned dataset by removing training complexes that closely resemble any test complex according to the established thresholds. The PDBbind CleanSplit protocol removes approximately 4% of training complexes to address train-test leakage and an additional 7.8% to reduce internal redundancies [28].
Retraining state-of-the-art affinity prediction models on properly cleaned datasets reveals the substantial impact of data bias. When models like GenScore and Pafnucy were retrained on the PDBbind CleanSplit dataset with reduced data leakage, their performance on the CASF benchmark dropped markedly, confirming that previously reported high performance was largely driven by data leakage rather than genuine generalization capability [28]. This demonstrates that the impressive benchmark performance of many published models does not translate to real-world scenarios where models encounter truly novel protein-ligand complexes.
Public structural databases exhibit significant biases in their coverage of protein-ligand interactions. The Protein Data Bank (PDB) contains substantial representation biases toward soluble, easily crystallized proteins, while membrane proteins, RNA-protein complexes, and other challenging targets remain severely underrepresented [43] [44]. As of 2024, only 4,888 RNA-protein complexes were available in the PDB, with fewer than 400 representing high-resolution, unique, non-redundant structures after accounting for redundancies [44]. This sparse structural coverage creates critical gaps in training data that directly impact model performance on pharmaceutically relevant but structurally elusive targets.
To mitigate the effects of data sparsity, researchers can employ the following methodological approaches:
Data Augmentation through Conformational Sampling:
Transfer Learning from Related Domains:
Federated Learning Approaches:
Table 2: Data Sparsity Mitigation Strategies and Their Applications
| Strategy | Methodology | Target Use Case | Limitations |
|---|---|---|---|
| Decoy Conformation Augmentation | Generate multiple docking poses for active compounds | Virtual screening for targets with limited active compounds | May introduce conformational bias if sampling is insufficient |
| Cross-Target Transfer Learning | Pre-train on targets with abundant data, fine-tune on sparse targets | Novel target families with limited structural data | Requires careful selection of source domains to ensure relevance |
| Federated Learning | Train across multiple institutions without data sharing | Proprietary datasets with IP constraints | Increased computational complexity and coordination overhead |
| Synthetic Data Generation | Generative models to create plausible protein-ligand complexes | Ultra-rare targets with minimal experimental data | Requires validation to ensure physical plausibility |
The relationship between data quality and model performance represents a critical dimension of the data bottleneck. Systematic studies examining the effects of data quality and quantity have demonstrated that variations in these parameters can cause performance discrepancies comparable to or even larger than those observed between different deep learning architectures [46]. Notably, the presence of diverse protein targets in training data produces a dramatic increase in prediction accuracy, highlighting the importance of target diversity over mere quantity of ligand data [46]. This suggests that the continued accumulation of high-quality affinity data, especially for new protein targets, is indispensable for improving deep learning models.
The growing practice of employing low-precision computation to enhance efficiency introduces subtle but significant challenges in model evaluation. When relevance scores between queries and documents are computed in low-precision formats (e.g., FP16, BF16), the reduced numerical granularity produces spurious ties—distinct true scores that collapse to the same quantized value [47]. These scoring collisions introduce high variability in evaluation results based on arbitrary tie-resolution methods, making reliable performance assessment difficult. In retrieval-based affinity prediction tasks, this can manifest as inconsistent ranking of candidate molecules based on predicted binding scores.
To address evaluation instability from low-precision data, researchers should implement the following High-Precision Scoring (HPS) protocol:
Maintain Low-Precision Forward Pass: Execute the primary model inference in low-precision (BF16/FP16) to preserve computational efficiency.
Upcast Final Scoring Operation: Before the final scoring function (softmax, sigmoid, or pairwise product), upcast the logits tensor to FP32 precision:
ŝ_i = ϕ(upcast(z_i)) [47]
Compute Fine-Grained Scores: Perform the final scoring operation in high precision to generate more discriminative relevance scores.
Implement Tie-Aware Metrics: Supplement standard evaluation metrics with tie-aware retrieval metrics (TRM) that report expected scores, ranges, and biases to quantify ordering uncertainty [47].
This combined approach dramatically reduces tie-induced instability, with experiments showing MRR@10 range reduction of 36.82% and recovery of near-FP32 evaluation stability [47].
Table 3: Essential Research Reagents for Addressing Data Bottlenecks
| Reagent / Resource | Type | Primary Function | Application Context |
|---|---|---|---|
| PDBbind CleanSplit | Curated Dataset | Eliminates train-test leakage in affinity prediction | Benchmarking generalization capability of scoring functions |
| PADIF Fingerprint | Interaction Representation | Captures nuanced protein-ligand interaction patterns | Virtual screening and target prediction for diverse target classes |
| High-Precision Scoring (HPS) | Evaluation Protocol | Reduces spurious ties in low-precision inference | Stable evaluation of retrieval-based affinity prediction |
| Tie-aware Retrieval Metrics (TRM) | Evaluation Metrics | Quantifies uncertainty from tied rankings | Comprehensive assessment of model ranking performance |
| Dark Chemical Matter (DCM) | Decoy Dataset | Provides confirmed non-binders for model training | Improving virtual screening specificity |
| Federated Learning Platforms | Computational Framework | Enables multi-institutional collaboration without data sharing | Training on proprietary datasets with IP constraints |
The implementation of PDBbind CleanSplit provides an instructive case study in addressing data bottlenecks. After identifying substantial data leakage between standard training and test sets, researchers developed a structure-based filtering algorithm that systematically removes training complexes that closely resemble any test complex [28]. The resulting dataset enables genuine evaluation of model generalization to unseen protein-ligand complexes. When a graph neural network model (GEMS) incorporating sparse graph modeling and transfer learning from language models was trained on this cleaned dataset, it maintained high benchmark performance despite the reduced data leakage, suggesting its predictions were based on genuine understanding of protein-ligand interactions rather than exploitation of dataset biases [28].
The data bottleneck in binding affinity prediction represents a multifaceted challenge encompassing bias, sparsity, and quality limitations that collectively constrain the real-world applicability of computational models. The systematic identification of data leakage between standard benchmarks reveals that reported performance metrics often substantially overestimate true generalization capability. Addressing these limitations requires coordinated advances in dataset curation, evaluation methodologies, and model architectures specifically designed to maximize learning from limited high-quality data.
Future progress depends on developing more sophisticated data curation protocols that proactively identify and eliminate biases, creating standardized evaluation frameworks that account for precision limitations, and fostering collaborative data sharing models that expand access to diverse, high-quality training data while respecting intellectual property constraints. The implementation of the experimental protocols and solutions outlined in this technical guide provides researchers with practical approaches to diagnose and address data bottlenecks in their own work, ultimately contributing to the development of more robust and generalizable scoring functions for computational drug discovery.
In the field of affinity prediction for drug discovery, the perceived performance of computational models is often dangerously inflated by fundamental errors in the application of machine learning principles. Specifically, improper management of the relationship between training and testing data—termed here "Train-Test Leadership"—introduces optimistic bias that undermines model reliability and generalizability. This technical examination addresses how data leakage, inconsistent preprocessing, and inadequate randomness control systematically compromise benchmarking integrity in scoring function development, perpetuating the well-documented limitations of classical affinity prediction methods and impeding genuine progress in the field.
The development of reliable scoring functions for protein-ligand binding affinity prediction represents a cornerstone of structure-based drug design. Despite decades of research, classical and machine learning-based scoring functions continue to demonstrate limited predictive accuracy on novel targets, with particularly poor performance in cross-target applications—a phenomenon known as the inter-protein scoring noise problem [9]. Empirical scoring functions trained using linear regression or machine learning methods on experimental structures and affinity data have shown considerable improvements in prediction accuracy for large generic datasets [5]. However, these apparent advances often fail to translate to real-world drug discovery applications, with few AI-discovered therapeutics reaching clinical trials and none achieving clinical approval as of 2024 [48].
This performance-translation gap frequently stems from improperly implemented benchmarking methodologies that systematically inflate perceived model capability. The core issue resides in what we term "Train-Test Leadership"—the comprehensive approach to managing the relationship, separation, and processing of training and testing data throughout the model development pipeline. Insufficient attention to the subtle ways in which information leaks between these datasets, or in which preprocessing decisions optimize for benchmark performance rather than generalizability, creates a misleading impression of model efficacy that evaporates when facing truly novel prediction tasks.
Data leakage occurs when information that would not be available at prediction time is used during model training, resulting in optimistically biased performance estimates [49]. In affinity prediction, this manifests particularly during preprocessing and feature selection stages.
Mechanism of Leakage: When the entire dataset is used for operations that should be restricted to training data only, such as feature selection, normalization parameter calculation, or dimensionality reduction, information from the test set contaminates the training process. The model effectively gains "foresight" about the test distribution, violating the fundamental assumption of independent evaluation [49].
Experimental Demonstration: In a demonstration using synthetic data with 10,000 randomly generated features and completely random targets, including test data in feature selection resulted in an accuracy score of 0.76—far above the expected chance performance of 0.5. When proper protocol was followed (splitting data first, then performing feature selection using only training data), accuracy correctly fell to chance level (0.5) [49].
Table 1: Impact of Data Leakage on Model Performance
| Scenario | Feature Selection Method | Test Accuracy | Interpretation |
|---|---|---|---|
| Incorrect | Pre-splitting selection using all data | 0.76 | Severely inflated |
| Correct | Post-splitting selection using only training data | 0.50 | Accurate (chance) |
| Pipeline-based | Automated separation via scikit-learn Pipeline | 0.50 | Accurate (chance) |
Inconsistent application of preprocessing transformations between training and testing phases creates a mismatch between the data distributions the model was trained on and those it encounters during deployment [49] [50].
Normalization Pitfall: A common error occurs when normalization parameters (e.g., mean and standard deviation for StandardScaler, min and max for MinMaxScaler) are calculated using the entire dataset before splitting, rather than being fit solely on training data and then applied to the test set. This subtly introduces test set information into the training process [50].
Concrete Example: In a polynomial regression predicting house prices from square footage, when training data was normalized using parameters from the complete dataset (including test observations), the model achieved deceptively good performance. When proper protocol was followed (normalization parameters calculated from training data only), performance on the true test set more accurately reflected real-world generalizability [50].
Impact on Affinity Prediction: For empirical scoring functions that rely on feature descriptors capturing essential interaction features between proteins and ligands [5], inconsistent preprocessing creates models that appear accurate during benchmarking but fail to generalize across diverse protein families or structural motifs.
The management of randomness through random_state parameters significantly impacts the reproducibility and reliability of benchmarking results [49].
Random State Rules:
fit or split multiple times always yields the same resultsNone or a RandomState instance is passed, fit and split yield different results each timeRandomState instances when creating estimators, or leave random_state to None [49]Benchmarking Implications: Inconsistent handling of randomness across different stages of model evaluation (e.g., during cross-validation splits versus final model training) introduces uncontrolled variance that can artificially enhance or depress perceived performance, compromising comparisons between different scoring functions.
Protocol 1: Strict Separation Workflow
Validation Step: To verify proper separation, use synthetic tests with randomized targets to ensure models achieve expected chance performance when no true relationship exists [49].
The most robust defense against data leakage is implementing a unified pipeline that encapsulates all preprocessing and modeling steps [49].
scikit-learn Implementation:
Advantages:
fit_transform is only applied to training data during cross-validationClassical scoring functions demonstrate a specific failure mode known as inter-protein scoring noise: while capable of enriching active molecules for a single protein target, they fail to identify the correct protein target for a given active molecule due to scoring variation between different binding pockets [9].
Benchmarking Implications: Traditional train-test splits that randomly assign protein-ligand complexes across targets fail to adequately assess this critical capability. A more rigorous approach involves leave-one-target-out validation, where all complexes for specific protein targets are held out during training.
Recent Assessment: In evaluations of the Boltz-2 biomolecular foundation model, while initial claims suggested performance approaching free-energy perturbation in estimating binding affinity, the model failed to correctly identify protein targets for active molecules when tested on the LIT-PCBA benchmark set for target identification [9].
Table 2: Essential Research Reagents for Affinity Prediction Benchmarking
| Reagent/Solution | Function | Considerations |
|---|---|---|
| LIT-PCBA dataset | Benchmark set for target identification based on the LIT-PCBA [9] | Tests ability to identify correct protein target for active molecules |
| PDBbind database | Curated collection of protein-ligand complexes with binding affinity data [5] | Provides experimental structures and affinity data for training empirical scoring functions |
| Boltz-2 model | Biomolecular foundation model for affinity prediction [9] | Reference for comparing new methods; demonstrates current limitations |
| Classical scoring functions | Empirical functions using linear regression or machine learning [5] | Baseline for method comparison; exhibit known limitations |
Table 3: Comprehensive Benchmarking Metrics for Affinity Prediction
| Metric Category | Specific Metrics | Interpretation Guidelines |
|---|---|---|
| Target Identification | Success rate in identifying correct protein target for active molecules [9] | Primary metric for generalizability; should exceed 0.5 for useful methods |
| Affinity Accuracy | Root mean square error (RMSE) between predicted and experimental binding affinities [5] | Context-dependent; must be compared to state-of-the-art and classical baselines |
| Ranking Capability | Enrichment factors, ROC curves, AUC values [5] | Measures utility for virtual screening applications |
| Cross-Target Performance | Variance in performance across different protein families [9] | Lower variance indicates better generalizability |
Quality Control Protocol:
The inflation of perceived performance through improper train-test management represents a critical barrier to genuine progress in affinity prediction research. The field's continued reliance on benchmarks vulnerable to data leakage and preprocessing inconsistencies perpetuates the development of methods that excel in artificial testing environments but fail in practical applications—particularly for challenging problems like inter-protein scoring noise [9].
Addressing these issues requires both technical corrections in implementation and cultural shifts in evaluation standards. The adoption of pipeline-based approaches, comprehensive negative controls, and more rigorous benchmarking sets that specifically test generalizability across protein targets will enable more accurate assessment of true methodological advances. Furthermore, increased transparency in reporting preprocessing methodologies, data splitting strategies, and randomization protocols will facilitate more meaningful comparisons between different scoring functions [48].
Only through such rigorous attention to the fundamentals of machine learning evaluation can the field overcome the current limitations of classical scoring functions and produce genuinely reliable affinity prediction methods capable of accelerating drug discovery and development.
The accurate prediction of drug-target binding affinity (DTA) is a cornerstone of modern computational drug discovery. While classical scoring functions and contemporary deep learning models have advanced this field, their performance remains fundamentally constrained by the quality and composition of the training data upon which they are built. Redundancy—the overrepresentation of similar protein sequences and ligand structures in training datasets—introduces significant bias, reduces model generalizability, and ultimately limits real-world applicability [51] [7].
The limitations of classical scoring functions (e.g., force-field-based, empirical, and knowledge-based) are well-documented, particularly their struggle with generalization across diverse protein families and ligand classes [7]. These functions often exhibit predictive bias toward specific target types with abundant structural data, such as soluble proteins, while performing poorly on membrane proteins like G protein-coupled receptors (GPCRs) and ion channels, which are crucial drug targets but structurally underrepresented [10] [11]. This bias stems directly from redundant training sets that fail to adequately represent the structural diversity of biological targets. As the field transitions toward data-driven machine learning and deep learning approaches, addressing dataset redundancy becomes increasingly critical for developing robust predictive models with genuine utility in drug discovery pipelines.
Redundancy in drug-target affinity data manifests primarily through two channels: sequence redundancy in target proteins and structural redundancy in ligand compounds. The former occurs when training sets contain multiple similar protein sequences from the same family, while the latter arises from numerous structurally analogous compounds. This redundancy creates models that perform exceptionally well on familiar data but fail to generalize to novel targets or chemotypes—a significant problem for drug discovery where innovation precisely targets novel biology and chemistry [51] [36].
The standard threshold-based algorithm (used in tools like CD-HIT, PISCES, and UCLUST) for selecting representative subsets often exacerbates these issues. This approach applies a heuristic threshold where sequences are added to a representative set only if no existing member shares similarity above the threshold (typically 40% or 90% sequence identity). This method has two critical drawbacks: it ignores all similarities below the specified threshold, potentially selecting representatives with similarities very close to the cutoff, and it provides no guarantees about the final set size, which is crucial for downstream applications [51].
Table 1: Comparison of Representative Subset Selection Methods for Protein Sequences
| Method | Core Algorithm | Advantages | Limitations |
|---|---|---|---|
| Threshold Algorithm (CD-HIT, UCLUST) | Heuristic threshold-based selection | Fast computation; widely adopted | No theoretical guarantees; ignores sub-threshold similarities; unstable output size |
| Submodular Optimization (Repset) | Discrete optimization with diminishing returns property | Provable theoretical guarantees; maximizes structural diversity; flexible objective functions | Computationally more intensive; requires pairwise similarity calculations |
| Clustering-based Methods | Group sequences then select exemplars | Intuitive grouping structure | Exemplar selection may not optimize global diversity |
Experimental evidence demonstrates that submodular optimization approaches consistently yield protein sequence subsets with greater structural diversity than sets chosen by existing threshold-based methods. When evaluated against the structural classification of proteins (SCOPe) library as a gold standard, submodular optimization selects sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches [51].
Submodular optimization provides a mathematical framework for representative subset selection with theoretical guarantees. A submodular function exhibits the property of "diminishing returns"—the incremental value of adding a sequence to a representative set decreases as the set grows. This property makes these functions amenable to efficient optimization with provable approximation guarantees [51].
The fundamental approach involves defining a submodular objective function that quantifies the quality of a candidate representative subset, then applying optimization algorithms to identify a subset that maximizes this function. Formally, for a set function ( f: 2^S \rightarrow \mathbb{R} ), ( f ) is submodular if for every ( A \subseteq B \subseteq S ) and ( s \in S \setminus B ), it holds that:
[ f(A \cup {s}) - f(A) \geq f(B \cup {s}) - f(B) ]
This mathematical framework enables the development of objective functions that simultaneously maximize representativeness (ensuring every sequence in the full set has a similar representative) and minimize redundancy (ensuring selected representatives are diverse) [51].
Table 2: Key Research Reagent Solutions for Non-Redundant Dataset Construction
| Tool/Resource | Type | Primary Function | Application Context |
|---|---|---|---|
| Repset | Software package | Submodular optimization for representative sequence selection | Creating non-redundant protein sequence sets for model training |
| PSI-BLAST | Algorithm | Protein similarity search and alignment | Calculating pairwise similarities for optimization input |
| SCOPe Library | Database | Structural classification of proteins | Gold standard for evaluating structural diversity |
| PubChem/ChEMBL | Database | Repository of chemical molecules and bioactivities | Source for ligand structures and binding affinities |
| PaDEL Descriptors | Software | Molecular descriptor calculation | Featurization of ligand compounds for diversity analysis |
The optimization framework allows for designing specialized objective functions tailored to specific research needs. For instance, a mixture objective function can be created that performs well for both large and small representative sets, addressing a key limitation of threshold-based approaches. Similarly, hybrid functions can incorporate sequence length preferences, encouraging the selection of longer sequences when those are desirable for downstream applications [51].
For drug-target affinity prediction, this approach can be extended to handle both protein and ligand redundancy simultaneously. Molecular descriptors associated with molecular vibrations—including E-state descriptors, autocorrelation descriptors, and topological descriptors—can be screened to represent ligand diversity, while protein sequence descriptors capture target diversity [10] [11]. By treating the molecule-target pair as a whole system, researchers can create comprehensive non-redundant datasets for affinity prediction [11].
Rigorous evaluation of non-redundant training sets requires standardized metrics and benchmark datasets. For protein sequence sets, structural diversity measured against reference databases like SCOPe provides a key validation metric. For affinity prediction tasks, standard benchmarks include the Davis kinase binding affinity dataset (containing 442 proteins and 68 drugs with 30,056 interactions) and the KIBA dataset (containing 229 proteins and 2,111 drugs with 118,254 interactions) [52].
These benchmark datasets address data heterogeneity concerns, with Smith-Waterman similarity analysis showing that 92% of protein pairs in the Davis dataset and 99% in the KIBA dataset have sequence similarity of at most 60%, indicating inherent non-redundancy [52]. Performance metrics should include both predictive accuracy (measured via Mean Squared Error/MSE and Concordance Index/CI) and generalizability to novel targets and compounds.
Contemporary deep learning models for DTA prediction demonstrate the critical importance of proper dataset construction. Models like DeepDTA, GraphDTA, and ImageDTA show significantly different performance characteristics when trained and evaluated on properly constructed non-redundant datasets [36] [52].
For example, ImageDTA—which treats word vector-encoded SMILES strings as images and processes them with multiscale 2D convolutional neural networks—achieves superior performance on benchmark datasets through architectural innovations that better capture structural information while minimizing information loss common in pooling operations [52]. This approach demonstrates MSE values of 0.214 and CI values of 0.890 on the Davis dataset, outperforming many traditional approaches while maintaining greater interpretability [52].
The implementation of non-redundant training sets directly addresses critical limitations in classical scoring function approaches. Classical methods—including physics-based, empirical, and knowledge-based scoring functions—often struggle with accuracy and applicability because they were frequently developed and parameterized using limited, redundant datasets [7]. This has restricted their effectiveness, particularly for membrane protein targets and novel compound classes [10].
Modern multitask learning frameworks like DeepDTAGen, which simultaneously predict drug-target binding affinities and generate novel target-aware drug variants, demonstrate the power of diverse training data. These models leverage shared feature spaces for both tasks, but their effectiveness depends critically on comprehensive training data that adequately represents the chemical and biological space of interest [36]. The development of specialized optimization algorithms, such as FetterGrad, to mitigate gradient conflicts in multitask learning further enhances model performance when trained on well-constructed datasets [36].
The construction of diverse, non-redundant training sets represents a fundamental prerequisite for advancing drug-target affinity prediction beyond the limitations of classical scoring functions. Methodologies based on submodular optimization provide a mathematically rigorous framework with theoretical guarantees for selecting representative subsets that maximize structural diversity. When integrated with modern deep learning architectures and comprehensive benchmarking, these approaches enable the development of predictive models with significantly enhanced accuracy, interpretability, and real-world applicability across diverse target classes and compound libraries.
As the field progresses, future work should focus on developing standardized non-redundancy benchmarks, optimizing computational efficiency for large-scale dataset construction, and creating integrated frameworks that simultaneously address redundancy in both target and compound spaces. Through these advances, the drug discovery community can overcome one of the most persistent limitations in computational affinity prediction and accelerate the development of novel therapeutic agents.
The prediction of binding affinity between small molecule drugs and their target proteins is a cornerstone of computational drug discovery. For decades, this field was dominated by classical scoring functions, which are limited by their reliance on simplified physical models and their inability to learn from large-scale data. The rise of machine learning (ML), particularly deep learning, has ushered in a paradigm shift, overcoming these constraints through data-driven approaches that capture complex patterns in molecular structures and interactions. This technical review examines the fundamental limitations of classical methods and delineates how modern ML architectures—including convolutional neural networks (CNNs), graph neural networks (GNNs), and transformer-based models—are achieving superior predictive performance. We provide a quantitative analysis of model capabilities, detailed experimental protocols for benchmarking, and visualization of key workflows. The transition to ML represents a fundamental advancement in the accuracy and efficiency of binding affinity prediction, with profound implications for accelerating drug discovery.
Classical scoring functions have been the workhorse of structure-based virtual screening for predicting protein-ligand binding affinity. These methods are generally categorized into three groups: force-field-based, empirical, and knowledge-based functions [28]. Despite their long-standing utility, they share critical limitations that have constrained their predictive accuracy and generalizability.
A primary shortcoming is their dependence on hand-crafted parameters and simplified physical models. Classical functions often rely on linear regression models that cannot assimilate large amounts of structural and binding data, limiting their capacity to capture the complex, non-linear relationships governing molecular interactions [18] [31]. Furthermore, their performance plateau in virtual screening and binding affinity prediction has been extensively documented; they show limited accuracy in predicting binding affinities for protein-ligand poses [28] [18].
Perhaps the most significant recent revelation is the problem of train-test data leakage and dataset redundancy. Studies have shown that the impressive benchmark performance of many models, including modern deep learning approaches, is artificially inflated due to structural similarities between the training set (e.g., PDBbind) and standard test benchmarks (e.g., the Comparative Assessment of Scoring Functions or CASF) [28]. One analysis found that nearly 50% of training complexes are part of a similarity cluster, and when trained on a properly filtered dataset (PDBbind CleanSplit), the performance of state-of-the-art models drops substantially, revealing that previous high scores were largely driven by data leakage rather than genuine generalization [28]. This indicates that the true generalization capability of many scoring functions has been systematically overestimated.
Machine learning models overcome the fundamental constraints of classical approaches by learning directly from data rather than relying on pre-defined physical equations. This data-driven paradigm allows them to discover intricate patterns in protein-ligand complexes that are intractable for classical functions.
The development of ML models for affinity prediction has progressed through several stages, each introducing more sophisticated architectural components.
Table: Evolution of Deep Learning Models for Affinity Prediction
| Model Era | Representative Architectures | Typical Input Representations | Key Innovations |
|---|---|---|---|
| Early Deep Learning | CNNs, RNNs [38] | SMILES strings, amino acid sequences [38] | Moving beyond manual feature engineering to automated feature learning from primary structures. |
| Graph-Based Models | GNNs (e.g., GraphDTA) [38] [36] | Molecular graphs for drugs, sequences or graphs for proteins [53] | Representing molecules as graphs to explicitly model atomic bonds and topology. |
| Attention & Transformer Models | Transformers, Self-Attention Mechanisms [38] | SMILES, sequences, often augmented with language model embeddings [38] | Capturing long-range dependencies and utilizing transfer learning from large language models (e.g., ProtBERT, ChemBERTa). |
| Multimodal & Hybrid Models | GNNs + Transformers, Diffusion Models [53] [36] [32] | 3D structures, sequences, graphs, and interaction networks [53] | Integrating multiple input representations and model types for a more holistic view of the complex. |
Diagram 1: The methodological evolution of binding affinity prediction models.
Quantitative benchmarking reveals the significant performance gap between classical and ML-based scoring functions. The following tables synthesize key metrics from rigorous evaluations.
Table: Virtual Screening Performance on DUD-E Benchmark (102 Targets)
| Scoring Function | Type | Hit Rate (Top 1%) | Hit Rate (Top 0.1%) | Notes |
|---|---|---|---|---|
| RF-Score-VS [18] | Machine Learning (Random Forest) | 55.6% | 88.6% | Trained on 15,426 active & 893,897 inactive molecules. |
| AutoDock Vina [18] | Classical | 16.2% | 27.5% | Used as a baseline for comparison. |
Table: Binding Affinity Prediction Performance (Pearson R)
| Model / Scenario | Trained on Standard PDBbind | Trained on PDBbind CleanSplit [28] | Performance Drop |
|---|---|---|---|
| Typical Deep Learning Model | High (e.g., R ~0.8+)* | Substantially Lower | Highlights effect of data leakage. |
| GEMS (GNN with LLM Transfer) [28] | Not Applicable | State-of-the-art | Maintains high performance on cleaned data. |
| Classical SF (e.g., Vina) [18] | - | R ≈ -0.18 | Poor correlation with experimental affinity. |
Note: Exact values for models suffering from leakage are omitted as they are considered unreliable [28].
Beyond affinity prediction, ML models excel in other docking tasks. A 2025 multidimensional evaluation of docking methods categorized them into four performance tiers based on pose accuracy and physical validity [32]:
This study found that while generative diffusion models achieved superior pose accuracy (e.g., SurfDock RMSD ≤ 2 Å success rate >70%), traditional methods consistently excelled in producing physically plausible poses (PB-valid rates >94%) [32]. This highlights a current challenge for pure DL docking methods.
To ensure robust and generalizable model development, researchers must adopt rigorous experimental protocols that address common pitfalls like data leakage.
Objective: To generate a training dataset free of structural similarities with standard test sets, enabling a genuine assessment of model generalization [28].
Methodology:
Objective: To simultaneously predict drug-target binding affinity and generate novel, target-aware drug molecules using a shared feature space, as exemplified by the DeepDTAGen framework [36].
Methodology:
Diagram 2: Multitask learning framework for affinity prediction and drug generation.
Successful development and benchmarking of ML models for affinity prediction rely on a suite of public databases, software tools, and computational resources.
Table: Essential Resources for Binding Affinity Research
| Resource Name | Type | Primary Function | Key Features / Usage |
|---|---|---|---|
| PDBbind [53] [31] | Database | Provides curated protein-ligand complexes with experimental binding affinity data. | Core dataset for training and testing; includes 3D structures and Kd, Ki, or IC50 values. |
| CASF Benchmark [28] [31] | Benchmark | Standardized benchmark for scoring function evaluation. | Used for rigorous testing; must be used with care to avoid data leakage with PDBbind. |
| BindingDB [53] [36] | Database | Public database of measured binding affinities. | Provides a large volume of interaction data for training and validation. |
| DUD-E [18] | Benchmark | Directory of useful decoys for virtual screening evaluation. | Contains known actives and property-matched decoys for 102 targets to test screening power. |
| Graph Neural Networks (GNNs) [28] [36] | Software/Algorithm | Models molecular structure as graphs for feature learning. | Represents drugs as graphs of atoms (nodes) and bonds (edges) to capture structural information. |
| ProtInter [54] | Software Tool | Calculates non-covalent interactions from protein complex PDB files. | Used for feature engineering in traditional ML; quantifies hydrogen bonds, hydrophobic interactions, etc. |
| PDBbind CleanSplit [28] | Dataset | A leakage-free version of PDBbind. | Essential for training models to ensure generalizability is not overestimated. |
The rise of machine learning has fundamentally transformed the landscape of binding affinity prediction. Data-driven models have demonstrably overcome the performance plateau of classical scoring functions by leveraging large datasets and advanced architectures to capture the complex physics of molecular interactions. However, this field continues to evolve rapidly, with several critical frontiers on the horizon.
The transition from classical to machine learning-based scoring functions marks a definitive maturation in computational drug discovery. By directly confronting the limitations of hand-crafted physics and linear models, ML approaches have established a new paradigm defined by learning, adaptability, and superior predictive power. As the field addresses current challenges surrounding data bias, generalizability, and interpretability, machine learning is poised to become an even more indispensable tool, accelerating the delivery of life-saving therapeutics.
The adoption of machine-learning scoring functions (ML-SFs) for protein-ligand binding affinity prediction represents a paradigm shift in structure-based drug design. While benchmark studies frequently report superior performance of ML-SFs over classical scoring functions, a closer examination reveals significant gaps between benchmark performance and real-world applicability. This review synthesizes recent evidence demonstrating how data leakage, dataset biases, and evaluation methodologies have systematically inflated perceived ML-SF performance. We analyze methodological advances for creating leakage-free benchmarks, explore the generalizability challenges of current approaches, and provide a technical framework for rigorous SF evaluation. The findings underscore that despite impressive benchmark metrics, most ML-SFs still struggle with target identification and generalization to novel protein families—critical requirements for successful drug discovery applications.
Accurate prediction of protein-ligand binding affinity is fundamental to computational drug discovery. The field has witnessed a rapid transition from classical scoring functions (based on physical principles, empirical data, or knowledge-based statistics) to machine-learning scoring functions (ML-SFs) that leverage complex patterns in structural data. Published literature often shows ML-SFs achieving remarkable performance on standard benchmarks, suggesting a dramatic improvement over classical approaches. However, a growing body of evidence indicates that these performance gains may be substantially overstated due to fundamental flaws in benchmarking methodologies.
The core issue lies in what has been termed the "inter-protein scoring noise problem" – while classical scoring functions can often enrich active molecules for a specific protein target, they frequently fail to identify the correct protein target for a given active molecule due to scoring variations between different binding pockets [9]. A truly robust binding affinity prediction method should overcome this limitation by demonstrating both capabilities. Recent investigations have revealed that the standard practice of training on the PDBbind database and testing on the Comparative Assessment of Scoring Functions (CASF) benchmark has created a situation of widespread train-test data leakage, severely inflating performance metrics and leading to overestimation of model generalization capabilities [28].
This technical review examines the performance gaps between classical and machine-learning scoring functions through a critical lens, focusing on the benchmarking realities that have obscured their true capabilities. We present comprehensive quantitative comparisons, analyze methodologies for proper model evaluation, and provide a framework for future development of more robust affinity prediction tools.
Recent investigations have uncovered substantial data leakage between standard training and test sets used in scoring function development. When models are trained on the PDBbind database and evaluated on the CASF benchmark, their performance metrics become significantly inflated due to structural similarities between the datasets [28].
Table 1: Quantifying Data Leakage Between PDBbind and CASF Benchmarks
| Metric | Before Filtering | After CleanSplit Filtering |
|---|---|---|
| Similar CASF complexes | 49% of all CASF complexes | Structurally distinct complexes |
| Similar training complexes | ~600 complexes | Removed from training set |
| Training set size | Full PDBbind | Reduced by 11.8% |
| Ligand-based leakage | Present (Tanimoto > 0.9) | Eliminated |
A structure-based clustering algorithm analyzing protein similarity (TM-scores), ligand similarity (Tanimoto scores), and binding conformation similarity (pocket-aligned ligand RMSD) identified that nearly 600 high-similarity pairs exist between PDBbind training and CASF complexes, affecting 49% of all CASF test complexes [28]. These similarities enable models to achieve high benchmark performance through memorization rather than genuine understanding of protein-ligand interactions.
The impact of this leakage is substantial. When state-of-the-art models like GenScore and Pafnucy were retrained on a cleaned dataset with reduced leakage, their performance dropped markedly, indicating that their previously reported excellence was largely driven by data leakage rather than true generalization capability [28]. This pattern extends beyond specific architectures and suggests a systemic issue in the field's evaluation methodologies.
To address data leakage, researchers have proposed PDBbind CleanSplit, a training dataset curated by a structure-based filtering algorithm that eliminates train-test data leakage as well as redundancies within the training set [28]. The filtering employs a multi-stage approach:
The algorithm revealed that nearly 50% of training complexes are part of similarity clusters, meaning random splitting inadvertently inflates validation performance as models can match validation complexes with similar training examples [28]. By addressing both train-test leakage and internal redundancies, CleanSplit provides a more rigorous foundation for model development and evaluation.
When evaluated under leakage-free conditions, the performance gap between classical and machine-learning SFs narrows considerably. The table below summarizes comparative performance across multiple benchmarking scenarios:
Table 2: Performance Comparison of Scoring Function Types Under Different Evaluation Paradigms
| Scoring Function Type | CASF2016 RMSE (Traditional Split) | CASF2016 RMSE (CleanSplit) | Target Identification Accuracy | Generalization to Novel Targets |
|---|---|---|---|---|
| Classical SFs | 1.45-1.85 | 1.50-1.90 | Limited | Moderate |
| ML-SFs (Standard Training) | 1.15-1.35 | 1.40-1.75 | Limited | Poor to Moderate |
| ML-SFs (CleanSplit Training) | N/A | 1.20-1.50 | Improved | Moderate to Good |
| GEMS (GNN with CleanSplit) | N/A | 1.28 (RMSE) | Not reported | Good |
The performance degradation of ML-SFs when moving from standard benchmarks to more rigorous evaluations is particularly revealing. For instance, the graph neural network model GEMS (Graph neural network for Efficient Molecular Scoring) maintains higher benchmark performance when trained on CleanSplit, achieving a Pearson R of 0.856 on CASF2016 compared to significantly lower correlations for other models retrained on the same dataset [28]. This suggests that architectural choices and training strategies significantly impact genuine generalization capability.
A critical test for any binding affinity prediction method is its ability to identify the correct protein target for a given active molecule – a capability that remains challenging for both classical and ML approaches. Researchers have developed a new benchmark for target identification based on LIT-PCBA to evaluate whether modern models can correctly identify targets of active molecules [9].
Strikingly, even advanced models like Boltz-2, which claimed to approach the performance of free-energy perturbation in estimating binding affinity, cannot reliably identify the correct protein target by predicting higher binding affinity compared to decoy targets [9]. This failure occurs despite promising performance on traditional affinity prediction benchmarks, highlighting a fundamental limitation in current approaches.
This target identification challenge represents what researchers have termed "the next major hurdle to successful deep-learning-based affinity prediction using protein-ligand complexes" [9]. Any model truly capable of accurate binding affinity prediction should perform well on target-prediction benchmark tasks, a standard that most current ML-SFs fail to meet.
The development of rigorous benchmarking methodologies has become as important as the development of new models. Structure-based filtering algorithms represent a significant advance in this direction. These algorithms employ a multi-modal approach to identify and remove problematic similarities:
Figure 1: Structure-based filtering workflow for detecting data leakage. The algorithm assesses similarity across multiple dimensions before making filtering decisions.
The algorithm computes similarity between protein-ligand complexes using three complementary metrics: protein similarity (TM-scores), ligand similarity (Tanimoto scores), and binding conformation similarity (pocket-aligned ligand RMSD) [28]. This multi-modal approach can identify complexes with similar interaction patterns even when proteins have low sequence identity, addressing limitations of traditional sequence-based analysis.
Beyond data leakage concerns, researchers have developed new benchmarking approaches that test different aspects of model capability:
Target Identification Benchmarks: These evaluate whether models can identify the correct protein target for active molecules by predicting higher binding affinity compared to decoy targets [9]. This addresses the "inter-protein scoring noise problem" where classical SFs fail to identify correct targets despite enriching actives for specific targets.
Synthetic Data Augmentation: This approach tests model robustness using AI-predicted complexes rather than experimental structures. Studies show that data augmentation benefits depend critically on structural quality, with low-quality synthetic examples providing limited value [55].
Orthogonal Dataset Validation: Models are tested on datasets specifically designed to be structurally distinct from training data, such as the FEP benchmark dataset with minimal data leakage from training sets [55].
The GEMS (Graph neural network for Efficient Molecular Scoring) model exemplifies recent advances in ML-SF design that address generalization challenges. GEMS combines a sparse graph modeling of protein-ligand interactions with transfer learning from language models, enabling it to maintain state-of-the-art performance when trained on the cleaned PDBbind CleanSplit dataset [28].
Ablation studies with GEMS revealed that the model fails to produce accurate predictions when protein nodes are omitted from the graph, suggesting its predictions are based on genuine understanding of protein-ligand interactions rather than exploiting dataset biases [28]. This represents an important validation of the model's learning mechanism.
PATH+ represents a different philosophical approach, prioritizing interpretability alongside performance. This algorithm uses persistent homology, a mathematical tool from algebraic topology, to encode structural binding features [56]. Unlike black-box deep learning models, PATH+ provides inherent interpretability, allowing researchers to trace predictions back to specific atomic interactions.
The "persistence fingerprint" in PATH+ efficiently captures geometric properties such as molecular cavities and interaction patterns at multiple scales [56]. This approach demonstrates that high accuracy doesn't necessarily require sacrificing interpretability, addressing a key limitation of many deep learning-based SFs.
The SG-ML-PLAP framework combines extended connectivity interaction features (ECIF) with machine learning to predict binding affinities [57]. This approach shows improved performance compared to conventional scoring functions and several other ML-SFs, particularly when training on crystal structures is supplemented with redocked protein-ligand complexes.
Benchmarking on CASF datasets and prediction of unseen protein-ligand complexes with different structural features demonstrates the framework's robustness [57]. The integration of multiple data sources and feature types represents a pragmatic approach to improving model generalization.
Table 3: Key Research Reagents and Computational Tools for Scoring Function Development
| Resource | Type | Function | Access |
|---|---|---|---|
| PDBbind CleanSplit | Dataset | Leakage-free training and evaluation data | Publicly available |
| CASF Benchmark | Benchmark suite | Standardized performance assessment | Publicly available |
| GEMS | Software | Graph neural network for affinity prediction | Open source |
| PATH+ | Software | Interpretable topological affinity prediction | Open source (OSPREY) |
| SG-ML-PLAP | Web server | Structure-guided ML affinity predictor | http://www.nii.ac.in/sg-ml-plap.html |
| Boltz-2 | Model | Biomolecular foundation model for affinity | Not specified |
| AEV-PLIG | Software | GNN-based scoring function | Not specified |
To avoid data leakage artifacts, researchers should implement rigorous dataset splitting protocols:
These protocols help ensure that reported performance metrics reflect genuine generalization capability rather than memorization of training examples.
A robust benchmarking workflow should evaluate multiple aspects of model performance:
Figure 2: Comprehensive benchmarking workflow for scoring functions. A rigorous evaluation assesses multiple performance aspects beyond simple affinity prediction.
This multi-faceted evaluation approach ensures that models are tested on clinically relevant tasks including affinity prediction (scoring power), pose prediction (ranking power), and target identification – each of which requires different capabilities.
The benchmarking realities in scoring function development reveal a complex landscape where reported performance metrics often obscure significant limitations. While machine-learning SFs demonstrate impressive capabilities on standard benchmarks, their advantages over classical approaches diminish substantially when evaluated under leakage-free conditions and on clinically relevant tasks like target identification.
The field is transitioning toward more rigorous evaluation methodologies that better reflect real-world drug discovery challenges. Critical developments include structure-based dataset filtering, novel benchmarking paradigms, and emphasis on model interpretability. These advances are essential for developing ML-SFs that genuinely improve rather than simply replicating the limitations of classical approaches in new forms.
Future progress will likely depend on several key developments: (1) creation of larger, more diverse, and rigorously curated training datasets; (2) development of evaluation standards that include target identification capabilities; (3) improved model architectures that better capture physical principles of binding; and (4) greater emphasis on interpretability to build trust and provide mechanistic insights. As these developments mature, ML-SFs may finally deliver on their promise to transform computational drug discovery.
The accurate prediction of protein-ligand binding affinity represents a cornerstone of structure-based drug design. For decades, classical scoring functions—built upon physical force-field, empirical, or knowledge-based approaches—have been the standard computational tools for this task. However, these traditional methods suffer from well-documented limitations, including oversimplification of desolvation and entropy effects, and reliance on linear regression techniques that fail to capture the complex, non-linear nature of molecular interactions [58]. The advent of deep learning promised to overcome these limitations through sophisticated architectures capable of learning intricate patterns from large-scale structural databases. Paradoxically, many of these modern approaches have failed to deliver substantial improvements in real-world drug discovery applications, despite reporting impressive benchmark performance [28] [59].
This discrepancy between reported accuracy and practical utility stems from a fundamental confusion between generalization and memorization. Recent research has revealed that the standard practice of randomly splitting data between training and test sets creates an artificial scenario that allows models to exploit hidden biases and structural similarities in the data. Consequently, models appear highly accurate during benchmarking but perform poorly when faced with truly novel protein-ligand complexes in prospective drug discovery campaigns [28] [59]. This paper examines the root causes of this generalization crisis, presents rigorous methodologies for proper model evaluation, and introduces emerging solutions designed to restore the predictive power of computational affinity prediction.
Recent investigations have exposed systematic flaws in the standard benchmarks used to evaluate scoring functions. The most critical issue involves data leakage between the PDBbind database (used for training) and the Comparative Assessment of Scoring Functions (CASF) benchmark (used for testing). A structure-based clustering analysis revealed that nearly 600 significant similarities exist between PDBbind training complexes and CASF test complexes, affecting approximately 49% of all CASF complexes [28]. These similarities extend beyond mere sequence identity to encompass ligand structures, binding pocket configurations, and even binding affinity values.
The consequences of this leakage are profound. When models encounter test complexes that closely resemble those in their training set, they can achieve high accuracy through simple pattern matching rather than genuine understanding of protein-ligand interactions. Alarmingly, some models maintain competitive performance even when critical protein or ligand information is deliberately omitted from inputs, suggesting they rely on memorizing spurious correlations rather than learning fundamental principles of molecular recognition [28].
Table 1: Performance Degradation When Addressing Data Leakage
| Model Type | Performance on Standard Split | Performance on Clean Split | Performance Drop | Evaluation Metric |
|---|---|---|---|---|
| GenScore | Excellent benchmark performance | Marked performance drop | Substantial | RMSE on CASF2016 |
| Pafnucy | Excellent benchmark performance | Marked performance drop | Substantial | RMSE on CASF2016 |
| GEMS (novel GNN) | State-of-the-art performance | Maintains high performance | Minimal | RMSE on CASF2016 |
| Simple similarity algorithm | Competitive with published models | N/A | N/A | Pearson R = 0.716 |
When state-of-the-art binding affinity prediction models like GenScore and Pafnucy were retrained on a properly curated dataset (PDBbind CleanSplit) with reduced data leakage, their performance dropped markedly compared to their reported benchmarks. This confirms that their previously reported high performance was largely driven by data leakage rather than genuine generalization capability [28]. In contrast, a simple algorithm that predicts binding affinity by averaging values from the five most similar training complexes achieved competitive performance with some published deep-learning models, demonstrating that sophisticated architectures may be accomplishing little more than this straightforward similarity matching [28].
Proper data partitioning is essential for meaningful evaluation of model generalization. The standard random splitting approach, while methodologically straightforward, often produces optimistically biased performance estimates. More rigorous strategies include:
Studies implementing these rigorous partitioning strategies consistently reveal substantial performance degradation compared to random splits. For instance, models showing high predictive correlations (Pearson coefficients up to 0.70) under random partitioning exhibited significantly reduced performance with UniProt-based partitioning [60].
The PDBbind CleanSplit methodology represents a comprehensive approach to addressing both train-test leakage and internal dataset redundancy [28]. The protocol involves:
This systematic filtering resulted in the removal of approximately 11.8% of training complexes (4% for direct train-test similarity and 7.8% for internal redundancies), producing a refined dataset that enables genuine evaluation of model generalization [28].
Diagram 1: PDBbind CleanSplit Creation Workflow. This protocol systematically removes structurally similar complexes between training and test sets.
Beyond proper data partitioning, comprehensive evaluation requires multiple metrics to assess different aspects of model performance:
A comprehensive study of convolutional neural networks for kinase inhibitor prediction revealed dramatic performance differences depending on splitting strategy [59]. When using standard random splitting, models achieved performance comparable to state-of-the-art reports. However, when evaluated using "split-by-inhibitor" methodology—where all data for specific compounds were withheld from training—model performance deteriorated substantially, with some models showing no improvement over simple baseline methods.
This failure demonstrates that models were primarily memorizing kinase phylogeny and matching chemical analogues rather than learning fundamental principles of molecular recognition. The models successfully associated specific molecular scaffolds with particular kinase subfamilies but could not generalize to novel chemical entities, severely limiting their utility in prospective drug discovery where truly novel compounds are of greatest interest [59].
The generalization failure extends beyond kinase-specific applications to general protein-ligand binding affinity prediction. Analysis of top-performing models in the CASF benchmark revealed that many could not maintain their performance when evaluated on the PDBbind CleanSplit dataset [28]. The observed performance drops were not random—models consistently failed for complexes with novel structural features not represented in their training data, while maintaining accuracy for complexes similar to those they had seen during training.
This pattern confirms that the models were operating primarily through memorization and similarity matching rather than genuine understanding of physical interactions. The problem was particularly pronounced for models that used limited molecular representations or lacked sufficient architectural capacity to capture complex physical relationships [28].
Table 2: Experimental Protocols for Assessing Generalization
| Experiment | Protocol | Key Measurements | Interpretation |
|---|---|---|---|
| Data Splitting Comparison | Train and evaluate identical models using random splitting vs. strict splitting methods | Performance difference between splitting methods | Quantifies overoptimism from standard evaluation |
| Ablation Analysis | Systematically remove protein or ligand information from input features | Performance degradation with reduced information | Tests whether predictions use genuine interaction information |
| Similarity-based Prediction | Implement simple similarity-matching algorithm as baseline | Comparison with complex model performance | Establishes minimum expected performance |
| Cross-Dataset Evaluation | Train on one dataset, evaluate on entirely different dataset | Absolute performance on external dataset | Measures real-world generalization |
Novel neural network architectures show promise for genuine generalization. The GEMS (Graph neural network for Efficient Molecular Scoring) model employs a sparse graph representation of protein-ligand interactions combined with transfer learning from protein language models [28]. This approach maintains high benchmark performance even when trained on the challenging PDBbind CleanSplit dataset, suggesting true learning of interaction principles rather than data memorization.
Crucially, ablation studies with GEMS demonstrate that the model fails to produce accurate predictions when protein nodes are omitted from the graph, confirming that its predictions rely on genuine understanding of protein-ligand interactions rather than exploiting dataset biases [28].
Integration of diverse modeling approaches through meta-modeling (ensemble learning) provides another path to improved generalization. By combining classical force-field-based scoring functions with sequence-based deep learning models, researchers have created meta-models that outperform individual base models while demonstrating improved generalization across diverse benchmarks [58].
These hybrid approaches benefit from the complementary strengths of different methodologies—physical models provide theoretical grounding and interpretability, while data-driven models capture complex patterns that may be difficult to parameterize explicitly. The resulting ensembles show more consistent performance across different target classes and reduced sensitivity to dataset-specific biases [58].
Leveraging transfer learning from large-scale protein language models (e.g., ESM-2) provides a powerful strategy for embedding fundamental biochemical knowledge into affinity prediction models [60]. These pre-trained embeddings capture evolutionary constraints and structural principles that generalize across diverse protein families, reducing the tendency to memorize dataset-specific patterns.
Similarly, multi-task learning approaches that simultaneously predict multiple properties (binding affinity, solubility, toxicity) force models to develop more general representations of molecular interactions, improving performance on any single task including affinity prediction.
Diagram 2: Architecture Strategies for Improved Generalization. Combining multiple approaches addresses limitations of individual methods.
Table 3: Key Research Reagents and Computational Tools
| Resource | Type | Primary Function | Generalization Relevance |
|---|---|---|---|
| PDBbind Database | Dataset | Comprehensive collection of protein-ligand structures with experimental binding affinities | Foundation for training and evaluation; requires proper filtering |
| CASF Benchmark | Evaluation Framework | Standardized test sets for scoring function comparison | Contains documented data leakage; requires careful usage |
| PDBbind CleanSplit | Curated Dataset | Structure-filtered training set minimizing similarity to test complexes | Enables proper generalization assessment |
| ESM-2 Protein Language Model | Pre-trained Model | Provides evolutionary-informed protein representations | Transfer learning improves generalization to novel proteins |
| GEMS (Graph neural network for Efficient Molecular Scoring) | Model Architecture | Sparse graph representation of protein-ligand interactions | Maintains performance on independent test sets |
| Anchor-Query Framework | Methodology | Leverages limited reference data to predict unknown states | Improves prediction for novel targets with minimal data |
The distinction between generalization and memorization represents a critical challenge for computational drug discovery. The documented performance overestimation in current binding affinity prediction models stems from fundamental flaws in evaluation methodologies rather than technical limitations of the models themselves. By adopting rigorous data partitioning strategies, comprehensive evaluation protocols, and architecturally innovative approaches, the field can transition from models that excel retrospectively on biased benchmarks to those that offer genuine predictive power for novel drug targets.
The path forward requires a cultural shift in how we evaluate computational models—prioritizing rigorous generalization assessment over impressive but potentially misleading benchmark performance. The solutions outlined in this review, including the PDBbind CleanSplit protocol, advanced architectures like GEMS, and sophisticated meta-modeling approaches, provide a foundation for developing the next generation of predictive tools that will truly accelerate drug discovery rather than simply providing retrospective accuracy.
Structure-based drug design (SBDD) relies heavily on computational methods to predict how small molecules interact with biological targets, with molecular docking serving as a cornerstone technique. [61] At the heart of every molecular docking tool lies its scoring function (SF), a mathematical model that predicts the binding affinity and orientation of a ligand within a protein's binding pocket. [32] [61] The accuracy and reliability of these scoring functions directly impact the success of virtual screening (VS) and binding pose prediction, critically influencing the efficiency of lead discovery and optimization in drug development. [27] [61]
This review is framed within the context of a broader thesis on the limitations of classical scoring functions for affinity prediction. Traditional SFs, categorized as physics-based, empirical, or knowledge-based, have long been plagued by persistent challenges. [28] A well-documented phenomenon is the "inter-protein scoring noise," where classical SFs can enrich active molecules for a single protein target but fail to identify the correct target for a given active molecule due to scoring variations between different binding pockets. [9] This limitation severely restricts their utility in target identification and polypharmacology studies. Furthermore, the advent of deep learning (DL) has promised a paradigm shift, yet comprehensive benchmarking reveals that these modern approaches often struggle with generalization, physical plausibility, and robustness against data leakage, indicating that the field has not yet fully overcome the fundamental hurdles of affinity prediction. [9] [32] [28]
To objectively compare the performance of different SF types, researchers employ standardized benchmark datasets and specific evaluation protocols across several key dimensions.
Recent studies highlight that the standard practice of training on the PDBbind database and testing on the Comparative Assessment of Scoring Functions (CASF) benchmark suffers from significant train-test data leakage, severely inflating performance estimates. [28] Nearly 49% of CASF complexes have highly similar counterparts in the training set, allowing models to exploit memorization rather than genuine learning of protein-ligand interactions. [28] To address this, rigorously curated datasets like PDBbind CleanSplit have been developed, which apply structure-based filtering algorithms to ensure strict separation between training and test complexes, enabling a more realistic assessment of generalization capability. [28]
A comprehensive 2025 evaluation of nine docking methods across the Astex diverse set, PoseBusters benchmark set, and DockGen dataset revealed distinct performance tiers when assessing both pose accuracy (RMSD ≤ 2 Å) and physical validity (PB-valid): traditional methods > hybrid AI scoring with traditional conformational search > generative diffusion methods > regression-based methods. [32]
Table 1: Pose Prediction Accuracy and Physical Validity Across Benchmark Datasets [32]
| Method Category | Specific Method | RMSD ≤ 2 Å (Astex) | PB-Valid (Astex) | Combined Success (Astex) | Combined Success (DockGen) |
|---|---|---|---|---|---|
| Traditional | Glide SP | 81.18% | 97.65% | 79.41% | 42.31% |
| Traditional | AutoDock Vina | 75.29% | 89.41% | 68.24% | 26.92% |
| Generative Diffusion | SurfDock | 91.76% | 63.53% | 61.18% | 33.33% |
| Generative Diffusion | DiffBindFR (MDN) | 75.29% | 52.94% | 43.53% | 18.52% |
| Regression-Based | KarmaDock | 31.76% | 35.29% | 14.12% | 2.56% |
| Regression-Based | QuickBind | 37.65% | 31.76% | 17.65% | 2.56% |
This stratification highlights the diverse strengths and limitations of each approach. While generative diffusion models like SurfDock achieve exceptional pose accuracy, they frequently produce physically implausible structures with steric clashes or incorrect bond geometries. [32] In contrast, traditional methods like Glide SP maintain remarkably high physical validity across all datasets, though with somewhat lower pose accuracy. [32]
In virtual screening applications, the combination of docking tools with machine learning-based rescoring has demonstrated significant performance improvements. A benchmarking study against both wild-type and quadruple-mutant Plasmodium falciparum Dihydrofolate Reductase (PfDHFR) revealed that:
Table 2: Virtual Screening Performance (EF1%) Against PfDHFR Variants [27]
| Docking Tool | Scoring Function | Wild-Type EF1% | Quadruple-Mutant EF1% |
|---|---|---|---|
| PLANTS | Native | 21 | 19 |
| PLANTS | CNN-Score | 28 | 24 |
| FRED | Native | 18 | 22 |
| FRED | CNN-Score | 25 | 31 |
| AutoDock Vina | Native | <10 (worse-than-random) | <10 (worse-than-random) |
| AutoDock Vina | RF-Score-VS v2 | 15 (better-than-random) | 17 (better-than-random) |
Notably, rescoring with CNN-Score consistently enhanced screening performance across both variants, with the combination of FRED and CNN-Score achieving the highest enrichment (EF1% = 31) against the resistant quadruple mutant. [27] This demonstrates the potential of ML-based rescoring to overcome limitations of classical SFs, particularly for challenging targets like drug-resistant enzymes.
For target identification, the critical test is whether an SF can correctly identify the true protein target for a given active molecule by predicting higher binding affinity compared to decoy targets. Alarmingly, even the recently developed Boltz-2 biomolecular foundation model, which claimed to approach Free Energy Perturbation (FEP) performance, failed this fundamental test in a rigorous benchmark based on LIT-PCBA, indicating that generalizable understanding of protein-ligand interactions remains an unachieved goal. [9]
The accuracy of binding affinity prediction varies substantially across SF types, with deep learning models particularly affected by data leakage issues:
Table 3: Binding Affinity Prediction Performance on CASF Benchmark [28]
| Model | Training Dataset | Pearson R (CASF-2016) | Generalization Assessment |
|---|---|---|---|
| GenScore | Original PDBbind | 0.826 | Severely inflated |
| GenScore | PDBbind CleanSplit | 0.685 | True performance |
| Pafnucy | Original PDBbind | 0.779 | Severely inflated |
| Pafnucy | PDBbind CleanSplit | 0.612 | True performance |
| GEMS (GNN) | PDBbind CleanSplit | 0.816 | Robust generalization |
| Similarity Search Algorithm | - | 0.716 | Benchmark reference |
When trained on the original PDBbind database, top-performing models show excellent benchmark performance, but this drops markedly when trained on the cleaned PDBbind CleanSplit dataset, confirming that their previous high scores were largely driven by data leakage rather than genuine generalization. [28] In contrast, the GEMS graph neural network maintains high performance when trained on CleanSplit, suggesting more robust learning of protein-ligand interactions. [28]
Alternative approaches like the PATH+ algorithm, which uses persistent homology to capture geometric properties of protein-ligand complexes, demonstrate comparable accuracy with the added benefits of interpretability and superior computational efficiency (10x faster than previous topology-based methods). [56] For specific target classes like GPCRs, advanced molecular dynamics approaches with re-engineered Bennett acceptance ratio (BAR) methods have shown promising correlation with experimental data (R² = 0.789). [30]
The comprehensive assessment referenced in Section 3.1 followed this rigorous methodology [32]:
Docking evaluation workflow. This diagram outlines the standardized protocol for benchmarking scoring functions across multiple datasets and performance dimensions.
To address data leakage concerns highlighted in Section 3.3, the PDBbind CleanSplit protocol implements this structure-based filtering approach [28]:
The PfDHFR benchmarking study referenced in Section 3.2 employed this integrated workflow [27]:
Inter-protein scoring noise. This diagram illustrates the fundamental limitation of classical scoring functions, which can work within a single target but fail at cross-target identification.
Multidimensional evaluation framework. A comprehensive assessment of scoring functions requires testing across multiple, complementary performance dimensions.
Table 4: Key Research Reagents and Computational Resources
| Resource Name | Type | Primary Function | Application Context |
|---|---|---|---|
| PDBbind CleanSplit | Dataset | Training data with minimized train-test leakage | Robust model training and evaluation [28] |
| DEKOIS 2.0 | Benchmark Set | Provides active compounds and challenging decoys | Virtual screening benchmarking [27] |
| PoseBusters | Validation Tool | Checks physical/chemical plausibility of poses | Pose quality assessment [32] |
| Persistent Homology | Mathematical Framework | Encodes multi-scale geometric features | Interpretable affinity prediction (PATH+) [56] |
| BAR Method | Algorithm | Free energy calculation with explicit solvation | High-accuracy affinity prediction for membrane proteins [30] |
| CNN-Score | ML Scoring Function | Rescoring docking poses with convolutional networks | Virtual screening enhancement [27] |
This comparative analysis reveals that no single scoring function type currently dominates across all critical performance dimensions. Classical physics-based functions demonstrate superior physical plausibility and robustness, while deep learning approaches show promise in specific tasks like virtual screening enrichment but struggle with generalization and data leakage issues. The persistence of fundamental challenges like inter-protein scoring noise and the performance gap observed when proper dataset splitting is implemented suggests that the field must move beyond traditional benchmarking practices. Future developments should prioritize truly generalizable models that capture universal principles of molecular recognition rather than exploiting dataset-specific patterns, with interpretability and physical plausibility as central design considerations alongside predictive accuracy.
The field of computational drug design has long been hampered by the limitations of classical scoring functions in accurately predicting protein-ligand binding affinities. These traditional methods, often based on force-fields, empirical data, or knowledge-based statistical potentials, struggle with generalizability and accuracy, creating a bottleneck in structure-based drug design (SBDD) [28] [35]. The emergence of deep learning, particularly Graph Neural Networks (GNNs), represents a paradigm shift. GNNs have quietly become a transformative tool, revolutionizing drug discovery by accurately modeling molecular structures and interactions with binding targets [62] [63]. However, this progress has been accompanied by significant challenges, most notably the issue of data leakage in public benchmarks that has severely inflated performance metrics and led to an overestimation of model capabilities [28]. This whitepaper details how the convergence of novel GNN architectures, rigorously curated datasets, and advanced training protocols is establishing a new gold standard, finally narrowing the performance gap with computationally intensive physics-based methods like Free Energy Perturbation (FEP) while being orders of magnitude faster [35].
Classical scoring functions have been the cornerstone of computer-aided drug design (CADD), but their applicability is constrained by a fundamental trade-off between computational cost and predictive accuracy [35]. These methods, which include force-field-based, empirical, and knowledge-based approaches implemented in docking tools like AutoDock Vina and GOLD, are often computationally intensive and show limited accuracy in binding affinity prediction [28]. A well-documented phenomenon that highlights their weakness is the inter-protein scoring noise problem: while these functions can sometimes enrich active molecules for a specific protein target, they generally fail to identify the correct protein target for a given active molecule due to scoring variations between different binding pockets [9].
This limitation restricts their utility in target identification, a critical step in drug discovery. Furthermore, classical scoring functions often fail on realistic tasks encountered in hit-to-lead optimization, such as reliably ranking the binding affinity of a congeneric series of ligands [35]. While more rigorous methods like FEP offer higher accuracy, their prohibitive computational cost—often requiring hours on supercomputers—makes them unsuitable for high-throughput virtual screening [35] [64]. This created a critical methods gap in the speed-accuracy landscape, yearning for approaches that are definitively more accurate than docking but faster than FEP [64].
Graph Neural Networks (GNNs) are a class of deep learning models within the broader deep learning revolution that are uniquely suited for non-Euclidean data [65]. Their rise in the AI research landscape has been spectacular, with the term "Graph Neural Network" consistently ranking in the top 3 keywords for major AI conferences and a striking +447% average annual increase in related publications during 2017-2019 [65].
In the context of molecular modeling, GNNs offer an intuitive and expressive framework. They operate directly on molecular graphs, where atoms are represented as nodes and chemical bonds as edges [63]. This allows GNNs to natively learn complex topological and geometric features of drug-like molecules that would be lost in traditional "rigid" data structures like fixed-size grids or sequences [65]. By performing message-passing operations across the graph, GNNs can capture both the local chemical environments of atoms and the global structure of the molecule, learning a rich hierarchical representation that is immensely valuable for predicting molecular properties and interactions [62] [63].
The impressive benchmark performance reported by many early deep-learning-based scoring functions was, unfortunately, built on a flawed foundation. A critical issue was the widespread train-test data leakage between the primary training database, PDBbind, and the standard evaluation benchmark, the Comparative Assessment of Scoring Function (CASF) [28]. Studies revealed that nearly half (49%) of all CASF complexes had exceptionally similar counterparts in the training set, sharing not only similar ligand and protein structures but also comparable ligand positioning and, unsurprisingly, closely matched affinity labels [28]. This meant models could achieve high benchmark performance through simple memorization rather than a genuine understanding of protein-ligand interactions, severely inflating their perceived generalization capabilities [28].
To address this, researchers introduced PDBbind CleanSplit, a training dataset curated by a novel structure-based filtering algorithm [28]. This algorithm uses a combined assessment of:
The algorithm eliminates not only train-test data leakage but also redundancies within the training set itself, where nearly 50% of complexes were part of a similarity cluster [28]. This filtering encourages models to learn fundamental principles of binding instead of relying on memorization. The dramatic impact of this curation is clear: when top-performing models like GenScore and Pafnucy were retrained on CleanSplit, their benchmark performance dropped substantially, revealing that their previous high scores were largely driven by data leakage [28].
The Graph neural network for Efficient Molecular Scoring (GEMS) is a leading model designed for robust generalization. It leverages a sparse graph modeling of protein-ligand interactions and transfer learning from language models [28]. Its architecture can be summarized as follows:
Diagram: GEMS Model Architecture. This illustrates the integration of protein and ligand representations into a sparse interaction graph processed by GNN layers [28].
A key strength of GEMS is its ability to maintain high benchmark performance when trained on the rigorously curated CleanSplit dataset, suggesting its predictions are based on a genuine understanding of protein-ligand interactions rather than exploiting data leakage [28]. Ablation studies confirmed this, showing the model fails to produce accurate predictions when protein nodes are omitted from the graph [28].
Another advanced architecture is the Atomic Environment Vector–Protein Ligand Interaction Graph (AEV-PLIG) model [35]. It combines two powerful concepts:
AEV-PLIG enhances this by using extended connectivity interaction features (ECIF) for a richer set of 22 distinct protein atom types and employs enhanced GATv2 graph attention layers, which are strictly more expressive than standard GATs [35]. The model is trained using both experimental data and augmented data generated via template-based modelling or molecular docking, which has been shown to significantly improve performance on challenging benchmarks [35].
For researchers seeking to implement or benchmark these models, the following protocol is essential:
Data Curation: Use the PDBbind CleanSplit methodology to ensure no data leakage exists between training and test sets [28]. This involves:
Benchmarking: Evaluate model performance on multiple independent test sets, including:
Critical Assessment: The model must be tested on its ability to solve the inter-protein scoring noise problem. A reliable benchmark for this is target identification based on datasets like LIT-PCBA, where the model must correctly identify the target of active molecules by predicting a higher binding affinity compared to decoy targets [9].
The table below summarizes the performance of modern GNN-based scoring functions compared to traditional methods, highlighting the new benchmarks being set in the field.
Table 1: Performance Comparison of Scoring Functions for Binding Affinity Prediction
| Method Category | Representative Model | Key Benchmark Performance (Pearson R / RMSE) | Computational Speed | Key Strengths |
|---|---|---|---|---|
| Classical Docking | AutoDock Vina [28] | Limited accuracy [28] | ~1 minute (CPU) [64] | Fast, high-throughput |
| Alchemical Methods | FEP+ [35] | PCC: ~0.68 on congeneric series [35] | >12 hours (GPU cluster) [64] | High accuracy, gold standard |
| ML Scoring (with data leakage) | Pre-CleanSplit Models [28] | Inflated metrics (e.g., R>0.8) [28] | Minutes (GPU) | Fast, previously high benchmark scores |
| Advanced GNNs (no leakage) | GEMS (on CleanSplit) [28] | Maintains high performance on independent tests [28] | Minutes (GPU) | Generalizes to unseen complexes |
| Advanced GNNs (with augmented data) | AEV-PLIG (on FEP benchmark) [35] | PCC: 0.59 (vs. 0.41 without augmentation) [35] | ~400,000x faster than FEP [35] | Accurate & fast on lead-optimization tasks |
The performance of GNNs is particularly notable on tasks critical for drug discovery. For example, AEV-PLIG shows how leveraging augmented data can drastically improve the ranking of congeneric ligands, with Kendall's τ (a rank correlation metric) increasing from 0.26 to 0.42, closing the gap with FEP+ (Kendall's τ of 0.49) [35]. This demonstrates that GNNs are beginning to address the real-world need for accurately prioritizing compounds during lead optimization.
Table 2: Key Resources for GNN-Based Affinity Prediction Research
| Resource Name | Type | Function & Application in Research |
|---|---|---|
| PDBbind CleanSplit [28] | Dataset | Curated training set free of data leakage; enables genuine evaluation of model generalizability. |
| CASF Benchmark [28] [35] | Benchmarking Suite | Standard set for comparative assessment of scoring functions; use with CleanSplit protocol. |
| OOD Test Set [35] | Benchmarking Suite | Realistic out-of-distribution test set designed to penalize ligand/protein memorization. |
| LIT-PCBA Target ID Benchmark [9] | Benchmarking Suite | Tests model's ability to identify the correct protein target for active molecules. |
| Graph Neural Network Frameworks | Software | Libraries like PyTorch Geometric and DGL for building GNN models like GEMS and AEV-PLIG. |
| Molecular Dynamics Trajectories | Data | Used for data augmentation (as in AEV-PLIG) to increase data diversity and model robustness [35]. |
| Template-Based Modelling & Docking | Software/Algorithm | Tools to generate augmented synthetic protein-ligand complex structures for training [35]. |
Graph Neural Networks, when trained on rigorously curated datasets and evaluated on demanding benchmarks, are unequivocally setting a new gold standard for binding affinity prediction. They are successfully addressing the long-standing limitations of classical scoring functions, particularly the inter-protein scoring noise problem and the inability to accurately rank congeneric series [9] [35]. By mitigating data leakage through protocols like PDBbind CleanSplit, the field can now have greater confidence in reported performance metrics [28].
The integration of advanced architectures like GEMS and AEV-PLIG with strategies such as transfer learning and data augmentation is producing models that are beginning to narrow the performance gap with high-end computational methods like FEP [28] [35]. The ability of these models to provide FEP-level correlation on lead optimization tasks while being hundreds of thousands of times faster represents a monumental leap forward [35]. This opens the door for their practical application in drug discovery pipelines, from powering generative AI models that design new protein-ligand interactions to enabling high-throughput virtual screening with unprecedented accuracy. As these tools continue to evolve, they promise to significantly accelerate the pace of drug discovery, reducing development costs and late-stage failures [62].
The limitations of classical scoring functions are fundamental, stemming from their rigid, pre-defined functional forms and inability to fully capture the complex physics of molecular binding. As a result, they have hit a persistent performance plateau in critical tasks like binding affinity prediction and virtual screening. The path forward is clearly charted by data-driven, machine-learning approaches, which circumvent these limitations by learning the functional form directly from large-scale structural and interaction data. However, the success of these next-generation models hinges on addressing new challenges, particularly the critical need for rigorously curated, non-redundant benchmark datasets free of data leakage. Future progress will depend on a synergistic combination of improved physical modeling, advanced deep learning architectures like graph neural networks, and a renewed focus on robust, generalizable validation practices. This evolution promises to deliver more accurate and reliable computational tools, ultimately accelerating the discovery of new therapeutics and deepening our understanding of biomolecular interactions.