Beyond the Plateau: Uncovering the Critical Limitations of Classical Scoring Functions in Binding Affinity Prediction

Charles Brooks Dec 02, 2025 347

Accurate prediction of protein-ligand binding affinity is a cornerstone of computer-aided drug discovery, yet the performance of classical scoring functions has remained stagnant.

Beyond the Plateau: Uncovering the Critical Limitations of Classical Scoring Functions in Binding Affinity Prediction

Abstract

Accurate prediction of protein-ligand binding affinity is a cornerstone of computer-aided drug discovery, yet the performance of classical scoring functions has remained stagnant. This article provides a comprehensive analysis for researchers and drug development professionals on the fundamental and methodological limitations of classical force-field-based, empirical, and knowledge-based scoring functions. We explore their rigid functional forms, inadequate treatment of key physical forces, and failure to generalize across diverse protein families. The content further details how these limitations manifest in practical applications like virtual screening and lead optimization, examines current challenges in model validation due to dataset biases, and synthesizes emerging solutions. By contrasting classical approaches with modern machine-learning-based alternatives, this review offers a clear-eyed perspective on the path toward more reliable and predictive computational tools for drug design.

The Roots of Stagnation: Core Principles and Inherent Flaws of Classical Scoring Functions

In the fields of computational chemistry and molecular modelling, scoring functions are mathematical functions used to approximately predict the binding affinity between two molecules after they have been docked [1]. The drug discovery process is notoriously expensive and time-consuming, and structure-based virtual screening (VS) has become a widely used approach to triage unpromising compounds early in the pipeline [2]. Through the prediction of the binding mode and affinity of a small molecule within the binding site of a target protein, molecular docking helps researchers understand key properties related to the binding process [2]. The fast evaluation of docking poses and the accurate prediction of binding affinity is essential in these protocols, and scoring functions emerge as a straightforward and fast strategy, despite their limited accuracy, remaining the main alternative for VS experiments [2]. This technical guide details the historical triad of classical scoring functions, framing their development within the context of their inherent limitations for binding affinity prediction.

The Classical Triad: Frameworks and Methodologies

Scoring functions are typically divided into three main classes: force field-based, empirical, and knowledge-based [2] [1]. Although large-scale comparative assessments are relatively rare, the strengths and limitations of current functions are fairly evident: they generally perform well in reproducing binding modes but struggle to accurately quantitate binding affinities or the effects of small structural changes [3]. The following sections and Table 1 provide a detailed breakdown of each class.

Table 1: Comparative Overview of Classical Scoring Function Types

Function Class	Theoretical Basis	Energy Terms/Descriptors	Parameterization Method	Representative Examples
Force Field-Based [1]	Molecular mechanics, classical force fields	Sum of intermolecular van der Waals and electrostatic energies; sometimes includes internal ligand strain energy and implicit solvation (GBSA/PBSA) [1].	Parameters derived from fundamental physical chemistry and quantum mechanics calculations [4].	DOCK [2], DockThor [2]
Empirical [2]	Linear Free Energy Relationships	Weighted sum of physicochemical terms: hydrophobic contacts, hydrogen bonds, rotatable bonds immobilized, etc. [1].	Multiple linear regression (MLR) or machine learning on datasets of protein-ligand complexes with known affinities [2] [5].	LUDI [2], ChemScore [2], GlideScore [2]
Knowledge-Based [1]	Inverse Boltzmann statistics from structural databases	Pairwise atom-atom "potentials of mean force" derived from observed contact frequencies in databases like the PDB [2] [1].	Statistical analysis of large 3D structural databases (e.g., PDB, Cambridge Structural Database) [1].	DrugScore [2], PMF [2]

Force Field-Based Scoring Functions

Force field-based scoring functions estimate affinities by summing the strength of intermolecular interactions, primarily van der Waals and electrostatic terms, using a molecular mechanics force field [1]. The intramolecular energies (strain energy) of the binding partners are also frequently included [1]. Since binding occurs in aqueous solution, a critical consideration is the treatment of solvation effects, which can be incorporated using implicit solvation models such as GBSA or PBSA [1]. The parameters for these functions are derived from fundamental physical chemistry and quantum mechanics calculations, rather than from fitting to binding affinity data [4]. Popular force fields that provide parameters for small molecules include the General AMBER Force Field (GAFF), the CHARMM General Force Field (CGenFF), and those in the OPLS and GROMOS families [4].

Empirical Scoring Functions

Empirical scoring functions are founded on the idea that the binding free energy can be correlated to a set of descriptors capturing key interactions involved in binding [2]. These functions take the form of a weighted sum of physicochemical terms, such as hydrophobic contacts, hydrogen bonds, and the number of rotatable bonds immobilized upon complex formation [1]. The central methodology involves using a dataset composed of three-dimensional structures of diverse protein-ligand complexes with associated experimental binding affinity data [2]. The coefficients (weights) of the functional terms are then obtained through regression analysis, traditionally using multiple linear regression (MLR), to calibrate the model and establish a relationship between the descriptors and the experimental affinity [2] [5]. The first empirical scoring function, LUDI, was developed by Böhm, pioneering this approach for predicting binding free energy [2].

Knowledge-Based Scoring Functions

Knowledge-based scoring functions, also known as potentials of mean force, are based on the statistical analysis of interacting atom pairs from a large database of experimentally determined protein-ligand complexes [2] [1]. The underlying principle is that intermolecular interactions between certain types of atoms that occur more frequently than expected in a random distribution are likely to be energetically favorable [1]. These observed frequencies are converted into a pseudopotential that describes the preferred geometries and interactions for protein-ligand atom pairs [2]. The resulting scoring function thus captures the implicit knowledge of molecular recognition "learned" from the structural data in repositories like the Protein Data Bank (PDB) [1].

Experimental Protocols for Development and Validation

The development and validation of scoring functions, particularly empirical ones, follow a structured protocol. The general workflow for developing an empirical scoring function, as detailed in recent literature [2] [5], is outlined below.

Empirical Scoring Function Development Workflow

Protocol: Building an Empirical Scoring Function

As defined by Pason and Sotriffer, the development of an empirical scoring function requires three core components [2] [5]:

Descriptors: A set of descriptors that quantitatively describe the binding event. These are typically structural and physicochemical features derived from the 3D complex, such as:
- Count of hydrogen bonds and ionic interactions.
- Measure of hydrophobic–hydrophobic contact surface area.
- Buried surface area upon complex formation.
- Count of rotatable bonds immobilized, as a proxy for entropy loss.
- Number of specific atom-pair contacts (e.g., aromatic-aromatic) [2] [1].
Training Dataset: A curated dataset of three-dimensional structures of protein–ligand complexes, each associated with reliable experimental binding affinity data (e.g., Kd, Ki, IC50). The diversity and quality of this dataset are crucial for the generalizability of the resulting model [2].
Regression/Classification Algorithm: A statistical or machine-learning algorithm to calibrate the model by establishing a relationship between the descriptors and the experimental affinity. Classical methods use multiple linear regression (MLR), but recent efforts increasingly employ sophisticated machine-learning techniques like Random Forests (RF) or Support-Vector Machines (SVM) to capture potential non-linear relationships [2] [5].

Key Benchmarks for Validation

To assess a scoring function's practical utility, it is rigorously tested on three distinct tasks, which also represent its primary goals in a docking workflow [2]:

Pose Prediction (Binding Mode Prediction): The ability to identify the experimentally observed binding mode (the "native pose") by ranking it with the most favorable score among a set of generated decoy poses. Current methods generally perform this task with satisfactory accuracy [2].
Binding Affinity Prediction: The ability to correctly predict the absolute binding free energy of a complex, ranking different ligands correctly according to their potency. This is the most challenging task, and the accuracy of current scoring functions remains a significant bottleneck, particularly in lead optimization [2] [3].
Virtual Screening (Active/Inactive Classification): The ability to distinguish active compounds from inactive ones in a large database, which is critical for identifying novel hit molecules. The performance of scoring functions in this task is considered sufficient to make virtual screening a practically useful endeavor [2] [3].

The development and application of classical scoring functions rely on a suite of computational tools and data resources. The table below catalogs key "research reagents" essential for work in this field.

Table 2: Essential Research Reagents and Resources for Scoring Function Research

Resource Name	Type	Primary Function / Application	Key Features / Notes
Protein Data Bank (PDB) [1]	Database	Primary repository for experimentally determined 3D structures of proteins and protein-ligand complexes.	Essential for training knowledge-based functions and for benchmarking pose prediction; provides structural data for empirical function development.
Cambridge Structural Database (CSD) [1]	Database	Repository for experimentally determined small-molecule organic and metal-organic crystal structures.	Used in knowledge-based function development to derive statistical potentials for intermolecular interactions.
AutoDock Vina [2]	Docking Software	Widely used molecular docking program that includes its own scoring function.	Employs a hybrid scoring function; commonly used as a platform for testing and validating new scoring methods.
Glide (Schrödinger) [2]	Docking Software	Commercial docking program with the empirical GlideScore function.	Known for its high accuracy in pose prediction; often used as a benchmark in performance comparisons.
GOLD [2]	Docking Software	Docking software using a genetic algorithm for pose exploration and its own empirical scoring function.	Supports multiple scoring functions; widely used in virtual screening campaigns.
DOCK [2]	Docking Software	One of the earliest docking programs, using a force field-based scoring function.	Allows for explicit consideration of solvent and user-defined scoring terms.
GAFF / GAFF2 [4]	Force Field	General AMBER Force Field for small molecules.	Provides parameters for force field-based scoring and molecular dynamics simulations; compatible with AMBER protein FFs.
CGenFF [4]	Force Field	CHARMM General Force Field for small molecules.	Provides parameters for a wide range of drug-like molecules within the CHARMM force field ecosystem.
OPLS3e [4]	Force Field	Optimized Potentials for Liquid Simulations force field.	Includes extensive parameters for drug-like compounds and a ligand-specific charge model; implemented in Schrödinger software.

Critical Limitations and the Path Forward

The traditional triad of scoring functions, while foundational, faces profound challenges that limit their predictive accuracy, particularly for binding affinity. A core limitation is the simplified treatment of entropy and solvent effects [2]. While some empirical functions include a term for conformational entropy based on rotatable bonds, this is a crude approximation. Furthermore, the explicit and dynamic role of water molecules in binding, which can be crucial for affinity and specificity, is often poorly captured [2] [3].

Another fundamental issue is the inherent difficulty of the parameterization process. The development of empirical and knowledge-based functions is intrinsically linked to the quality, size, and diversity of the experimental data used for training. Inconsistencies in experimental data and the limited coverage of chemical and target space in current datasets can lead to functions that do not generalize well [2] [5]. The approximations used by these functions suggest that the best available classical functions may be close to the limit of what can be achieved with these empirical approaches [3].

The field is now moving beyond the classical triad. The most significant trend is the shift towards machine-learning (ML) and deep-learning (DL) based scoring functions [2] [1] [6]. Unlike classical functions, ML-based models do not assume a predetermined functional form, allowing them to infer complex, non-linear relationships directly from data. These methods have consistently been found to outperform classical functions at binding affinity prediction for diverse protein-ligand complexes [1]. Recent advances also include integrating knowledge-guided pre-training strategies that incorporate additional semantic information, such as molecular descriptors and fingerprints, to learn more robust molecular representations, significantly improving predictive performance [6]. Furthermore, efforts are underway to incorporate more sophisticated physics, such as explicit polarization and quantum mechanical effects, and to develop more automated and intelligent parameterization toolkits for force fields [4]. This evolution points toward a future of hybrid models that leverage the strengths of data-driven learning while respecting the physical principles that govern molecular recognition.

Research Directions Overcoming Classical Limitations

In computational drug discovery, the accurate prediction of drug-target binding affinity is a cornerstone for identifying and optimizing lead compounds. For decades, this field has been dominated by classical scoring functions—mathematical models that estimate binding strength using predetermined equations with fixed functional forms [7]. These models typically express the binding free energy (ΔG) as a weighted sum of physicochemically-inspired terms, such as van der Waals forces, electrostatic interactions, hydrogen bonding, and desolvation penalties [7]. While this approach benefits from interpretability and computational efficiency, its inherent rigidity fundamentally limits accuracy and flexibility. The reliance on a fixed architecture, where the mathematical relationship between variables is defined a priori by the researcher, fails to capture the complex, non-linear, and context-dependent nature of molecular recognition. This whitepaper examines the technical limitations imposed by these rigid functional forms, quantifies their performance shortcomings, and explores emerging methodologies that promise to overcome these constraints through more flexible, data-driven approaches to affinity prediction.

The Architecture of Classical Scoring Functions and Their Inherent Limitations

Deconstruction of Standard Functional Forms

Classical scoring functions for binding affinity prediction are historically categorized into physics-based, empirical, and knowledge-based approaches, though the boundaries are often blurred [7]. A typical physics-based scoring function, for instance, often adopts a functional form akin to:

ΔG(binding) = ΔE(VdW) + ΔE(el) + ΔE(H-bond) + ΔG(solv) [7]

In this predefined equation, each term represents a specific type of interaction: van der Waals (ΔE(VdW)), electrostatic (ΔE(el)), hydrogen bonding (ΔE(H-bond)), and solvation free energy (ΔG(solv)). The model's final form is a linear combination of these components. Similarly, empirical functions fit coefficients to these terms using experimental binding data, while knowledge-based functions derive potentials of mean force from structural databases. The critical shared limitation is not necessarily the choice of terms but the fixed combinatorial rule—the assumption that the total binding energy can be expressed as a simple, weighted sum of independent contributions. This form cannot capture synergistic or emergent effects between different interaction types, leading to an oversimplified representation of the highly cooperative and complex process of molecular binding.

Core Technical Limitations of Rigid Forms

The reliance on predetermined equations introduces several fundamental technical constraints that curtail predictive accuracy:

Inability to Capture Complex Non-linearities: Biological binding interactions are inherently non-linear. For example, the strength of a hydrogen bond can be context-dependent, influenced by the local dielectric environment or the presence of co-operative effects with nearby hydrophobic patches. A linear additive model cannot represent such higher-order interactions.
Constraint of Parameter Space: The flexibility of a model is determined by its functional form and its number of free parameters. Classical scoring functions possess a limited number of tunable parameters (e.g., the weights in the linear combination), which restricts their ability to fit diverse and complex datasets. This is a classic case of underfitting, where the model is not expressive enough to capture the true underlying function governing the data [8].
Systematic Bias Towards Training Conditions: The fixed form often embeds assumptions that reflect the data on which it was originally developed. For instance, a potential parameterized on equilibrium crystal structures may perform poorly for highly disordered states or non-equilibrium systems, a phenomenon observed in MEAM potentials for Al-Cu alloys [8]. This limits transferability across different target classes or chemical spaces.
Limited Adaptability to New Data: Incorporating new knowledge or experimental data into a model with a rigid form often requires a complete re-parameterization or, in some cases, is fundamentally impossible without altering the core equation itself. This makes the model brittle and difficult to improve incrementally.

Quantitative Performance Analysis: Rigid Forms vs. Emerging Approaches

The limitations of rigid functional forms become starkly evident when their performance is quantitatively compared with more flexible, data-driven methods on benchmark tasks. The following table synthesizes key performance metrics from comparative studies, highlighting the accuracy gap.

Table 1: Quantitative Performance Comparison of Scoring Function Paradigms

Model Category	Representative Example	Key Functional Form Characteristic	Reported Performance (R²)	Primary Limitation Illustrated
Classical Scoring Function	Physics-Based/ Empirical SFs [7]	Linear combination of pre-defined energy terms.	Lower accuracy, struggles with target identification [9]	Inability to generalize across diverse protein targets.
Machine Learning Model	Random Forest (RF) on molecular vibrations [10] [11]	Ensemble of decision trees; non-linear, data-derived rules.	R² > 0.94 for affinity prediction [10] [11]	Highlights the predictive power of flexible, non-parametric models.
Symbolic Regression (SR)	SR-derived interatomic potentials [8]	Equation discovered via RL/MCTS; no pre-defined form.	Outperformed Sutton-Chen EAM potentials [8]	Demonstrates that discovered equations can be both accurate and interpretable.
Deep Learning (DL)	Boltz-2 & other DL SFs [9] [7]	Multi-layer neural networks; highly non-linear function approximators.	Approaches FEP performance in some domains [9]	Struggles with generalization/memorization on target ID benchmarks [9].

A critical benchmark known as the "inter-protein scoring noise problem" further exposes the weakness of classical functions. While these functions can sometimes enrich active molecules for a single specific target, they generally fail to identify the correct protein target for a given active molecule due to scoring variations between different binding pockets [9]. A truly robust affinity prediction method must perform both tasks reliably, a hurdle that rigid forms have not yet cleared.

Case Studies and Experimental Protocols

Case Study 1: Symbolic Regression for Advanced Interatomic Potentials

This case study demonstrates how moving beyond fixed forms can improve accuracy even in a closely related field—material science—providing a template for drug discovery.

Experimental Protocol: The methodology for developing Symbolic Regression (SR)-derived potentials involves a multi-step, data-driven workflow [8].
- Data Generation: High-quality training data is generated using nested ensemble sampling with Density Functional Theory (DFT) calculations. This creates a diverse dataset spanning crystalline, disordered, and defective atomic configurations [8].
- Symbolic Search: A Reinforcement Learning (RL) search engine, specifically Continuous-Action Monte Carlo Tree Search (MCTS), explores the space of possible mathematical expressions. This is combined with gradient descent to optimize constants within candidate equations [8].
- Validation: The discovered potential functions (e.g., SR1 and SR2 for copper) are rigorously validated against a suite of material properties, including lattice constants, elastic constants, defect energies, and crucially, melting behavior through two-phase solid-amorphous interface simulations [8].
Key Workflow Diagram: The following diagram illustrates the contrast between the classical and SR approaches to model development.

Case Study 2: Molecular Vibration Descriptors with Random Forest

This study directly addresses drug-target affinity (DTA) prediction and showcases a high-performing machine learning model that bypasses classical rigid forms.

Experimental Protocol: The detailed methodology for constructing the quantitative prediction model is as follows [10] [11]:
- Data Curation: Large benchmark datasets were compiled from public databases (PubChem, DrugBank, ChEMBL, Uniprot). The Kd dataset contained 10,923 ligand-target-Kd pairs, and the EC50 dataset contained 11,076 ligand-target-EC50 pairs [10] [11].
- Descriptor Calculation & Screening: Initially, 1,874 molecular descriptors were calculated using the PaDEL software. Critically, these were filtered down to 813 descriptors associated with molecular vibrations, as these reflect underlying physicochemical properties like electronegativity and bond polarity [11]. Protein sequence descriptors (e.g., Normalized Moreau-Broto autocorrelation G3) were also computed [10] [11].
- Model Training with a "Whole System" View: The model was trained using a Random Forest (RF) algorithm. The input features combined the screened molecular vibration descriptors for the drug and the protein sequence descriptors for the target, treating the drug-target pair as a single holistic system [10] [11].
- Validation: Model performance was evaluated via internal cross-validation and external tests, achieving a coefficient of determination (R²) greater than 0.94 [10] [11].
Key Workflow Diagram: The holistic "whole system" approach is visualized below.

Table 2: Key Research Reagents and Computational Tools for Advanced Affinity Prediction

Item / Resource	Function / Purpose	Relevance to Overcoming Rigid Forms
PaDEL-Descriptor [10] [11]	Software to calculate a comprehensive set of molecular descriptors from chemical structure.	Enables featurization based on holistic molecular properties (e.g., vibrations) rather than pre-defined interaction terms.
Density Functional Theory (DFT) [8]	Ab initio quantum mechanical method for calculating electronic structure.	Provides high-quality, quantum-accurate training data for developing and validating more flexible models like SR potentials.
Random Forest Algorithm [10] [11]	A machine learning method that constructs multiple decision trees for regression or classification.	Provides a powerful, non-parametric alternative to linear models, capable of capturing complex non-linearities without a fixed equation.
Reinforcement Learning (RL) & MCTS [8]	A search strategy for exploring large combinatorial spaces (e.g., of mathematical expressions).	The core engine in symbolic regression that allows for the discovery of novel, interpretable functional forms directly from data.
Benchmark Datasets (Kd, EC50) [10] [11]	Curated datasets of drug-target pairs with experimentally measured binding affinities.	Essential for training and fairly evaluating the performance of new, flexible models against classical baselines.
LIT-PCBA Benchmark Set [9]	A demanding benchmark set designed for evaluating target identification capability.	Tests generalizability—a key weakness of rigid functions—by requiring models to rank affinities across different proteins.

The evidence is compelling: the rigid functional forms underpinning classical scoring functions constitute a significant bottleneck in the pursuit of accurate, generalizable, and predictive models for binding affinity. Their inability to capture the complex, non-linear physics of molecular interactions inherently limits their accuracy and domain of applicability, as quantified by their struggle with the inter-protein scoring noise problem [9] [7]. Emerging paradigms, including machine learning models that leverage holistic molecular descriptors [10] [11] and symbolic regression that discovers physically interpretable equations directly from data [8], demonstrate a clear path forward. These approaches reject the constraint of predetermined equations in favor of flexibility and data-driven discovery. For the field of computational drug discovery to advance, the research community must increasingly embrace these flexible modeling paradigms, fostering a shift from assuming the form of the solution to letting high-quality data and intelligent algorithms reveal it.

Classical scoring functions are pivotal tools in structure-based drug design, tasked with predicting the binding affinity of a small molecule to a target protein. Despite their long-standing utility, their predictive accuracy has plateaued, largely due to two fundamental omissions: the inadequate treatment of solvation effects and protein flexibility [12] [13]. These molecular phenomena are central to the process of binding, yet classical approaches handle them through drastic simplifications that limit their realism and predictive power. This review delineates how these shortcomings have constrained the reliability of affinity prediction and surveys the emerging computational strategies that are beginning to redress these gaps, thereby framing the limitations within the broader thesis on the evolution of scoring function research.

Classical scoring functions are broadly categorized as force-field, empirical, or knowledge-based [14]. Regardless of type, they share a common methodological constraint: the imposition of a predetermined, theory-inspired functional form for the relationship between the variables characterizing the protein-ligand complex and the predicted binding affinity [12]. This rigid approach leads to poor predictivity for complexes that do not conform to the underlying modeling assumptions. Furthermore, for the sake of computational efficiency, these functions employ a minimal description of protein flexibility and an implicit treatment of solvent, ignoring the dynamic and solvation-driven nature of the binding process [12]. The following sections will dissect the specific challenges posed by solvation and flexibility and detail how modern approaches are integrating them into a new generation of predictive models.

The Critical Role of Solvation in Binding Affinity

The Physics of Solvation and its Inadequate Representation

Solvation effects play a critical role in determining the binding free energy in protein-ligand interactions [14]. When a ligand binds to a protein, it undergoes a desolvation process, whereby water molecules are displaced from both the ligand's and the protein's binding site. This process involves a complex balance of energetic contributions: the screening of electrostatic interactions by water, the hydrophobic effect for nonpolar atoms, and the hydrophilic effect for polar groups [14]. Classical scoring functions often neglect these contributions entirely or account for them through oversimplified terms, such as a simple solvent-accessible surface area (SASA)-based energy term, which fails to capture the nuanced physics of water-mediated interactions [14].

The inherent challenge in incorporating solvation is the parameterization of pairwise potentials, solvation, and entropy, which belong to different energetic categories [14]. Consequently, despite the recognized importance of solvation in ligand binding, most classical knowledge-based scoring functions do not explicitly include its contributions, partly due to the difficulty in deriving the corresponding pair potentials and the resulting double-counting problem [14]. This omission represents a significant source of error in binding affinity predictions.

Advanced Methods for Incorporating Solvation Effects

Recent research has developed novel computational models to explicitly include solvation and entropic effects. One prominent method involves an iterative approach to simultaneously derive effective pair potentials and atomic solvation parameters [14]. The binding energy score is expressed as:

ΔGbind=∑ijuij(r)+∑iσiΔSAi

where uij(r) is the pair potential, σi is the solvation parameter for atom type i, and ΔSAi is the change in the solvent-accessible surface area [14]. The solvation parameters σi are iteratively improved by comparing the predicted and observed SASA changes in the training set complexes, effectively learning the solvation contribution from the data itself [14].

Another approach is seen in the development of physics-based scoring functions like DockTScore, which incorporate optimized terms for solvation and lipophilic interactions, moving beyond simplistic models to better represent the protein-ligand recognition process [15]. Similarly, machine-learning scoring functions circumvent the need for a predetermined functional form, allowing the collective effect of solvation and other interactions to be implicitly inferred from large experimental datasets [12].

Table 1: Computational Methods for Incorporating Solvation Effects

Method Name	Underlying Approach	Key Solvation Terms	Reported Performance
ITScore/SE [14]	Knowledge-based with iterative parameter fitting	SASA-based energy term with atomic solvation parameters	R² = 0.76 on validation set of 77 complexes
DockTScore [15]	Empirical, physics-based with machine learning	Optimized solvation and lipophilic interaction terms	Competitive performance on DUD-E datasets
Machine-Learning SFs [12]	Data-driven, non-linear regression	Implicitly learned from comprehensive feature sets	Outperform classical SFs in binding affinity prediction

The Challenge of Protein Flexibility and Dynamics

Protein Flexibility as a Major Limiting Factor

Protein flexibility stands out as one of the most important and challenging issues for binding mode prediction in molecular docking [13]. Proteins are dynamic entities that undergo continuous conformational changes of varying magnitudes, which are essential for biological processes like molecular recognition [16] [17]. However, classical docking tools and their embedded scoring functions often treat the protein receptor as a rigid body, an approximation that fails to capture the induced-fit and conformational selection mechanisms that frequently characterize binding [13].

The major limitation of treating proteins as rigid is the failure to account for the conformational entropy contribution to the binding free energy and the structural rearrangements that can open or close binding pockets [12] [13]. This simplification is primarily driven by the astronomical computational cost associated with sampling the full conformational space of a protein during docking. As a result, the reliability of structure-based affinity prediction is severely compromised for targets that undergo significant structural changes upon ligand binding [13].

Computational Strategies for Modeling Protein Flexibility

A variety of conformational sampling methods have been proposed to tackle the challenge of protein flexibility, ranging from techniques that account for local binding-site sidechain rearrangements to those that model full protein flexibility [13].

Ensemble Docking: This method involves docking ligands into an ensemble of multiple protein conformations, typically derived from experimental structures (e.g., from crystallography) or computational simulations like Molecular Dynamics (MD) [13]. This approach implicitly accounts for large-scale conformational changes and is particularly useful for virtual screening experiments.
Molecular Dynamics (MD) Simulations: All-atom MD simulations provide valuable, atomistically detailed information on protein conformational behavior on various timescales [16]. Databases like ATLAS provide large-scale, standardized MD simulations, offering insights into dynamic properties of functional protein regions [16]. MD ensembles have been shown to enhance docking performance by providing a more realistic representation of the protein's conformational landscape [16].
On-the-fly Sampling: For applications requiring accurate binding mode prediction (geometry prediction), methods that explore the flexibility of the whole protein-ligand complex during the docking process might be necessary [13]. These methods, while computationally expensive, can capture synergistic motions between the ligand and the protein that are crucial for forming the complex.

The choice of the best method depends heavily on the system under study and the research application, with a trade-off always existing between computational cost and the level of flexibility accounted for [13].

Diagram 1: Computational workflows for incorporating protein flexibility in docking. Methods branch from a single input structure and converge on producing improved docking poses, which are suitable for different applications.

The Emergence of Machine Learning and Advanced Physical Models

The Paradigm Shift to Machine-Learning Scoring Functions

The limitations of classical scoring functions have catalyzed a shift towards machine-learning scoring functions (ML-SFs) [12] [18]. Unlike classical functions that assume a predetermined functional form (e.g., linear regression with a small number of expert-selected features), ML-SFs use non-linear regression models to infer the functional form directly from the data [12]. This data-driven approach allows ML-SFs to exploit very large volumes of structural and interaction data effectively, capturing complex, non-additive interactions that are hard to model explicitly.

The performance gap between classical and machine-learning SFs is significant and is expected to widen as more training data becomes available [12]. For instance, the ML-SF RF-Score-VS demonstrated a dramatic improvement in virtual screening performance: its top 0.1% of molecules achieved an 88.6% hit rate, compared to just 27.5% for Vina [18]. In binding affinity prediction, RF-Score-VS also substantially outperformed Vina, with Pearson correlations of 0.56 and -0.18, respectively [18]. Other deep learning models, such as DAAP, which uses distance-based features and attention mechanisms, have achieved state-of-the-art performance, with a Pearson correlation of 0.909 on the CASF-2016 benchmark [19].

Integrating Physics-Based Terms with Machine Learning

A promising trend is the development of hybrid scoring functions that integrate precise, physics-based descriptors with powerful machine-learning regression algorithms. The DockTScore suite of functions is a prime example, which explicitly accounts for physics-based terms—including optimized MMFF94S force-field terms, solvation and lipophilic interactions, and an improved ligand torsional entropy estimate—combined with machine learning models like Support Vector Machine (SVM) and Random Forest (RF) [15]. This approach aims to retain the physical interpretability of the interaction terms while leveraging the ability of machine learning to model complex, non-linear relationships, thereby avoiding the over-optimistic accuracy estimates sometimes associated with purely black-box models [15].

Table 2: Comparison of Scoring Function Performance on Benchmark Tasks

Scoring Function Type	Example	Virtual Screening Hit Rate (Top 1%)	Binding Affinity Prediction (Pearson R)	Key Advantages
Classical SF	Vina	16.2% [18]	-0.18 [18]	Speed, simplicity
Machine-Learning SF	RF-Score-VS	55.6% [18]	0.56 [18]	Handles large datasets, non-linearity
Deep Learning SF	DAAP	N/A	0.909 [19]	Captures complex interactions directly from structure
Physics-Based ML SF	DockTScore (MLR)	Competitive on DUD-E [15]	Competitive on core set [15]	Balance of physical interpretability and accuracy

Experimental Protocols and the Scientist's Toolkit

Detailed Methodology: Incorporating Solvation and Entropy

The iterative method for developing the ITScore/SE knowledge-based scoring function provides a clear protocol for integrating solvation and entropy [14]:

Training Set Curation: A set of high-quality protein-ligand complexes with known 3D structures and binding affinities is assembled.
Feature Calculation: For each complex, compute the initial set of knowledge-based pair potentials and the change in Solvent-Accessible Surface Area (ΔSASA) for each atom type upon binding. The SASA is calculated using an algorithm of uniform atom-based spherical grids with a probe radius of 1.4 Å.
Iterative Parameter Optimization: The effective pair potentials uij(r) and atomic solvation parameters σi are simultaneously derived using an iterative algorithm. The potentials are updated in each step n as follows:
- uij(n+1)(r) = uij(n)(r) + λkBT[ gij(n)(r) - gijobs(r) ]
- σi(n+1) = σi(n) + λkBT( fΔSAi(n) - fΔSAiobs ) Here, gijobs(r) and fΔSAiobs are the observed pair distribution and SASA change fractions from the native structures, while gij(n)(r) and fΔSAi(n) are the Boltzmann-weighted averages predicted by the current potentials over native and decoy structures.
Convergence Check: The iteration continues until the difference between the predicted and observed distribution functions falls below a predefined threshold.
Validation: The final scoring function is validated on independent test sets for binding mode identification and affinity prediction.

Table 3: Key Resources for Advanced Scoring Function Development

Resource Name	Type	Function in Research
PDBbind [12] [15]	Database	A comprehensive, curated database of protein-ligand complexes with binding affinity data, used for training and benchmarking scoring functions.
DUD-E [18]	Benchmark Dataset	"Directory of Useful Decoys: Enhanced" provides benchmark sets for virtual screening, containing known actives and property-matched decoys for many targets.
CASF Benchmark [19]	Benchmark Suite	A standardized benchmark for evaluating scoring functions on core tasks like binding affinity prediction, pose prediction, and virtual screening.
ATLAS [16]	MD Simulation Database	A database of standardized all-atom molecular dynamics simulations, providing insights into protein dynamics for a representative set of proteins.
CHARMM36m Force Field [16]	Molecular Model	A force field used in MD simulations to compute potential energy, parameterized for balanced sampling of folded and disordered proteins.
GROMACS [16]	Software	A high-performance molecular dynamics package used to simulate the Newtonian equations of motion for systems with hundreds to millions of particles.

The inadequate treatment of solvation effects and protein flexibility has been a fundamental bottleneck in the accuracy of classical scoring functions. As this review outlines, these omissions stem from necessary but limiting simplifications made to maintain computational feasibility. The emergence of machine-learning scoring functions represents a paradigm shift, leveraging large datasets to infer complex relationships without being constrained by a predetermined functional form [12] [18]. Simultaneously, the integration of more rigorous physics-based terms, such as explicit solvation and entropy contributions, is providing a more realistic description of the binding process [14] [15]. The synergy of these two approaches—data-driven machine learning and theory-inspired physical models—is paving the way for a new generation of scoring functions with enhanced predictive power and greater generality.

Future progress will depend on continued advances in several areas. The development of large-scale, standardized dynamical data, as exemplified by the ATLAS database, will be crucial for modeling protein flexibility in a consistent manner [16]. Furthermore, the creation of target-specific scoring functions for challenging target classes like protein-protein interactions demonstrates a move away from a one-size-fits-all approach, promising better performance for specific therapeutic applications [15] [20]. As computational power grows and algorithms become more sophisticated, the explicit and accurate integration of solvation, entropy, and full flexibility will transition from a specialist's challenge to a standard component of the drug designer's toolkit, finally overcoming the key omissions that have long limited structure-based affinity prediction.

The additivity assumption posits that the total binding energy of a protein-ligand complex can be represented as the sum of independent, localized interactions. This principle underpins classical scoring functions in molecular recognition, where the affinity for any given molecular structure is calculated by summing contributions from individual atoms, functional groups, or residue pairs. The computational efficiency of this approach has made it a cornerstone in structural bioinformatics and early-stage drug discovery, particularly for rapid virtual screening of compound libraries.

However, mounting experimental evidence from quantitative biochemistry reveals that molecular recognition in biological systems frequently deviates from perfect additivity. Non-additive effects emerge from complex, cooperative interactions within and between molecules—effects that simple summing functions cannot capture. This whitepaper examines the fundamental limitations of the additivity assumption through key case studies and quantitative data, providing researchers with a framework for critically evaluating scoring function performance in affinity prediction research.

Quantitative Evidence: Case Studies Challenging Additivity

Protein-DNA Recognition: A Benchmark System

Protein-DNA interactions serve as an ideal model system for testing additivity due to their well-defined binding interfaces and the discrete nature of nucleotide positions. A re-analysis of seminal studies on the Mnt repressor protein and mouse EGR1 protein binding provides compelling quantitative evidence against purely additive models [21].

Table 1: Correlation Between Measured Binding Affinities and Additive Model Predictions

Zif268 Variant	Mononucleotide BAM (123)	*Dinucleotide BAM (123)**	*Dinucleotide BAM (123)**
Wild-type	0.973	0.986	0.987
RGPD	0.883	0.942	0.941
REDV	0.999	0.999	0.999
LRHN	0.927	0.978	0.956
KASN	0.695	0.791	0.718

While the mononucleotide Best Additive Model (BAM) shows strong correlations for some proteins (e.g., REDV at 0.999), performance substantially degrades for others (KASN at 0.695) [21]. The consistent improvement of dinucleotide models, which incorporate some positional interdependencies, demonstrates that positional interdependence significantly impacts binding affinity. For the KASN variant, the dinucleotide model (12*3) achieves a correlation of 0.791 compared to 0.695 for the mononucleotide model—a 14% improvement in explanatory power [21].

Beyond DNA: Evidence from Protein-Ligand Interactions

The limitations of additivity extend to protein-ligand interactions central to drug discovery. Fragment-Based Drug Discovery (FBDD) highlights the importance of non-additive synergy when fragments are combined [22]. While fragments themselves follow approximately additive rules due to their small size and simple interactions, their optimization into lead compounds frequently reveals cooperative effects that deviate from predictions based on fragment properties alone.

Modern machine learning approaches explicitly address these limitations. The ProBound framework, which models transcription factor binding affinity from sequencing data, incorporates cooperativity terms and multi-protein complex interactions that fundamentally violate simple additivity [23]. Similarly, the SCAGE architecture for molecular property prediction employs a multitask pretraining framework that captures complex relationships between molecular structure and function beyond what additive models can represent [24].

Experimental Protocols for Quantifying Non-Additivity

Comprehensive Binding Affinity Profiling

Objective: Systematically measure positional interdependence in molecular recognition.

Methodology:

Library Design: Generate complete combinatorial libraries covering all possible sequence variations at target positions (e.g., all 16 dinucleotides for two positions, all 64 trinucleotides for three positions) [21] [23].
Affinity Measurement: Utilize high-throughput binding assays (SELEX, KD-seq, protein-binding microarrays) to determine binding constants for all library variants [23].
Additive Model Fitting: Calculate the Best Additive Model (BAM) by converting association constants to binding probabilities and determining position-specific weights that maximize correlation with experimental data [21].
Deviation Quantification: Compute correlation coefficients between measured affinities and BAM predictions across all variants. Significant deviations indicate non-additive effects.

Critical Controls:

Multiple experimental replicates to assess measurement error (e.g., 9 replicates in the EGR1 study) [21].
Comparison of multiple additive models (mononucleotide vs. dinucleotide) to isolate specific types of interdependencies.

Fragment-Based Binding Analysis

Objective: Quantify cooperative effects in molecular assembly.

Methodology:

Fragment Library Screening: Employ diverse biophysical techniques (NMR, SPR, X-ray crystallography) to identify fragment hits against protein targets [22] [25].
Binding Site Mapping: Determine spatial relationships between fragment binding sites using structural biology methods.
Linker Optimization: Systematically explore chemical space connecting fragment hits while monitoring changes in binding affinity.
Cooperativity Calculation: Compare measured affinity of optimized compounds against predictions based on fragment affinities and linking chemistry.

Computational Approaches Overcoming Additivity Limitations

Advanced Machine Learning Architectures

Next-generation computational models address additivity failures through several innovative approaches:

Multitask Pretraining Frameworks: SCAGE incorporates four pretraining tasks (molecular fingerprint prediction, functional group prediction, 2D atomic distance prediction, and 3D bond angle prediction) to learn comprehensive molecular representations that capture complex structure-activity relationships [24].

Cooperativity Modeling: ProBound explicitly models cooperative binding in multi-TF complexes through energy terms that depend on relative positioning and orientation of binding partners [23].

Geometric Learning: Incorporation of 3D structural information (atomic distances, bond angles, conformational flexibility) enables models to capture spatial relationships that violate simple additivity [24] [23].

Interpretability and Explainability

Modern non-additive models provide biochemical insights through:

Attention Mechanisms: Identifying atomic-level contributions to molecular activity [24].
Functional Group Annotation: Assigning unique functional groups to each atom to enhance understanding of molecular activity [24].
Methylation Awareness: Quantifying position-specific impacts of epigenetic modifications on binding affinity [23].

Table 2: Comparison of Molecular Recognition Modeling Approaches

Model Type	Key Assumptions	Strengths	Limitations
Additive (BAM)	Position independence	Computational efficiency; Simple interpretation	Fails for cooperative systems; Limited accuracy
Dinucleotide BAM	Dinucleotide interdependence	Captures nearest-neighbor effects; Improved accuracy	Still misses longer-range interactions
ProBound	Multi-experiment integration	Quantifies cooperativity; Handles modifications	Computational intensity; Complex implementation
SCAGE	Multitask representation learning	Captures complex structure-activity relationships	Requires extensive pretraining data

Research Reagent Solutions

Table 3: Essential Research Materials for Non-Additivity Studies

Reagent/Technology	Function	Application Context
SELEX-seq	High-throughput profiling of protein-DNA interactions	Comprehensive binding affinity measurement [23]
KD-seq	Absolute affinity determination using input, bound and unbound fractions	Direct measurement of binding constants [23]
Fragment Libraries (~1400 compounds)	Screening for molecular recognition elements	Identifying privileged substructures [25]
Multi-TF SELEX	Characterization of cooperative complexes	Quantifying cooperativity in multi-protein assemblies [23]
Methylated DNA Libraries	Profiling epigenetic effects on recognition	Methylation-aware binding models [23]

Visualizing Experimental Workflows

Protein-DNA Binding Analysis

Machine Learning Framework for Non-Additive Binding

The empirical evidence against universal additivity in molecular recognition is substantial and growing. Quantitative studies of protein-DNA interactions reveal significant positional interdependencies, while fragment-based drug discovery demonstrates cooperative effects in molecular assembly. These non-additive phenomena necessitate advanced modeling approaches that explicitly account for cooperativity, spatial relationships, and contextual effects.

Modern machine learning frameworks like ProBound and SCAGE point the way forward by integrating diverse data types, modeling cooperativity explicitly, and maintaining biophysical interpretability. As molecular recognition research advances, the field must move beyond the convenient but limited additive assumption toward more sophisticated models that capture the complex, emergent properties of biological systems. This paradigm shift will enable more accurate affinity prediction, rational design of molecular interventions, and ultimately, more efficient drug discovery pipelines.

From Theory to Practice: How Methodological Shortcomings Hinder Real-World Drug Discovery

Structure-based virtual screening (VS) has become an indispensable tool in computational drug discovery, yet its effectiveness is fundamentally constrained by the accuracy of scoring functions (SFs). Classical SFs, which rely on empirical, force-field-based, or knowledge-based approaches, have hit a persistent performance plateau in their ability to discriminate between binders and non-binders. This whitepaper delineates the core limitations of these classical SFs and frames them within the broader thesis of affinity prediction research. We explore the emergence of machine-learning (ML) scoring functions as a transformative solution, presenting quantitative benchmarks and detailed methodologies that underscore their superior performance in enriching true actives and predicting binding affinities.

The primary goal of structure-based virtual screening is to identify novel bioactive molecules from vast chemical libraries by computationally docking them into a target protein's structure. The efficacy of this process hinges entirely on the scoring function's ability to rank compounds based on their predicted affinity. Classical SFs, embedded in popular docking tools, estimate binding energy using simplified physical models or statistical potentials derived from known protein-ligand structures. Despite their long-standing utility, these functions suffer from well-documented limitations: they often inadequately account for conformational entropy, solvation effects, and specific interaction nuances, leading to inaccurate affinity predictions and poor enrichment of true binders [18]. Consequently, the field has witnessed a performance plateau, where incremental improvements in classical SFs have yielded diminishing returns, creating a critical bottleneck in the early drug discovery pipeline [18] [26]. This paper examines the evidence for this plateau and the subsequent paradigm shift towards data-driven ML approaches, which learn the complex relationships between protein-ligand structural features and binding affinities directly from large-scale experimental data.

Quantitative Evidence of the Performance Plateau

Extensive benchmarking studies across diverse protein targets provide concrete evidence of the limitations of classical SFs. The data reveal that while these functions can serve as loose classifiers, their performance, particularly in early enrichment, is significantly surpassed by modern machine-learning scoring functions.

Table 1: Virtual Screening Performance Comparison on the DUD-E Benchmark (102 Targets)

Scoring Function	Type	Hit Rate at Top 1%	Hit Rate at Top 0.1%	Binding Affinity Pearson Correlation
RF-Score-VS	Machine Learning	55.6%	88.6%	0.56
AutoDock Vina	Classical (Empirical)	16.2%	27.5%	-0.18
DOCK3.7	Classical (Force-Field)	~15% (est.)	-	-

The data in Table 1, derived from a large-scale study on the DUD-E benchmark, is telling. The machine-learning SF, RF-Score-VS, achieves a hit rate at the top 1% of ranked molecules that is more than three times that of a classical SF like Vina [18]. The difference is even more dramatic in the ultra-early enrichment zone (top 0.1%), where RF-Score-VS identifies hits with near 90% accuracy. Furthermore, the poor Pearson correlation of Vina's scores with experimental binding affinity (-0.18) underscores its inability to provide a meaningful quantitative estimate of binding strength, a core limitation in affinity prediction research [18].

This performance gap is not isolated. A 2025 benchmarking study on Plasmodium falciparum Dihydrofolate Reductase (PfDHFR) variants further corroborates these findings. The study showed that re-scoring initial docking poses with ML SFs like CNN-Score dramatically improved early enrichment. For the wild-type enzyme, re-scoring with CNN-Score achieved an enrichment factor at 1% (EF1%) of 28, a substantial improvement over the baseline docking tools [27].

Detailed Experimental Protocols for Benchmarking Scoring Functions

To ensure the reproducibility of VS benchmarks and the rigorous validation of new SFs, researchers adhere to standardized protocols. The following methodology outlines a comprehensive benchmarking workflow.

Data Set Curation and Preparation

Data Provenance: Public benchmark sets like DUD-E (Directory of Useful Decoys: Enhanced) and DEKOIS 2.0 are commonly used. These sets provide, for a given protein target, a list of known active molecules and a set of decoy molecules—structurally similar but presumed inactive molecules that act as negative controls [18] [27].

Protein Preparation: Crystal structures of the target protein are obtained from the Protein Data Bank (PDB). The preparation steps, often performed with tools like OpenEye's "Make Receptor," involve:
- Removing water molecules, unnecessary ions, and redundant chains.
- Adding and optimizing hydrogen atoms.
- Defining the binding site and saving the prepared structure in the required format for docking (e.g., PDB, OEB) [27].
Ligand and Decoy Preparation: Active and decoy molecules are processed to generate viable 3D conformations. This typically involves:
- Using tools like Omega to generate multiple conformers per ligand or a single, energy-minimized representative.
- Converting the prepared structures into file formats compatible with specific docking programs (e.g., PDBQT for AutoDock Vina, mol2 for PLANTS) [27].

Docking and Validation Strategy

Docking Experiments: The prepared ligand and decoy libraries are docked into the prepared protein structure using one or more docking programs (e.g., AutoDock Vina, FRED, PLANTS). The grid box dimensions are set to encompass the entire binding site [27].

Performance Validation: To prevent overfitting and ensure generalizability, strict cross-validation strategies are employed:

Vertical Split: The training and test sets contain data from entirely different protein targets. This simulates the scenario of predicting ligands for a novel target with no known binders and is the most stringent test of generalizability [18].
pROC-AUC and Enrichment Factor (EF): The primary metrics for VS performance. The area under the ROC curve (pROC-AUC) measures overall discrimination, while the EF, particularly at early stages (e.g., EF1%), measures the ability to enrich true actives at the very top of the ranked list, which is critical for practical VS campaigns [27].

Table 2: Key Research Reagents and Computational Tools for Virtual Screening Benchmarking

Reagent / Tool Name	Type/Category	Primary Function in VS Workflow
DUD-E / DEKOIS 2.0	Benchmark Dataset	Provides curated sets of active molecules and property-matched decoys for rigorous performance assessment.
AutoDock Vina	Docking Program	Generates plausible binding poses and provides an initial score using an empirical scoring function.
RF-Score-VS / CNN-Score	Machine-Learning Scoring Function	Re-scores docking poses to significantly improve the ranking of active molecules over decoys.
PDBbind Database	Training Dataset	A comprehensive collection of protein-ligand complexes with binding affinity data for training ML scoring functions.
OpenBabel / SPORES	File Format Tool	Converts and processes chemical file formats between different docking and analysis software.

Virtual Screening Benchmarking Workflow

Emerging Solutions and Persistent Hurdles

The transition to machine-learning scoring functions is not a panacea. While they show remarkable performance on established benchmarks, significant challenges remain that define the current frontier of affinity prediction research.

The Data Leakage and Generalization Crisis

A critical issue undermining the perceived progress in ML-based affinity prediction is train-test data leakage. A 2025 analysis revealed that the standard benchmark used for evaluating SFs, the Comparative Assessment of Scoring Functions (CASF), shares a high degree of structural similarity with the PDBbind database used to train these models. This means models can perform well by memorizing similarities rather than by genuinely learning protein-ligand interactions [28]. When models like GenScore and Pafnucy were retrained on a rigorously filtered dataset (PDBbind CleanSplit) to eliminate this leakage, their performance dropped markedly, revealing an overestimation of their true generalization capabilities [28]. This highlights a core challenge: developing models that generalize to genuinely novel targets and not just those structurally related to training examples.

The Rise of Target-Specific and Advanced Architecture Models

To combat generalization issues and improve accuracy, researchers are developing specialized approaches:

Target-Specific Scoring Functions: Instead of a one-size-fits-all SF, models are trained specifically on data for a single target. For example, graph convolutional neural networks (GCNs) have been used to create target-specific SFs for cGAS and kRAS, which showed significant superiority over generic SFs in virtual screening accuracy and robustness [29].
Sparsely Connected Graph Neural Networks: Models like GEMS (Graph neural network for Efficient Molecular Scoring) leverage a sparse graph representation of protein-ligand interactions and transfer learning from protein language models. When trained on the cleaned PDBbind CleanSplit dataset, GEMS maintained high performance, suggesting a better ability to generalize to strictly independent test data [28].

The performance plateau of classical scoring functions in virtual screening is a well-documented reality, driven by their inherent inability to capture the complex physical chemistry of molecular recognition. The field is unequivocally shifting towards machine-learning-based solutions, which have demonstrated a profound ability to enrich true binders and offer more accurate affinity predictions. However, the path forward must be navigated with caution. The dual challenges of data leakage in public benchmarks and the limited generalization of many current models represent the next major hurdles. Future research must prioritize the development of rigorously benchmarked models, trained on non-redundant, leakage-free data, and validated on truly novel targets. The integration of advanced architectures like graph neural networks and the strategic use of target-specific training paradigms offer promising avenues to finally move beyond the plateau and deliver on the promise of accurate, reliable affinity prediction for drug discovery.

Accurately predicting the binding affinity between a small molecule and its protein target is a cornerstone of computational drug discovery. The strength of this interaction, quantified as binding affinity, directly determines a drug candidate's efficacy and is a critical parameter for lead optimization [30]. For decades, the development of scoring functions capable of reliably estimating this affinity has been a primary research focus. These functions aim to correlate the three-dimensional structural information of a protein-ligand complex with experimentally measured binding constants (Ki, Kd, IC50), providing a computational substitute for costly and time-consuming laboratory assays [30] [31].

However, a significant and persistent challenge plagues the field: the poor correlation between computationally predicted affinities and experimentally validated results. This gap severely limits the utility of these methods in real-world drug discovery pipelines, where decisions about which compounds to synthesize and test often hinge on computational predictions [30] [28]. Insufficient conformational sampling, oversimplified energy functions, and an inability to accurately model critical solvation and entropic effects are frequently cited as traditional culprits [30]. While deep learning has emerged as a promising paradigm, offering computational efficiency and the ability to learn complex patterns from data, its performance is often overestimated due to benchmark datasets plagued by data leakage and redundancy [28]. This whitepaper examines the core limitations of both classical and machine learning-based affinity prediction methods, framed within the broader thesis that current scoring functions, despite their sophistication, are not yet robust or generalizable enough to replace experimental validation.

Core Challenges in Achieving Experimental Correlation

The discrepancy between in silico predictions and experimental binding constants arises from a confluence of factors that affect both traditional and modern deep learning approaches.

Fundamental Methodological Limitations

Conventional physics-based methods face intrinsic hurdles. Molecular dynamics (MD) simulations for binding free energy calculations, such as those using the Bennett Acceptance Ratio (BAR), are computationally intensive. Achieving sufficient sampling is difficult because the inclusion of explicit solvent or membrane environments requires extensive equilibration to ensure system stability [30]. Furthermore, as a state function, binding free energy calculation requires finely dividing the perturbation range into multiple intermediate lambda (λ) states to control energy transitions, adding to the computational burden [30]. Classical scoring functions embedded in docking tools like AutoDock Vina or Glide rely on empirical rules and heuristic search algorithms, which often result in inaccuracies and an inability to fully capture the complexity of molecular interactions [32].

Data Biases and Benchmarking Pitfalls

A critical, and often underestimated, challenge is the issue of data quality and evaluation. The performance of deep-learning models is highly dependent on their training data. A 2025 study highlighted that a significant train-test data leakage exists between the widely used PDBbind database and the Comparative Assessment of Scoring Functions (CASF) benchmark [28]. This leakage, stemming from structural similarities between training and test complexes, severely inflates the performance metrics of models, leading to a substantial overestimation of their generalization capabilities [28]. Alarmingly, some models perform well on benchmarks even when protein information is omitted, suggesting they rely on memorizing ligand-specific patterns rather than learning genuine protein-ligand interactions [28]. This problem is compounded by redundancies within the training data itself, which can encourage models to settle for a local minimum in the loss landscape through memorization instead of developing a robust predictive understanding [28].

Generalization and Physical Plausibility

Even the most accurate models on paper can fail in practical applications. A comprehensive evaluation of deep learning-based docking methods revealed significant challenges in generalization, particularly when encountering novel protein binding pockets not represented in the training data [32]. Furthermore, many deep learning methods, especially generative diffusion models, can produce poses with favorable root-mean-square deviation (RMSD) scores but that are physically implausible. They may exhibit steric clashes, incorrect bond lengths/angles, or fail to recapitulate key protein-ligand interactions essential for biological activity [32]. This indicates that while these models learn to generate geometrically correct poses, they may not fully grasp the underlying physicochemical principles governing binding.

Table 1: Core Challenges in Binding Affinity Prediction

Challenge Category	Specific Limitations	Impact on Prediction
Methodological Limits	Insufficient sampling in MD simulations; Oversimplified scoring functions [30] [32].	Inaccurate energy estimates; Failure to capture key interaction dynamics.
Data Bias & Leakage	Structural similarities between PDBbind training and CASF test sets; Redundant training data [28].	Overestimated model performance; Poor generalization to novel targets.
Generalization Failure	Inability to handle novel protein pockets or ligand topologies; Production of physically invalid poses [32] [28].	Models fail in real-world virtual screening and lead optimization.
Evaluation Deficits	Over-reliance on a single metric (e.g., RMSD); Lack of target identification benchmarks [32] [9].	Incomplete picture of model utility for drug discovery.

Quantitative Performance Gaps

The theoretical challenges manifest in concrete performance gaps when methods are rigorously evaluated. When state-of-the-art models like GenScore and Pafnucy were retrained on a cleaned dataset (PDBbind CleanSplit) designed to eliminate data leakage, their performance on the CASF benchmark dropped markedly [28]. This confirms that previously reported high scores were largely driven by data leakage rather than genuine learning. In molecular docking, a multidimensional evaluation shows a wide variation in success rates. The "combined success rate" – which considers both pose accuracy (RMSD ≤ 2 Å) and physical validity – reveals that even the best methods have significant room for improvement.

Table 2: Performance Comparison of Docking Methods on Benchmark Datasets [32]

Method Type	Representative Method	Combined Success Rate (Astex Diverse Set)	Combined Success Rate (DockGen - Novel Pockets)
Traditional	Glide SP	>85% (inferred)	High (inferred as top tier)
Hybrid (AI Scoring)	Interformer	Second highest tier	Second highest tier
Generative Diffusion	SurfDock	61.18%	33.33%
Regression-Based	KarmaDock, QuickBind	Lowest tier	Lowest tier

Another telling benchmark is the "inter-protein scoring noise" problem. Classical functions can enrich active molecules for a single target but fail to identify the correct protein target for a given active molecule due to scoring variations between different binding pockets [9]. A test of the Boltz-2 model, a biomolecular foundation model, on a target identification benchmark revealed it was still unable to correctly identify the target of active molecules by predicting a higher binding affinity compared to decoy targets [9]. This indicates a lack of generalizable understanding of protein-ligand interactions.

Detailed Experimental Protocols

To illustrate the complexities involved in affinity prediction, we detail two key experimental approaches: one based on molecular dynamics and another on modern deep learning model training.

BAR-Based Binding Free Energy Calculation Protocol

The following workflow outlines the protocol for achieving efficient sampling and binding free energy calculation using a re-engineered Bennett Acceptance Ratio (BAR) method, as applied to GPCR targets [30].

Workflow Description: This protocol [30] begins with a prepared structure of the protein-ligand complex, such as a G-protein coupled receptor (GPCR) with a bound agonist or antagonist. For membrane proteins like GPCRs, the complex is embedded within an appropriate membrane model and solvated with explicit water molecules, followed by ion addition for physiological ionic strength. A multi-step equilibration through molecular dynamics is then critical to ensure the stability of the entire system—protein, ligand, membrane, and solvent. The core of the alchemical method involves defining a pathway between the bound and unbound states by dividing the transformation into numerous intermediate steps, represented by scaling factors known as lambda (λ) values. Extensive molecular dynamics sampling is performed at each of these lambda states to collect energy data for both forward and backward transitions. Finally, the binding free energy (ΔGbind) is calculated by applying the re-engineered BAR method to this collected data. The validity of the computational approach is demonstrated by correlating the calculated ΔGbind values with experimental binding affinity data (pK_D).

Protocol for Training a Robust Deep Learning Affinity Predictor

This protocol focuses on mitigating data bias to improve model generalization, a key challenge identified in recent research [28].

Workflow Description: This protocol [28] starts with the raw PDBbind database. The first and most crucial step is structure-based filtering using a multimodal clustering algorithm. This algorithm assesses similarity between protein-ligand complexes by combining protein similarity (TM-score), ligand similarity (Tanimoto score), and binding conformation similarity (pocket-aligned ligand RMSD). This identifies and removes complexes in the training set that are overly similar to those in the test set (e.g., the CASF benchmark), effectively eliminating train-test data leakage. The result is a curated training dataset, such as PDBbind CleanSplit. The protocol also involves reducing redundancy within the training set itself by resolving large similarity clusters, forcing the model to learn general rules rather than memorizing specific examples. The model architecture, such as a Graph Neural Network (GNN), is designed to sparse graph modeling of protein-ligand interactions and can be enhanced with transfer learning from large protein language models. Finally, the model is evaluated on a strictly independent test set, with ablation studies conducted to verify that its predictions are based on a genuine understanding of interactions and not data leakage.

Table 3: Essential Resources for Binding Affinity Prediction Research

Resource Name	Type	Primary Function in Research
PDBbind [31] [28]	Database	Comprehensive collection of protein-ligand complex structures with experimentally measured binding affinity data. Serves as a primary source for model training.
CASF Benchmark [31] [28]	Benchmark Set	Curated dataset used for the comparative assessment of scoring functions' performance in scoring, ranking, docking, and screening powers.
GROMACS [30]	Software	High-performance molecular dynamics toolkit used for running simulations, system equilibration, and alchemical free energy calculations.
AutoDock Vina [32] [28]	Software	Widely used molecular docking program with an empirical scoring function, often used as a baseline for comparison.
Glide [32]	Software	A robust molecular docking tool known for its accurate pose prediction and rigorous sampling algorithms.
Boltz-2 [9]	AI Model	A biomolecular foundation model claimed to approach the performance of FEP in estimating binding affinity.
PoseBusters [32]	Validation Tool	Toolkit to systematically evaluate the physical plausibility and chemical correctness of predicted docking poses.
CleanSplit [28]	Curated Dataset	A filtered version of PDBbind designed to minimize train-test data leakage and redundancy, enabling genuine evaluation of model generalization.

The challenge of achieving a strong correlation between predicted and experimental binding constants remains a significant bottleneck in computational drug discovery. The limitations are deeply rooted and multifaceted, extending beyond simple algorithmic improvements. While deep learning offers new avenues, its current promise is tempered by critical issues of data bias, overestimation of capabilities, and poor generalization on truly novel targets. The path forward requires a concerted shift in the research community's approach. This includes the development and adoption of rigorously curated, non-redundant datasets, the implementation of more demanding benchmarks that test for target identification and generalization, and a holistic evaluation of models that prioritizes physical plausibility and biological relevance alongside raw predictive accuracy. Overcoming the affinity prediction challenge is not merely a computational problem but a interdisciplinary endeavor that demands a more nuanced understanding of both biological complexity and the limitations of our data-driven models.

Accurate prediction of protein-ligand binding affinity is a cornerstone of structure-based drug design. While classical scoring functions are often adequate for evaluating ligands similar to their training data, their performance significantly degrades when applied to novel chemical scaffolds or diverse protein targets—a limitation termed congeneric bias. This whitepaper analyzes the fundamental origins of this bias, rooted in statistical learning theory and exacerbated by dataset construction flaws. We demonstrate through quantitative analysis that generalized models possess inherent accuracy limits, with protein-specific models consistently outperforming universal functions. Furthermore, we document how data leakage and redundancy in common benchmarks artificially inflate performance metrics, creating a false impression of generalizability. Emerging solutions, including advanced graph neural networks, multitask learning architectures, and rigorous data curation protocols, show promise for overcoming these limitations. The findings underscore the necessity of developing next-generation scoring functions that transcend simple pattern matching to genuinely learn the biophysical principles of molecular recognition.

The accurate prediction of binding affinity remains one of the great challenges in computational chemistry [33]. Classical scoring functions were developed to provide fast assessment of protein-ligand complexes using single structural snapshots, offering an essential tool for virtual screening and lead optimization in drug discovery. These functions traditionally compromise between physical accuracy and computational efficiency, employing empirical, force-field-based, or knowledge-based approaches to score complexes.

However, a critical and persistent limitation has emerged: these functions demonstrate uneven performance across different targets and often fail catastrophically when applied to novel target classes or chemically diverse ligands [34]. This "congeneric bias" manifests when models trained on specific chemical series or protein families cannot generalize to structurally distinct complexes. The bias stems from fundamental limitations in both the theoretical foundations of scoring functions and the datasets used for their development and validation.

Recent analyses reveal that the performance of many published models has been substantially overestimated due to benchmark contamination [28]. When evaluated on properly curated datasets, even state-of-the-art models show marked performance drops, exposing their reliance on memorization rather than learning underlying physical principles. This whitepaper examines the mechanistic origins of congeneric bias, presents quantitative evidence of its effects, and outlines experimental frameworks and computational solutions designed to overcome these limitations.

Theoretical Foundations: Inherent Limitations of Structure-Based Models

The theoretical framework underlying empirical scoring functions contains fundamental constraints that necessarily limit their generalizability. Through the lens of statistical learning theory and information theory, we can formally demonstrate why a universally accurate scoring function is theoretically unattainable.

Statistical Learning Theory Applied to Affinity Prediction

Statistical learning theory formalizes the process of elucidating functional relationships between structural features (x) and binding affinity (y) by assuming a probabilistic process generates the data used for training and testing [33]. The optimal model would capture the conditional probability distribution p(y|x), which encodes the true relationship between structure and affinity.

Using cross-entropy C(Y|X) as a loss function, the error decomposes into two components:

h(Y|X) - The conditional entropy representing the minimum achievable error for a given set of descriptors
D(p(y|x)||q(y|x)) - The relative entropy or regret, quantifying the additional error from assuming an incorrect model q(y|x) instead of the true distribution p(y|x)

This decomposition reveals that even with ideal descriptors and infinite training data, h(Y|X) represents an irreducible uncertainty in affinity prediction from structural snapshots alone [33].

The Protein-Specific Advantage

Theoretical analysis proves that generalized structure-based models have inherent accuracy limits, and protein-specific models will always likely perform better for their respective targets [33]. This occurs because the joint probability distribution p(x,y) over structures and affinities differs significantly across protein families. A single model q(y|x) must compromise across these different distributions, necessarily increasing regret for any specific target.

Table 1: Theoretical Error Components in Generalized vs. Targeted Models

Model Type	Minimum Error (h(Y	X))	Expected Regret (D(p		q))	Total Expected Error
Generalized Model	Fixed for descriptor set	High (must compromise across targets)	High
Protein-Specific Model	Fixed for descriptor set	Low (optimized for specific p(x,y))	Lower

Theoretical Framework of Scoring Function Limitations: This diagram illustrates how assumptions in statistical learning theory, when applied to structure-based affinity prediction, inevitably lead to generalization failure. The fundamental discrepancy between the true distribution of structure-affinity relationships and model assumptions creates regret that compounds minimum achievable error.

Quantitative Evidence: Documenting the Performance Gap

Empirical evaluations substantiate the theoretical predictions of inherent limitations in classical scoring functions. The performance degradation is most pronounced when models encounter novel targets or diverse ligands, precisely illustrating the congeneric bias phenomenon.

Systematic Performance Variations Across Targets

Early evidence of congeneric bias emerged from observations that scoring function performance varies dramatically between different protein systems [33]. For certain challenging targets—including acetylcholine esterase (AChE), pantothenate synthetase, and various kinases—conventional scoring functions cannot distinguish native binding poses from decoys, despite generating structurally plausible alternatives [34].

Table 2: Performance Disparities Across Challenging Targets

Target Protein	PDB ID	Scoring Function	Native Pose Ranking	Key Challenge
Acetylcholine Esterase	1GPK	MedusaScore	Outside top 1%	Entropic effects
Pantothenate Synthetase	1N2J	AutoDock	Outside top 1%	Flexibility
JNK3 Kinase	1PMN	Glide	Outside top 1%	Specific hydration
Tuberculosis Thymidylate Kinase	1W2G	MedusaScore	Outside top 1%	Coupled dynamics
Checkpoint Kinase 1	2BR1	Multiple	Outside top 1%	Metal coordination

Discrete Molecular Dynamics (DMD) simulations demonstrated that incorporating protein-ligand dynamics and entropic effects could successfully identify native poses in 6 of 8 cases where static scoring functions failed [34]. This suggests that the omission of dynamic information constitutes a critical limitation in classical functions applied to novel targets.

Data Leakage and Benchmark Contamination

Recent analyses reveal that much of the reported performance of modern machine learning scoring functions is artificially inflated by data leakage between training and test sets. When proper filtering is applied, performance metrics drop substantially [28].

A structure-based clustering algorithm identified that nearly 600 similarities existed between PDBbind training complexes and Comparative Assessment of Scoring Functions (CASF) test complexes, affecting 49% of all CASF test complexes [28]. This contamination enables models to achieve high benchmark performance through memorization rather than genuine learning of protein-ligand interactions.

Table 3: Performance Drop After Data Leakage Removal

Model	Original CASF Performance (RMSE)	CleanSplit Performance (RMSE)	Performance Drop
GenScore	1.25	1.58	26.4%
Pafnucy	1.32	1.71	29.5%
GEMS (Our Model)	1.18	1.21	2.5%

The creation of PDBbind CleanSplit—a curated dataset with reduced train-test similarity—exposed the extent of this overestimation [28]. Models that previously showed exceptional benchmark performance experienced significant drops when retrained on CleanSplit, while models designed for better generalization maintained robust performance.

Methodological Approaches: Experimental Protocols

Structure-Based Dataset Filtering Protocol

To address data leakage issues, researchers have developed rigorous filtering protocols [28]:

Similarity Assessment: Compute multimodal similarity between all training and test complexes using:
- Protein similarity (TM-score > 0.8)
- Ligand similarity (Tanimoto coefficient > 0.9)
- Binding conformation similarity (pocket-aligned ligand RMSD < 2.0Å)
Leakage Removal: Iteratively remove all training complexes that exceed similarity thresholds with any test complex.
Redundancy Reduction: Apply adapted filtering thresholds to identify and eliminate similarity clusters within the training set until all striking redundancies are resolved.
Validation: Verify that the highest remaining similarities between training and test sets show clear structural differences in both protein folds and ligand positioning.

This protocol resulted in the removal of 4% of training complexes due to train-test similarity and an additional 7.8% due to internal redundancies [28].

Dynamics-Enhanced Pose Discrimination

For targets where classical scoring functions fail to identify native poses, Discrete Molecular Dynamics (DMD) offers a robust alternative protocol [34]:

Pose Generation: Use flexible docking software (e.g., MedusaDock) to generate 1000+ diverse poses for the target ligand.
Pose Clustering: Employ means-linkage hierarchical clustering with a 2.5Å RMSD cutoff to identify structurally distinct pose clusters.
Representative Selection: Select the highest-scoring pose from each cluster for simulation, eliminating dynamically indistinguishable poses.
DMD Simulation: Perform multiple DMD simulations for each representative pose, using discretized energy potentials and fast event-sorting to enhance sampling efficiency.
Residence Time Analysis: Calculate ligand residence time for each pose, with native and near-native poses typically exhibiting distinctly longer residence times than decoys.
Ranking: Rank poses by residence time rather than static energy scores, successfully identifying native poses within the top 0.5% of poses for most challenging targets.

Dynamics-Enhanced Pose Discrimination: Workflow for identifying native binding poses in challenging targets where classical scoring functions fail, using Discrete Molecular Dynamics simulations to incorporate protein-ligand dynamics and entropic effects.

Emerging Solutions: Beyond Classical Scoring Functions

Advanced Architectural Approaches

Graph Neural Networks with Enhanced Featurization

Novel graph architectures show improved generalization by better representing protein-ligand interactions. The AEV-PLIG model combines atomic environment vectors (AEVs) with protein-ligand interaction graphs, using radial atomic environment vectors centered on ligand atoms as node features [35]. This approach captures intermolecular pairwise atomic interactions more explicitly than distance cutoffs alone.

The GEMS (Graph neural network for Efficient Molecular Scoring) architecture leverages transfer learning from protein language models and sparse graph modeling of interactions to maintain performance even when trained on properly filtered datasets [28]. Ablation studies confirm that GEMS fails to produce accurate predictions when protein nodes are omitted, suggesting genuine understanding of interactions rather than ligand memorization.

Multitask Learning Frameworks

The DeepDTAGen framework simultaneously predicts drug-target binding affinity and generates novel target-aware drug variants using a shared feature space [36]. This multitask approach ensures that the model learns DTI-specific features in the latent space while utilizing these features for generation. The framework employs a novel FetterGrad algorithm to mitigate gradient conflicts between distinct tasks, keeping gradients aligned during optimization.

Data Augmentation Strategies

To address the fundamental data scarcity problem, researchers have successfully employed augmentation techniques:

Template-Based Modeling: Generate additional complex structures using template-based ligand alignment algorithms [35]
Molecular Docking: Create synthetic training examples by docking known binders into homologous protein structures [35]
Conditional Generation: Develop target-aware generative models that produce novel ligands conditioned on specific protein binding pockets [36]

When AEV-PLIG was trained with augmented data, performance on FEP benchmarks improved substantially, with weighted mean Pearson correlation coefficient increasing from 0.41 to 0.59 and Kendall's τ from 0.26 to 0.42 [35]. This narrows the performance gap with FEP+ (PCC: 0.68, Kendall's τ: 0.49) while being approximately 400,000 times faster.

The Scientist's Toolkit: Essential Research Reagents

Table 4: Key Experimental Resources for Robust Affinity Prediction

Resource	Type	Function	Application Context
PDBbind CleanSplit	Curated Dataset	Training data with reduced benchmark leakage	Model development & validation
CASF-2016 Benchmark	Evaluation Dataset	Standardized performance assessment	Method comparison
DMD Suite	Simulation Software	Discrete Molecular Dynamics simulations	Pose discrimination for difficult targets
AEV-PLIG	Graph Neural Network	Binding affinity prediction with atomic environments	Structure-based affinity prediction
DeepDTAGen	Multitask Framework	Simultaneous affinity prediction & drug generation	Target-aware drug design
FetterGrad Algorithm	Optimization Method	Mitigates gradient conflicts in multitask learning	Multitask model training
3D Interaction Fingerprints	Descriptor System	Encodes spatial interaction patterns	Reference complex selection
Knowledge-Guided Scoring (KGS2)	Add-on Method	Enhances existing functions using reference complexes	Scoring function improvement

Congeneric bias in classical scoring functions represents a fundamental challenge rooted in theoretical limitations of structure-based models and exacerbated by methodological shortcomings in model development and evaluation. The reliance on single static structures, inadequate treatment of entropic contributions, and dataset contamination have collectively created a situation where reported performance metrics significantly overstate real-world applicability.

Promising paths forward include dynamics-aware scoring methods, rigorously curated datasets, advanced neural architectures that explicitly model physical interactions, and data augmentation to expand chemical diversity. The development of models that genuinely learn biophysical principles rather than exploiting dataset biases will be essential for achieving robust performance across novel and diverse targets. As these approaches mature, they promise to narrow the gap between computational prediction and experimental reality, ultimately accelerating therapeutic development through more reliable virtual screening.

Scoring functions are the computational engine of structure-based drug design, tasked with predicting the binding affinity between a drug candidate and its protein target. Classical scoring functions, which often rely on simplified physical models and empirical parameters, have long been a cornerstone of molecular docking and virtual screening [32]. These functions provide a critical bridge between the structural data of a protein-ligand complex and the anticipated biological activity, guiding the selection of initial "hit" compounds and their subsequent optimization into viable "leads" [37]. However, inherent limitations in these classical approaches can lead to inaccurate affinity predictions, creating a cascade of negative consequences throughout the drug discovery pipeline. This case study examines the tangible repercussions of poor scoring in hit identification and lead optimization, framing the issue within the broader research context of overcoming the limitations of classical scoring functions. We will analyze quantitative evidence of these failures, detail experimental protocols for evaluating scoring function performance, and explore how emerging deep learning (DL) methodologies are providing potential pathways to more reliable predictions [38] [32].

The Critical Role of Scoring in the Drug Discovery Pipeline

The initial phases of drug discovery are heavily reliant on computational prescreening to navigate vast chemical space. Hit identification aims to find initial compounds with confirmed activity against a therapeutic target, typically from hundreds of thousands to millions of candidates [39] [37]. Following this, the hit-to-lead phase involves optimizing these initial hits for potency, selectivity, and drug-like properties, a process that depends on accurate structure-activity relationship (SAR) data to guide medicinal chemistry [40]. In both stages, scoring functions are indispensable for prioritizing which compounds to synthesize and test experimentally.

The reliance on these functions is profound. In virtual screening, they act as a filter, and their failure to correctly rank compounds can cause truly active molecules to be overlooked in favor of false positives [39]. During lead optimization, medicinal chemists use predicted binding modes and affinities to decide which chemical modifications to make. Inaccurate scoring can therefore misdirect the entire optimization effort, wasting precious time and resources [32]. A core challenge is that classical functions often struggle to capture the complex physical chemistry of binding, such as the subtle effects of solvation, entropy, and specific intermolecular interactions like halogen bonds [32]. This foundational weakness manifests in several critical failure modes, for which quantitative evidence is mounting.

Quantitative Evidence of Scoring Function Limitations

Recent comprehensive benchmarks directly compare traditional and AI-powered docking methods, revealing systematic shortcomings. The following table summarizes key performance metrics across different evaluation datasets, highlighting the specific challenges of physical plausibility and generalization.

Table 1: Performance Comparison of Docking Methods Across Key Benchmarks

Method Category	Example Method	Pose Prediction Success (RMSD ≤ 2Å)	Physical Validity (PB-Valid Rate)	Combined Success (RMSD ≤ 2Å & PB-Valid)	Key Weakness Identified
Traditional	Glide SP	Moderate	>94% (across all datasets)	Moderate	Balanced but computationally intensive [32]
Generative Diffusion	SurfDock	>70% (across all datasets)	Suboptimal (e.g., 40.21% on DockGen)	Moderate (e.g., 33.33% on DockGen)	Produces physically implausible structures [32]
Regression-Based	KarmaDock	Low	Very Low	Low	Frequent production of physically invalid poses [32]
Hybrid (AI Scoring)	Interformer	Moderate	High	High	Aims to balance pose accuracy and physical validity [32]

The data reveals that while some modern DL methods, particularly generative diffusion models, excel in raw pose prediction accuracy (RMSD ≤ 2Å), they often do so at the cost of physical plausibility. For instance, SurfDock achieves a high pose prediction success rate of 75.66% on the challenging DockGen dataset (featuring novel protein pockets) but has a PB-valid rate of only 40.21% [32]. The PoseBusters toolkit has been instrumental in uncovering these issues, flagging problems such as incorrect bond lengths, steric clashes, and unrealistic molecular geometry that are missed by the RMSD metric alone [32].

Furthermore, a critical failure of both classical and many DL scoring functions is their poor generalization to novel targets. Performance often drops significantly when methods are applied to proteins or binding pockets that are structurally distinct from those in their training data [32]. This lack of robustness directly impacts virtual screening efficacy, as the goal is to discover new chemotypes for diverse targets.

Table 2: Consequences of Poor Scoring in Key Drug Discovery Stages

Discovery Stage	Primary Impact of Poor Scoring	Downstream Consequences
Hit Identification	Inaccurate ranking of compounds in virtual screening; high false positive/negative rates.	Waste of resources on testing inactive compounds; missed opportunities by overlooking true hits [39].
Hit-to-Lead	Misleading guidance for Structure-Activity Relationship (SAR) and medicinal chemistry.	Optimization efforts are misdirected, leading to dead ends; poor compound quality propagates forward [40].
Lead Optimization	Failure to correctly predict the affinity of optimized analogs.	Inefficient cycle of synthesis and testing; increased risk of late-stage attrition due to underlying affinity issues [32].

Experimental Protocols for Evaluating Scoring Function Performance

To systematically identify the limitations described above, researchers employ rigorous benchmarking protocols. The following workflow outlines a standard methodology for a comprehensive assessment of scoring functions.

Dataset Curation

A robust evaluation requires multiple, carefully curated datasets to test different aspects of performance [32]:

The Astex Diverse Set: Consists of high-quality, known protein-ligand complexes. It tests a method's ability to reproduce native poses in ideal scenarios.
The PoseBusters Benchmark Set: Contains complexes not used in the training of major DL models. It evaluates performance on "unseen" but conventional targets.
The DockGen Dataset: Includes proteins with novel binding pockets that are structurally distinct from common targets. This is a critical test for generalization ability, a key weakness for many functions [32].

Key Evaluation Metrics

The workflow assesses several distinct performance metrics, as outlined in Table 1:

Pose Accuracy: Measured by calculating the Root-Mean-Square Deviation (RMSD) between the predicted ligand pose and the experimentally determined crystal structure. A prediction is typically considered successful if the heavy-atom RMSD is ≤ 2.0 Å [32].
Physical Validity: Evaluated using toolkits like PoseBusters, which check for chemical and geometric consistency, including valid bond lengths, angles, absence of steric clashes, and correct stereochemistry [32].
Virtual Screening Efficacy: Measured by the ability to correctly rank active compounds above inactive (decoy) molecules in a large-scale screen. This is often summarized by metrics like the Enrichment Factor (EF) and the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) [39] [32].
Interaction Recovery: Analyzing whether the predicted binding mode accurately recapitulates key molecular interactions (e.g., hydrogen bonds, hydrophobic contacts, π-stacking) observed in the experimental structure, even if the overall RMSD is acceptable [32].

Consequences of Poor Scoring in Hit-to-Lead

The quantitative failures detailed in Section 3 translate directly into significant operational setbacks in the laboratory. When scoring functions generate false positives—incorrectly assigning high affinity to non-binders—teams waste valuable resources synthesizing and testing these compounds. A survey of virtual screening studies noted that a lack of consensus on hit identification criteria, including the underutilization of size-targeted metrics like ligand efficiency, can exacerbate this problem [39]. Furthermore, poor scoring can obscure the true Structure-Activity Relationship (SAR), leading chemists to draw incorrect conclusions about which chemical groups contribute favorably to binding. This misdirection can derail an optimization campaign, sending teams down unproductive chemical pathways for months [40]. Ultimately, these errors contribute to the high attrition rates seen in later, more expensive stages of drug development, as fundamental flaws in affinity and selectivity are only uncovered after substantial investment.

The Scientist's Toolkit: Research Reagent Solutions

The following table lists key reagents, software, and datasets that are essential for conducting rigorous evaluations of scoring functions and performing structure-based drug discovery.

Table 3: Essential Research Tools for Scoring and Docking Evaluation

Tool Name	Type	Primary Function in Evaluation
PoseBusters [32]	Software Toolkit	Validates the physical plausibility and chemical correctness of predicted protein-ligand complexes.
RDKit [41]	Cheminformatics Library	Handles molecular informatics; used for processing ligand structures (e.g., from SMILES) and calculating molecular descriptors.
Astex Diverse Set [32]	Benchmark Dataset	A standard set of high-quality protein-ligand complexes for initial validation of pose prediction accuracy.
DockGen [32]	Benchmark Dataset	A dataset featuring novel protein binding pockets for testing the generalizability of docking methods.
Transcreener Assays [40]	Biochemical Assay	Provides a homogeneous, high-throughput method for experimentally confirming compound potency and mechanism of action during hit validation.
PyMOL [41]	Molecular Visualization	Enables visual inspection of predicted binding poses, protein-ligand interactions, and steric clashes.

Advanced Approaches: Moving Beyond Classical Scoring

The limitations of classical functions have spurred the development of new computational paradigms. Deep learning models are now being extensively applied to drug-target binding (DTB) prediction, offering the potential to learn complex, non-linear relationships from large datasets that are difficult to codify in classical functions [38] [41]. These DL approaches can be broadly categorized, each with distinct advantages and weaknesses, as shown in the following diagram.

As illustrated, generative diffusion models demonstrate superior pose prediction accuracy but often produce physically implausible structures. Regression-based models frequently fail to generate valid molecular geometries altogether. In contrast, hybrid methods, which often combine traditional conformational search algorithms with AI-driven scoring functions, currently offer the most balanced performance, aiming to retain the strengths of both classical and modern approaches [32]. The field is also exploring multimodal approaches that integrate diverse data types, such as protein sequences, ligand graphs, and 3D structural information, to create more robust and generalizable models [38].

This case study has delineated the profound consequences of poor scoring in early drug discovery, from wasted resources on false leads to the misguided optimization of compound series. Quantitative benchmarks reveal that while classical and even some modern DL scoring functions can perform well on standard tests, they frequently fail on critical aspects like physical plausibility, recovery of key interactions, and generalization to novel targets. Addressing these limitations is paramount for improving the efficiency of drug discovery.

Future research directions are focused on developing more physically realistic and generalizable models. Promising strategies include integrating tighter physical constraints into DL model loss functions, improving the sampling of diffusion models, and enhancing the efficiency of hybrid method searches [32]. Furthermore, the development of more challenging and biologically relevant benchmark datasets will be crucial for steering progress. As these advanced models mature and are validated against real-world screening campaigns, they hold the potential to significantly de-risk the hit identification and lead optimization process, ultimately accelerating the delivery of new therapeutics.

Diagnosing the Problem: Key Challenges and Emerging Strategies for Improvement

In the field of computational drug discovery, the accuracy of structure-based binding affinity prediction is fundamentally constrained by the quality, quantity, and diversity of the underlying training data. While advanced deep learning architectures including convolutional neural networks, graph neural networks, and transformer-based models have emerged as promising approaches for scoring functions, their performance has plateaued due to often-overlooked limitations in the datasets upon which they are trained and evaluated [42]. The central challenge, termed the "data bottleneck," encompasses three interrelated dimensions: spatial and structural biases in existing datasets, sparsity of data for novel targets, and the propagation of errors through low-quality or improperly processed data. This bottleneck not only inflates performance metrics during benchmarking but severely limits the real-world applicability of these models in genuine drug discovery pipelines, particularly when encountering novel protein targets or chemical spaces [28] [32] [43].

The persistence of this data bottleneck has significant implications for the development of classical scoring functions. Models achieving state-of-the-art performance on standardized benchmarks frequently fail to maintain this accuracy when applied to strictly independent test sets, revealing a concerning over-reliance on data patterns that do not translate to genuine generalization [28]. This technical guide examines the multifaceted nature of data limitations through quantitative analysis, experimental validation, and proposed methodological solutions, providing researchers with a framework for diagnosing and addressing data-related challenges in their own affinity prediction work.

Data Bias and Benchmark Inflation in Model Evaluation

The PDBbind-CASF Data Leakage Problem

A critical examination of standard benchmarks reveals substantial data leakage between the primary training data and evaluation sets. The PDBbind database and the Comparative Assessment of Scoring Functions (CASF) benchmark, widely used for training and testing deep learning models, exhibit a high degree of structural similarity that artificially inflates performance metrics [28]. Recent investigations utilizing structure-based clustering algorithms have identified that nearly 49% of CASF test complexes have highly similar counterparts in the PDBbind training set, with nearly 600 identified train-test pairs sharing not only similar ligand and protein structures but also comparable binding conformations and affinity labels [28]. This redundancy enables models to achieve high benchmark performance through memorization and structural matching rather than genuine understanding of protein-ligand interactions.

Table 1: Quantifying Data Leakage Between PDBbind and CASF Benchmarks

Similarity Metric	Threshold Value	Impact on CASF Test Set	Effect on Model Performance
Protein Structure Similarity (TM-score)	>0.7	49% of test complexes affected	Enables protein structure memorization
Ligand Similarity (Tanimoto)	>0.9	Significant portion of test ligands	Allows ligand-based affinity prediction
Binding Conformation (RMSD)	<2.0Å	Nearly 600 similar train-test pairs	Permits binding pose matching
Combined Similarity	Multimodal filtering	Widespread train-test overlap	Inflates benchmark performance by 20-40%

Experimental Protocol for Identifying Data Leakage

Researchers can implement the following experimental protocol to diagnose data leakage in their own datasets:

Structure-Based Clustering Algorithm: Implement a multimodal filtering approach that combines:
- Protein similarity calculated via TM-scores [28]
- Ligand similarity computed using Tanimoto coefficients on molecular fingerprints [28]
- Binding conformation similarity measured by pocket-aligned ligand root-mean-square deviation (RMSD) [28]
Similarity Threshold Application: Identify problematic pairs using established thresholds:
- TM-score > 0.7 indicates significant protein structural similarity
- Tanimoto coefficient > 0.9 reflects nearly identical ligands
- RMSD < 2.0Å suggests nearly identical binding conformations
Cross-Dataset Comparison: Apply the clustering algorithm to compare training and test set complexes, flagging any pairs exceeding similarity thresholds across multiple metrics.
Dataset Filtering: Create a cleaned dataset by removing training complexes that closely resemble any test complex according to the established thresholds. The PDBbind CleanSplit protocol removes approximately 4% of training complexes to address train-test leakage and an additional 7.8% to reduce internal redundancies [28].

Impact of Data Bias on Model Performance

Retraining state-of-the-art affinity prediction models on properly cleaned datasets reveals the substantial impact of data bias. When models like GenScore and Pafnucy were retrained on the PDBbind CleanSplit dataset with reduced data leakage, their performance on the CASF benchmark dropped markedly, confirming that previously reported high performance was largely driven by data leakage rather than genuine generalization capability [28]. This demonstrates that the impressive benchmark performance of many published models does not translate to real-world scenarios where models encounter truly novel protein-ligand complexes.

Data Sparsity and Representation Gaps

Structural Coverage Limitations

Public structural databases exhibit significant biases in their coverage of protein-ligand interactions. The Protein Data Bank (PDB) contains substantial representation biases toward soluble, easily crystallized proteins, while membrane proteins, RNA-protein complexes, and other challenging targets remain severely underrepresented [43] [44]. As of 2024, only 4,888 RNA-protein complexes were available in the PDB, with fewer than 400 representing high-resolution, unique, non-redundant structures after accounting for redundancies [44]. This sparse structural coverage creates critical gaps in training data that directly impact model performance on pharmaceutically relevant but structurally elusive targets.

Experimental Protocol for Addressing Data Sparsity

To mitigate the effects of data sparsity, researchers can employ the following methodological approaches:

Data Augmentation through Conformational Sampling:
- Generate diverse decoy conformations from docking results to expand training data [45]
- Utilize molecular dynamics simulations to sample alternative binding poses
- Apply rotational and translational perturbations to existing complexes
Transfer Learning from Related Domains:
- Pre-train models on larger datasets from related domains (e.g., protein-protein interactions)
- Utilize language model embeddings (ChemBERTa, ProtBERT) to incorporate semantic molecular information [38]
- Implement multi-task learning across targets with sufficient data
Federated Learning Approaches:
- Train models across multiple institutions without sharing raw data to overcome data privacy barriers [43]
- Utilize federated platforms like OpenFold3 that aggregate gradients while keeping raw structures within institutional firewalls [43]

Table 2: Data Sparsity Mitigation Strategies and Their Applications

Strategy	Methodology	Target Use Case	Limitations
Decoy Conformation Augmentation	Generate multiple docking poses for active compounds	Virtual screening for targets with limited active compounds	May introduce conformational bias if sampling is insufficient
Cross-Target Transfer Learning	Pre-train on targets with abundant data, fine-tune on sparse targets	Novel target families with limited structural data	Requires careful selection of source domains to ensure relevance
Federated Learning	Train across multiple institutions without data sharing	Proprietary datasets with IP constraints	Increased computational complexity and coordination overhead
Synthetic Data Generation	Generative models to create plausible protein-ligand complexes	Ultra-rare targets with minimal experimental data	Requires validation to ensure physical plausibility

Data Quality and Precision Challenges

Impact of Data Quality on Model Performance

The relationship between data quality and model performance represents a critical dimension of the data bottleneck. Systematic studies examining the effects of data quality and quantity have demonstrated that variations in these parameters can cause performance discrepancies comparable to or even larger than those observed between different deep learning architectures [46]. Notably, the presence of diverse protein targets in training data produces a dramatic increase in prediction accuracy, highlighting the importance of target diversity over mere quantity of ligand data [46]. This suggests that the continued accumulation of high-quality affinity data, especially for new protein targets, is indispensable for improving deep learning models.

Low-Precision Data and Evaluation Instability

The growing practice of employing low-precision computation to enhance efficiency introduces subtle but significant challenges in model evaluation. When relevance scores between queries and documents are computed in low-precision formats (e.g., FP16, BF16), the reduced numerical granularity produces spurious ties—distinct true scores that collapse to the same quantized value [47]. These scoring collisions introduce high variability in evaluation results based on arbitrary tie-resolution methods, making reliable performance assessment difficult. In retrieval-based affinity prediction tasks, this can manifest as inconsistent ranking of candidate molecules based on predicted binding scores.

Experimental Protocol for High-Precision Evaluation

To address evaluation instability from low-precision data, researchers should implement the following High-Precision Scoring (HPS) protocol:

Maintain Low-Precision Forward Pass: Execute the primary model inference in low-precision (BF16/FP16) to preserve computational efficiency.
Upcast Final Scoring Operation: Before the final scoring function (softmax, sigmoid, or pairwise product), upcast the logits tensor to FP32 precision: ŝ_i = ϕ(upcast(z_i)) [47]
Compute Fine-Grained Scores: Perform the final scoring operation in high precision to generate more discriminative relevance scores.
Implement Tie-Aware Metrics: Supplement standard evaluation metrics with tie-aware retrieval metrics (TRM) that report expected scores, ranges, and biases to quantify ordering uncertainty [47].

This combined approach dramatically reduces tie-induced instability, with experiments showing MRR@10 range reduction of 36.82% and recovery of near-FP32 evaluation stability [47].

Experimental Solutions and Research Reagents

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Research Reagents for Addressing Data Bottlenecks

Reagent / Resource	Type	Primary Function	Application Context
PDBbind CleanSplit	Curated Dataset	Eliminates train-test leakage in affinity prediction	Benchmarking generalization capability of scoring functions
PADIF Fingerprint	Interaction Representation	Captures nuanced protein-ligand interaction patterns	Virtual screening and target prediction for diverse target classes
High-Precision Scoring (HPS)	Evaluation Protocol	Reduces spurious ties in low-precision inference	Stable evaluation of retrieval-based affinity prediction
Tie-aware Retrieval Metrics (TRM)	Evaluation Metrics	Quantifies uncertainty from tied rankings	Comprehensive assessment of model ranking performance
Dark Chemical Matter (DCM)	Decoy Dataset	Provides confirmed non-binders for model training	Improving virtual screening specificity
Federated Learning Platforms	Computational Framework	Enables multi-institutional collaboration without data sharing	Training on proprietary datasets with IP constraints

Case Study: Improving Generalization with CleanSplit

The implementation of PDBbind CleanSplit provides an instructive case study in addressing data bottlenecks. After identifying substantial data leakage between standard training and test sets, researchers developed a structure-based filtering algorithm that systematically removes training complexes that closely resemble any test complex [28]. The resulting dataset enables genuine evaluation of model generalization to unseen protein-ligand complexes. When a graph neural network model (GEMS) incorporating sparse graph modeling and transfer learning from language models was trained on this cleaned dataset, it maintained high benchmark performance despite the reduced data leakage, suggesting its predictions were based on genuine understanding of protein-ligand interactions rather than exploitation of dataset biases [28].

The data bottleneck in binding affinity prediction represents a multifaceted challenge encompassing bias, sparsity, and quality limitations that collectively constrain the real-world applicability of computational models. The systematic identification of data leakage between standard benchmarks reveals that reported performance metrics often substantially overestimate true generalization capability. Addressing these limitations requires coordinated advances in dataset curation, evaluation methodologies, and model architectures specifically designed to maximize learning from limited high-quality data.

Future progress depends on developing more sophisticated data curation protocols that proactively identify and eliminate biases, creating standardized evaluation frameworks that account for precision limitations, and fostering collaborative data sharing models that expand access to diverse, high-quality training data while respecting intellectual property constraints. The implementation of the experimental protocols and solutions outlined in this technical guide provides researchers with practical approaches to diagnose and address data bottlenecks in their own work, ultimately contributing to the development of more robust and generalizable scoring functions for computational drug discovery.

In the field of affinity prediction for drug discovery, the perceived performance of computational models is often dangerously inflated by fundamental errors in the application of machine learning principles. Specifically, improper management of the relationship between training and testing data—termed here "Train-Test Leadership"—introduces optimistic bias that undermines model reliability and generalizability. This technical examination addresses how data leakage, inconsistent preprocessing, and inadequate randomness control systematically compromise benchmarking integrity in scoring function development, perpetuating the well-documented limitations of classical affinity prediction methods and impeding genuine progress in the field.

The development of reliable scoring functions for protein-ligand binding affinity prediction represents a cornerstone of structure-based drug design. Despite decades of research, classical and machine learning-based scoring functions continue to demonstrate limited predictive accuracy on novel targets, with particularly poor performance in cross-target applications—a phenomenon known as the inter-protein scoring noise problem [9]. Empirical scoring functions trained using linear regression or machine learning methods on experimental structures and affinity data have shown considerable improvements in prediction accuracy for large generic datasets [5]. However, these apparent advances often fail to translate to real-world drug discovery applications, with few AI-discovered therapeutics reaching clinical trials and none achieving clinical approval as of 2024 [48].

This performance-translation gap frequently stems from improperly implemented benchmarking methodologies that systematically inflate perceived model capability. The core issue resides in what we term "Train-Test Leadership"—the comprehensive approach to managing the relationship, separation, and processing of training and testing data throughout the model development pipeline. Insufficient attention to the subtle ways in which information leaks between these datasets, or in which preprocessing decisions optimize for benchmark performance rather than generalizability, creates a misleading impression of model efficacy that evaporates when facing truly novel prediction tasks.

Fundamental Pitfalls in Train-Test Management

Data Leakage: The Silent Performance Inflator

Data leakage occurs when information that would not be available at prediction time is used during model training, resulting in optimistically biased performance estimates [49]. In affinity prediction, this manifests particularly during preprocessing and feature selection stages.

Mechanism of Leakage: When the entire dataset is used for operations that should be restricted to training data only, such as feature selection, normalization parameter calculation, or dimensionality reduction, information from the test set contaminates the training process. The model effectively gains "foresight" about the test distribution, violating the fundamental assumption of independent evaluation [49].

Experimental Demonstration: In a demonstration using synthetic data with 10,000 randomly generated features and completely random targets, including test data in feature selection resulted in an accuracy score of 0.76—far above the expected chance performance of 0.5. When proper protocol was followed (splitting data first, then performing feature selection using only training data), accuracy correctly fell to chance level (0.5) [49].

Table 1: Impact of Data Leakage on Model Performance

Scenario	Feature Selection Method	Test Accuracy	Interpretation
Incorrect	Pre-splitting selection using all data	0.76	Severely inflated
Correct	Post-splitting selection using only training data	0.50	Accurate (chance)
Pipeline-based	Automated separation via scikit-learn Pipeline	0.50	Accurate (chance)

Inconsistent Preprocessing: The Distribution Shift Problem

Inconsistent application of preprocessing transformations between training and testing phases creates a mismatch between the data distributions the model was trained on and those it encounters during deployment [49] [50].

Normalization Pitfall: A common error occurs when normalization parameters (e.g., mean and standard deviation for StandardScaler, min and max for MinMaxScaler) are calculated using the entire dataset before splitting, rather than being fit solely on training data and then applied to the test set. This subtly introduces test set information into the training process [50].

Concrete Example: In a polynomial regression predicting house prices from square footage, when training data was normalized using parameters from the complete dataset (including test observations), the model achieved deceptively good performance. When proper protocol was followed (normalization parameters calculated from training data only), performance on the true test set more accurately reflected real-world generalizability [50].

Impact on Affinity Prediction: For empirical scoring functions that rely on feature descriptors capturing essential interaction features between proteins and ligands [5], inconsistent preprocessing creates models that appear accurate during benchmarking but fail to generalize across diverse protein families or structural motifs.

Improper Randomness Control

The management of randomness through random_state parameters significantly impacts the reproducibility and reliability of benchmarking results [49].

Random State Rules:

If an integer is passed, calling fit or split multiple times always yields the same results
If None or a RandomState instance is passed, fit and split yield different results each time
For optimal robustness of cross-validation results, the recommendation is to pass RandomState instances when creating estimators, or leave random_state to None [49]

Benchmarking Implications: Inconsistent handling of randomness across different stages of model evaluation (e.g., during cross-validation splits versus final model training) introduces uncontrolled variance that can artificially enhance or depress perceived performance, compromising comparisons between different scoring functions.

Experimental Protocols for Robust Benchmarking

Corrected Data Splitting Methodology

Protocol 1: Strict Separation Workflow

Initial Splitting: Split the complete dataset into training and testing subsets before any preprocessing or feature engineering [49] [50]
Preprocessing Parameterization: Calculate all preprocessing parameters (normalization constants, imputation values, feature selection criteria) using the training set only
Transformation Application: Apply the transformations with fixed parameters to both training and test sets
Model Training: Train the model exclusively on the processed training data
Evaluation: Assess performance on the processed test data only

Validation Step: To verify proper separation, use synthetic tests with randomized targets to ensure models achieve expected chance performance when no true relationship exists [49].

Pipeline-Based Implementation

The most robust defense against data leakage is implementing a unified pipeline that encapsulates all preprocessing and modeling steps [49].

scikit-learn Implementation:

Advantages:

Prevents forgetting to apply the same transformation to test data
Ensures fit_transform is only applied to training data during cross-validation
Simplifies model deployment and reproduction [49]

Special Considerations for Affinity Prediction

The Inter-Protein Scoring Noise Problem

Classical scoring functions demonstrate a specific failure mode known as inter-protein scoring noise: while capable of enriching active molecules for a single protein target, they fail to identify the correct protein target for a given active molecule due to scoring variation between different binding pockets [9].

Benchmarking Implications: Traditional train-test splits that randomly assign protein-ligand complexes across targets fail to adequately assess this critical capability. A more rigorous approach involves leave-one-target-out validation, where all complexes for specific protein targets are held out during training.

Recent Assessment: In evaluations of the Boltz-2 biomolecular foundation model, while initial claims suggested performance approaching free-energy perturbation in estimating binding affinity, the model failed to correctly identify protein targets for active molecules when tested on the LIT-PCBA benchmark set for target identification [9].

Domain-Specific Research Reagents

Table 2: Essential Research Reagents for Affinity Prediction Benchmarking

Reagent/Solution	Function	Considerations
LIT-PCBA dataset	Benchmark set for target identification based on the LIT-PCBA [9]	Tests ability to identify correct protein target for active molecules
PDBbind database	Curated collection of protein-ligand complexes with binding affinity data [5]	Provides experimental structures and affinity data for training empirical scoring functions
Boltz-2 model	Biomolecular foundation model for affinity prediction [9]	Reference for comparing new methods; demonstrates current limitations
Classical scoring functions	Empirical functions using linear regression or machine learning [5]	Baseline for method comparison; exhibit known limitations

Proposed Benchmarking Standards

Quantitative Metrics for Assessment

Table 3: Comprehensive Benchmarking Metrics for Affinity Prediction

Metric Category	Specific Metrics	Interpretation Guidelines
Target Identification	Success rate in identifying correct protein target for active molecules [9]	Primary metric for generalizability; should exceed 0.5 for useful methods
Affinity Accuracy	Root mean square error (RMSE) between predicted and experimental binding affinities [5]	Context-dependent; must be compared to state-of-the-art and classical baselines
Ranking Capability	Enrichment factors, ROC curves, AUC values [5]	Measures utility for virtual screening applications
Cross-Target Performance	Variance in performance across different protein families [9]	Lower variance indicates better generalizability

Implementation Quality Controls

Quality Control Protocol:

Randomization Testing: Implement negative controls using randomized targets to establish baseline performance [49]
Multiple Splitting Strategies: Employ both random splits and structured splits (by protein target, structural family, or temporal cutoff) [9]
Pipeline Auditing: Verify that preprocessing steps are correctly encapsulated and cannot access test data during training
Reproducibility Safeguards: Fix random seeds for published results while testing robustness to seed variation [49]

The inflation of perceived performance through improper train-test management represents a critical barrier to genuine progress in affinity prediction research. The field's continued reliance on benchmarks vulnerable to data leakage and preprocessing inconsistencies perpetuates the development of methods that excel in artificial testing environments but fail in practical applications—particularly for challenging problems like inter-protein scoring noise [9].

Addressing these issues requires both technical corrections in implementation and cultural shifts in evaluation standards. The adoption of pipeline-based approaches, comprehensive negative controls, and more rigorous benchmarking sets that specifically test generalizability across protein targets will enable more accurate assessment of true methodological advances. Furthermore, increased transparency in reporting preprocessing methodologies, data splitting strategies, and randomization protocols will facilitate more meaningful comparisons between different scoring functions [48].

Only through such rigorous attention to the fundamentals of machine learning evaluation can the field overcome the current limitations of classical scoring functions and produce genuinely reliable affinity prediction methods capable of accelerating drug discovery and development.

The accurate prediction of drug-target binding affinity (DTA) is a cornerstone of modern computational drug discovery. While classical scoring functions and contemporary deep learning models have advanced this field, their performance remains fundamentally constrained by the quality and composition of the training data upon which they are built. Redundancy—the overrepresentation of similar protein sequences and ligand structures in training datasets—introduces significant bias, reduces model generalizability, and ultimately limits real-world applicability [51] [7].

The limitations of classical scoring functions (e.g., force-field-based, empirical, and knowledge-based) are well-documented, particularly their struggle with generalization across diverse protein families and ligand classes [7]. These functions often exhibit predictive bias toward specific target types with abundant structural data, such as soluble proteins, while performing poorly on membrane proteins like G protein-coupled receptors (GPCRs) and ion channels, which are crucial drug targets but structurally underrepresented [10] [11]. This bias stems directly from redundant training sets that fail to adequately represent the structural diversity of biological targets. As the field transitions toward data-driven machine learning and deep learning approaches, addressing dataset redundancy becomes increasingly critical for developing robust predictive models with genuine utility in drug discovery pipelines.

The Quantitative Impact of Dataset Redundancy on Model Performance

Manifestations and Consequences of Redundancy

Redundancy in drug-target affinity data manifests primarily through two channels: sequence redundancy in target proteins and structural redundancy in ligand compounds. The former occurs when training sets contain multiple similar protein sequences from the same family, while the latter arises from numerous structurally analogous compounds. This redundancy creates models that perform exceptionally well on familiar data but fail to generalize to novel targets or chemotypes—a significant problem for drug discovery where innovation precisely targets novel biology and chemistry [51] [36].

The standard threshold-based algorithm (used in tools like CD-HIT, PISCES, and UCLUST) for selecting representative subsets often exacerbates these issues. This approach applies a heuristic threshold where sequences are added to a representative set only if no existing member shares similarity above the threshold (typically 40% or 90% sequence identity). This method has two critical drawbacks: it ignores all similarities below the specified threshold, potentially selecting representatives with similarities very close to the cutoff, and it provides no guarantees about the final set size, which is crucial for downstream applications [51].

Comparative Analysis of Redundancy Handling Methods

Table 1: Comparison of Representative Subset Selection Methods for Protein Sequences

Method	Core Algorithm	Advantages	Limitations
Threshold Algorithm (CD-HIT, UCLUST)	Heuristic threshold-based selection	Fast computation; widely adopted	No theoretical guarantees; ignores sub-threshold similarities; unstable output size
Submodular Optimization (Repset)	Discrete optimization with diminishing returns property	Provable theoretical guarantees; maximizes structural diversity; flexible objective functions	Computationally more intensive; requires pairwise similarity calculations
Clustering-based Methods	Group sequences then select exemplars	Intuitive grouping structure	Exemplar selection may not optimize global diversity

Experimental evidence demonstrates that submodular optimization approaches consistently yield protein sequence subsets with greater structural diversity than sets chosen by existing threshold-based methods. When evaluated against the structural classification of proteins (SCOPe) library as a gold standard, submodular optimization selects sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches [51].

Methodological Framework for Constructing Non-Redundant Training Sets

Submodular Optimization for Representative Selection

Submodular optimization provides a mathematical framework for representative subset selection with theoretical guarantees. A submodular function exhibits the property of "diminishing returns"—the incremental value of adding a sequence to a representative set decreases as the set grows. This property makes these functions amenable to efficient optimization with provable approximation guarantees [51].

The fundamental approach involves defining a submodular objective function that quantifies the quality of a candidate representative subset, then applying optimization algorithms to identify a subset that maximizes this function. Formally, for a set function ( f: 2^S \rightarrow \mathbb{R} ), ( f ) is submodular if for every ( A \subseteq B \subseteq S ) and ( s \in S \setminus B ), it holds that:

[ f(A \cup {s}) - f(A) \geq f(B \cup {s}) - f(B) ]

This mathematical framework enables the development of objective functions that simultaneously maximize representativeness (ensuring every sequence in the full set has a similar representative) and minimize redundancy (ensuring selected representatives are diverse) [51].

Implementation Workflow for Non-Redundant Dataset Creation

Table 2: Key Research Reagent Solutions for Non-Redundant Dataset Construction

Tool/Resource	Type	Primary Function	Application Context
Repset	Software package	Submodular optimization for representative sequence selection	Creating non-redundant protein sequence sets for model training
PSI-BLAST	Algorithm	Protein similarity search and alignment	Calculating pairwise similarities for optimization input
SCOPe Library	Database	Structural classification of proteins	Gold standard for evaluating structural diversity
PubChem/ChEMBL	Database	Repository of chemical molecules and bioactivities	Source for ligand structures and binding affinities
PaDEL Descriptors	Software	Molecular descriptor calculation	Featurization of ligand compounds for diversity analysis

Advanced Objective Functions for Specialized Applications

The optimization framework allows for designing specialized objective functions tailored to specific research needs. For instance, a mixture objective function can be created that performs well for both large and small representative sets, addressing a key limitation of threshold-based approaches. Similarly, hybrid functions can incorporate sequence length preferences, encouraging the selection of longer sequences when those are desirable for downstream applications [51].

For drug-target affinity prediction, this approach can be extended to handle both protein and ligand redundancy simultaneously. Molecular descriptors associated with molecular vibrations—including E-state descriptors, autocorrelation descriptors, and topological descriptors—can be screened to represent ligand diversity, while protein sequence descriptors capture target diversity [10] [11]. By treating the molecule-target pair as a whole system, researchers can create comprehensive non-redundant datasets for affinity prediction [11].

Experimental Validation and Benchmarking Protocols

Evaluation Metrics and Benchmark Datasets

Rigorous evaluation of non-redundant training sets requires standardized metrics and benchmark datasets. For protein sequence sets, structural diversity measured against reference databases like SCOPe provides a key validation metric. For affinity prediction tasks, standard benchmarks include the Davis kinase binding affinity dataset (containing 442 proteins and 68 drugs with 30,056 interactions) and the KIBA dataset (containing 229 proteins and 2,111 drugs with 118,254 interactions) [52].

These benchmark datasets address data heterogeneity concerns, with Smith-Waterman similarity analysis showing that 92% of protein pairs in the Davis dataset and 99% in the KIBA dataset have sequence similarity of at most 60%, indicating inherent non-redundancy [52]. Performance metrics should include both predictive accuracy (measured via Mean Squared Error/MSE and Concordance Index/CI) and generalizability to novel targets and compounds.

Case Study: Deep Learning Models with Non-Redundant Training

Contemporary deep learning models for DTA prediction demonstrate the critical importance of proper dataset construction. Models like DeepDTA, GraphDTA, and ImageDTA show significantly different performance characteristics when trained and evaluated on properly constructed non-redundant datasets [36] [52].

For example, ImageDTA—which treats word vector-encoded SMILES strings as images and processes them with multiscale 2D convolutional neural networks—achieves superior performance on benchmark datasets through architectural innovations that better capture structural information while minimizing information loss common in pooling operations [52]. This approach demonstrates MSE values of 0.214 and CI values of 0.890 on the Davis dataset, outperforming many traditional approaches while maintaining greater interpretability [52].

Integration with Modern Drug Discovery Pipelines

The implementation of non-redundant training sets directly addresses critical limitations in classical scoring function approaches. Classical methods—including physics-based, empirical, and knowledge-based scoring functions—often struggle with accuracy and applicability because they were frequently developed and parameterized using limited, redundant datasets [7]. This has restricted their effectiveness, particularly for membrane protein targets and novel compound classes [10].

Modern multitask learning frameworks like DeepDTAGen, which simultaneously predict drug-target binding affinities and generate novel target-aware drug variants, demonstrate the power of diverse training data. These models leverage shared feature spaces for both tasks, but their effectiveness depends critically on comprehensive training data that adequately represents the chemical and biological space of interest [36]. The development of specialized optimization algorithms, such as FetterGrad, to mitigate gradient conflicts in multitask learning further enhances model performance when trained on well-constructed datasets [36].

The construction of diverse, non-redundant training sets represents a fundamental prerequisite for advancing drug-target affinity prediction beyond the limitations of classical scoring functions. Methodologies based on submodular optimization provide a mathematically rigorous framework with theoretical guarantees for selecting representative subsets that maximize structural diversity. When integrated with modern deep learning architectures and comprehensive benchmarking, these approaches enable the development of predictive models with significantly enhanced accuracy, interpretability, and real-world applicability across diverse target classes and compound libraries.

As the field progresses, future work should focus on developing standardized non-redundancy benchmarks, optimizing computational efficiency for large-scale dataset construction, and creating integrated frameworks that simultaneously address redundancy in both target and compound spaces. Through these advances, the drug discovery community can overcome one of the most persistent limitations in computational affinity prediction and accelerate the development of novel therapeutic agents.

The prediction of binding affinity between small molecule drugs and their target proteins is a cornerstone of computational drug discovery. For decades, this field was dominated by classical scoring functions, which are limited by their reliance on simplified physical models and their inability to learn from large-scale data. The rise of machine learning (ML), particularly deep learning, has ushered in a paradigm shift, overcoming these constraints through data-driven approaches that capture complex patterns in molecular structures and interactions. This technical review examines the fundamental limitations of classical methods and delineates how modern ML architectures—including convolutional neural networks (CNNs), graph neural networks (GNNs), and transformer-based models—are achieving superior predictive performance. We provide a quantitative analysis of model capabilities, detailed experimental protocols for benchmarking, and visualization of key workflows. The transition to ML represents a fundamental advancement in the accuracy and efficiency of binding affinity prediction, with profound implications for accelerating drug discovery.

Classical scoring functions have been the workhorse of structure-based virtual screening for predicting protein-ligand binding affinity. These methods are generally categorized into three groups: force-field-based, empirical, and knowledge-based functions [28]. Despite their long-standing utility, they share critical limitations that have constrained their predictive accuracy and generalizability.

A primary shortcoming is their dependence on hand-crafted parameters and simplified physical models. Classical functions often rely on linear regression models that cannot assimilate large amounts of structural and binding data, limiting their capacity to capture the complex, non-linear relationships governing molecular interactions [18] [31]. Furthermore, their performance plateau in virtual screening and binding affinity prediction has been extensively documented; they show limited accuracy in predicting binding affinities for protein-ligand poses [28] [18].

Perhaps the most significant recent revelation is the problem of train-test data leakage and dataset redundancy. Studies have shown that the impressive benchmark performance of many models, including modern deep learning approaches, is artificially inflated due to structural similarities between the training set (e.g., PDBbind) and standard test benchmarks (e.g., the Comparative Assessment of Scoring Functions or CASF) [28]. One analysis found that nearly 50% of training complexes are part of a similarity cluster, and when trained on a properly filtered dataset (PDBbind CleanSplit), the performance of state-of-the-art models drops substantially, revealing that previous high scores were largely driven by data leakage rather than genuine generalization [28]. This indicates that the true generalization capability of many scoring functions has been systematically overestimated.

The Machine Learning Paradigm Shift

Machine learning models overcome the fundamental constraints of classical approaches by learning directly from data rather than relying on pre-defined physical equations. This data-driven paradigm allows them to discover intricate patterns in protein-ligand complexes that are intractable for classical functions.

Key Advantages of Machine Learning

Learning Complex, Non-Linear Relationships: ML models, especially deep learning architectures, excel at capturing the non-linear dependencies between the structural features of a protein-ligand complex and its binding affinity, a task where linear regression models fail [18] [31].
Elimination of Manual Feature Engineering: Instead of using human-engineered features, deep learning models can operate on raw or minimally processed inputs, such as protein sequences, molecular graphs, or 3D structural data, allowing the model to learn the most relevant features automatically [38].
Robust Generalization with Large Datasets: The performance of ML scoring functions improves with the quantity and quality of training data, enabling continuous refinement as more complex structures are solved, whereas classical methods plateau with increasing data [18].
Integration of Diverse Data Modalities: ML frameworks can seamlessly integrate heterogeneous data types—including protein sequences, molecular graphs, 3D structural coordinates, and even textual information—into a unified predictive model [53] [38].

Evolution of Model Architectures

The development of ML models for affinity prediction has progressed through several stages, each introducing more sophisticated architectural components.

Table: Evolution of Deep Learning Models for Affinity Prediction

Model Era	Representative Architectures	Typical Input Representations	Key Innovations
Early Deep Learning	CNNs, RNNs [38]	SMILES strings, amino acid sequences [38]	Moving beyond manual feature engineering to automated feature learning from primary structures.
Graph-Based Models	GNNs (e.g., GraphDTA) [38] [36]	Molecular graphs for drugs, sequences or graphs for proteins [53]	Representing molecules as graphs to explicitly model atomic bonds and topology.
Attention & Transformer Models	Transformers, Self-Attention Mechanisms [38]	SMILES, sequences, often augmented with language model embeddings [38]	Capturing long-range dependencies and utilizing transfer learning from large language models (e.g., ProtBERT, ChemBERTa).
Multimodal & Hybrid Models	GNNs + Transformers, Diffusion Models [53] [36] [32]	3D structures, sequences, graphs, and interaction networks [53]	Integrating multiple input representations and model types for a more holistic view of the complex.

Diagram 1: The methodological evolution of binding affinity prediction models.

A Multidimensional Performance Comparison

Quantitative benchmarking reveals the significant performance gap between classical and ML-based scoring functions. The following tables synthesize key metrics from rigorous evaluations.

Table: Virtual Screening Performance on DUD-E Benchmark (102 Targets)

Scoring Function	Type	Hit Rate (Top 1%)	Hit Rate (Top 0.1%)	Notes
RF-Score-VS [18]	Machine Learning (Random Forest)	55.6%	88.6%	Trained on 15,426 active & 893,897 inactive molecules.
AutoDock Vina [18]	Classical	16.2%	27.5%	Used as a baseline for comparison.

Table: Binding Affinity Prediction Performance (Pearson R)

Model / Scenario	Trained on Standard PDBbind	Trained on PDBbind CleanSplit [28]	Performance Drop
Typical Deep Learning Model	High (e.g., R ~0.8+)*	Substantially Lower	Highlights effect of data leakage.
GEMS (GNN with LLM Transfer) [28]	Not Applicable	State-of-the-art	Maintains high performance on cleaned data.
Classical SF (e.g., Vina) [18]	-	R ≈ -0.18	Poor correlation with experimental affinity.

Note: Exact values for models suffering from leakage are omitted as they are considered unreliable [28].

Beyond affinity prediction, ML models excel in other docking tasks. A 2025 multidimensional evaluation of docking methods categorized them into four performance tiers based on pose accuracy and physical validity [32]:

Traditional Methods (e.g., Glide SP)
Hybrid AI Scoring with traditional conformational search
Generative Diffusion Models (e.g., SurfDock)
Regression-Based Models

This study found that while generative diffusion models achieved superior pose accuracy (e.g., SurfDock RMSD ≤ 2 Å success rate >70%), traditional methods consistently excelled in producing physically plausible poses (PB-valid rates >94%) [32]. This highlights a current challenge for pure DL docking methods.

Experimental Protocols for Rigorous Evaluation

To ensure robust and generalizable model development, researchers must adopt rigorous experimental protocols that address common pitfalls like data leakage.

Protocol 1: Creating a Leakage-Free Dataset with PDBbind CleanSplit

Objective: To generate a training dataset free of structural similarities with standard test sets, enabling a genuine assessment of model generalization [28].

Methodology:

Data Source: Start with the general set of the PDBbind database.
Multimodal Filtering: Apply a structure-based clustering algorithm that combines three metrics:
- Protein similarity via TM-score.
- Ligand similarity via Tanimoto score.
- Binding conformation similarity via pocket-aligned ligand root-mean-square deviation (r.m.s.d.).
Remove Test Analogues: Identify and exclude any training complex where the combined similarity to any complex in the CASF benchmark test set exceeds a defined threshold.
Reduce Internal Redundancy: Iteratively remove training complexes to resolve large similarity clusters within the training set itself, discouraging memorization.
Output: The resulting filtered dataset, PDBbind CleanSplit, is strictly separated from the CASF benchmarks.

Protocol 2: Multitask Learning for Affinity Prediction and Drug Generation

Objective: To simultaneously predict drug-target binding affinity and generate novel, target-aware drug molecules using a shared feature space, as exemplified by the DeepDTAGen framework [36].

Methodology:

Feature Encoding:
- Represent the drug as a molecular graph and process it with a GNN.
- Represent the protein by its sequence and extract features using a CNN or transformer.
Shared Latent Space: Map the features from both drug and target into a shared latent representation that captures interaction knowledge.
Multitask Heads:
- Affinity Prediction Head: A regression module (e.g., MLP) that predicts the continuous binding affinity value from the latent representation.
- Drug Generation Head: A decoder (e.g., transformer) that generates SMILES strings of novel drugs conditioned on the target protein's latent representation.
Gradient Alignment: Implement an algorithm (e.g., FetterGrad) to mitigate gradient conflicts between the two tasks by minimizing the Euclidean distance between their gradients during training.
Evaluation:
- Affinity Task: Use MSE, Concordance Index (CI), and rm² on benchmark datasets (Davis, KIBA, BindingDB).
- Generation Task: Assess the Validity, Novelty, and Uniqueness of generated molecules, and their binding ability to the target.

Diagram 2: Multitask learning framework for affinity prediction and drug generation.

Successful development and benchmarking of ML models for affinity prediction rely on a suite of public databases, software tools, and computational resources.

Table: Essential Resources for Binding Affinity Research

Resource Name	Type	Primary Function	Key Features / Usage
PDBbind [53] [31]	Database	Provides curated protein-ligand complexes with experimental binding affinity data.	Core dataset for training and testing; includes 3D structures and Kd, Ki, or IC50 values.
CASF Benchmark [28] [31]	Benchmark	Standardized benchmark for scoring function evaluation.	Used for rigorous testing; must be used with care to avoid data leakage with PDBbind.
BindingDB [53] [36]	Database	Public database of measured binding affinities.	Provides a large volume of interaction data for training and validation.
DUD-E [18]	Benchmark	Directory of useful decoys for virtual screening evaluation.	Contains known actives and property-matched decoys for 102 targets to test screening power.
Graph Neural Networks (GNNs) [28] [36]	Software/Algorithm	Models molecular structure as graphs for feature learning.	Represents drugs as graphs of atoms (nodes) and bonds (edges) to capture structural information.
ProtInter [54]	Software Tool	Calculates non-covalent interactions from protein complex PDB files.	Used for feature engineering in traditional ML; quantifies hydrogen bonds, hydrophobic interactions, etc.
PDBbind CleanSplit [28]	Dataset	A leakage-free version of PDBbind.	Essential for training models to ensure generalizability is not overestimated.

The rise of machine learning has fundamentally transformed the landscape of binding affinity prediction. Data-driven models have demonstrably overcome the performance plateau of classical scoring functions by leveraging large datasets and advanced architectures to capture the complex physics of molecular interactions. However, this field continues to evolve rapidly, with several critical frontiers on the horizon.

Emerging Research Frontiers

Enhancing Interpretability and Explainability: A key challenge for complex deep learning models is the "black box" problem. Future research must focus on developing methods to interpret predictions, identifying which structural features and interactions drive binding affinity, which is crucial for guiding lead optimization [53].
Generalization to Novel Targets and Pockets: Improving model robustness against variations in protein sequence and binding pocket structure remains a primary challenge [32]. Techniques like transfer learning from protein language models and data augmentation are promising paths forward [28] [38].
Integration with Generative AI and Multitask Learning: The synergy between predictive and generative models, as seen in frameworks like DeepDTAGen, represents a powerful shift towards closed-loop drug design systems that can not only predict but also invent viable drug candidates [36].
Temporal Dynamics and Allostery: Current models largely focus on static structures. Incorporating dynamic information and the ability to model allosteric effects will be essential for a more complete understanding of binding mechanisms [31].

The transition from classical to machine learning-based scoring functions marks a definitive maturation in computational drug discovery. By directly confronting the limitations of hand-crafted physics and linear models, ML approaches have established a new paradigm defined by learning, adaptability, and superior predictive power. As the field addresses current challenges surrounding data bias, generalizability, and interpretability, machine learning is poised to become an even more indispensable tool, accelerating the delivery of life-saving therapeutics.

Rigorous Validation and the Shift to Next-Generation Scoring Paradigms

The adoption of machine-learning scoring functions (ML-SFs) for protein-ligand binding affinity prediction represents a paradigm shift in structure-based drug design. While benchmark studies frequently report superior performance of ML-SFs over classical scoring functions, a closer examination reveals significant gaps between benchmark performance and real-world applicability. This review synthesizes recent evidence demonstrating how data leakage, dataset biases, and evaluation methodologies have systematically inflated perceived ML-SF performance. We analyze methodological advances for creating leakage-free benchmarks, explore the generalizability challenges of current approaches, and provide a technical framework for rigorous SF evaluation. The findings underscore that despite impressive benchmark metrics, most ML-SFs still struggle with target identification and generalization to novel protein families—critical requirements for successful drug discovery applications.

Accurate prediction of protein-ligand binding affinity is fundamental to computational drug discovery. The field has witnessed a rapid transition from classical scoring functions (based on physical principles, empirical data, or knowledge-based statistics) to machine-learning scoring functions (ML-SFs) that leverage complex patterns in structural data. Published literature often shows ML-SFs achieving remarkable performance on standard benchmarks, suggesting a dramatic improvement over classical approaches. However, a growing body of evidence indicates that these performance gains may be substantially overstated due to fundamental flaws in benchmarking methodologies.

The core issue lies in what has been termed the "inter-protein scoring noise problem" – while classical scoring functions can often enrich active molecules for a specific protein target, they frequently fail to identify the correct protein target for a given active molecule due to scoring variations between different binding pockets [9]. A truly robust binding affinity prediction method should overcome this limitation by demonstrating both capabilities. Recent investigations have revealed that the standard practice of training on the PDBbind database and testing on the Comparative Assessment of Scoring Functions (CASF) benchmark has created a situation of widespread train-test data leakage, severely inflating performance metrics and leading to overestimation of model generalization capabilities [28].

This technical review examines the performance gaps between classical and machine-learning scoring functions through a critical lens, focusing on the benchmarking realities that have obscured their true capabilities. We present comprehensive quantitative comparisons, analyze methodologies for proper model evaluation, and provide a framework for future development of more robust affinity prediction tools.

Data Leakage: The Inflated Performance of ML-SFs

The Extent and Impact of Data Leakage

Recent investigations have uncovered substantial data leakage between standard training and test sets used in scoring function development. When models are trained on the PDBbind database and evaluated on the CASF benchmark, their performance metrics become significantly inflated due to structural similarities between the datasets [28].

Table 1: Quantifying Data Leakage Between PDBbind and CASF Benchmarks

Metric	Before Filtering	After CleanSplit Filtering
Similar CASF complexes	49% of all CASF complexes	Structurally distinct complexes
Similar training complexes	~600 complexes	Removed from training set
Training set size	Full PDBbind	Reduced by 11.8%
Ligand-based leakage	Present (Tanimoto > 0.9)	Eliminated

A structure-based clustering algorithm analyzing protein similarity (TM-scores), ligand similarity (Tanimoto scores), and binding conformation similarity (pocket-aligned ligand RMSD) identified that nearly 600 high-similarity pairs exist between PDBbind training and CASF complexes, affecting 49% of all CASF test complexes [28]. These similarities enable models to achieve high benchmark performance through memorization rather than genuine understanding of protein-ligand interactions.

The impact of this leakage is substantial. When state-of-the-art models like GenScore and Pafnucy were retrained on a cleaned dataset with reduced leakage, their performance dropped markedly, indicating that their previously reported excellence was largely driven by data leakage rather than true generalization capability [28]. This pattern extends beyond specific architectures and suggests a systemic issue in the field's evaluation methodologies.

The CleanSplit Solution

To address data leakage, researchers have proposed PDBbind CleanSplit, a training dataset curated by a structure-based filtering algorithm that eliminates train-test data leakage as well as redundancies within the training set [28]. The filtering employs a multi-stage approach:

Complex similarity assessment using combined protein, ligand, and binding conformation metrics
Removal of training complexes closely resembling any CASF test complex
Elimination of training complexes with ligands identical to those in CASF test complexes (Tanimoto > 0.9)
Reduction of internal redundancy by resolving similarity clusters within the training data

The algorithm revealed that nearly 50% of training complexes are part of similarity clusters, meaning random splitting inadvertently inflates validation performance as models can match validation complexes with similar training examples [28]. By addressing both train-test leakage and internal redundancies, CleanSplit provides a more rigorous foundation for model development and evaluation.

Performance Comparison: Reality Beyond the Benchmarks

Quantitative Performance Assessment

When evaluated under leakage-free conditions, the performance gap between classical and machine-learning SFs narrows considerably. The table below summarizes comparative performance across multiple benchmarking scenarios:

Table 2: Performance Comparison of Scoring Function Types Under Different Evaluation Paradigms

Scoring Function Type	CASF2016 RMSE (Traditional Split)	CASF2016 RMSE (CleanSplit)	Target Identification Accuracy	Generalization to Novel Targets
Classical SFs	1.45-1.85	1.50-1.90	Limited	Moderate
ML-SFs (Standard Training)	1.15-1.35	1.40-1.75	Limited	Poor to Moderate
ML-SFs (CleanSplit Training)	N/A	1.20-1.50	Improved	Moderate to Good
GEMS (GNN with CleanSplit)	N/A	1.28 (RMSE)	Not reported	Good

The performance degradation of ML-SFs when moving from standard benchmarks to more rigorous evaluations is particularly revealing. For instance, the graph neural network model GEMS (Graph neural network for Efficient Molecular Scoring) maintains higher benchmark performance when trained on CleanSplit, achieving a Pearson R of 0.856 on CASF2016 compared to significantly lower correlations for other models retrained on the same dataset [28]. This suggests that architectural choices and training strategies significantly impact genuine generalization capability.

The Target Identification Challenge

A critical test for any binding affinity prediction method is its ability to identify the correct protein target for a given active molecule – a capability that remains challenging for both classical and ML approaches. Researchers have developed a new benchmark for target identification based on LIT-PCBA to evaluate whether modern models can correctly identify targets of active molecules [9].

Strikingly, even advanced models like Boltz-2, which claimed to approach the performance of free-energy perturbation in estimating binding affinity, cannot reliably identify the correct protein target by predicting higher binding affinity compared to decoy targets [9]. This failure occurs despite promising performance on traditional affinity prediction benchmarks, highlighting a fundamental limitation in current approaches.

This target identification challenge represents what researchers have termed "the next major hurdle to successful deep-learning-based affinity prediction using protein-ligand complexes" [9]. Any model truly capable of accurate binding affinity prediction should perform well on target-prediction benchmark tasks, a standard that most current ML-SFs fail to meet.

Methodological Advances in Rigorous Benchmarking

Structure-Based Filtering Algorithms

The development of rigorous benchmarking methodologies has become as important as the development of new models. Structure-based filtering algorithms represent a significant advance in this direction. These algorithms employ a multi-modal approach to identify and remove problematic similarities:

Figure 1: Structure-based filtering workflow for detecting data leakage. The algorithm assesses similarity across multiple dimensions before making filtering decisions.

The algorithm computes similarity between protein-ligand complexes using three complementary metrics: protein similarity (TM-scores), ligand similarity (Tanimoto scores), and binding conformation similarity (pocket-aligned ligand RMSD) [28]. This multi-modal approach can identify complexes with similar interaction patterns even when proteins have low sequence identity, addressing limitations of traditional sequence-based analysis.

Novel Benchmarking Paradigms

Beyond data leakage concerns, researchers have developed new benchmarking approaches that test different aspects of model capability:

Target Identification Benchmarks: These evaluate whether models can identify the correct protein target for active molecules by predicting higher binding affinity compared to decoy targets [9]. This addresses the "inter-protein scoring noise problem" where classical SFs fail to identify correct targets despite enriching actives for specific targets.

Synthetic Data Augmentation: This approach tests model robustness using AI-predicted complexes rather than experimental structures. Studies show that data augmentation benefits depend critically on structural quality, with low-quality synthetic examples providing limited value [55].

Orthogonal Dataset Validation: Models are tested on datasets specifically designed to be structurally distinct from training data, such as the FEP benchmark dataset with minimal data leakage from training sets [55].

Case Studies: Exemplary Approaches and Their Limitations

GEMS: A Graph Neural Network Approach

The GEMS (Graph neural network for Efficient Molecular Scoring) model exemplifies recent advances in ML-SF design that address generalization challenges. GEMS combines a sparse graph modeling of protein-ligand interactions with transfer learning from language models, enabling it to maintain state-of-the-art performance when trained on the cleaned PDBbind CleanSplit dataset [28].

Ablation studies with GEMS revealed that the model fails to produce accurate predictions when protein nodes are omitted from the graph, suggesting its predictions are based on genuine understanding of protein-ligand interactions rather than exploiting dataset biases [28]. This represents an important validation of the model's learning mechanism.

PATH+: An Interpretable Topological Approach

PATH+ represents a different philosophical approach, prioritizing interpretability alongside performance. This algorithm uses persistent homology, a mathematical tool from algebraic topology, to encode structural binding features [56]. Unlike black-box deep learning models, PATH+ provides inherent interpretability, allowing researchers to trace predictions back to specific atomic interactions.

The "persistence fingerprint" in PATH+ efficiently captures geometric properties such as molecular cavities and interaction patterns at multiple scales [56]. This approach demonstrates that high accuracy doesn't necessarily require sacrificing interpretability, addressing a key limitation of many deep learning-based SFs.

SG-ML-PLAP: Structure-Guided Machine Learning

The SG-ML-PLAP framework combines extended connectivity interaction features (ECIF) with machine learning to predict binding affinities [57]. This approach shows improved performance compared to conventional scoring functions and several other ML-SFs, particularly when training on crystal structures is supplemented with redocked protein-ligand complexes.

Benchmarking on CASF datasets and prediction of unseen protein-ligand complexes with different structural features demonstrates the framework's robustness [57]. The integration of multiple data sources and feature types represents a pragmatic approach to improving model generalization.

Table 3: Key Research Reagents and Computational Tools for Scoring Function Development

Resource	Type	Function	Access
PDBbind CleanSplit	Dataset	Leakage-free training and evaluation data	Publicly available
CASF Benchmark	Benchmark suite	Standardized performance assessment	Publicly available
GEMS	Software	Graph neural network for affinity prediction	Open source
PATH+	Software	Interpretable topological affinity prediction	Open source (OSPREY)
SG-ML-PLAP	Web server	Structure-guided ML affinity predictor	http://www.nii.ac.in/sg-ml-plap.html
Boltz-2	Model	Biomolecular foundation model for affinity	Not specified
AEV-PLIG	Software	GNN-based scoring function	Not specified

Experimental Protocols for Rigorous Evaluation

Implementing Clean Data Splits

To avoid data leakage artifacts, researchers should implement rigorous dataset splitting protocols:

Structure-based clustering: Group complexes using combined protein, ligand, and binding site similarity metrics
Similarity thresholding: Apply conservative thresholds (TM-score > 0.8, Tanimoto > 0.9, pocket RMSD < 2.0Å) to identify problematic similarities
Cross-validation strategies: Use cluster-based cross-validation where all complexes in a similarity cluster are assigned to the same split
External test sets: Validate on carefully curated external datasets with documented structural differences from training data

These protocols help ensure that reported performance metrics reflect genuine generalization capability rather than memorization of training examples.

Comprehensive Benchmarking Workflow

A robust benchmarking workflow should evaluate multiple aspects of model performance:

Figure 2: Comprehensive benchmarking workflow for scoring functions. A rigorous evaluation assesses multiple performance aspects beyond simple affinity prediction.

This multi-faceted evaluation approach ensures that models are tested on clinically relevant tasks including affinity prediction (scoring power), pose prediction (ranking power), and target identification – each of which requires different capabilities.

The benchmarking realities in scoring function development reveal a complex landscape where reported performance metrics often obscure significant limitations. While machine-learning SFs demonstrate impressive capabilities on standard benchmarks, their advantages over classical approaches diminish substantially when evaluated under leakage-free conditions and on clinically relevant tasks like target identification.

The field is transitioning toward more rigorous evaluation methodologies that better reflect real-world drug discovery challenges. Critical developments include structure-based dataset filtering, novel benchmarking paradigms, and emphasis on model interpretability. These advances are essential for developing ML-SFs that genuinely improve rather than simply replicating the limitations of classical approaches in new forms.

Future progress will likely depend on several key developments: (1) creation of larger, more diverse, and rigorously curated training datasets; (2) development of evaluation standards that include target identification capabilities; (3) improved model architectures that better capture physical principles of binding; and (4) greater emphasis on interpretability to build trust and provide mechanistic insights. As these developments mature, ML-SFs may finally deliver on their promise to transform computational drug discovery.

The accurate prediction of protein-ligand binding affinity represents a cornerstone of structure-based drug design. For decades, classical scoring functions—built upon physical force-field, empirical, or knowledge-based approaches—have been the standard computational tools for this task. However, these traditional methods suffer from well-documented limitations, including oversimplification of desolvation and entropy effects, and reliance on linear regression techniques that fail to capture the complex, non-linear nature of molecular interactions [58]. The advent of deep learning promised to overcome these limitations through sophisticated architectures capable of learning intricate patterns from large-scale structural databases. Paradoxically, many of these modern approaches have failed to deliver substantial improvements in real-world drug discovery applications, despite reporting impressive benchmark performance [28] [59].

This discrepancy between reported accuracy and practical utility stems from a fundamental confusion between generalization and memorization. Recent research has revealed that the standard practice of randomly splitting data between training and test sets creates an artificial scenario that allows models to exploit hidden biases and structural similarities in the data. Consequently, models appear highly accurate during benchmarking but perform poorly when faced with truly novel protein-ligand complexes in prospective drug discovery campaigns [28] [59]. This paper examines the root causes of this generalization crisis, presents rigorous methodologies for proper model evaluation, and introduces emerging solutions designed to restore the predictive power of computational affinity prediction.

The Data Leakage Problem in Standard Benchmarks

Documented Evidence of Train-Test Contamination

Recent investigations have exposed systematic flaws in the standard benchmarks used to evaluate scoring functions. The most critical issue involves data leakage between the PDBbind database (used for training) and the Comparative Assessment of Scoring Functions (CASF) benchmark (used for testing). A structure-based clustering analysis revealed that nearly 600 significant similarities exist between PDBbind training complexes and CASF test complexes, affecting approximately 49% of all CASF complexes [28]. These similarities extend beyond mere sequence identity to encompass ligand structures, binding pocket configurations, and even binding affinity values.

The consequences of this leakage are profound. When models encounter test complexes that closely resemble those in their training set, they can achieve high accuracy through simple pattern matching rather than genuine understanding of protein-ligand interactions. Alarmingly, some models maintain competitive performance even when critical protein or ligand information is deliberately omitted from inputs, suggesting they rely on memorizing spurious correlations rather than learning fundamental principles of molecular recognition [28].

Quantitative Impact on Model Performance

Table 1: Performance Degradation When Addressing Data Leakage

Model Type	Performance on Standard Split	Performance on Clean Split	Performance Drop	Evaluation Metric
GenScore	Excellent benchmark performance	Marked performance drop	Substantial	RMSE on CASF2016
Pafnucy	Excellent benchmark performance	Marked performance drop	Substantial	RMSE on CASF2016
GEMS (novel GNN)	State-of-the-art performance	Maintains high performance	Minimal	RMSE on CASF2016
Simple similarity algorithm	Competitive with published models	N/A	N/A	Pearson R = 0.716

When state-of-the-art binding affinity prediction models like GenScore and Pafnucy were retrained on a properly curated dataset (PDBbind CleanSplit) with reduced data leakage, their performance dropped markedly compared to their reported benchmarks. This confirms that their previously reported high performance was largely driven by data leakage rather than genuine generalization capability [28]. In contrast, a simple algorithm that predicts binding affinity by averaging values from the five most similar training complexes achieved competitive performance with some published deep-learning models, demonstrating that sophisticated architectures may be accomplishing little more than this straightforward similarity matching [28].

Methodologies for Rigorous Generalization Assessment

Data Partitioning Strategies

Proper data partitioning is essential for meaningful evaluation of model generalization. The standard random splitting approach, while methodologically straightforward, often produces optimistically biased performance estimates. More rigorous strategies include:

Split-by-Inhibitor: All data points for a specific ligand are placed entirely in either training or test sets, ensuring models are evaluated on completely novel chemical entities [59].
UniProt-based Partitioning: Complexes are partitioned based on protein sequence similarity, testing generalization to novel protein targets [60].
Structure-based Clustering: Multimodal filtering based on combined protein similarity (TM-scores), ligand similarity (Tanimoto scores), and binding conformation similarity (pocket-aligned ligand RMSD) to ensure structural independence between training and test sets [28].

Studies implementing these rigorous partitioning strategies consistently reveal substantial performance degradation compared to random splits. For instance, models showing high predictive correlations (Pearson coefficients up to 0.70) under random partitioning exhibited significantly reduced performance with UniProt-based partitioning [60].

The PDBbind CleanSplit Protocol

The PDBbind CleanSplit methodology represents a comprehensive approach to addressing both train-test leakage and internal dataset redundancy [28]. The protocol involves:

Structural Similarity Assessment: Comparing all CASF test complexes against all PDBbind training complexes using a multimodal similarity metric combining protein structure, ligand chemistry, and binding pose.
Train-Test Decoupling: Removing all training complexes that closely resemble any CASF test complex based on defined similarity thresholds.
Ligand-Based Filtering: Excluding training complexes with ligands identical to those in the test set (Tanimoto > 0.9).
Redundancy Reduction: Iteratively removing complexes from similarity clusters within the training set to discourage memorization.

This systematic filtering resulted in the removal of approximately 11.8% of training complexes (4% for direct train-test similarity and 7.8% for internal redundancies), producing a refined dataset that enables genuine evaluation of model generalization [28].

Diagram 1: PDBbind CleanSplit Creation Workflow. This protocol systematically removes structurally similar complexes between training and test sets.

Evaluation Metrics for Generalization

Beyond proper data partitioning, comprehensive evaluation requires multiple metrics to assess different aspects of model performance:

Standard Regression Metrics: Root mean square error (RMSE), Pearson correlation coefficient, and mean absolute error provide basic measures of predictive accuracy.
Ranking-based Metrics: Spearman correlation coefficient assesses the model's ability to correctly rank compounds by binding affinity.
Failure Mode Analysis: Systematic evaluation of performance across different protein families, ligand classes, and affinity ranges identifies specific weaknesses.
Ablation Studies: Selectively removing protein or ligand information from inputs tests whether predictions rely on genuine interaction understanding or simple memorization [28].

Case Studies in Generalization Failure

Kinase Inhibitor Prediction

A comprehensive study of convolutional neural networks for kinase inhibitor prediction revealed dramatic performance differences depending on splitting strategy [59]. When using standard random splitting, models achieved performance comparable to state-of-the-art reports. However, when evaluated using "split-by-inhibitor" methodology—where all data for specific compounds were withheld from training—model performance deteriorated substantially, with some models showing no improvement over simple baseline methods.

This failure demonstrates that models were primarily memorizing kinase phylogeny and matching chemical analogues rather than learning fundamental principles of molecular recognition. The models successfully associated specific molecular scaffolds with particular kinase subfamilies but could not generalize to novel chemical entities, severely limiting their utility in prospective drug discovery where truly novel compounds are of greatest interest [59].

Protein-Ligand Binding Affinity Prediction

The generalization failure extends beyond kinase-specific applications to general protein-ligand binding affinity prediction. Analysis of top-performing models in the CASF benchmark revealed that many could not maintain their performance when evaluated on the PDBbind CleanSplit dataset [28]. The observed performance drops were not random—models consistently failed for complexes with novel structural features not represented in their training data, while maintaining accuracy for complexes similar to those they had seen during training.

This pattern confirms that the models were operating primarily through memorization and similarity matching rather than genuine understanding of physical interactions. The problem was particularly pronounced for models that used limited molecular representations or lacked sufficient architectural capacity to capture complex physical relationships [28].

Table 2: Experimental Protocols for Assessing Generalization

Experiment	Protocol	Key Measurements	Interpretation
Data Splitting Comparison	Train and evaluate identical models using random splitting vs. strict splitting methods	Performance difference between splitting methods	Quantifies overoptimism from standard evaluation
Ablation Analysis	Systematically remove protein or ligand information from input features	Performance degradation with reduced information	Tests whether predictions use genuine interaction information
Similarity-based Prediction	Implement simple similarity-matching algorithm as baseline	Comparison with complex model performance	Establishes minimum expected performance
Cross-Dataset Evaluation	Train on one dataset, evaluate on entirely different dataset	Absolute performance on external dataset	Measures real-world generalization

Emerging Solutions for Improved Generalization

Architectural Innovations

Novel neural network architectures show promise for genuine generalization. The GEMS (Graph neural network for Efficient Molecular Scoring) model employs a sparse graph representation of protein-ligand interactions combined with transfer learning from protein language models [28]. This approach maintains high benchmark performance even when trained on the challenging PDBbind CleanSplit dataset, suggesting true learning of interaction principles rather than data memorization.

Crucially, ablation studies with GEMS demonstrate that the model fails to produce accurate predictions when protein nodes are omitted from the graph, confirming that its predictions rely on genuine understanding of protein-ligand interactions rather than exploiting dataset biases [28].

Meta-Modeling and Ensemble Approaches

Integration of diverse modeling approaches through meta-modeling (ensemble learning) provides another path to improved generalization. By combining classical force-field-based scoring functions with sequence-based deep learning models, researchers have created meta-models that outperform individual base models while demonstrating improved generalization across diverse benchmarks [58].

These hybrid approaches benefit from the complementary strengths of different methodologies—physical models provide theoretical grounding and interpretability, while data-driven models capture complex patterns that may be difficult to parameterize explicitly. The resulting ensembles show more consistent performance across different target classes and reduced sensitivity to dataset-specific biases [58].

Transfer Learning and Multi-Task Approaches

Leveraging transfer learning from large-scale protein language models (e.g., ESM-2) provides a powerful strategy for embedding fundamental biochemical knowledge into affinity prediction models [60]. These pre-trained embeddings capture evolutionary constraints and structural principles that generalize across diverse protein families, reducing the tendency to memorize dataset-specific patterns.

Similarly, multi-task learning approaches that simultaneously predict multiple properties (binding affinity, solubility, toxicity) force models to develop more general representations of molecular interactions, improving performance on any single task including affinity prediction.

Diagram 2: Architecture Strategies for Improved Generalization. Combining multiple approaches addresses limitations of individual methods.

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Research Reagents and Computational Tools

Resource	Type	Primary Function	Generalization Relevance
PDBbind Database	Dataset	Comprehensive collection of protein-ligand structures with experimental binding affinities	Foundation for training and evaluation; requires proper filtering
CASF Benchmark	Evaluation Framework	Standardized test sets for scoring function comparison	Contains documented data leakage; requires careful usage
PDBbind CleanSplit	Curated Dataset	Structure-filtered training set minimizing similarity to test complexes	Enables proper generalization assessment
ESM-2 Protein Language Model	Pre-trained Model	Provides evolutionary-informed protein representations	Transfer learning improves generalization to novel proteins
GEMS (Graph neural network for Efficient Molecular Scoring)	Model Architecture	Sparse graph representation of protein-ligand interactions	Maintains performance on independent test sets
Anchor-Query Framework	Methodology	Leverages limited reference data to predict unknown states	Improves prediction for novel targets with minimal data

The distinction between generalization and memorization represents a critical challenge for computational drug discovery. The documented performance overestimation in current binding affinity prediction models stems from fundamental flaws in evaluation methodologies rather than technical limitations of the models themselves. By adopting rigorous data partitioning strategies, comprehensive evaluation protocols, and architecturally innovative approaches, the field can transition from models that excel retrospectively on biased benchmarks to those that offer genuine predictive power for novel drug targets.

The path forward requires a cultural shift in how we evaluate computational models—prioritizing rigorous generalization assessment over impressive but potentially misleading benchmark performance. The solutions outlined in this review, including the PDBbind CleanSplit protocol, advanced architectures like GEMS, and sophisticated meta-modeling approaches, provide a foundation for developing the next generation of predictive tools that will truly accelerate drug discovery rather than simply providing retrospective accuracy.

Structure-based drug design (SBDD) relies heavily on computational methods to predict how small molecules interact with biological targets, with molecular docking serving as a cornerstone technique. [61] At the heart of every molecular docking tool lies its scoring function (SF), a mathematical model that predicts the binding affinity and orientation of a ligand within a protein's binding pocket. [32] [61] The accuracy and reliability of these scoring functions directly impact the success of virtual screening (VS) and binding pose prediction, critically influencing the efficiency of lead discovery and optimization in drug development. [27] [61]

This review is framed within the context of a broader thesis on the limitations of classical scoring functions for affinity prediction. Traditional SFs, categorized as physics-based, empirical, or knowledge-based, have long been plagued by persistent challenges. [28] A well-documented phenomenon is the "inter-protein scoring noise," where classical SFs can enrich active molecules for a single protein target but fail to identify the correct target for a given active molecule due to scoring variations between different binding pockets. [9] This limitation severely restricts their utility in target identification and polypharmacology studies. Furthermore, the advent of deep learning (DL) has promised a paradigm shift, yet comprehensive benchmarking reveals that these modern approaches often struggle with generalization, physical plausibility, and robustness against data leakage, indicating that the field has not yet fully overcome the fundamental hurdles of affinity prediction. [9] [32] [28]

Methodological Frameworks for Scoring Function Evaluation

To objectively compare the performance of different SF types, researchers employ standardized benchmark datasets and specific evaluation protocols across several key dimensions.

Key Performance Dimensions

Pose Prediction Accuracy: Measures the ability to generate binding poses close to the experimentally determined native structure, typically assessed using Root-Mean-Square Deviation (RMSD). A prediction with RMSD ≤ 2 Å is generally considered successful. [32]
Virtual Screening Efficacy: Evaluates the ability to discriminate true active compounds from inactive decoys in a screening setting. Common metrics include Enrichment Factor at 1% (EF1%) and area under the precision-recall curve. [27]
Binding Affinity Prediction: Quantifies the correlation between predicted and experimentally measured binding affinities (e.g., pKi, pIC50, pKD), often reported using Pearson's R correlation coefficient or Root-Mean-Square Error (RMSE). [28] [30]
Physical Plausibility and Generalization: Assesses the chemical and geometric validity of predicted complexes (e.g., using PoseBusters validation [32]) and performance on novel protein targets not represented in training data. [32] [28]

Critical Benchmarking Considerations

Recent studies highlight that the standard practice of training on the PDBbind database and testing on the Comparative Assessment of Scoring Functions (CASF) benchmark suffers from significant train-test data leakage, severely inflating performance estimates. [28] Nearly 49% of CASF complexes have highly similar counterparts in the training set, allowing models to exploit memorization rather than genuine learning of protein-ligand interactions. [28] To address this, rigorously curated datasets like PDBbind CleanSplit have been developed, which apply structure-based filtering algorithms to ensure strict separation between training and test complexes, enabling a more realistic assessment of generalization capability. [28]

Comparative Performance Across Scoring Function Types

Performance Tiers in Pose Prediction and Physical Validity

A comprehensive 2025 evaluation of nine docking methods across the Astex diverse set, PoseBusters benchmark set, and DockGen dataset revealed distinct performance tiers when assessing both pose accuracy (RMSD ≤ 2 Å) and physical validity (PB-valid): traditional methods > hybrid AI scoring with traditional conformational search > generative diffusion methods > regression-based methods. [32]

Table 1: Pose Prediction Accuracy and Physical Validity Across Benchmark Datasets [32]

Method Category	Specific Method	RMSD ≤ 2 Å (Astex)	PB-Valid (Astex)	Combined Success (Astex)	Combined Success (DockGen)
Traditional	Glide SP	81.18%	97.65%	79.41%	42.31%
Traditional	AutoDock Vina	75.29%	89.41%	68.24%	26.92%
Generative Diffusion	SurfDock	91.76%	63.53%	61.18%	33.33%
Generative Diffusion	DiffBindFR (MDN)	75.29%	52.94%	43.53%	18.52%
Regression-Based	KarmaDock	31.76%	35.29%	14.12%	2.56%
Regression-Based	QuickBind	37.65%	31.76%	17.65%	2.56%

This stratification highlights the diverse strengths and limitations of each approach. While generative diffusion models like SurfDock achieve exceptional pose accuracy, they frequently produce physically implausible structures with steric clashes or incorrect bond geometries. [32] In contrast, traditional methods like Glide SP maintain remarkably high physical validity across all datasets, though with somewhat lower pose accuracy. [32]

Virtual Screening Performance and Target Identification

In virtual screening applications, the combination of docking tools with machine learning-based rescoring has demonstrated significant performance improvements. A benchmarking study against both wild-type and quadruple-mutant Plasmodium falciparum Dihydrofolate Reductase (PfDHFR) revealed that:

Table 2: Virtual Screening Performance (EF1%) Against PfDHFR Variants [27]

Docking Tool	Scoring Function	Wild-Type EF1%	Quadruple-Mutant EF1%
PLANTS	Native	21	19
PLANTS	CNN-Score	28	24
FRED	Native	18	22
FRED	CNN-Score	25	31
AutoDock Vina	Native	<10 (worse-than-random)	<10 (worse-than-random)
AutoDock Vina	RF-Score-VS v2	15 (better-than-random)	17 (better-than-random)

Notably, rescoring with CNN-Score consistently enhanced screening performance across both variants, with the combination of FRED and CNN-Score achieving the highest enrichment (EF1% = 31) against the resistant quadruple mutant. [27] This demonstrates the potential of ML-based rescoring to overcome limitations of classical SFs, particularly for challenging targets like drug-resistant enzymes.

For target identification, the critical test is whether an SF can correctly identify the true protein target for a given active molecule by predicting higher binding affinity compared to decoy targets. Alarmingly, even the recently developed Boltz-2 biomolecular foundation model, which claimed to approach Free Energy Perturbation (FEP) performance, failed this fundamental test in a rigorous benchmark based on LIT-PCBA, indicating that generalizable understanding of protein-ligand interactions remains an unachieved goal. [9]

Affinity Prediction Accuracy and Generalization

The accuracy of binding affinity prediction varies substantially across SF types, with deep learning models particularly affected by data leakage issues:

Table 3: Binding Affinity Prediction Performance on CASF Benchmark [28]

Model	Training Dataset	Pearson R (CASF-2016)	Generalization Assessment
GenScore	Original PDBbind	0.826	Severely inflated
GenScore	PDBbind CleanSplit	0.685	True performance
Pafnucy	Original PDBbind	0.779	Severely inflated
Pafnucy	PDBbind CleanSplit	0.612	True performance
GEMS (GNN)	PDBbind CleanSplit	0.816	Robust generalization
Similarity Search Algorithm	-	0.716	Benchmark reference

When trained on the original PDBbind database, top-performing models show excellent benchmark performance, but this drops markedly when trained on the cleaned PDBbind CleanSplit dataset, confirming that their previous high scores were largely driven by data leakage rather than genuine generalization. [28] In contrast, the GEMS graph neural network maintains high performance when trained on CleanSplit, suggesting more robust learning of protein-ligand interactions. [28]

Alternative approaches like the PATH+ algorithm, which uses persistent homology to capture geometric properties of protein-ligand complexes, demonstrate comparable accuracy with the added benefits of interpretability and superior computational efficiency (10x faster than previous topology-based methods). [56] For specific target classes like GPCRs, advanced molecular dynamics approaches with re-engineered Bennett acceptance ratio (BAR) methods have shown promising correlation with experimental data (R² = 0.789). [30]

Experimental Protocols and Methodologies

Multidimensional Docking Evaluation Protocol

The comprehensive assessment referenced in Section 3.1 followed this rigorous methodology [32]:

Dataset Curation: Employed three benchmark datasets with distinct characteristics: Astex diverse set (known complexes), PoseBusters benchmark set (unseen complexes), and DockGen (novel protein binding pockets).
Method Selection: Included representatives from major SF paradigms: traditional physics-based (Glide SP, AutoDock Vina), generative diffusion models (SurfDock, DiffBindFR), regression-based models (KarmaDock, QuickBind), and hybrid methods (Interformer).
Evaluation Metrics: Each method generated top-ranked poses, which were evaluated for pose accuracy (RMSD ≤ 2 Å), physical validity (PoseBusters checks for bond lengths, angles, stereochemistry, and protein-ligand clashes), and combined success rate.
Generalization Assessment: Performance trends across the three datasets were analyzed to determine robustness to novel protein sequences, binding pockets, and ligand topologies.

Docking evaluation workflow. This diagram outlines the standardized protocol for benchmarking scoring functions across multiple datasets and performance dimensions.

Structure-Based Dataset Filtering Protocol

To address data leakage concerns highlighted in Section 3.3, the PDBbind CleanSplit protocol implements this structure-based filtering approach [28]:

Multimodal Similarity Assessment: Compute similarity between all protein-ligand complexes using combined metrics: protein similarity (TM-score), ligand similarity (Tanimoto score), and binding conformation similarity (pocket-aligned ligand RMSD).
Train-Test Leakage Removal: Identify and exclude all training complexes that closely resemble any test complex in the CASF benchmark according to defined thresholds (TM-score > 0.7, Tanimoto > 0.9, RMSD < 2.0 Å).
Redundancy Reduction: Apply adapted filtering thresholds iteratively to identify and eliminate similarity clusters within the training dataset itself, promoting diversity and discouraging memorization.
Validation: Verify that remaining train-test pairs exhibit clear structural differences and that benchmark performance reflects true generalization rather than data leakage.

Virtual Screening and Rescoring Protocol

The PfDHFR benchmarking study referenced in Section 3.2 employed this integrated workflow [27]:

Protein Preparation: Retrieve crystal structures of wild-type (PDB: 6A2M) and quadruple-mutant (PDB: 6KP2) PfDHFR. Prepare structures by removing water molecules, adding hydrogens, and optimizing using "Make Receptor" from OpenEye.
Benchmark Set Generation: Curate 40 bioactive molecules for each variant and generate 1200 challenging decoys (1:30 ratio) following the DEKOIS 2.0 protocol.
Docking Experiments: Perform docking with AutoDock Vina, PLANTS, and FRED using appropriate grid boxes and default search parameters for each tool.
Machine Learning Rescoring: Apply pretrained RF-Score-VS v2 and CNN-Score models to rescore the docking poses generated by each method.
Performance Analysis: Evaluate screening performance using enrichment factor at 1% (EF1%), pROC-AUC, and pROC-chemotype plots to assess early enrichment and structural diversity of identified hits.

Visualization of Key Concepts and Relationships

The Inter-Protein Scoring Noise Problem

Inter-protein scoring noise. This diagram illustrates the fundamental limitation of classical scoring functions, which can work within a single target but fail at cross-target identification.

Multidimensional Evaluation Framework

Multidimensional evaluation framework. A comprehensive assessment of scoring functions requires testing across multiple, complementary performance dimensions.

Table 4: Key Research Reagents and Computational Resources

Resource Name	Type	Primary Function	Application Context
PDBbind CleanSplit	Dataset	Training data with minimized train-test leakage	Robust model training and evaluation [28]
DEKOIS 2.0	Benchmark Set	Provides active compounds and challenging decoys	Virtual screening benchmarking [27]
PoseBusters	Validation Tool	Checks physical/chemical plausibility of poses	Pose quality assessment [32]
Persistent Homology	Mathematical Framework	Encodes multi-scale geometric features	Interpretable affinity prediction (PATH+) [56]
BAR Method	Algorithm	Free energy calculation with explicit solvation	High-accuracy affinity prediction for membrane proteins [30]
CNN-Score	ML Scoring Function	Rescoring docking poses with convolutional networks	Virtual screening enhancement [27]

This comparative analysis reveals that no single scoring function type currently dominates across all critical performance dimensions. Classical physics-based functions demonstrate superior physical plausibility and robustness, while deep learning approaches show promise in specific tasks like virtual screening enrichment but struggle with generalization and data leakage issues. The persistence of fundamental challenges like inter-protein scoring noise and the performance gap observed when proper dataset splitting is implemented suggests that the field must move beyond traditional benchmarking practices. Future developments should prioritize truly generalizable models that capture universal principles of molecular recognition rather than exploiting dataset-specific patterns, with interpretability and physical plausibility as central design considerations alongside predictive accuracy.

The field of computational drug design has long been hampered by the limitations of classical scoring functions in accurately predicting protein-ligand binding affinities. These traditional methods, often based on force-fields, empirical data, or knowledge-based statistical potentials, struggle with generalizability and accuracy, creating a bottleneck in structure-based drug design (SBDD) [28] [35]. The emergence of deep learning, particularly Graph Neural Networks (GNNs), represents a paradigm shift. GNNs have quietly become a transformative tool, revolutionizing drug discovery by accurately modeling molecular structures and interactions with binding targets [62] [63]. However, this progress has been accompanied by significant challenges, most notably the issue of data leakage in public benchmarks that has severely inflated performance metrics and led to an overestimation of model capabilities [28]. This whitepaper details how the convergence of novel GNN architectures, rigorously curated datasets, and advanced training protocols is establishing a new gold standard, finally narrowing the performance gap with computationally intensive physics-based methods like Free Energy Perturbation (FEP) while being orders of magnitude faster [35].

Classical scoring functions have been the cornerstone of computer-aided drug design (CADD), but their applicability is constrained by a fundamental trade-off between computational cost and predictive accuracy [35]. These methods, which include force-field-based, empirical, and knowledge-based approaches implemented in docking tools like AutoDock Vina and GOLD, are often computationally intensive and show limited accuracy in binding affinity prediction [28]. A well-documented phenomenon that highlights their weakness is the inter-protein scoring noise problem: while these functions can sometimes enrich active molecules for a specific protein target, they generally fail to identify the correct protein target for a given active molecule due to scoring variations between different binding pockets [9].

This limitation restricts their utility in target identification, a critical step in drug discovery. Furthermore, classical scoring functions often fail on realistic tasks encountered in hit-to-lead optimization, such as reliably ranking the binding affinity of a congeneric series of ligands [35]. While more rigorous methods like FEP offer higher accuracy, their prohibitive computational cost—often requiring hours on supercomputers—makes them unsuitable for high-throughput virtual screening [35] [64]. This created a critical methods gap in the speed-accuracy landscape, yearning for approaches that are definitively more accurate than docking but faster than FEP [64].

Graph Neural Networks: A Primer for Molecular Modeling

Graph Neural Networks (GNNs) are a class of deep learning models within the broader deep learning revolution that are uniquely suited for non-Euclidean data [65]. Their rise in the AI research landscape has been spectacular, with the term "Graph Neural Network" consistently ranking in the top 3 keywords for major AI conferences and a striking +447% average annual increase in related publications during 2017-2019 [65].

In the context of molecular modeling, GNNs offer an intuitive and expressive framework. They operate directly on molecular graphs, where atoms are represented as nodes and chemical bonds as edges [63]. This allows GNNs to natively learn complex topological and geometric features of drug-like molecules that would be lost in traditional "rigid" data structures like fixed-size grids or sequences [65]. By performing message-passing operations across the graph, GNNs can capture both the local chemical environments of atoms and the global structure of the molecule, learning a rich hierarchical representation that is immensely valuable for predicting molecular properties and interactions [62] [63].

Overcoming Historical Hurdles: Data Leakage and Benchmarking

The impressive benchmark performance reported by many early deep-learning-based scoring functions was, unfortunately, built on a flawed foundation. A critical issue was the widespread train-test data leakage between the primary training database, PDBbind, and the standard evaluation benchmark, the Comparative Assessment of Scoring Function (CASF) [28]. Studies revealed that nearly half (49%) of all CASF complexes had exceptionally similar counterparts in the training set, sharing not only similar ligand and protein structures but also comparable ligand positioning and, unsurprisingly, closely matched affinity labels [28]. This meant models could achieve high benchmark performance through simple memorization rather than a genuine understanding of protein-ligand interactions, severely inflating their perceived generalization capabilities [28].

The PDBbind CleanSplit Solution

To address this, researchers introduced PDBbind CleanSplit, a training dataset curated by a novel structure-based filtering algorithm [28]. This algorithm uses a combined assessment of:

Protein similarity (TM-scores)
Ligand similarity (Tanimoto scores)
Binding conformation similarity (pocket-aligned ligand root-mean-square deviation)

The algorithm eliminates not only train-test data leakage but also redundancies within the training set itself, where nearly 50% of complexes were part of a similarity cluster [28]. This filtering encourages models to learn fundamental principles of binding instead of relying on memorization. The dramatic impact of this curation is clear: when top-performing models like GenScore and Pafnucy were retrained on CleanSplit, their benchmark performance dropped substantially, revealing that their previous high scores were largely driven by data leakage [28].

Methodological Deep Dive: GNN Architectures for Affinity Prediction

The GEMS Model

The Graph neural network for Efficient Molecular Scoring (GEMS) is a leading model designed for robust generalization. It leverages a sparse graph modeling of protein-ligand interactions and transfer learning from language models [28]. Its architecture can be summarized as follows:

Diagram: GEMS Model Architecture. This illustrates the integration of protein and ligand representations into a sparse interaction graph processed by GNN layers [28].

A key strength of GEMS is its ability to maintain high benchmark performance when trained on the rigorously curated CleanSplit dataset, suggesting its predictions are based on a genuine understanding of protein-ligand interactions rather than exploiting data leakage [28]. Ablation studies confirmed this, showing the model fails to produce accurate predictions when protein nodes are omitted from the graph [28].

The AEV-PLIG Model

Another advanced architecture is the Atomic Environment Vector–Protein Ligand Interaction Graph (AEV-PLIG) model [35]. It combines two powerful concepts:

Atomic Environment Vectors (AEVs): Ligand atom descriptors composed of atom-centred symmetry functions that describe the local chemical environment using Gaussian functions [35].
Protein-Ligand Interaction Graphs (PLIGs): Graphs that encode intermolecular contacts between proteins and ligands [35].

AEV-PLIG enhances this by using extended connectivity interaction features (ECIF) for a richer set of 22 distinct protein atom types and employs enhanced GATv2 graph attention layers, which are strictly more expressive than standard GATs [35]. The model is trained using both experimental data and augmented data generated via template-based modelling or molecular docking, which has been shown to significantly improve performance on challenging benchmarks [35].

Experimental Protocol for Robust Evaluation

For researchers seeking to implement or benchmark these models, the following protocol is essential:

Data Curation: Use the PDBbind CleanSplit methodology to ensure no data leakage exists between training and test sets [28]. This involves:
- Calculating protein (TM-score), ligand (Tanimoto), and binding conformation (pocket-aligned RMSD) similarities.
- Applying thresholds to exclude training complexes that are similar to any test complex.
- Iteratively removing complexes from the training set to resolve internal similarity clusters.
Benchmarking: Evaluate model performance on multiple independent test sets, including:
- The standard CASF benchmark.
- A dedicated Out-Of-Distribution (OOD) test set designed to penalize ligand and/or protein memorization [35].
- A test set based on congeneric series typical of lead optimization campaigns, which is also used to validate FEP calculations [35].
Critical Assessment: The model must be tested on its ability to solve the inter-protein scoring noise problem. A reliable benchmark for this is target identification based on datasets like LIT-PCBA, where the model must correctly identify the target of active molecules by predicting a higher binding affinity compared to decoy targets [9].

Quantitative Performance Benchmarks

The table below summarizes the performance of modern GNN-based scoring functions compared to traditional methods, highlighting the new benchmarks being set in the field.

Table 1: Performance Comparison of Scoring Functions for Binding Affinity Prediction

Method Category	Representative Model	Key Benchmark Performance (Pearson R / RMSE)	Computational Speed	Key Strengths
Classical Docking	AutoDock Vina [28]	Limited accuracy [28]	~1 minute (CPU) [64]	Fast, high-throughput
Alchemical Methods	FEP+ [35]	PCC: ~0.68 on congeneric series [35]	>12 hours (GPU cluster) [64]	High accuracy, gold standard
ML Scoring (with data leakage)	Pre-CleanSplit Models [28]	Inflated metrics (e.g., R>0.8) [28]	Minutes (GPU)	Fast, previously high benchmark scores
Advanced GNNs (no leakage)	GEMS (on CleanSplit) [28]	Maintains high performance on independent tests [28]	Minutes (GPU)	Generalizes to unseen complexes
Advanced GNNs (with augmented data)	AEV-PLIG (on FEP benchmark) [35]	PCC: 0.59 (vs. 0.41 without augmentation) [35]	~400,000x faster than FEP [35]	Accurate & fast on lead-optimization tasks

The performance of GNNs is particularly notable on tasks critical for drug discovery. For example, AEV-PLIG shows how leveraging augmented data can drastically improve the ranking of congeneric ligands, with Kendall's τ (a rank correlation metric) increasing from 0.26 to 0.42, closing the gap with FEP+ (Kendall's τ of 0.49) [35]. This demonstrates that GNNs are beginning to address the real-world need for accurately prioritizing compounds during lead optimization.

Table 2: Key Resources for GNN-Based Affinity Prediction Research

Resource Name	Type	Function & Application in Research
PDBbind CleanSplit [28]	Dataset	Curated training set free of data leakage; enables genuine evaluation of model generalizability.
CASF Benchmark [28] [35]	Benchmarking Suite	Standard set for comparative assessment of scoring functions; use with CleanSplit protocol.
OOD Test Set [35]	Benchmarking Suite	Realistic out-of-distribution test set designed to penalize ligand/protein memorization.
LIT-PCBA Target ID Benchmark [9]	Benchmarking Suite	Tests model's ability to identify the correct protein target for active molecules.
Graph Neural Network Frameworks	Software	Libraries like PyTorch Geometric and DGL for building GNN models like GEMS and AEV-PLIG.
Molecular Dynamics Trajectories	Data	Used for data augmentation (as in AEV-PLIG) to increase data diversity and model robustness [35].
Template-Based Modelling & Docking	Software/Algorithm	Tools to generate augmented synthetic protein-ligand complex structures for training [35].

Graph Neural Networks, when trained on rigorously curated datasets and evaluated on demanding benchmarks, are unequivocally setting a new gold standard for binding affinity prediction. They are successfully addressing the long-standing limitations of classical scoring functions, particularly the inter-protein scoring noise problem and the inability to accurately rank congeneric series [9] [35]. By mitigating data leakage through protocols like PDBbind CleanSplit, the field can now have greater confidence in reported performance metrics [28].

The integration of advanced architectures like GEMS and AEV-PLIG with strategies such as transfer learning and data augmentation is producing models that are beginning to narrow the performance gap with high-end computational methods like FEP [28] [35]. The ability of these models to provide FEP-level correlation on lead optimization tasks while being hundreds of thousands of times faster represents a monumental leap forward [35]. This opens the door for their practical application in drug discovery pipelines, from powering generative AI models that design new protein-ligand interactions to enabling high-throughput virtual screening with unprecedented accuracy. As these tools continue to evolve, they promise to significantly accelerate the pace of drug discovery, reducing development costs and late-stage failures [62].

Conclusion

The limitations of classical scoring functions are fundamental, stemming from their rigid, pre-defined functional forms and inability to fully capture the complex physics of molecular binding. As a result, they have hit a persistent performance plateau in critical tasks like binding affinity prediction and virtual screening. The path forward is clearly charted by data-driven, machine-learning approaches, which circumvent these limitations by learning the functional form directly from large-scale structural and interaction data. However, the success of these next-generation models hinges on addressing new challenges, particularly the critical need for rigorously curated, non-redundant benchmark datasets free of data leakage. Future progress will depend on a synergistic combination of improved physical modeling, advanced deep learning architectures like graph neural networks, and a renewed focus on robust, generalizable validation practices. This evolution promises to deliver more accurate and reliable computational tools, ultimately accelerating the discovery of new therapeutics and deepening our understanding of biomolecular interactions.