Solving the PDBbind Data Leakage Crisis: Strategies for Generalizable Binding Affinity Prediction

Caroline Ward Dec 02, 2025 81

This article addresses the critical challenge of data leakage in PDBbind training datasets, which has been shown to severely inflate the performance metrics of machine learning models for protein-ligand binding...

Solving the PDBbind Data Leakage Crisis: Strategies for Generalizable Binding Affinity Prediction

Abstract

This article addresses the critical challenge of data leakage in PDBbind training datasets, which has been shown to severely inflate the performance metrics of machine learning models for protein-ligand binding affinity prediction. We explore the root causes of this leakage, including structural redundancies and similarities between standard training and test sets like CASF. The content provides a comprehensive overview of modern mitigation strategies, such as the PDBbind CleanSplit and LP-PDBBind protocols, which employ structure-based filtering to create truly independent training and test sets. Furthermore, we discuss the integration of these methods with broader data quality initiatives, such as HiQBind-WF, and evaluate the real-world performance of retrained models on independent benchmarks. This guide is essential for researchers and drug development professionals aiming to build predictive models with robust, generalizable capabilities for structure-based drug discovery.

The PDBbind Data Leakage Problem: Why Your Model's Performance Might Be an Illusion

Defining Data Leakage in the Context of PDBbind and CASF Benchmarks

Frequently Asked Questions

1. What is data leakage in the context of PDBbind and the CASF benchmark? Data leakage occurs when information from the test dataset (in this case, the CASF core sets) inadvertently influences the training process of a model. For PDBbind, this is not typically a literal duplication of data points, but rather the presence of highly similar protein-ligand complexes in both the training (general/refined sets) and test (core sets) data. This similarity allows models to "cheat" by making predictions based on memorization of structural patterns, rather than learning generalizable principles of binding, leading to an overestimation of the model's true performance on novel complexes [1] [2].

2. Why is data leakage between PDBbind and CASF a problem? Data leakage creates an over-optimistic assessment of a model's "scoring power," which is its ability to predict binding affinity. When a model is evaluated on test complexes that are very similar to those it was trained on, its high performance does not translate to real-world drug discovery scenarios, where it must score entirely new protein targets and novel chemical compounds. This inflates benchmark results and masks the model's true generalization capability [1] [2] [3].

3. How can I detect potential data leakage in my dataset? You can analyze your dataset for these key risk factors:

Unrealistically High Performance: If your model achieves exceptionally high accuracy on the benchmark with minimal tuning, it is a major red flag [4] [5].
High Structural Similarity: Use algorithms to check for complexes in your training set that have high protein structure similarity (TM-score), ligand similarity (Tanimoto score), and similar binding conformations (pocket-aligned ligand RMSD) to complexes in your test set [1].
Identity Clusters: Check for the same protein or the same ligand appearing in both your training and test splits [2].

4. What are the main solutions for mitigating data leakage? The research community has developed curated datasets and splits to address this issue:

PDBbind CleanSplit: A re-splitting of PDBbind that uses a structure-based filtering algorithm to remove training complexes that are structurally similar to any CASF test complex. It also reduces redundancies within the training set itself [1].
LP-PDBBind (Leak-Proof PDBBind): A reorganized dataset that controls for data leakage by minimizing the sequence and chemical similarity of proteins and ligands between the training, validation, and test datasets [2] [6].
HiQBind-WF: A workflow focused on creating high-quality protein-ligand binding datasets by correcting common structural artifacts in proteins and ligands, which can further improve model reliability [7].

Troubleshooting Guides

Guide 1: Diagnosing Over-optimistic Model Performance

Symptoms: Your model performs exceptionally well on the CASF benchmark (e.g., low RMSE, high Pearson R) but performs poorly when you test it on your own, truly independent data from other sources like BindingDB.

Diagnostic Steps:

Benchmark with Clean Splits: Retrain your model on a leak-proof dataset like PDBbind CleanSplit or LP-PDBBind and re-evaluate its performance on the corresponding test set. A significant drop in performance (e.g., an increase in RMSE) is a strong indicator that your original model was benefiting from data leakage [1] [2].
Perform an Ablation Study: Systematically remove different types of information from your model's input during training and testing. For example, try omitting protein node information from a graph neural network. If the model's performance does not drop significantly, it suggests the predictions are not based on genuine protein-ligand interactions but are likely relying on memorized ligand patterns or other leaked information [1].
Run a Simple Baseline Algorithm: Implement a naive prediction method that finds the most structurally similar training complex for each test complex and uses its affinity as the prediction. If this simple, non-machine-learning method performs competitively with your complex model, it confirms that the test set can be "solved" through data lookup rather than learned principles [1].

Guide 2: Implementing a Leakage-Aware Data Splitting Strategy

Objective: To create a training and test split from PDBbind that ensures a rigorous evaluation of your model's generalization.

Methodology: The following workflow, based on the PDBbind CleanSplit protocol, outlines the key steps for creating a leakage-aware dataset [1].

Experimental Protocol:

Calculate Complex Similarity: For every possible pair between a training complex (from PDBbind general/refined sets) and a test complex (from a CASF core set), compute three metrics [1]:
- Protein Structure Similarity: Use the TM-score algorithm. A score of 1.0 indicates perfect structural alignment.
- Ligand Chemical Similarity: Use the Tanimoto coefficient based on molecular fingerprints. A score > 0.9 typically indicates very similar or identical ligands.
- Binding Conformation Similarity: Calculate the root-mean-square deviation (RMSD) of the ligand atoms after aligning the protein binding pockets.
Identify Leakage: Define similarity thresholds to flag problematic pairs. For example, a pair might be considered a "leak" if it has a high TM-score, a high Tanimoto score, and a low RMSD simultaneously [1].
Filter the Training Set: Remove all training complexes identified in the previous step from your training dataset. This ensures no test complex has a close relative in the training data.
Reduce Internal Redundancy (Optional but Recommended): Apply a similar clustering algorithm within the training set itself and remove some complexes to break up large clusters of highly similar structures. This encourages the model to learn general rules instead of memorizing specific structural motifs [1].

Quantitative Impact of Data Leakage

The table below summarizes the demonstrated effect of data leakage on model performance and the benefits of using leak-proof datasets.

Table 1: Performance Impact of Data Leakage and Mitigation Strategies

Model / Scenario	Training Dataset	Test Dataset	Performance (Example)	Implication
State-of-the-Art Models (GenScore, Pafnucy)	Original PDBbind	CASF Benchmark	High Performance [1]	Performance is artificially inflated due to data leakage.
Same Models Retrained	PDBbind CleanSplit	CASF Benchmark	Substantial Performance Drop [1]	Confirms that original high scores were driven by leakage.
GEMS (Graph Neural Network)	PDBbind CleanSplit	CASF Benchmark	Maintains High Performance (RMSE ~1.22 pK) [1]	Demonstrates genuine generalization capability when trained on a clean dataset.
Various SFs (Vina, IGN, etc.)	LP-PDBBind	Independent BDB2020+ Set	Better Performance vs. models trained on standard PDBbind [2]	Leak-proof training leads to more reliable application on new data.

The Scientist's Toolkit: Research Reagents & Solutions

Table 2: Key Resources for Leakage-Aware Binding Affinity Prediction

Item	Type	Function & Relevance
PDBbind CleanSplit	Curated Dataset	A reorganized split of PDBbind designed to eliminate train-test leakage and reduce internal redundancy, enabling a true test of generalization [1].
LP-PDBBind	Curated Dataset	A "Leak-Proof" version of PDBbind that controls for protein and ligand similarity across splits [2] [6].
HiQBind & HiQBind-WF	Dataset & Workflow	A high-quality dataset and an open-source, semi-automated workflow for curating protein-ligand complexes by fixing structural errors, which improves data quality for training [7].
BDB2020+	Independent Benchmark	A rigorously compiled test set from BindingDB entries deposited after 2020, used for true external validation of model performance [2].
Structure-Based Clustering Algorithm	Methodology	An algorithm that combines TM-score, Tanimoto score, and RMSD to identify overly similar complexes for filtering [1].
Graph Neural Networks (e.g., GEMS, IGN)	Model Architecture	GNNs that use sparse graph modeling of protein-ligand interactions are showing promising generalization capabilities when trained on clean data [1] [2].

Troubleshooting Guides

Guide 1: Diagnosing and Resolving Data Leakage in PDBBind

Problem: Your machine learning model for binding affinity prediction performs excellently on standard benchmarks (like CASF) but fails dramatically when applied to genuinely new protein-ligand complexes.

Root Cause: Data leakage due to high structural, sequence, and chemical similarities between the training data (PDBbind general/refined sets) and test data (CASF core set) [1] [2]. Nearly half (49%) of CASF test complexes have exceptionally similar counterparts in the training data, allowing models to "cheat" by memorization rather than learning generalizable principles [1].

Diagnosis Steps:

Similarity Analysis: Use a structure-based clustering algorithm to compare your training and test complexes across three metrics:
- Protein similarity (TM-score ≥ 0.7) [1]
- Ligand similarity (Tanimoto coefficient > 0.9) [1]
- Binding conformation similarity (pocket-aligned ligand RMSD < 2.0 Å) [1]
Performance Drop Test: Retrain your model on a leakage-free split (like PDBbind CleanSplit or LP-PDBBind). A substantial performance drop indicates previous results were inflated by leakage [1] [2].
Ablation Study: Run predictions while omitting protein node information. Accurate predictions without protein data suggest ligand memorization is a primary mechanism, indicating fundamental leakage [1].

Resolution Steps:

Adopt a Cleaned Dataset: Replace the standard PDBbind split with a rigorously filtered dataset.
- PDBbind CleanSplit: Uses a structure-based filtering algorithm to remove training complexes that resemble any CASF test complex and reduces redundancies within the training set [1].
- LP-PDBBind (Leak-Proof PDBBind): A reorganized dataset minimizing sequence and chemical similarity of both proteins and ligands between splits, also filtering out covalent binders and structures with steric clashes [6] [2].
- HiQBind: Created via an open-source workflow (HiQBind-WF) that corrects common structural artifacts in PDB structures and ensures reliable binding data [8].
Use Advanced Splitting Tools: For new data, employ tools like DataSAIL to perform similarity-aware data splits that minimize information leakage by formulating the split as a combinatorial optimization problem [9].
Benchmark on Truly Independent Data: Use recently proposed independent test sets like BDB2020+ (built from post-2020 BindingDB data matched with PDB structures) for a genuine assessment of generalizability [2].

Guide 2: Improving Model Generalization for Novel Complexes

Problem: After fixing data leakage, model performance on independent tests is lower than desired.

Root Cause: The model architecture itself may lack the inductive biases necessary to generalize to novel protein-ligand pairs that are structurally dissimilar to training examples.

Diagnosis Steps:

Analyze Performance by Similarity: Stratify your test results based on the similarity of the test complexes to the nearest neighbors in the training set. Poor performance on low-similarity complexes confirms a generalization failure.
Check Input Representations: Determine if your model uses representations that overly rely on ligand features alone, which is a common shortcut [1].

Resolution Steps:

Architecture Selection: Implement models designed for robust generalization.
- Sparse Graph Neural Networks (GNNs): Models like GEMS (Graph neural network for Efficient Molecular Scoring) represent protein-ligand interactions as sparse graphs and can maintain high performance even when trained on leakage-free data [1].
- Transfer Learning: Incorporate pre-trained language models (e.g., for protein sequences or ligand SMILES) to provide a richer initial representation that helps with generalization [1].
Data Augmentation: Augment limited experimental data with high-quality modeled structures from datasets like BindingNet v2. Training on this larger and more diverse dataset has been shown to significantly improve model performance on novel ligands [10].
Physics-Based Refinement: Combine deep learning models with physics-based refinement and rescoring methods (e.g., MM-GB/SA) to improve the quality of predicted poses and affinities [10].

Frequently Asked Questions (FAQs)

Q1: What exactly is "data leakage" in the context of PDBbind and the CASF benchmark?

Data leakage here is not merely having identical complexes in both training and test sets. It refers to the presence of highly similar proteins (high sequence/TM-score) and/or ligands (high Tanimoto coefficient) in both the PDBbind training data and the CASF test set [1] [2]. This similarity allows models to achieve high benchmark performance by exploiting structural memorization rather than learning the underlying principles of binding, leading to an overestimation of their true generalization capability [1].

Q2: What quantitative evidence exists for this data leakage crisis?

Studies have rigorously quantified the extent of the problem. One analysis revealed that nearly 600 high-similarity pairs exist between the standard PDBbind training set and the CASF-2016 benchmark, involving 49% of all CASF test complexes [1]. A simple algorithm that just found the 5 most similar training complexes for a test complex and averaged their affinities achieved a competitive Pearson R of 0.716 on CASF2016, demonstrating that similarity-based lookup can mimic "intelligent" prediction [1].

Q3: How much does data leakage inflate model performance?

The inflation is substantial. When top-performing models like GenScore and Pafnucy were retrained on a leakage-free split (PDBbind CleanSplit), their benchmark performance dropped markedly [1]. This confirms that the previously excellent performance was largely driven by data leakage and not model generalization.

Q4: Are certain model architectures more susceptible to data leakage?

All models trained on leaked data will show inflated performance. However, some architectures may be more prone to exploiting shortcuts. For instance, models that primarily rely on ligand information can accurately predict affinities for test ligands that are highly similar to those seen in training, even without protein context [1]. The solution is not just about architecture but about training data quality.

Q5: What is the practical impact of using a leak-proof dataset on real-world drug discovery?

Using leak-proof splits like LP-PDBBind for training leads to models that perform significantly better on truly independent test sets (e.g., BDB2020+) [2]. This translates to more reliable predictions for novel drug targets and compounds, which is the central goal of computational drug discovery. It prevents wasted resources based on over-optimistic in-silico results.

Q6: Beyond protein-ligand binding, is this a broader issue in biomedical machine learning?

Yes, data leakage due to similarity is a pervasive problem. It has been documented in other areas such as prediction of protein-protein interactions and missense variant deleteriousness, where standard random splits allow models to use protein-level shortcuts, leading to poor performance on out-of-distribution data [9].

Quantitative Evidence of the Crisis and Its Resolution

Metric / Finding	Value / Description	Implication
CASF Complexes with Highly Similar Training Counterparts	49%	Nearly half the benchmark does not test generalization to new complexes.
Performance of Similarity-Based Lookup Algorithm	Pearson R = 0.716 (CASF2016)	Simple memorization can achieve performance rivaling complex models.
Performance Drop of Top Models on CleanSplit	"Marked" and "Substantial" drop	Previous high performance was largely driven by data leakage.

Table 2: Comparison of Datasets and Splits for Mitigating Leakage

Dataset / Split	Key Curation Methodology	Key Advantage
PDBbind CleanSplit [1]	Structure-based filtering removing complexes with high protein (TM-score), ligand (Tanimoto), and binding pose (RMSD) similarity to test set.	Creates a strictly separated training set, turning CASF into a true external test.
LP-PDBBind [2]	Minimizes sequence/chemical similarity of both proteins and ligands between splits. Removes covalent binders and clashes.	Provides a standardized, cleaned data split for robust model comparison.
HiQBind & HiQBind-WF [8]	Open-source workflow to correct structural artifacts (bonds, protonation, clashes) in PDB structures.	Improves structural quality and reliability of binding affinity annotations.
DataSAIL [9]	Algorithmic tool for similarity-aware data splitting, formulated as an optimization problem.	Generic tool for creating leakage-reduced splits for various biomedical data types.

Experimental Protocols for Creating a Leakage-Free Benchmark

Protocol 1: Creating a Cleaned Data Split (e.g., PDBbind CleanSplit)

Objective: To generate a training dataset free of complexes that are highly similar to a designated test benchmark.

Materials:

Source dataset (e.g., PDBbind general/refined set for training, CASF core set for test)
Protein structure alignment tool (e.g., for calculating TM-scores)
Cheminformatics toolkit (e.g., for calculating Tanimoto coefficients from ligand SMILES)
Structural analysis tool (e.g., for calculating pocket-aligned ligand RMSD)

Methodology:

For each complex in the test set, compare it against every complex in the training set using a multi-modal similarity assessment [1]:
- Calculate protein structure similarity using the TM-score. A TM-score ≥ 0.7 suggests significant structural similarity.
- Calculate ligand similarity using the Tanimoto coefficient based on molecular fingerprints. A Tanimoto coefficient > 0.9 indicates high chemical similarity.
- Calculate binding pose similarity by aligning the protein pockets and computing the RMSD of the ligand heavy atoms. An RMSD < 2.0 Å suggests a very similar binding mode.
Filter the training set by removing any complex that exceeds similarity thresholds (e.g., TM-score ≥ 0.7 AND Tanimoto > 0.9) with any test complex [1].
Remove redundant training complexes by applying the same multi-modal comparison within the training set itself and iteratively removing complexes to dissolve the largest similarity clusters. This encourages the model to learn general rules instead of memorizing specific patterns [1].
The resulting filtered dataset (e.g., PDBbind CleanSplit) is now strictly separated from the test benchmark and can be used for training models to assess true generalization.

Protocol 2: Retraining and Evaluating Models on a Cleaned Split

Objective: To realistically assess the generalization capability of a scoring function.

Materials:

Cleaned dataset split (from Protocol 1 or a pre-made one like LP-PDBBind)
Independent test set (e.g., BDB2020+ [2], or a cluster-based split with no similarity to training)
Model implementations (e.g., Graph Neural Networks, RF-Score, etc.)

Methodology:

Retrain the model using the training partition of the cleaned dataset.
Evaluate the model on the test partition of the cleaned dataset. This gives a baseline performance without leakage.
Evaluate the model on a truly independent test set like BDB2020+, which contains protein-ligand complexes released after 2020 and filtered for similarity to the training data [2]. This is the ultimate test of generalizability.
Conduct an ablation study to probe the model's reasoning. For example, run predictions on the test set after omitting the protein nodes from the input. A model that still performs well is likely relying heavily on ligand memorization, whereas a model whose performance crashes is genuinely using protein-ligand interaction information [1].

Research Reagent Solutions

Table 3: Essential Tools and Datasets for Robust Binding Affinity Prediction

Reagent / Resource	Type	Function / Purpose
PDBbind CleanSplit [1]	Curated Dataset	A training set filtered to remove structural similarities with CASF benchmarks, mitigating train-test leakage.
LP-PDBBind [2]	Curated Dataset	A leak-proof reorganization of PDBbind with minimized protein and ligand similarity between splits.
HiQBind & HiQBind-WF [8]	Data Curation Workflow	An open-source, semi-automated workflow to correct common structural artifacts in PDB complexes.
DataSAIL [9]	Software Tool	A Python package for performing similarity-aware data splits to minimize information leakage in biomedical ML.
BDB2020+ [2]	Independent Test Set	A high-quality benchmark compiled from post-2020 BindingDB and PDB data, useful for final model validation.
BindingNet v2 [10]	Augmented Dataset	A large set of modeled protein-ligand complexes to augment training data and improve model generalization.

Workflow Diagrams

Diagram 1: Data Leakage Crisis in PDBbind

Diagram 2: Creating a Leakage-Free Benchmark

Frequently Asked Questions

1. What is data leakage in the context of PDBBind, and why is it a problem? Data leakage occurs when highly similar protein or ligand complexes are present in both the training and testing datasets. Unlike exact duplicates, this often involves proteins with high sequence similarity or ligands with high chemical similarity. This inflates performance metrics during benchmarking because the model is tested on data that is not truly novel, giving a false impression of its ability to generalize to new, unseen complexes. Consequently, a model may perform poorly in real-world drug discovery applications where it encounters truly novel targets [6] [1] [2].

2. How can I detect if my model's performance is compromised by data leakage? A key red flag is a significant performance drop when evaluating your model on a carefully curated, leakage-proof test set compared to a standard benchmark like the CASF core set. For instance, when state-of-the-art models were retrained on a leakage-proof dataset, their performance on the CASF benchmark dropped markedly [1]. Another method is to use a simple similarity-search algorithm that predicts affinity by averaging labels from the most similar training complexes; competitive performance from this naive approach suggests that your model might be leveraging memorization rather than learning generalizable principles [1].

3. What are the main types of errors found in PDBBind that affect model training? Beyond data leakage, the database contains curation and structural errors. A manual analysis of a protein-protein subset found an ~19% error rate in curated equilibrium dissociation constants (K_D). These errors were categorized as shown in the table below [11]. Furthermore, common structural artifacts include covalent binders incorrectly labeled as non-covalent, ligands with rare elements, and severe steric clashes between protein and ligand atoms, all of which can mislead the training of scoring functions [8].

4. What solutions and resources are available to mitigate these issues? Researchers have developed new dataset splits and cleaning workflows to address these problems:

Leak-Proof Splits: LP-PDBBind and PDBbind CleanSplit are reorganized versions of the dataset that control for protein and ligand similarity across training, validation, and test sets [6] [1] [2].
Independent Test Sets: BDB2020+ and HiQBind are new independent datasets compiled from recent PDB entries and other sources like BindingDB, filtered with strict similarity controls to serve as reliable benchmarks [6] [8].
Quality Control Workflows: HiQBind-WF is an open-source, semi-automated workflow that corrects common structural artifacts in PDB files, ensuring higher-quality input data [8].

Troubleshooting Guides

Guide 1: Diagnosing and Resolving Data Leakage

Symptoms: Your model shows excellent performance on standard benchmarks (e.g., CASF core set) but fails to make accurate predictions for your own novel protein-ligand complexes.

Methodology:

Test on a Curated Benchmark: Evaluate your model on an independent benchmark like BDB2020+, which contains complexes deposited after 2020 and is filtered to be dissimilar to common training sets [6] [2]. A large performance gap between your results on this set and the CASF set indicates likely leakage.
Perform Cluster-Based Cross-Validation: Instead of random splits, split your data using single-linkage clustering based on protein sequence similarity. This ensures that similar proteins are not scattered across training and test sets, providing a more realistic estimate of performance on new protein families [11].
Analyze Similarity: Use a structure-based clustering algorithm that assesses protein similarity (TM-score), ligand similarity (Tanimoto score), and binding conformation similarity (pocket-aligned ligand RMSD). Identify and remove training complexes that are highly similar to any test complex [1].

Solution: Retrain your model on a leak-proof dataset. The table below summarizes the performance impact of retraining models on such datasets, demonstrating a more realistic assessment of generalization capability.

Table 1: Impact of Leak-Proof Training on Model Performance

Model	Performance on CASF with Standard Training	Performance on CASF with Leak-Proof Training	Key Change
GenScore [1]	Excellent benchmark performance	Marked performance drop	Trained on PDBbind CleanSplit
Pafnucy [1]	Excellent benchmark performance	Marked performance drop	Trained on PDBbind CleanSplit
IGN [6] [2]	Good performance	Better generalizability on independent BDB2020+ set	Trained on LP-PDBBind

Diagram: Troubleshooting workflow for data leakage.

Guide 2: Addressing Data Curation Errors

Symptoms: Your model's predictions are inconsistent or show poor correlation with experimental results, even after accounting for data leakage.

Methodology:

Manual Verification: For a subset of your data, manually check the primary literature associated with the PDB entry to verify the reported K_D value. Focus on categories known to have high error rates.
Categorize Discrepancies: Classify any found errors using established categories to understand the root cause. The table below details common curation error types identified in research [11].
Implement Automated Filters: Use a workflow like HiQBind-WF to automatically filter out problematic complexes, such as covalent binders, structures with steric clashes, or ligands with rare atomic elements [8].

Solution: Correct the errors in your dataset or use a pre-corrected dataset. Research shows that correcting curation errors can improve the Pearson correlation between predicted and measured log10(K_D) values by approximately 8 percentage points [11].

Table 2: Common Categories of Curation Errors in PDBBind

Error Category	Description	Example
No K_D	The protein complex in the PDB structure does not have a K_D value reported in the primary publication.	K_D is reported for a different protein construct than the one crystallized [11].
Different Heterodimer	The K_D value belongs to a different protein heterodimer than the one in the PDB structure.	K_D is for full-length protein, but PDB structure is of a truncated variant [11].
Units	The units of the K_D value are incorrect (e.g., nM vs. µM).	PDBBind reports 1.5 × 10⁻⁷ M, but the primary paper reports 1.5 × 10⁻¹⁰ M [11].
Approximate	PDBBind reports an approximate value, while the primary citation reports a more precise one.	Paper reports 7.4 × 10⁻⁷ M; PDBBind reports 8 × 10⁻⁷ M [11].
Multisite K_D	PDBBind provides a single K_D, but the primary publication reports multiple values for a multi-site binding model.	Publication reports two K_Ds; PDBBind reports only one [11].

Diagram: Workflow for addressing data curation errors.

The Scientist's Toolkit

Table 3: Essential Research Reagents and Resources

Item Name	Type	Function and Explanation
LP-PDBBind [6] [2]	Dataset	A leak-proof reorganization of PDBBind with minimized protein/ligand similarity between splits to train more generalizable models.
PDBbind CleanSplit [1]	Dataset	A filtered training dataset created via structure-based clustering to eliminate data leakage and redundancy within the training set.
BDB2020+ [6] [2]	Benchmark Dataset	An independent evaluation set compiled from BindingDB and PDB entries post-2020, used for true external validation of model generalizability.
HiQBind-WF [8]	Software Workflow	An open-source, semi-automated workflow that corrects common structural artifacts in PDB files (e.g., bond orders, steric clashes, protonation states).
Cluster-Based Cross-Validation [11]	Methodology	A validation technique that groups similar proteins into clusters, ensuring all members of a cluster are in the same data split to prevent over-optimistic performance estimates.
Structure-Based Clustering Algorithm [1]	Algorithm	A method to identify similar complexes using combined protein structure (TM-score), ligand chemistry (Tanimoto), and binding pose (RMSD) metrics.

Troubleshooting Guides

Guide 1: Diagnosing and Mitigating Data Leakage in PDBbind Training

Problem: Your machine learning model for binding affinity prediction performs well on benchmark tests (like CASF) but fails dramatically in real-world drug discovery applications on novel protein targets.

Explanation: This performance gap often stems from data leakage, where models memorize similarities between training and test data instead of learning generalizable principles of protein-ligand interactions. The standard PDBbind dataset and CASF benchmark share significant structural similarities, inflating performance metrics [1] [2].

Diagnosis and Solutions:

Symptom	Root Cause	Investigation Method	Solution
High benchmark performance but poor performance on novel targets	Protein Similarity: Highly similar protein sequences or folds between training and test sets [1] [12].	Calculate TM-scores or sequence identity between training and test proteins [1].	Use similarity-aware data splits (e.g., PDBbind CleanSplit, LP-PDBBind) [1] [2].
Model accurately predicts affinity for known ligand scaffolds but fails on new chemotypes	Ligand Memorization: Same or highly similar ligands (Tanimoto score >0.9) in both training and test sets [1] [2].	Compute Tanimoto coefficients between training and test ligands [1].	Filter training set to remove ligands highly similar to those in the test set [1].
Model performs well on specific binding conformations but poorly on novel poses	Binding Conformation Leakage: Nearly identical protein-ligand binding geometries (low pocket-aligned RMSD) in both datasets [1].	Calculate pocket-aligned ligand RMSD between complexes [1].	Implement structure-based filtering using combined protein, ligand, and conformation metrics [1].

Quantitative Impact of Data Leakage:

The table below summarizes the extent of data leakage identified in the standard PDBbind dataset and the performance drop observed when models are retrained on leakage-free splits [1].

Metric	Standard PDBbind	After CleanSplit Filtering	Notes
Test Complexes Affected	~49% of CASF complexes	Strictly independent	49% of test complexes had highly similar counterparts in training [1].
Training Complexes Removed	N/A	~12% total removed	4% removed due to test similarity, ~8% for internal redundancy [1].
Model Performance (RMSE)	Artificially low	Increases significantly	e.g., State-of-the-art model performance dropped on CASF2016 after retraining on CleanSplit [1].

Guide 2: Implementing a Leakage-Free Data Split for Your Dataset

Problem: You need to create a robust training/test split for your proprietary protein-ligand dataset to ensure your model will generalize.

Explanation: Random splitting is insufficient for biomolecular data due to inherent structural and chemical similarities. Specialized algorithms and tools are required to minimize data leakage.

Workflow for Creating a Leakage-Free Split:

Implementation Methods:

Method	Description	Tools	Applicability
Multi-Metric Filtering	Uses combined protein, ligand, and conformation similarity to identify and remove overly similar complexes [1].	Custom scripts (e.g., PDBbind CleanSplit algorithm) [1].	Best for structure-based affinity prediction models.
Optimization-Based Splitting	Formulates splitting as a combinatorial optimization problem to minimize inter-split similarity [12] [9].	DataSAIL [12] [9]	General purpose; handles 1D (proteins or ligands) and 2D (protein-ligand pairs) data.
Cluster-Based Splitting	Clusters data by similarity, then assigns entire clusters to splits to ensure independence [2].	LP-PDBBind protocol [2]	Good for controlling both protein and ligand leakage simultaneously.

Validation Protocol: After creating your splits, validate them by:

Checking that no protein in the test set has a TM-score >0.5 with any training protein [1].
Ensuring no test ligand has a Tanimoto coefficient >0.9 with any training ligand [1].
Using an independent external test set (e.g., BDB2020+) [2] for final evaluation.

Frequently Asked Questions

Q1: What exactly is "data leakage" in the context of PDBbind and protein-ligand affinity prediction? Data leakage occurs when information from the test dataset inadvertently influences the training process, leading to overly optimistic performance estimates. In PDBbind, this is not usually exact duplicates but high structural and chemical similarities between complexes in the standard training set (e.g., PDBbind general/refined) and the test set (e.g., CASF core set). Models then exploit these similarities through "shortcut learning" rather than learning generalizable binding principles [1] [2].

Q2: My model uses a graph neural network (GNN). Why is it particularly vulnerable to ligand memorization? GNNs can exploit statistical shortcuts. Studies show that GNNs for binding affinity sometimes rely heavily on ligand features alone to make predictions, especially when the same or similar ligands appear in both training and test sets. When protein nodes are omitted from the graph, prediction accuracy often drops significantly, confirming that the model is memorizing ligands rather than learning protein-ligand interactions [1].

Q3: Are there any ready-to-use, leakage-free versions of PDBbind available? Yes, recent research has produced curated, leakage-reduced datasets:

PDBbind CleanSplit: A re-split of PDBbind using a structure-based filtering algorithm to remove complexes with high similarity to the CASF benchmarks and reduce internal redundancies [1].
LP-PDBBind (Leak-Proof PDBbind): A reorganized dataset with splits designed to minimize protein and ligand similarity between training, validation, and test sets [2] [6]. These datasets are designed to enable more realistic model evaluation and improve generalization [1] [2].

Q4: How does the DataSAIL tool help prevent data leakage, and when should I use it? DataSAIL is a Python package that formally treats data splitting as a combinatorial optimization problem. It is particularly valuable when:

You are working with non-PDBbind data or a proprietary dataset and need to create robust splits.
Your task involves two-dimensional data, such as protein-ligand pairs, where you need to control for similarity along both the protein and ligand dimensions simultaneously [12] [9]. DataSAIL helps ensure that the similarity between training and test data is minimized, providing a more realistic performance estimate for out-of-distribution applications [12].

The Scientist's Toolkit

Research Reagent Solutions

Reagent / Resource	Type	Function in Mitigating Data Leakage
PDBbind CleanSplit	Curated Dataset	Provides a leakage-reduced version of PDBbind for training and evaluation, ensuring the test set (CASF) is structurally independent of the training data [1].
LP-PDBBind	Curated Dataset	Offers a reorganized PDBbind with training/validation/test splits designed to minimize protein and ligand similarity, controlling for both dimensions of leakage [2].
DataSAIL	Software Tool	A versatile Python package for performing similarity-aware data splits on biomolecular data, including complex protein-ligand pairs [12] [9].
BDB2020+	Independent Benchmark	An external test set compiled from BindingDB entries deposited after 2020, used for truly independent evaluation of model generalizability [2].
TM-score Algorithm	Metric Algorithm	Quantifies protein structural similarity; used to identify and filter out proteins with high TM-score (>0.5) between splits [1].
Tanimoto Coefficient	Metric Algorithm	Calculates ligand chemical similarity; used to filter out ligands with high Tanimoto score (>0.9) between splits [1].

Building Robust Datasets: Practical Protocols for Leakage-Free Splits

Troubleshooting Guides

Guide 1: Diagnosing and Mitigating Data Leakage in PDBbind

Problem: Models exhibit high benchmark performance on CASF datasets but fail dramatically in real-world applications or on truly independent tests.

Root Cause: Significant data leakage exists between the standard PDBbind training set and the common CASF benchmark test sets [1] [13]. Nearly 49% of CASF complexes have exceptionally similar counterparts (in protein structure, ligand chemistry, and binding conformation) in the training data, allowing models to "memorize" rather than generalize [1]. This inflates performance metrics and creates over-optimistic expectations of model capability.

Solution: Implement the PDBbind CleanSplit protocol, which applies a structure-based filtering algorithm to remove problematic similarities [1] [13].

Step	Action	Rationale
1. Identify Leakage	Compare all training and test complexes using combined protein similarity (TM-score), ligand similarity (Tanimoto), and binding conformation similarity (pocket-aligned ligand RMSD) [1].	A multi-faceted approach catches leaks that single-metric (e.g., sequence-based) checks miss.
2. Remove Test Similarities	Exclude any training complex with TM-score > 0.8, Tanimoto > 0.9, or a combined (Tanimoto + (1 - RMSD)) score > 0.8 versus any test complex [1].	Severs the direct structural shortcut between training and test examples.
3. Prevent Ligand Memorization	Remove training complexes with ligands identical (Tanimoto > 0.9) to those in the test set [1].	Stops the model from predicting affinity based solely on recognizing a known ligand.
4. Reduce Internal Redundancy	Apply adapted thresholds to identify and break up large similarity clusters within the training set itself [1].	Forces the model to learn generalizable rules instead of relying on numerous near-duplicates.

Verification: After applying CleanSplit, retrain your model. A significant performance drop on the CASF benchmark indicates that the original model's performance was likely inflated by data leakage. A model with genuine generalization capability will maintain robust performance [1].

Guide 2: Addressing Poor Generalization on Novel Complexes

Problem: A model, trained on a leakage-free dataset like CleanSplit, still performs poorly on novel protein families or ligand scaffolds.

Root Cause: The model architecture itself may be prone to learning shortcuts or lacks the necessary inductive biases to capture genuine protein-ligand interactions [1] [13].

Solution: Adopt an architecture designed for generalization, such as the GEMS (Graph neural network for Efficient Molecular Scoring) model, and leverage transfer learning [1].

Component	Implementation	Benefit
Sparse Graph Representation	Model the protein-ligand complex as a graph, with atoms as nodes and interactions as edges [1].	Focuses the model on relevant local chemical environments and interactions, improving efficiency and generalization.
Ablation Study	Systematically remove parts of the input (e.g., protein nodes) during evaluation [1].	Verifies that predictions are based on genuine protein-ligand interactions and not just ligand-based memorization.
Transfer Learning	Initialize model components using pre-trained language models on large corpora of protein sequences or chemical compounds [1].	Provides the model with a strong foundational understanding of biochemistry and chemistry before learning the specific task of affinity prediction.

Frequently Asked Questions (FAQs)

Q1: What is the single most critical change I should make to my PDBbind training pipeline to improve model generalization?

A: The most critical change is to replace the standard PDBbind training split with a leakage-free version, such as PDBbind CleanSplit or LP-PDBBind [1] [2]. This ensures your model is evaluated on a test set that truly represents novel challenges, providing a realistic measure of its real-world applicability.

Q2: My model's performance dropped significantly after I switched to CleanSplit. Does this mean my model is bad?

A: Not necessarily. A performance drop is an expected and positive sign that you have successfully eliminated the data leakage that was artificially inflating your metrics [1]. It means you are now measuring your model's true generalization capability. This provides a more honest starting point for further model improvement.

Q3: Are there automated tools available to create my own leakage-free data splits for other biomolecular datasets?

A: Yes. Tools like DataSAIL are specifically designed for this purpose [12]. DataSAIL formulates leakage-reduced data splitting as a combinatorial optimization problem, handling complex scenarios involving one-dimensional (e.g., single molecules) and two-dimensional (e.g., drug-target pairs) data while controlling for similarity across splits.

Q4: Beyond data leakage, what other data quality issues should I be aware of in PDBbind?

A: Several other issues can compromise model training, which workflows like HiQBind-WF and PDBBind-Opt aim to fix [8] [14]. Key problems include:

Covalent binders: Complexes where the ligand is covalently linked to the protein, which have a different binding mechanism [8] [14].
Structural artifacts: Incorrect bond orders, severe steric clashes, and missing atoms in the original PDB structures [8] [14].
Presence of rare chemical elements: Ligands containing elements like Tellurium (Te) or Selenium (Se) can be problematic due to their scarcity [8].

Experimental Protocols & Workflows

PDBbind CleanSplit Filtering Algorithm Workflow

The following diagram illustrates the logical workflow of the structure-based filtering algorithm used to create PDBbind CleanSplit.

Protocol: Executing the CleanSplit Filtering Algorithm

Objective: To create a training dataset (CleanSplit) free of data leakage against a designated test set (e.g., CASF core set) by removing structurally similar complexes.

Inputs:

Full set of protein-ligand complexes from PDBbind.
Test set complexes (e.g., from CASF 2016).

Methodology:

Similarity Computation: For each training-test complex pair, compute three key metrics [1]:
- Protein Structure Similarity: Use TM-align to calculate the TM-score. A score > 0.8 indicates high structural similarity, even with low sequence identity [1].
- Ligand Chemical Similarity: Calculate the Tanimoto coefficient based on molecular fingerprints (e.g., using RDKit). A score > 0.9 indicates nearly identical ligands [1].
- Binding Conformation Similarity: After aligning protein pockets via TM-align, compute the root-mean-square deviation (RMSD) of the ligand heavy atoms. This measures how similarly the ligand is positioned in the binding site [1].

Application of Exclusion Criteria: A training complex is excluded if it meets ANY of the following conditions versus a test complex [1]:
- Its ligand is nearly identical (Tanimoto > 0.9) AND has a similar binding affinity (label difference |ΔpK| ≤ 1).
- It has high protein similarity (TM-score > 0.8).
- It has a high combined score for ligand similarity and positioning (Tanimoto + (1 - RMSD) > 0.8).
Redundancy Reduction (Optional but Recommended): Apply adapted versions of the above thresholds to identify and remove similar complexes within the training set, ensuring greater diversity and discouraging memorization [1].

Output: A filtered training dataset (PDBbind CleanSplit) rigorously separated from the test set.

Performance Validation Experiment

Objective: To quantify the impact of data leakage and validate the effectiveness of the CleanSplit dataset.

Method:

Select two state-of-the-art binding affinity prediction models (e.g., GenScore and Pafnucy) [1].
Train two instances of each model:
- Instance A: Trained on the standard PDBbind training set.
- Instance B: Trained on the PDBbind CleanSplit set.
Evaluate the performance of all trained models on the same CASF benchmark, using standard metrics like Pearson's R and Root-Mean-Square Error (RMSE).

Expected Results:

Model	Training Set	CASF Benchmark Performance (Pearson R / RMSE)	Interpretation
GenScore	Standard PDBbind	High (Inflated)	Performance likely driven by data leakage [1]
GenScore	PDBbind CleanSplit	Substantially Lower	Reveals the model's true generalization capability [1]
GEMS	PDBbind CleanSplit	Maintains High	Demonstrates genuine generalization, not reliant on leakage [1]

The Scientist's Toolkit: Research Reagent Solutions

Tool / Resource	Type	Primary Function	Relevance to Mitigating Data Leakage
PDBbind CleanSplit	Curated Dataset	Provides a leakage-free training split for PDBbind.	The core solution; a benchmark-ready dataset for robust model training and evaluation [1] [13].
LP-PDBBind	Curated Dataset	A reorganized PDBbind split controlling for protein and ligand similarity.	An alternative leakage-proof dataset, also used to retrain and re-evaluate scoring functions [2] [6].
DataSAIL	Software Tool	Computes optimal data splits for biomedical ML to minimize information leakage.	Generalizes the splitting protocol; can be applied to create custom leakage-free splits for various datasets and problem types [12].
HiQBind-WF / PDBBind-Opt	Workflow	An open-source, automated workflow for correcting structural artifacts in protein-ligand complexes.	Addresses data quality issues orthogonal to leakage, such as fixing incorrect bond orders, removing covalent binders, and resolving steric clashes [8] [14].
GEMS Model	Machine Learning Model	A graph neural network for binding affinity prediction.	An example of a model architecture designed to achieve high performance without relying on data leakage, using sparse graphs and transfer learning [1].

Core Concept FAQs

What is the primary objective of the LP-PDBBind protocol? The primary objective of the LP-PDBBind (Leak-Proof PDBBind) protocol is to reorganize the popular PDBBind dataset into training, validation, and test sets that rigorously control for data leakage. Data leakage is defined as the presence of proteins and ligands with high sequence and structural similarity across different dataset splits, which can lead to artificially inflated performance metrics and poor generalizability of scoring functions to truly novel protein-ligand complexes [2].

How does "data leakage" specifically impact the development of scoring functions? When data leakage occurs, machine learning models or empirical scoring functions may achieve high performance on test sets by "memorizing" similarities to the training data, rather than by learning generalizable principles of binding. This creates an overoptimistic assessment of a model's capability. Consequently, a model that performs excellently on a contaminated test set may perform poorly in real-world drug discovery applications on novel targets or compounds [2] [3].

What are the key differences between LP-PDBBind and the standard PDBBind split? The standard PDBBind's "general," "refined," and "core" sets are known to be cross-contaminated with highly similar proteins and ligands. In contrast, LP-PDBBind introduces a new data splitting strategy that minimizes sequence and chemical similarity of both proteins and ligands between the training, validation, and test datasets. It also includes additional data cleaning steps to remove covalent binders and correct structural artifacts [2].

Implementation & Troubleshooting FAQs

What are the specific similarity thresholds used to define data leakage in LP-PDBBind? The LP-PDBBind protocol defines and controls for similarity using pairwise comparisons. The specific thresholds are designed to ensure that proteins and ligands in the test set are not highly similar to those in the training set. The following table summarizes the key criteria:

Table: Key Similarity Control Criteria in LP-PDBBind

Entity	Similarity Measure	Objective
Protein	Pairwise sequence similarity	Ensure test proteins have low sequence similarity to training proteins [2].
Ligand	Chemical fingerprint similarity (e.g., Tanimoto similarity)	Ensure test ligands are chemically dissimilar to training ligands [2].
Protein-Ligand Pair	Structural interaction patterns	Minimize similarity in protein-ligand interaction patterns between splits [2].

The dataset size after applying LP-PDBBind is smaller. Is this a problem? A reduction in dataset size is an expected and acceptable consequence of rigorous data curation. The primary goal of LP-PDBBind is not to maximize quantity, but to ensure quality and reliability for model evaluation. A smaller, "leak-proof" dataset provides a more realistic and trustworthy benchmark for assessing the true generalizability of your scoring function [2] [3].

How do I access and use the LP-PDBBind dataset? The LP-PDBBind dataset is available via a GitHub repository. The repository contains meta-information files (e.g., LP_PDBBind.csv) that specify the new data splits, clean levels, and other annotations. You will need to cross-reference this with structure files downloaded from the PDBBind website [15].

Table 1: LP-PDBBind Dataset Structure

Component	Description	File/Location
Meta-information	PDB IDs, splits, SMILES, sequences, affinity data	`dataset/LP_PDBBind.csv`
Structure Files	Protein (.pdb) and ligand (.sdf/.mol2) structures	To be downloaded from the official PDBBind website.
Clean Levels	Boolean flags (CL1, CL2, CL3) indicating data quality tiers	Specified in the meta-information file.

My model, trained on LP-PDBBind, shows lower performance on the test set. What does this mean? A drop in performance when moving from a standard split to LP-PDBBind is not a failure of your model, but rather an indication that the previous evaluation was likely biased. LP-PDBBind provides a more rigorous and realistic assessment of your model's scoring power. This result underscores the importance of using a leakage-free benchmark to guide the development of generalizable models [2].

Research Reagent Solutions

Table 2: Essential Materials for LP-PDBBind and Related Research

Research Reagent / Tool	Type	Primary Function
LP-PDBBind Dataset	Curated Dataset	A leakage-proof benchmark for training and evaluating protein-ligand scoring functions [2] [15].
BDB2020+ Dataset	Independent Test Set	An independent benchmark compiled from BindingDB entries deposited after 2020, used for final model validation [2] [15].
DataSAIL	Software Tool	A Python package for performing similarity-aware data splitting to minimize information leakage in biomedical ML tasks [12].
HiQBind-WF	Software Workflow	An open-source, semi-automated workflow for curating high-quality, non-covalent protein-ligand datasets and correcting structural artifacts [8].

Experimental Protocol: Creating the LP-PDBBind Split

The following diagram illustrates the workflow for generating the LP-PDBBind dataset, which involves data cleaning and similarity-based splitting.

LP-PDBBind Creation Workflow

Step-by-Step Methodology:

Data Cleaning and Curation:
- Input: Begin with the raw PDBBind dataset (e.g., version 2020) [2].
- Remove Covalent Binders: Filter out protein-ligand complexes where the ligand is covalently bound to the protein, as non-covalent binding is the primary focus [2] [8].
- Apply Rare Element Filter: Exclude ligands containing elements other than H, C, N, O, F, P, S, Cl, Br, and I to avoid data sparsity issues [8].
- Remove Steric Clashes: Exclude structures where any protein-ligand heavy atom pairs are closer than 2 Å, as these represent physically unrealistic interactions [8].
Similarity Analysis:
- Protein Similarity: For all protein sequences in the cleaned dataset, compute pairwise sequence similarity (e.g., using BLAST or an equivalent algorithm) [2] [15].
- Ligand Similarity: For all ligands, compute pairwise chemical similarity using molecular fingerprint representations (e.g., ECFP fingerprints) and a metric like Tanimoto similarity [2] [15].
Similarity-Aware Data Splitting:
- Algorithm: Use a splitting algorithm that formulates the assignment of complexes to training, validation, and test sets as an optimization problem. The objective is to minimize the maximum similarity of proteins and ligands between the different splits [2].
- Output: The result is the LP-PDBBind dataset, comprising three distinct splits where proteins and ligands in the test set are not highly similar to those in the training set. This dataset is now suitable for training and evaluating generalizable scoring functions [2].

What is the primary goal of this multimodal filtering approach?

This filtering methodology aims to mitigate data leakage in protein-ligand binding affinity prediction models, particularly for datasets like PDBbind. Data leakage occurs when models are trained and tested on non-independent data, leading to overoptimistic performance that doesn't generalize to real-world applications. By employing three complementary metrics, the approach ensures training and test sets contain structurally distinct complexes [13] [1].

Why are these three specific metrics used together?

Each metric captures a different dimension of protein-ligand complex similarity, providing a more robust assessment than any single metric could achieve [13]:

TM-score assesses 3D protein structure similarity
Tanimoto score assesses 2D ligand structural similarity
Pocket-aligned RMSD assesses binding conformation similarity

This multimodal approach can identify complexes with similar interaction patterns even when proteins have low sequence identity, addressing limitations of traditional sequence-based filtering [13].

Technical Specifications & Thresholds

What are the quantitative thresholds for identifying problematic similarities?

The table below summarizes the key filtering thresholds used to identify and remove overly similar protein-ligand complexes:

Table 1: Multimodal Filtering Thresholds for Identifying Data Leakage

Metric	Measurement Focus	Similarity Threshold	Interpretation Guidelines
TM-score	Protein structure similarity	>0.5	Generally indicates the same protein fold [16]
Tanimoto Coefficient	Ligand chemical similarity	>0.9	Indicates highly similar or identical ligands [13]
Pocket-aligned RMSD	Binding conformation similarity	<2.0 Å	Suggests nearly identical ligand positioning [13]

What practical impact does this filtering have on datasets?

Application of these thresholds to the PDBbind-CASF benchmark relationship revealed:

~600 similarities identified between training and test complexes [13]
49% of CASF test complexes had highly similar counterparts in training data [13]
~12% of training complexes removed to create a "clean" dataset (4% for test separation + 7.8% for internal redundancy) [13]

Implementation Workflow

The following diagram illustrates the complete multimodal filtering process:

Research Reagent Solutions

Table 2: Essential Tools and Resources for Implementing Multimodal Filtering

Tool/Resource	Type	Primary Function	Implementation Notes
TM-score	Software utility	Quantifies protein structural similarity	Available as C++ or Fortran source code; values >0.5 indicate same fold [16]
Tanimoto Coefficient	Mathematical metric	Calculates 2D molecular similarity based on chemical fingerprints	Typically implemented using RDKit or similar cheminformatics libraries [13]
Pocket-aligned RMSD	Geometric calculation	Measures binding mode similarity after structural alignment	Requires prior pocket alignment; values <2.0 Å indicate near-identical positioning [13]
PDBbind Database	Data resource	Source of protein-ligand complexes with binding affinities	General/refined sets for training; core set for testing [13] [2]
CASF Benchmark	Evaluation dataset	Standard benchmark for scoring functions	Must be separated from training data via filtering [13]

Frequently Asked Questions

How does this approach improve model generalization?

By ensuring strict separation between training and test complexes, models cannot rely on memorizing similar structures and must learn genuine protein-ligand interaction principles. When state-of-the-art models were retrained on the filtered PDBbind CleanSplit, their performance dropped substantially, indicating previous benchmark results were inflated by data leakage [13].

What's the difference between this and time-based splitting?

Time-based splitting (training on pre-2020 data, testing on post-2020 data) doesn't adequately address the issue because new drugs often target established proteins, and existing drugs are tested on new proteins. Structural similarities can still occur across time partitions, making multimodal filtering more reliable for ensuring true independence [2].

How computationally intensive is this process?

The all-against-all comparison of protein-ligand complexes is computationally demanding but crucial. For large datasets like PDBbind, this requires efficient implementation and potentially high-performance computing resources. The TM-score calculation, in particular, involves complex structural alignments that can be computationally expensive [16].

Can these methods identify similarities despite low sequence identity?

Yes, this is a key advantage. Unlike sequence-based methods, the multimodal approach can detect complexes with similar interaction patterns even when protein sequences show low identity. This makes it particularly valuable for identifying subtle data leakage that would escape traditional filtering methods [13].

Troubleshooting Guide

Problem: Inconsistent TM-score values

Solution: Ensure you're normalizing by the same chain length when comparing scores. TM-score values depend on the normalization length, so consistent implementation is crucial for reproducible filtering [16].

Problem: High computational overhead

Solution: Consider implementing a tiered approach where rapid fingerprint-based screening (Tanimoto) is performed first, followed by more computationally intensive structural comparisons (TM-score, pocket-RMSD) only for promising candidates.

Problem: Residual data leakage after filtering

Solution: Re-examine your similarity thresholds. You may need to tighten them for specific applications. Additionally, check for similarities within the training set itself, as internal redundancies can also hamper model generalization [13].

Problem: Handling covalent binders

Solution: Exclude covalent protein-ligand complexes from your dataset before applying multimodal filtering, as they represent a different binding paradigm that requires specialized treatment in scoring functions [8].

In the field of computational drug design, accurately predicting protein-ligand binding affinity is crucial for structure-based drug discovery. While the issue of data leakage between training and test sets has gained significant attention, a more insidious problem often lurks within the training data itself: redundancy. This technical guide addresses strategies for identifying and mitigating redundancy within training sets, specifically focusing on PDBbind datasets, to build models that genuinely generalize to novel protein-ligand complexes rather than merely memorizing structural similarities.

Frequently Asked Questions (FAQs)

FAQ 1: What is the fundamental difference between train-test leakage and intra-training set redundancy?

Train-Test Leakage occurs when information from the test set inadvertently influences the training process. This leads to inflated performance metrics during benchmarking that do not reflect the model's true ability to generalize to unseen data. A known issue in molecular benchmarking is the structural similarity between complexes in the general PDBbind set and those in the CASF benchmark [1].
Intra-Training Set Redundancy refers to the presence of numerous highly similar data points within the training set itself. This encourages the model to settle for a local minimum in the loss landscape by simply memorizing these redundant patterns, rather than learning the underlying principles of protein-ligand interactions. It hampers generalization by promoting a reliance on similarity-based shortcuts [1].

FAQ 2: Why is simple random splitting insufficient for complex biomolecular data?

Random splitting assumes data points are independent and identically distributed. However, biomolecular data, such as protein-ligand complexes, exhibit complex dependency structures. For example, multiple complexes might share nearly identical protein structures, highly similar ligands, or comparable binding conformations. A random split can easily place these highly similar complexes in both the training and validation sets, leading to overoptimistic validation metrics and masking poor true generalization [1] [12].

FAQ 3: How can I quantify redundancy in my training set?

Redundancy can be quantified using a multimodal similarity approach that assesses several axes of similarity between data points. Key metrics include:

Protein Similarity: Using metrics like the TM-score to compare protein structures [1].
Ligand Similarity: Using metrics like the Tanimoto coefficient to compare small molecules [1].
Binding Conformation Similarity: Using metrics like the pocket-aligned ligand root-mean-square deviation (RMSD) to compare how ligands sit in the binding pocket [1]. By applying thresholds across these combined metrics, you can identify clusters of highly similar complexes that constitute redundancy.

FAQ 4: What is the practical impact of removing redundant data? Won't it hurt performance?

Counterintuitively, removing redundant data can improve model generalization and final test performance on independent data. Training on a highly redundant set is like studying for an exam by reading the same paragraph repeatedly; you become an expert on that paragraph but fail to understand the chapter. Similarly, models trained on diverse, non-redundant sets are forced to learn broader, more generalizable patterns. Research on chest X-ray datasets showed that models trained on a de-redundanted, "informative subset" of data significantly outperformed models trained on the full, redundant dataset during both internal and external testing [17].

Troubleshooting Guides

Problem 1: My model performs excellently on the validation set but poorly on external tests.

Diagnosis: This classic sign suggests either train-test leakage or that your validation set is not truly independent due to underlying redundancy in the entire dataset.

Solution: Implement a similarity-clustered split.

Calculate Similarity: Compute all-against-all protein, ligand, and binding site similarities for your entire dataset (including any planned validation/test sets) [1].
Cluster Complexes: Use a clustering algorithm (e.g., agglomerative clustering) to group complexes that exceed your similarity thresholds (e.g., TM-score > 0.8, Tanimoto > 0.9, RMSD < 2.0Å) [1] [17].
Assign Splits by Cluster: Ensure that all complexes within a single cluster are assigned to the same data split (training, validation, or test). This guarantees that the validation and test sets contain truly novel complexes not represented in the training data [12].

Problem 2: I have a limited dataset and am concerned that removing data will lead to underfitting.

Diagnosis: The concern is valid, but the goal is to remove redundant information, not unique information. The key is to prioritize quality and diversity over sheer quantity.

Solution: Use an entropy-based informative sample selection.

Train a Baseline Model: First, train a model on your entire, potentially redundant, training set.
Score Sample Informativeness: Use the trained model to evaluate each training sample. Calculate the entropy of the model's prediction for each sample. Samples with high entropy (high prediction uncertainty) are deemed more "informative" as the model has not yet learned them well, while low-entropy samples are considered learned or redundant [17].
Select an Informative Subset: Use an optimization procedure (e.g., Bayesian optimization) to select a subset of training data that maximizes the average informativeness (entropy) score.
Fine-Tune the Model: Fine-tune your model on this curated, informative subset. This approach has been shown to yield better performance on external test sets than using the full dataset [17].

Problem 3: I am working with paired data (like protein-ligand interactions) and need to avoid leakage on both axes.

Diagnosis: In two-dimensional data, leakage can occur if similar proteins or similar ligands appear across different splits.

Solution: Use a specialized tool for two-dimensional splitting.

Define Similarity for Both Entities: Calculate separate similarity metrics for the proteins and the ligands in your dataset.
Formulate a Constrained Optimization: The goal is to split the data such that no protein or ligand in the test set is highly similar to any in the training set. Tools like DataSAIL formalize this as a combinatorial optimization problem [12].
Run the Splitting Algorithm: DataSAIL uses clustering and integer linear programming to heuristically solve this NP-hard problem, producing splits that minimize information leakage across both dimensions of the data [12].

Experimental Protocols & Data

Protocol 1: Creating a PDBbind CleanSplit

This protocol is based on the methodology established to address data leakage in the PDBbind database [1].

Data Acquisition: Download the PDBbind database and the CASF benchmark datasets.
Similarity Calculation:
- For every protein-ligand complex in CASF, compare it against every complex in the PDBbind general set.
- Compute three similarity metrics: TM-score (protein similarity), Tanimoto coefficient (ligand similarity), and pocket-aligned ligand RMSD (binding pose similarity).
Filtering for Train-Test Separation:
- Identify and remove all complexes from the PDBbind training set that are above a threshold of similarity to any complex in the CASF test set. The original study used this to remove ~4% of training complexes, addressing 49% of CASF test complexes that had a highly similar counterpart in training [1].
Filtering for Intra-Training Redundancy:
- Within the remaining PDBbind training set, identify clusters of highly similar complexes using the same multimodal similarity approach.
- Iteratively remove complexes from these clusters until the largest remaining cluster is below a defined size threshold. This was shown to remove an additional ~7.8% of training complexes [1].
Result: The remaining dataset, termed PDBbind CleanSplit, is a refined training set with minimized train-test leakage and reduced internal redundancy.

Protocol 2: Entropy-Based Redundancy Reduction

This protocol is adapted from methods successfully applied to medical imaging datasets to remove semantic redundancy [17].

Initial Training: Train a baseline model (e.g., a Graph Neural Network for binding affinity prediction) on the entire available training set. Let's call this model M_baseline.
Inference and Entropy Calculation:
- Pass each training sample through Mbaseline to get a prediction.
- For a regression task, you can adapt the concept by measuring the model's uncertainty (e.g., using the variance from a probabilistic model or the error magnitude). For classification, calculate the prediction entropy directly. Entropy = -Σ p_i * log(p_i), where pi is the predicted probability for class i.
- High entropy/uncertainty indicates an informative sample that the model finds challenging.
Subset Selection:
- Use a search algorithm like Bayesian Optimization to find the subset of training data that, when used for training, results in a model with the lowest possible validation loss.
- The optimization process is guided by the entropy scores, prioritizing the inclusion of high-entropy samples.
Final Model Training: Train a new model from scratch on the optimized, informative subset identified in the previous step.

Quantitative Data on Data Redundancy and Filtering

Table 1: Impact of Data Filtering as Reported in PDBbind CleanSplit Study [1]

Filtering Type	Complexes Removed	Key Consequence
Train-Test Leakage Reduction	~4% of PDBbind training set	Addressed similarity for 49% of CASF-2016 test complexes, turning them into genuine external tests.
Intra-Training Redundancy Reduction	~7.8% of PDBbind training set	Broke up large similarity clusters within the training set, discouraging memorization.
Cumulative Filtering	~11.8% of PDBbind training set	Created the PDBbind CleanSplit, a refined dataset for robust model evaluation.

Table 2: Performance Comparison on Redundant vs. Non-Redundant Data

Dataset / Strategy	Reported Performance Insight
Standard PDBbind Split	Top models (e.g., GenScore, Pafnucy) showed high CASF performance, which dropped substantially when retrained on CleanSplit, indicating performance was previously driven by data leakage [1].
PDBbind CleanSplit	A GNN model (GEMS) maintained high CASF performance when trained on CleanSplit, demonstrating genuine generalization capability [1].
Entropy-Based Subset (Medical Imaging)	Model trained on an informative subset achieved significantly higher recall (0.7164 vs 0.6597) on internal test and dramatically better generalization on external test (0.3185 vs 0.2589) compared to a model trained on the full, redundant dataset [17].

Workflow Visualization

Diagram 1: Multimodal Filtering for Clean Training Set Creation

Multimodal Filtering Workflow - This diagram illustrates the two-stage process for creating a non-redundant training set, first by removing data points too similar to the test set, and then by reducing redundancy within the training data itself.

Diagram 2: Entropy-Based Informative Sample Selection

Entropy-Based Sample Selection - This diagram shows the process of using a baseline model to identify the most informative samples in a dataset based on prediction entropy, leading to a refined, non-redundant training subset.

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Key Tools and Resources for Mitigating Data Redundancy

Tool / Resource	Type	Function & Application
PDBbind CleanSplit [1]	Curated Dataset	A pre-filtered version of PDBbind with reduced train-test leakage and internal redundancy. Use as a benchmark training set for robust evaluation.
DataSAIL [12]	Python Package	Performs similarity-aware data splitting for 1D and 2D data. Ideal for creating splits that minimize leakage for protein, ligand, or protein-ligand pairs.
TM-score [1]	Algorithm/Metric	Measures protein structural similarity. A key metric for identifying redundant protein complexes in a dataset.
Tanimoto Coefficient [1]	Algorithm/Metric	Measures ligand similarity based on molecular fingerprints. Essential for identifying redundant ligands in a dataset.
Pocket-Aligned RMSD [1]	Algorithm/Metric	Measures the similarity of ligand binding conformations. Critical for assessing redundancy in the binding pose.
Entropy-Based Scoring [17]	Methodology	A strategy to score training samples by their informativeness, allowing for the creation of a potent, non-redundant subset without predefined similarity thresholds.

Beyond Splitting: Integrating Data Quality and Mitigation Strategies

Addressing Broader Data Quality Issues with HiQBind-WF

Frequently Asked Questions (FAQs)

1. What is HiQBind-WF and why was it developed? HiQBind-WF is an open-source, semi-automated workflow designed to create high-quality, non-covalent protein-ligand binding datasets. It was developed to address common structural artifacts and data quality issues found in widely used datasets like PDBbind, which can compromise the accuracy and generalizability of scoring functions used in drug discovery [18] [19] [20].

2. What are the main types of structural errors corrected by this workflow? The workflow specifically identifies and corrects several key issues [18] [19] [14]:

Structural Artifacts: Incorrect bond orders, protonation states, and aromaticity in ligands; missing atoms in proteins.
Inappropriate Complexes: Covalently bound protein-ligand complexes.
Non-Physical Structures: Severe steric clashes between protein and ligand heavy atoms.
Problematic Ligands: Ligands containing rarely-occurring elements or that are too small for meaningful binding studies.

3. How does HiQBind-WF improve dataset reproducibility? HiQBind-WF is designed as a semi-automated, open-source workflow. This minimizes manual intervention and fosters transparency, ensuring that the data curation process is consistent and reproducible for the entire research community [18] [19].

4. What is the difference between the optimized PDBbind and the new HiQBind dataset? The workflow can be applied to optimize the existing PDBbind dataset (creating PDBbind-Opt). Furthermore, it was used to create a completely new dataset, HiQBind, by matching binding free energies from sources like BioLiP, Binding MOAD, and BindingDB with co-crystalized structures from the PDB. HiQBind serves as an independent benchmark for scoring functions [18] [21] [19].

5. Where can I access the HiQBind-WF tools and datasets? The code for the HiQBind workflow is available on GitHub under an MIT license [21]. The prepared HiQBind dataset is accessible via a Figshare repository [21].

Troubleshooting Guides

Guide 1: Resolving Protein-Ligand Complex Structural Errors

Problem: Your dataset contains protein-ligand complexes with structural errors that negatively impact scoring function training.

Symptoms	Root Cause	Solution with HiQBind-WF
Poor scoring function performance/ generalizability [18]	Underlying training data contains structural artifacts [18] [14]	Apply the full HiQBind-WF curation pipeline to fix ligand and protein structures [19].
Physically impossible binding predictions	Non-covalent complexes mislabeled or containing severe steric clashes [19]	Use the Covalent Binder Filter and Steric Clashes Filter to remove non-physical complexes [19] [14].
Model bias towards rare elements	Ligands with infrequent elements (e.g., Te, Se) create data sparsity [19]	Apply the Rare Element Filter to exclude ligands with elements beyond H, C, N, O, F, P, S, Cl, Br, I [19] [14].

Step-by-Step Protocol:

Input Data: Begin with your list of PDB IDs and their corresponding binding affinity data [21].
Structure Splitting: Run the workflow to split each PDB entry into three components: the protein, the ligand(s), and any additives (ions, solvents) [19].
Apply Filtration Modules:
- Covalent Binder Filter: The workflow checks the "CONECT" records in the PDB file to identify and remove covalently bound ligands [19].
- Steric Clashes Filter: Excludes structures where any protein-ligand heavy atom pair is closer than 2 Å [19] [14].
- Rare Element & Small Ligand Filters: Removes ligands with rare elements or fewer than 4 heavy atoms [19].
Structure Correction:
- Ligand Fixing Module: Corrects bond order, protonation states, and aromaticity of the ligand [19].
- Protein Fixing Module: Uses tools like PDBFixer to add missing atoms and residues to the protein structure [14].
Final Output: The workflow outputs curated, high-quality structural files for each successful complex, marked with a done.tag file [21].

Guide 2: Mitigating Data Leakage in Model Training

Problem: Your machine learning models for binding affinity prediction show inflated performance during benchmarking but fail to generalize to truly new protein-ligand complexes due to data leakage.

Symptoms	Root Cause	Solution with HiQBind-WF & Data Splitting
High benchmark scores but poor real-world performance [1]	Train and test sets contain proteins/ligands with high sequence/structural similarity [2] [1]	Use similarity-controlled splits (like LP-PDBBind) to minimize data leakage [2].
Model memorization instead of learning interactions [1]	Redundant complexes in training set [1]	Apply data clustering and filtering to reduce internal dataset redundancy [1].

Step-by-Step Protocol for Creating a Leak-Proof Split:

Similarity Analysis: Calculate pairwise protein sequence similarity (e.g., using BLOSUM) and ligand chemical similarity (e.g., Tanimoto coefficients based on fingerprints) for all complexes in your dataset [2] [15].
Define Similarity Thresholds: Establish strict thresholds to define high similarity. For example, the LP-PDBBind method considers proteins with a BLOSUM score > 0.7 and ligands with a Tanimoto coefficient > 0.9 as highly similar [15].
Cluster Data: Group complexes into clusters based on these similarity metrics.
Partition Data: Assign entire clusters to training, validation, or test sets to ensure that highly similar complexes do not appear in different splits. This prevents the model from seeing nearly identical examples during training and evaluation [2].
Independent Validation: Test the final model on a truly external dataset, such as BDB2020+, which contains complexes deposited after a certain date and filtered for similarity to the training set [2] [6].

The following workflow diagram illustrates the integrated process of using HiQBind-WF for structural curation and data splitting to achieve generalizable models:

Problem: You need to create a new, high-quality protein-ligand binding dataset from various public sources to ensure independence and reliability.

Step-by-Step Protocol:

Data Aggregation: Compile a list of potential complexes by matching co-crystalized structures from the PDB with reliable binding affinity data from sources like BindingDB, BioLiP, and Binding MOAD [18] [19].
Run HiQBind-WF: Process all identified PDB entries through the HiQBind-WF pipeline to ensure structural integrity and apply all relevant filters [21].
Metadata Generation: For each successfully processed complex, gather metadata including protein sequence, ligand SMILES string, binding affinity value, and source information [21] [15].
Dataset Organization: Package the curated structural files and associated metadata into a standardized format. The HiQBind dataset, for example, is organized with a central metadata file (hiq_sm.csv) linking to individual structure folders [21].

Table: Key Resources for Protein-Ligand Dataset Curation and Model Training

Item	Function / Description	Relevance to HiQBind-WF
HiQBind-WF GitHub Repo [21]	Contains all scripts for the semi-automated curation workflow.	Primary tool for reproducing the dataset creation and optimization process.
Figshare HiQBind Repository [21]	Hosts the final, prepared HiQBind dataset.	Provides direct access to the ready-to-use, high-quality dataset.
LP-PDBBind Dataset & Code [15]	Provides meta-information and scripts for creating leak-proof data splits.	Essential for mitigating data leakage when splitting datasets for machine learning.
BDB2020+ Dataset [2] [15]	An independent test set of protein-ligand complexes from BindingDB and PDB (post-2020).	Serves as a stringent external benchmark for evaluating model generalizability.
PDBFixer [14]	A tool for adding missing atoms and residues to protein structures.	Used within the HiQBind-WF's ProteinFixer module [14].
RDKit [15]	A collection of cheminformatics and machine learning tools.	Used for processing ligand structures and calculating chemical similarities [15].

Correcting Common Structural Artifacts in Proteins and Ligands

Troubleshooting Guide: Structural Artifacts and Data Integrity

This guide addresses common structural artifacts in protein-ligand complexes and their critical connection to data leakage in machine learning model training, such as with PDBbind datasets. Proper identification and correction are essential for developing reliable predictive models in drug discovery.

FAQ 1: How can incorrect protein sidechain rotamers lead to data leakage, and how do I correct them?

Issue: Inaccurate sidechain conformations, particularly in binding pockets, create false structural patterns. Models trained on these artifacts learn to predict based on incorrect geometries, failing to generalize to real, flexible proteins [22].

Solution:

Identification: Use molecular visualization tools (e.g., ChimeraX, PyMOL) to inspect sidechains in the binding site. Look for unlikely atom clashes, unusual torsion angles, or poor fit in electron density maps [23].
Correction Protocol:
- Use dedicated refinement tools like the rotamer libraries integrated in Swiss PDB Viewer or MOE to sample more probable sidechain conformations [23].
- Employ flexible docking protocols where applicable. Tools like FlexPose use deep learning to model realistic sidechain flexibility upon ligand binding, moving beyond rigid-body assumptions [22].
- Validate the corrected rotamers by checking for improved steric complementarity and the absence of unrealistic atomic overlaps.

FAQ 2: What is the impact of misplaced hydrogen atoms on model generalization, and how are they fixed?

Issue: The positions of hydrogen atoms are often not determined in experimental methods like X-ray crystallography and are added computationally. Incorrect placement can skew calculations of hydrogen bonding and binding affinity, leading models to learn erroneous physico-chemical rules [22].

Solution:

Identification: Tools like MOE and Schrödinger suites can analyze and report potential issues in hydrogen bonding networks and protonation states [23].
Correction Protocol:
- Use protonation state predictors at the biological pH (e.g., 7.4) to determine the most likely state for histidine, aspartic acid, glutamic acid, and lysine residues.
- Run energy minimization with a force field (e.g., using YASARA or VMD) to optimize the geometry of all added hydrogens, relieving any steric strain [23].
- Validate the optimized structure by ensuring all key hydrogen bonds have plausible donor-acceptor distances and angles.

FAQ 3: How do inaccurate ligand bond orders and stereochemistry artifacts artificially inflate model performance?

Issue: If a ligand's bond order (e.g., single vs. double) or stereochemistry (e.g., R vs. S) is misassigned in the training data, a model may "memorize" this incorrect feature. During evaluation on a test set containing the same error, performance seems high, but the model will fail on data with correct chemistry—a classic case of data leakage [12].

Solution:

Identification: Visually cross-reference the ligand's 2D structure from the original scientific literature with its 3D representation in the PDB file using viewers like PyMOL or ChimeraX [23].
Correction Protocol:
- Curate the ligand library. Before docking or analysis, ensure all ligands have correct bond orders and stereochemistry defined. Software like Marvin and MOE include tools for this exact purpose [23].
- Use energy minimization to regularize the corrected ligand geometry, ensuring proper bond lengths and angles.
- Validate by checking that the ligand's geometry conforms to standard chemical constraints.

FAQ 4: Why is the treatment of protein flexibility and apo-holo differences critical for preventing data leakage?

Issue: Most traditional docking methods treat the protein receptor as rigid, often using a single, ligand-bound (holo) conformation. Models trained exclusively on such data learn to recognize only one conformational state and perform poorly when presented with an unbound (apo) structure or a different conformation, as they are effectively "leaking" state-specific information [22].

Solution:

Identification: Perform cross-docking, where a ligand is docked into a protein structure that was solved with a different ligand. Poor docking performance often indicates sensitivity to protein flexibility [22].
Correction Protocol:
- Utilize flexible docking algorithms. New deep learning approaches like FlexPose and DynamicBind are designed to handle protein backbone and sidechain flexibility during the docking process [22].
- Incorporate multiple receptor structures. If available, use an ensemble of protein conformations (from NMR, molecular dynamics simulations, or multiple crystal forms) for docking to account for inherent flexibility.
- Apply data splitting strategies. Use tools like DataSAIL to ensure that highly similar protein conformations (or sequences) are not spread across training and test sets, preventing the model from cheating through memorization [12].

Experimental Protocols for Artifact Correction and Validation

The following workflows provide detailed methodologies for addressing structural artifacts.

Protocol 1: Systematic Workflow for Identifying and Correcting Common Artifacts

This diagram outlines a general-purpose pipeline for structural quality control.

Protocol 2: Data Leakage Mitigation Strategy for Structural Datasets

This diagram illustrates steps to prevent data leakage when splitting datasets for machine learning, crucial for PDBbind-based research [12].

Data Presentation: Artifact Impact and Correction Metrics

The following table summarizes common artifacts, their impact on model training, and key metrics for validation.

Table 1: Summary of Common Structural Artifacts and Correction Metrics

Artifact Category	Impact on ML Model Generalization	Key Diagnostic Metric(s)	Target Value for Correction
Protein Sidechain Rotamers	Model learns non-physical binding site geometries; fails on flexible targets [22].	Rotamer outlier score (from MolProbity); RMSD of sidechain atoms.	>95% in favored rotamers; RMSD < 0.5 Å.
Ligand Bond Order/Stereochemistry	Data leakage via memorization of incorrect chemistry; poor prediction on novel scaffolds [12].	Check against canonical SMILES; bond length and angle deviations.	100% conformity with canonical structure; bond angle deviation < 5°.
Hydrogen Bonding Network	Skews prediction of binding affinity and specific interactions [22].	Donor-acceptor distance; angle geometry; number of unsatisfied H-bond donors/acceptors.	Distance: 2.5-3.5 Å; Angle: >120°; No unsatisfied strong donors/acceptors.
Global Protein Conformation (Apo vs. Holo)	Inability to handle induced fit; poor cross-docking performance [22].	RMSD of binding site residues between apo and holo forms; TM-score.	TM-score > 0.5 for similar folds; flexible docking required if RMSD > 2 Å.

The Scientist's Toolkit: Essential Research Reagents and Software

Table 2: Key Software Tools for Structural Artifact Correction and Analysis [23]

Tool Name	Primary Function	Relevance to Artifact Correction
ChimeraX	Molecular Visualization and Analysis	Interactive visualization for identifying clashes, validating rotamers, and analyzing hydrogen bonds.
PyMOL	Molecular Visualization and Rendering	High-quality imaging and scripting for in-depth structural analysis and figure generation.
MOE (Molecular Operating Environment)	Integrated Drug Discovery Suite	Comprehensive tools for structure preparation, protonation, energy minimization, and rotamer sampling.
VMD	Visualization and Analysis of Biomolecular Systems	Powerful for analyzing large systems, molecular dynamics trajectories, and volumetric data.
Schrödinger Suites	Integrated Computational Drug Discovery Platform	Industry-standard tools for protein preparation, ligand docking, and advanced simulation.
Swiss PDB Viewer	Protein Structure Analysis and Modeling	User-friendly interface for comparative modeling, energy minimization, and rotamer libraries.
DataSAIL	Data Splitting for Machine Learning	Mitigates data leakage by ensuring similarity-reduced splits for training and test sets [12].
FlexPose / DynamicBind	Flexible Protein-Ligand Docking	DL-based tools that model protein flexibility for more accurate docking to apo structures [22].

Fixing Covalent Binders, Rare Elements, and Steric Clashes

Frequently Asked Questions

Q1: Why is it critical to filter out covalent binders from non-covalent training sets? Covalent binding involves the formation of chemical bonds, which is fundamentally different from the non-covalent interactions (e.g., hydrogen bonding, hydrophobic effects) that standard scoring functions are designed to model. Including covalent binders in a dataset for non-covalent interaction prediction can confuse the model, compromise the accuracy of the learned energy landscape, and reduce its generalizability. A dedicated filter should be used to exclude ligands covalently bound to the protein, as indicated by the "CONECT" record in the PDB file [8] [14].

Q2: How do ligands with rare elements negatively impact model training? Ligands containing elements other than the common set (H, C, N, O, F, P, S, Cl, Br, I) are problematic due to data sparsity. Their infrequent occurrence (e.g., containing Te or Se) makes it challenging for machine learning models to learn meaningful binding features associated with them, potentially leading to poor generalization. Filtering them out ensures the model focuses on robust, frequently observed chemical interactions [8] [14].

Q3: What are the consequences of not filtering steric clashes? Severe steric clashes (protein-ligand heavy atom pairs closer than 2 Å) often arise from electron density uncertainties or inaccurate structural reconstruction. These clashes are physically infeasible for non-covalent interactions. Including them in training can be detrimental, causing physics-based scoring functions to underestimate repulsion energy and teaching machine learning models incorrect structural priors [8] [14].

Q4: How do these data quality issues relate to the broader problem of data leakage? Data leakage artificially inflates performance metrics during benchmarking. While often discussed in the context of train-test similarity, underlying data quality issues are a subtler form of leakage. If a model learns from incorrect data (e.g., structures with clashes or misclassified covalent complexes), it memorizes artifacts rather than generalizable biological principles. This leads to over-optimistic benchmark performance and failure in real-world applications, such as virtual screening on meticulously prepared structures [1] [8].

Troubleshooting Guides

Issue 1: Identifying and Filtering Covalent Binders

Problem: Your model's predictions are inaccurately skewed for certain targets, potentially because it was trained on a mixture of covalent and non-covalent mechanisms.

Solution: Implement an automated filter based on PDB file annotations.

Data Source: For each protein-ligand complex, obtain the original PDB and mmCIF files from the RCSB PDB [8] [14].
Filtering Logic: Parse the "CONECT" records within the PDB file. These records explicitly define the chemical bonds between atoms.
Action: Any ligand that shares a "CONECT" record linkage with a protein residue atom should be flagged and removed from the non-covalent training set [14].
Output: Generate a cleaned dataset and a separate log file of the removed covalent binders for potential specialized use [8].

Issue 2: Managing Ligands with Rare Elements

Problem: Your model shows high prediction error for ligands containing low-frequency elements not well-represented in the training data.

Solution: Apply a chemical element filter to standardize the ligand chemistry in your dataset.

Define Common Elements: Restrict ligands to those composed only of the following elements: H, C, N, O, F, P, S, Cl, Br, and I [8] [14].
Detection: For each ligand, extract the unique atomic elements from its structural data.
Action: Any ligand containing an element outside the defined list should be excluded from the standard training set. This filter removed 205 entries in one reported curation effort [14].

Issue 3: Resolving Severe Steric Clashes

Problem: Your model generates poses with unrealistic atom-atom overlaps or fails to predict repulsive interactions correctly.

Solution: Implement a steric clash filter based on interatomic distances.

Structure Preparation: Use a consistent method to add missing hydrogen atoms to both the protein and the ligand [8] [7].
Distance Calculation: For every heavy atom in the ligand, compute its distance to every heavy atom in the protein.
Threshold Setting: Define a clash threshold. A common and physically motivated threshold is 2.0 Å [8] [14].
Action: If any protein-ligand heavy atom pair is found closer than the 2.0 Å threshold, the entire complex should be removed from the dataset. This filter was shown to remove 164 entries from a version of PDBbind [14].

Experimental Protocols

Protocol 1: Implementing a Data Curation Workflow

This protocol outlines the steps for creating a high-quality, non-covalent protein-ligand dataset, integrating the fixes for the key issues above [8] [7].

1. Data Retrieval and Splitting

Download PDB and mmCIF files for your target complexes (e.g., from PDBbind or BioLiP) from the RCSB PDB [14].
Split each structure into three components: protein, ligand, and additives (ions, solvents, co-factors within 4Å of the protein-ligand complex) [8] [7].

2. Application of Content Filters

Covalent Binder Filter: Apply the method in Issue 1.
Rare Element Filter: Apply the method in Issue 2.
Steric Clash Filter: Apply the method in Issue 3.
(Optional) Small Ligand Filter: Exclude ligands with fewer than 4 heavy atoms (e.g., O₂, CO₂) as they are often beyond the scope of typical drug-discovery studies [8].

3. Structure Refinement

LigandFixer Module: Correct ligand bond orders, protonation states, and aromaticity using tools like RDKit [8] [14].
ProteinFixer Module: Use a tool like PDBFixer to add missing protein atoms and residues [14].
Complex Refinement: Recombine the fixed protein and ligand, then add hydrogen atoms to the entire complex simultaneously (not separately) followed by constrained energy minimization to resolve minor clashes and optimize hydrogen bonding [8] [7].

Protocol 2: Quantifying Data Quality Improvements

This protocol provides a framework for measuring the impact of your curation efforts.

1. Establish a Baseline

Start with a benchmark dataset, such as PDBbind v2020.
Use a standard benchmark like CASF (Comparative Assessment of Scoring Functions) to evaluate your model's performance (e.g., RMSD, Pearson R) before any curation [1].

2. Apply Curation Workflow

Run the dataset through the workflow described in Protocol 1.

3. Quantitative Analysis

Report the number and percentage of complexes removed by each filter (see Table 1 for an example).
Retrain and re-evaluate your model on the cleaned dataset using the same benchmark. A significant drop in performance may indicate that previous results were inflated by data leakage and memorization, paving the way for more robust model development [1].

Table 1: Example Filter Impact on a Dataset

Filter Type	Complexes Removed	Common Rationale
Covalent Binders	955 entries [14]	Fundamental mechanistic difference from non-covalent binding.
Rare Elements	205 entries [14]	Prevents overfitting to rare, poorly sampled features.
Steric Clashes	164 entries [14]	Removes physically unrealistic structures.
Redundancy/Similarity	~50% of training complexes [1]	Reduces memorization and encourages generalization.

Workflow Visualization

Data Curation Workflow

Research Reagent Solutions

Table 2: Essential Tools and Resources for Data Curation

Resource Name	Type	Primary Function in Curation
RCSB Protein Data Bank [8] [14]	Database	Source for original PDB and mmCIF structure files.
HiQBind-WF / PDBBind-Opt	Workflow	An open-source, semi-automated workflow implementing the filters and refinement steps described above [8] [14].
PDBFixer	Software Tool	Used in the ProteinFixer module to add missing atoms and residues to protein structures [14].
RDKit	Cheminformatics Library	Used in the LigandFixer module to correct ligand chemistry (bond order, protonation, aromaticity) [8].
DataSAIL	Python Package	Performs similarity-aware data splitting to minimize data leakage between training and test sets, complementing data curation [9].
PDBbind CleanSplit	Dataset	A curated version of PDBbind with reduced train-test data leakage and redundancy, enabling more realistic model evaluation [1].

Frequently Asked Questions (FAQs)

1. What is data leakage in the context of PDBbind, and why is it a problem?

Data leakage occurs when protein-ligand complexes with high structural or chemical similarity appear in both training and test datasets [1] [2]. This inflates performance metrics during benchmarking because models can "memorize" similar examples rather than learning to generalize, leading to over-optimistic results that don't hold up in real-world drug discovery applications [1]. One study found that nearly 600 similarities existed between PDBbind training and CASF benchmark complexes, affecting 49% of the test cases [1].

2. How can I check my dataset for data leakage issues?

You can use structure-based clustering algorithms that assess multimodal similarity [1]. Key metrics include:

Protein similarity: Calculate TM-scores to compare protein structures [1].
Ligand similarity: Compute Tanimoto scores based on molecular fingerprints [1].
Binding conformation similarity: Measure pocket-aligned ligand root-mean-square deviation (RMSD) [1]. Data leakage is likely if you find complexes with high similarity across these metrics in both training and test splits.

3. My model performs well on the CASF benchmark but poorly on my own proprietary data. What's wrong?

This is a classic symptom of data leakage between PDBbind and the CASF benchmark [1] [2]. When models are retrained on properly split datasets with reduced leakage, their performance on CASF typically drops substantially [1]. This indicates that original high scores were artificially inflated and true generalization capability is lower than reported.

4. Are there publicly available datasets that mitigate data leakage?

Yes, researchers have developed several cleaned dataset versions:

PDBbind CleanSplit: Applies structure-based filtering to remove train-test leakage and internal redundancies [1].
LP-PDBBind (Leak-Proof PDBBind): Controls for sequence and chemical similarity of both proteins and ligands across splits [2].
HiQBind: A new dataset created with a semi-automated workflow that fixes common structural artifacts [7].

5. What is the trade-off between using larger, augmented datasets versus smaller, high-quality ones?

Larger datasets like BindingNet v2 (with ~690,000 modeled complexes) can improve model generalization for novel ligands, with one study showing success rates increasing from 38.55% to 64.25% for binding pose prediction [10]. However, carefully curated smaller datasets with high structural accuracy (like HiQBind or cleaned PDBbind splits) provide more reliable affinity predictions by eliminating artifacts that compromise accuracy [7] [24]. The optimal choice depends on your specific application—pose generation may benefit from larger datasets, while affinity prediction requires higher quality data.

Troubleshooting Guides

Issue: Suspected Data Leakage in Custom Dataset Split

Symptoms:

High performance on validation/test sets but poor performance on truly novel complexes
Similar protein sequences or ligand scaffolds in training and test splits
Model performance drops significantly when tested on time-split data

Solution Steps:

Perform Similarity Analysis
- Calculate protein sequence identity between all training and test complexes
- Compute ligand Tanimoto similarities using ECFP4 fingerprints
- Identify pairs exceeding similarity thresholds (e.g., >0.7 Tanimoto coefficient)
Implement Strict Data Splitting
Validate with Independent Benchmark
- Test your model on truly external datasets like BDB2020+ [2]
- Use time-split validation with recent complexes not available during training

Issue: Poor Generalization to Novel Protein Targets

Symptoms:

Model works well on proteins similar to training set but fails on new protein families
Performance degradation on proteins with low sequence similarity to training data
Inability to rank ligands correctly for new target classes

Solution Steps:

Architecture Improvements
- Implement graph neural networks that explicitly model protein-ligand interactions [1]
- Use transfer learning from protein language models to capture general protein features [1]
- Incorporate E(3)-equivariant architectures for better geometric reasoning [25]
Data Strategy Enhancement
- Ensure training covers diverse protein families and fold classes
- Include low-similarity examples during training rather than filtering them out
- Consider using augmented datasets like BindingNet v2 for broader coverage [10]
Regularization Techniques
- Increase dropout rates specifically on protein and ligand embedding layers
- Add contrastive learning objectives to learn invariant representations
- Use data augmentation on protein structures (within realistic conformational space)

Issue: Structural Artifacts in Training Data Affecting Model Accuracy

Symptoms:

Model predictions correlate with structural artifacts rather than true binding physics
Poor performance on high-quality experimental structures despite good benchmark results
Generation of unrealistic molecular geometries during de novo design [25]

Solution Steps:

Data Quality Assessment
- Run structural validation tools like MolProbity on your training complexes
- Check for steric clashes, unusual bond lengths, and incorrect chirality
- Identify and correct protonation states and tautomeric forms [24]
Data Cleaning Pipeline Implement a workflow like HiQBind-WF [7]:
- LigandFixer: Correct bond orders, protonation states, and aromaticity
- ProteinFixer: Add missing atoms and residues in binding sites
- Structure Refinement: Simultaneously add hydrogens to protein-ligand complexes
- Steric Clash Removal: Identify and resolve atomic overlaps
Quality-Aware Training
- Weight training examples by structural resolution quality
- Add resolution-dependent noise during data augmentation
- Exclude complexes with resolution worse than 3.0Å for critical applications

Experimental Protocols & Methodologies

Protocol 1: Creating a Leakage-Free Dataset Split

Objective: Generate training and test splits without data leakage for reliable model evaluation.

Materials:

PDBbind general set or similar protein-ligand dataset
Structural similarity tools (TM-align, OpenBabel)
Chemical similarity toolkit (RDKit)

Procedure:

Calculate Pairwise Similarities
- For all protein pairs: Compute TM-scores using TM-align [1]
- For all ligand pairs: Compute Tanimoto coefficients using ECFP4 fingerprints [1]
- For protein-ligand complexes: Calculate pocket-aligned RMSD [1]
Apply Filtering Thresholds
- Strict Filtering: Remove training complexes with TM-score >0.7 OR Tanimoto >0.7 OR RMSD <2.0Å to any test complex [1]
- Moderate Filtering: Use higher thresholds appropriate for your specific application
Cluster and Split
- Build similarity graph based on thresholds
- Identify connected components as similarity clusters
- Assign entire clusters to train or test sets, never splitting clusters
Validation
- Verify no high-similarity pairs exist between train and test sets
- Test model performance on independent temporal split (e.g., BDB2020+) [2]

Table 1: Similarity Thresholds for Data Leakage Prevention

Similarity Type	Strict Threshold	Moderate Threshold	Measurement Tool
Protein Structure	TM-score < 0.5	TM-score < 0.7	TM-align
Ligand Chemistry	Tanimoto < 0.4	Tanimoto < 0.7	RDKit, ECFP4
Binding Pose	RMSD > 2.5Å	RMSD > 2.0Å	Pocket-aligned RMSD
Sequence Identity	< 30%	< 50%	BLAST, MMseqs2

Protocol 2: Structural Quality Assessment and Repair

Objective: Identify and correct common structural artifacts in protein-ligand complexes.

Materials:

Raw PDB files or CIF files from RCSB PDB
Molecular repair tools (OpenBabel, RDKit, PROPKA)
Quantum chemistry software (if available, for advanced refinement)

Procedure:

Initial Assessment
- Extract metadata: resolution, R-factor, deposition date
- Identify covalent vs. non-covalent binders (exclude covalent for standard SF training)
- Check for rare elements or unusual chemistry that might indicate artifacts
Ligand Processing
- Correct bond orders using structural information
- Assign proper protonation states for physiological pH (7.4)
- Ensure correct stereochemistry and aromaticity
- Fix common issues: nitro group distortions, amide planarity, guanidino group geometry [24]
Protein Processing
- Add missing heavy atoms in binding site residues
- Properly protonate histidine, aspartic acid, glutamic acid residues
- Resolve steric clashes with minimal conformational changes
Complex Refinement (Advanced)
- Use constrained minimization to fix geometrical issues while preserving binding pose
- For critical applications, employ QM-based refinement of ligand geometry [24]
- Validate against experimental electron density maps if available
Quality Metrics
- Pass PoseBusters validity checks [10]
- Reasonable bond lengths and angles compared to small molecule crystal data
- No severe steric clashes (overlap < 0.4Å)

Table 2: Structural Quality Metrics and Target Values

Quality Metric	High Quality	Acceptable	Assessment Tool
Resolution	< 2.0Å	< 2.8Å	PDB metadata
R-factor	< 0.20	< 0.25	PDB metadata
Ligand B-factor	< 60.0	< 80.0	PDB metadata
Steric clashes	None (overlap < 0.4Å)	Minor (overlap < 0.6Å)	MolProbity, PoseBusters
Bond length deviation	< 0.05Å from reference	< 0.10Å from reference	RDKit, CCDC data
Bond angle deviation	< 5° from reference	< 10° from reference	RDKit, CCDC data
Pass PoseBusters checks	All checks passed	>90% checks passed	PoseBusters toolkit

Data Presentation

Table 3: Performance Impact of Data Leakage Mitigation

Model	Original CASF2016 Performance (RMSE)	Performance on CleanSplit (RMSE)	Performance Drop	Independent Test (BDB2020+ RMSE)
GenScore	1.42	1.68	18.3%	1.75
Pafnucy	1.51	1.81	19.9%	1.84
GEMS (Ours)	1.38	1.39	0.7%	1.42
RF-Score	1.63	1.85	13.5%	1.89
AutoDock Vina	1.79	1.82	1.7%	1.87

Table 4: Dataset Comparison for Protein-Ligand Modeling

Dataset	Size (Complexes)	Binding Affinities	Structural Quality	Data Leakage Control	Primary Use Case
PDBbind v2020	~19,500	Yes	Variable	Poor	Baseline development
PDBbind CleanSplit	~17,800	Yes	Variable	Strict	Reliable benchmarking
LP-PDBBind	~16,500	Yes	Cleaned	Strict	Method evaluation
HiQBind	~30,000	Yes	High	Moderate	Production model training
BindingNet v2	~690,000	Yes	Modeled (variable)	Configurable	Data augmentation
MISATO	~20,000	Yes (curated)	QM-refined	Moderate	High-accuracy prediction

The Scientist's Toolkit: Research Reagent Solutions

Table 5: Essential Tools and Datasets for Robust Protein-Ligand Modeling

Resource Name	Type	Function	Access
PDBbind CleanSplit	Curated Dataset	Provides leakage-free training/test splits for reliable benchmarking	Upon publication request [1]
HiQBind-WF	Computational Tool	Semi-automated workflow for fixing structural artifacts in protein-ligand complexes	Open-source [7]
LP-PDBBind	Curated Dataset	Leak-proof dataset split with similarity control for both proteins and ligands	Available with paper [2]
BindingNet v2	Augmented Dataset	Large collection of modeled complexes for data augmentation and improved generalization	Available [10]
MISATO	Enhanced Dataset	Quantum-chemically refined structures with molecular dynamics trajectories	Open access [24]
BDB2020+	Benchmark Dataset	Temporal test set with complexes deposited after 2020 for independent validation	Available [2]
PoseBusters	Validation Tool	Checks structural validity of generated protein-ligand complexes	Open-source [10]
TM-align	Algorithm Tool	Computes protein structural similarity scores for leakage analysis	Open-source [1]

Workflow Diagrams

Data Leakage Assessment Workflow

Structural Quality Control Pipeline

Validating Success: Benchmarking Model Performance on Independent Tests

Frequently Asked Questions (FAQs)

Q1: What is data leakage in the context of PDBbind, and why is it a problem? Data leakage occurs when information from the test dataset unintentionally influences the training of a machine learning model. In PDBbind, this happens due to high structural similarities between protein-ligand complexes in the training and test sets (e.g., the CASF benchmark) [1]. Models can then "cheat" by memorizing these similarities rather than learning generalizable principles of binding, leading to severely inflated and unrealistic performance metrics that do not reflect true predictive power on novel targets [1] [3].

Q2: How significant is the performance drop when moving to a leakage-free split? The performance drop can be substantial, indicating that previously reported high accuracies were likely overstated. When state-of-the-art models like GenScore and Pafnucy were retrained on a leakage-free split (PDBbind CleanSplit), their performance "dropped markedly" [1]. One analysis showed that a simple search algorithm that just finds the most similar training complexes could achieve competitive performance with some deep learning models, highlighting that prior success was largely driven by data leakage rather than genuine learning [1].

Q3: What is the PDBbind CleanSplit dataset? PDBbind CleanSplit is a refined training dataset curated to eliminate data leakage and reduce internal redundancy [1]. It uses a structure-based filtering algorithm to ensure that training complexes are strictly separated from those in common test benchmarks like CASF. This is achieved by removing training complexes that are overly similar to any test complex, based on combined protein structure, ligand similarity, and binding conformation [1].

Q4: Are there other types of errors in PDBbind beyond data leakage? Yes, database curation errors are another significant issue. A manual analysis of the protein-protein subset of PDBbind found that approximately 19% of records had dissociation constant (KD) values that were not supported by their primary publications [11]. These errors included incorrect units, values belonging to different molecular constructs, and approximate instead of precise values [11]. Correcting these errors was shown to improve machine learning prediction accuracy [11].

Q5: What tools are available to create leakage-free data splits? DataSAIL is a specialized Python package designed to compute leakage-reduced data splits for biological data [12]. It formulates the splitting problem as a combinatorial optimization challenge, aiming to minimize similarity between training and test sets while preserving class distribution. This is particularly crucial for realistic performance estimation on out-of-distribution data [12].

Troubleshooting Guides

Issue 1: Inflated Performance During Benchmarking

Problem: Your model shows excellent performance on standard benchmarks (like CASF) but fails dramatically when tested on novel, proprietary targets.

Diagnosis: This is a classic symptom of data leakage. Your model is likely exploiting structural redundancies between the training and test sets instead of learning the underlying physics of binding.

Solution:

Retrain on a Clean Split: Use the PDBbind CleanSplit dataset for training to ensure a strict separation from your test benchmark [1].
Use Robust Splitting Tools: Employ tools like DataSAIL to create your own similarity-aware splits for your specific dataset, especially if it involves two-dimensional data like drug-target pairs [12].
Re-evaluate Performance: Assess your model's performance after retraining. A significant drop in metrics like Pearson correlation or Root-Mean-Square Error (RMSE) on the same benchmark confirms that leakage was present.

Issue 2: Poor Generalization to Novel Protein Targets

Problem: The model cannot accurately predict binding affinity for proteins with low sequence or structural homology to those in the training set.

Diagnosis: The training data may lack diversity, and the model has overfitted to overrepresented protein families.

Solution:

Apply Clustering-Based Cross-Validation: During development, use clustering-based cross-validation. Cluster protein complexes based on sequence or structure similarity (e.g., using Smith-Waterman alignment or TM-scores) and ensure all members of a cluster are in the same data split (training or test) [11]. This prevents the model from being tested on variants too similar to its training examples.
Curate for Diversity: Actively curate your training set to cover a broader range of protein families and ligand chemotypes. Prioritize data quality and diversity over sheer quantity [3].

Issue 3: Suspected Incorrect Affinity Labels

Problem: Model predictions consistently disagree with experimental values for specific complexes, even after verifying no structural leakage.

Diagnosis: The experimental binding affinity values (KD, Ki, IC50) in the database for those complexes may be incorrectly curated.

Solution:

Audit the Primary Literature: For critical data points, manually check the primary scientific publication cited in the PDB entry to verify the reported affinity value matches the database entry. Look for common errors such as incorrect units (e.g., nM vs. μM) or values assigned to the wrong molecular construct [11].
Use High-Quality Subsets: When possible, use datasets that have undergone rigorous curation for both structural and label quality, such as the HiQBind dataset [7].

Quantitative Performance Comparison

The following table summarizes the quantitative impact of using leakage-free splits and correcting data errors on model performance.

Table 1: Impact of Data Quality Improvements on Model Performance

Model / Experiment	Training Data	Test Data	Key Metric	Performance with Standard Split	Performance with Leakage-Free Split	Source
GenScore & Pafnucy	Original PDBbind	CASF Benchmark	Binding Affinity Prediction	Excellent benchmark performance	Performance dropped markedly	[1]
Random Forest Model	Original PDBbind (Open Access subset)	Cross-validation	Pearson R (log10(KD))	Baseline	~8 percentage point increase (after correcting 19% curation errors)	[11]
Similarity Search Algorithm	Original PDBbind	CASF2016	Pearson R	N/A	R = 0.716 (competitive with some DL models, highlighting leakage)	[1]

Experimental Protocols

Protocol 1: Creating a Leakage-Free Split with DataSAIL

Objective: To split a dataset of protein-ligand complexes into training and test sets while minimizing structural and ligand-based data leakage.

Materials: Dataset (e.g., PDBbind), DataSAIL tool [12].

Methodology:

Define Entity Types: For a 2D dataset (e.g., drug-target interactions), define two entity types: proteins and ligands.
Calculate Similarities: Compute pairwise similarity matrices.
- For proteins: Use a structural similarity metric like TM-score [1].
- For ligands: Use a chemical similarity metric like Tanimoto coefficient based on molecular fingerprints [1].
Set Splitting Constraints: Use DataSAIL to enforce that no protein or ligand in the test set is highly similar to any in the training set. The tool solves this as a combinatorial optimization problem.
Generate Splits: Run DataSAIL to output the final training, validation, and test sets. Some interactions may be lost to satisfy all constraints [12].

Protocol 2: Clustering-Based Cross-Validation

Objective: To realistically evaluate model performance and avoid over-optimism from testing on data similar to training data.

Materials: Dataset of protein complexes, clustering software, sequence or structure alignment tool.

Methodology:

Compute Pairwise Distances: Calculate distances between all protein complexes in the dataset. For protein-protein complexes, this can be a sequence-alignment-based distance [11]. For protein-ligand complexes, a combined protein and ligand similarity can be used [1].
Perform Clustering: Use a clustering algorithm (e.g., single-linkage hierarchical clustering) to group the complexes based on the calculated distances.
Split Data into Folds: Assign all complexes within a given cluster to the same fold (training or test). This ensures that structurally or sequentially similar complexes are not spread across different splits.
Train and Validate: Perform k-fold cross-validation, ensuring that for each fold, the model is trained and tested on dissimilar complexes [11].

Experimental Workflow Diagram

The Scientist's Toolkit

Table 2: Essential Research Reagents and Tools

Tool / Resource	Type	Function	Relevance to Mitigating Data Leakage
PDBbind CleanSplit [1]	Curated Dataset	A leakage-reduced version of the PDBbind training set.	Provides a ready-to-use, strictly separated training set for reliable model development.
DataSAIL [12]	Software Tool	Splits biological datasets to minimize information leakage.	Enables creation of custom leakage-free splits for proprietary or specialized datasets.
HiQBind & HiQBind-WF [7]	Curated Dataset & Workflow	Provides high-quality protein-ligand structures with corrected structural artifacts.	Addresses data quality issues orthogonal to leakage, improving the foundational data.
TM-score [1]	Algorithm	Measures protein structural similarity.	A key metric for identifying and filtering out structurally similar proteins during splitting.
Tanimoto Coefficient [1]	Algorithm	Measures ligand chemical similarity based on molecular fingerprints.	A key metric for identifying and filtering out chemically similar ligands during splitting.
Clustering-Based Cross-Validation [11]	Methodology	A validation technique that groups similar data points together.	Prevents over-optimistic performance estimates by ensuring dissimilarity between training and test folds.

Frequently Asked Questions

Q1: What is the core issue with the standard PDBbind and CASF benchmark setup? The core issue is widespread data leakage. Research has revealed that nearly 50% of the complexes in the common CASF benchmark sets have highly similar counterparts in the standard PDBbind training set [1] [13]. This structural similarity extends to shared ligands and closely matched binding affinity labels. When a model is trained on PDBbind and evaluated on CASF, it is often being tested on data it has effectively already seen, leading to performance metrics that are severely inflated and do not reflect true generalization to novel complexes [1] [26].

Q2: What specific problem does the PDBbind CleanSplit dataset solve? PDBbind CleanSplit is a curated training dataset designed to eliminate this data leakage [1]. It uses a structure-based filtering algorithm to ensure the training set is strictly separated from the CASF test sets. It removes two types of data:

Train-test leakage: All training complexes that closely resemble any CASF test complex are excluded [1].
Internal redundancy: Similarity clusters within the training set itself are resolved to discourage memorization and encourage learning generalizable patterns [1]. This makes CASF a true external benchmark, enabling a genuine evaluation of a model's ability to generalize [1].

Q3: Why did the performance of models like GenScore and Pafnucy drop on CleanSplit? The performance drop indicates that these models' high scores on the original benchmark were largely driven by data leakage rather than a deep understanding of protein-ligand interactions [1] [27]. The models had learned to exploit the structural and ligand-based similarities between the training and test sets. When these shortcuts were removed by CleanSplit, the models' inability to generalize to truly novel complexes was exposed [1]. The drop in performance is thus a more honest reflection of their predictive power on unseen data.

Q4: Are there models that maintain performance when trained on CleanSplit? Yes, the GEMS (graph neural network for efficient molecular scoring) model was developed alongside CleanSplit and maintains high benchmark performance when trained on this cleaned data [1] [13] [28]. Its architecture leverages a sparse graph representation of interactions and transfer learning from language models, which appears to help it learn generalizable principles of binding instead of relying on memorization [1]. Ablation studies showed that GEMS's performance collapses if protein node information is removed, suggesting its predictions are based on a genuine understanding of the interaction context [27].

Troubleshooting Guide: Interpreting Model Performance

Problem: My model's performance dropped significantly after I switched to a leakage-free dataset split. A drop in performance after moving to a rigorously split dataset like CleanSplit is not a failure but an expected correction. It indicates that your previous evaluation was likely skewed by data leakage.

Solution:

Re-evaluate Your Metrices: Interpret the new, lower performance scores as a more realistic baseline for your model's generalization capability [1] [9].
Audit Your Data Splitting Protocol: Implement a robust splitting method for all future experiments. Consider using tools like DataSAIL, a specialized Python package designed to minimize information leakage in biological datasets by formulating the split as an optimization problem [9].
Focus on Model Architecture: To improve genuine performance, explore architectures that explicitly model protein-ligand interactions. The success of GEMS suggests that graph neural networks combined with pre-trained language model embeddings are a promising direction for learning transferable concepts over memorizing data [1] [28].

Experimental Data & Performance Comparison

Table 1: Quantifying the Data Leakage in PDBbind and the CleanSplit Solution

Metric	Standard PDBbind	PDBbind CleanSplit
Train-Test Leakage	~600 similar pairs identified; affects 49% of CASF test complexes [1]	Strictly separated from CASF benchmarks [1]
Internal Redundancy	~50% of training complexes part of a similarity cluster [1]	Redundancy minimized by removing an additional 7.8% of training complexes [1]
Ligand-Based Leakage	Not systematically addressed	All training complexes with ligands identical (Tanimoto > 0.9) to test ligands are removed [1]

Table 2: Impact of PDBbind CleanSplit on Model Performance

Model	Performance on Standard PDBbind (Inflated)	Performance on PDBbind CleanSplit (Realistic)	Key Performance Change
Pafnucy	Excellent benchmark performance [1]	Performance "dropped markedly" [1]	R² score dropped by up to 0.4 [27]
GenScore	Excellent benchmark performance [1]	Performance dropped substantially [1]	Demonstrated better robustness than Pafnucy, but still showed a significant drop [1] [26]
GEMS	N/A (Developed with CleanSplit)	Maintains state-of-the-art performance [1] [28]	Achieves high prediction accuracy on CASF benchmark without data leakage [1]

Experimental Protocol: Reproducing the CleanSplit Benchmarking

Objective: To retrain an existing scoring function model (e.g., GenScore or Pafnucy) on both the standard PDBbind dataset and the PDBbind CleanSplit dataset, then evaluate its performance on the CASF benchmark to observe the effect of data leakage.

Materials:

Software: Python, machine learning framework (e.g., PyTorch, TensorFlow).
Datasets:
- Standard PDBbind training set (e.g., v2020 general set).
- PDBbind CleanSplit training set ( [1] - code and dataset available via provided links).
- CASF benchmark dataset (e.g., CASF-2016).
Model Code: Publicly available implementations of GenScore and Pafnucy.

Methodology:

Data Preparation:
- Download the standard PDBbind and CleanSplit training sets.
- Preprocess the complex structures (e.g., protonation, atom typing) as required by the model you are testing.
Model Training (Two Conditions):
- Condition A (Original): Train the model from scratch on the standard PDBbind training set.
- Condition B (Clean): Train an identical model from scratch on the PDBbind CleanSplit training set. Use the same hyperparameters and training procedure as in Condition A.
Model Evaluation:
- Use the officially released CASF-2016 core set as the test set for both conditions.
- For each trained model, calculate standard regression metrics on the CASF set, primarily Pearson's R and Root-Mean-Square Error (RMSE).
Results Analysis:
- Compare the Pearson R and RMSE scores between Condition A and Condition B.
- A significant performance drop in Condition B indicates the model was previously benefiting from data leakage.

The workflow for creating the CleanSplit dataset, which is central to this protocol, is based on a multi-stage filtering process as defined in the original research [1] and visualized below.

Diagram 1: Workflow for creating the PDBbind CleanSplit dataset.

Table 3: Essential Resources for Mitigating Data Leakage in Binding Affinity Prediction

Resource Name	Type	Function/Benefit
PDBbind CleanSplit	Curated Dataset	The core solution for eliminating data leakage between PDBbind and CASF benchmarks, enabling realistic model evaluation [1] [27].
DataSAIL	Software Tool (Python)	A versatile tool for performing leakage-reduced data splits for biological data, formulated as a combinatorial optimization problem [9].
GEMS Model	Machine Learning Model	A graph neural network that demonstrates robust generalization on CleanSplit by learning protein-ligand interactions, not memorizing data [1] [28].
TM-align	Algorithm/Tool	Used to compute TM-scores for quantifying protein structure similarity, a key metric in the CleanSplit filtering algorithm [1].
Tanimoto Coefficient	Similarity Metric	Calculates ligand similarity based on molecular fingerprints, used to prevent ligand-based memorization [1].
Pocket-aligned RMSD	Similarity Metric	Measures the similarity of ligand binding conformation within the protein pocket after structural alignment [1].

The field of computational drug discovery relies heavily on accurate protein-ligand binding affinity prediction. For years, models trained on the PDBbind database have reported impressive performance on standard benchmarks like the Comparative Assessment of Scoring Functions (CASF). However, recent research has exposed a "data leakage crisis" where this reported performance was severely inflated due to structural redundancies and similarities between training and test sets [3] [1]. Models were effectively memorizing training patterns rather than learning generalizable principles of molecular interactions [1]. This discovery necessitated the creation of rigorously filtered datasets, such as PDBbind CleanSplit, which removes these redundancies [1]. When retrained on these clean datasets, the performance of many state-of-the-art models dropped substantially, revealing their previously hidden generalization limitations [1]. This article highlights the models that have successfully weathered this paradigm shift and provides a technical toolkit for researchers navigating this new, more rigorous landscape.

FAQ: Understanding the Data Leakage Problem

Q1: What exactly is "data leakage" in the context of PDBbind and the CASF benchmark?

Data leakage occurs when models trained on PDBbind achieve high performance on the CASF benchmark not by learning generalizable protein-ligand interaction principles, but by exploiting structural redundancies. Nearly half (49%) of CASF complexes have a highly similar counterpart in the PDBbind training set, sharing comparable ligand and protein structures, ligand positioning, and affinity labels. This allows models to make accurate predictions through memorization rather than true understanding [1].

Q2: What is PDBbind CleanSplit and how does it solve the leakage problem?

PDBbind CleanSplit is a refined training dataset curated using a structure-based filtering algorithm that eliminates train-test data leakage and reduces internal redundancies [1]. The filtering is based on a combined assessment of:

Protein similarity (using TM-scores)
Ligand similarity (using Tanimoto scores)
Binding conformation similarity (using pocket-aligned ligand root-mean-square deviation) This multi-modal approach ensures that no training complex in CleanSplit closely resembles any complex in the CASF test sets, enabling a genuine evaluation of model generalization [1].

Q3: Which models have successfully maintained performance after being trained and evaluated on filtered datasets?

The GEMS (Graph neural network for Efficient Molecular Scoring) model is a prominent success story. When trained on the PDBbind CleanSplit dataset, it maintained high, state-of-the-art performance on the CASF benchmark, demonstrating robust generalization capabilities [1]. While specific performance data for IGN post-filtering is not available in the provided search results, it is recognized as a notable Graph Neural Network (GNN) based approach for scoring functions [22].

Troubleshooting Guide: Mitigating Data Leakage in Your Experiments

Problem 1: Sharp Performance Drop on Clean Data Splits

Symptoms: Your model performs excellently on standard benchmarks but shows a significant performance decrease when evaluated on a rigorously filtered dataset like PDBbind CleanSplit.

Diagnosis: The model is overfitting to structural motifs and redundancies present in the original data split rather than learning the underlying physics of binding.

Solutions:

Retrain on Clean Data: Use PDBbind CleanSplit or an equivalent leak-free dataset for all training and validation [1].
Employ Advanced Architectures: Implement architectures designed for generalization. The GEMS model, for instance, uses a sparse graph modeling of protein-ligand interactions combined with transfer learning from language models, which helps it learn more fundamental interaction rules [1].
Reduce Internal Redundancy: The CleanSplit filtering process also removes similar complexes within the training set itself, which discourages memorization and encourages the model to find broader patterns [1].

Problem 2: Preparing High-Quality, Leakage-Free Datasets

Symptoms: Inconsistent model performance and an inability to reproduce published results on public benchmarks.

Diagnosis: The underlying dataset may contain structural errors, statistical anomalies, or hidden redundancies that undermine model training and evaluation.

Solutions:

Adopt a Curated Workflow: Utilize open-source, automated data preparation workflows like HiQBind-WF [7]. This workflow corrects common issues in protein-ligand complexes, including:
- Ligand Fixing: Corrects bond orders and protonation states.
- Protein Fixing: Adds missing atoms and residues.
- Structure Refinement: Adds hydrogens to the protein-ligand complex in a combined state, which is crucial for modeling interactions like hydrogen bonding [7].
Implement Strict Splitting Protocols: When creating your own splits, use a structure-based clustering algorithm that considers protein similarity, ligand similarity, and binding pose similarity to ensure no data leaks between training and test sets [1].

Experimental Protocols & Workflows

Protocol 1: Creating a High-Quality Protein-Ligand Dataset (HiQBind-WF)

This protocol outlines the steps for creating a curated dataset of high-quality, non-covalent protein-ligand complex structures [7].

Data Retrieval: Download PDB and mmCIF files directly from the RCSB PDB.
Structure Splitting: For each entry, split the structure into three components: ligand, protein, and additives (ions, solvents, co-factors).
Initial Filtering:
- Reject ligands covalently bonded to proteins.
- Remove ligands with rarely-occurring elements.
- Discard structures containing severe steric clashes.
Ligand Fixing (LigandFixer Module): Correct ligand bond orders, protonation states, and aromaticity.
Protein Fixing (ProteinFixer Module): Add missing atoms and residues to all protein chains involved in binding.
Structure Recombination and Refinement: Recombine the fixed protein and ligand structures and perform a constrained energy minimization to resolve unreasonable structures and refine hydrogen positions.

The following workflow diagram visualizes this multi-stage curation process:

Protocol 2: Implementing the PDBbind CleanSplit Filtering Algorithm

This protocol describes the methodology for identifying and removing structural redundancies to create a leakage-free training set [1].

Multi-Modal Similarity Calculation: For every possible pair of protein-ligand complexes (between training and test sets, and within the training set), calculate three similarity metrics:
- Protein Similarity: Using TM-score.
- Ligand Similarity: Using Tanimoto coefficient based on molecular fingerprints.
- Binding Conformation Similarity: Using pocket-aligned ligand RMSD.
Identify Train-Test Leakage: Flag and remove any complex in the training set that is highly similar to any complex in the test set (CASF benchmarks) based on combined thresholds of the three metrics.
Reduce Training Set Redundancy: Iteratively identify and remove complexes from the training set that form tight similarity clusters, thereby maximizing dataset diversity and discouraging memorization.

Performance Data: Quantitative Comparisons

The table below summarizes the documented performance of the GEMS model and the general effect of re-training models on a cleaned dataset, demonstrating its robust generalization capability.

Table 1: Model Performance on PDBbind CleanSplit and CASF Benchmark

Model / Scenario	Training Dataset	Test Benchmark	Key Performance Metric	Outcome and Interpretation
GenScore, Pafnucy (State-of-the-Art Models)	Original PDBbind	CASF	High Performance (e.g., Low RMSE)	Substantial Performance Drop when retrained on CleanSplit. Shows prior performance was inflated by data leakage [1].
GEMS (Graph neural network for Efficient Molecular Scoring)	PDBbind CleanSplit	CASF	State-of-the-Art Prediction Accuracy	Maintained High Performance. Demonstrates genuine generalization to unseen complexes, as all similar training data was removed [1].
Simple Search Algorithm (Averaging affinities of 5 most similar training complexes)	Original PDBbind	CASF2016	Pearson R = 0.716, competitive RMSE	Competitive with early DL models. Proves that benchmark performance can be achieved through simple memorization, highlighting the leakage problem [1].

Item Name	Type	Function and Key Features	Use Case in Research
PDBbind CleanSplit [1]	Curated Dataset	A leakage-free version of PDBbind. Uses structure-based filtering on protein, ligand, and pose similarity to ensure train/test separation.	The recommended dataset for training and fairly evaluating new scoring functions to ensure generalizable performance.
HiQBind-WF [7]	Data Processing Workflow	An open-source, semi-automated workflow to correct structural artifacts in protein-ligand complexes (e.g., in PDBbind).	Preparing high-quality input data for model training by fixing common errors in ligands and proteins from the PDB.
GEMS Model [1]	Software / Model	A Graph Neural Network that uses sparse graph modeling and transfer learning. Maintains performance on CleanSplit.	A state-of-the-art model for binding affinity prediction that genuinely generalizes to novel protein-ligand complexes.
Structure-Based Clustering Algorithm [1]	Algorithm / Methodology	A multi-modal filtering algorithm based on TM-score, Tanimoto score, and pocket-aligned RMSD.	The core method for creating clean data splits and for auditing existing datasets for hidden redundancies and data leakage.

Frequently Asked Questions

Q1: What is data leakage in the context of PDBBind, and why is it a crisis for drug discovery research?

Data leakage occurs when highly similar protein or ligand structures appear in both the training and test sets of a dataset like PDBBind. This allows machine learning models to "cheat" by memorizing these similarities rather than learning generalizable principles of binding affinity. This crisis has led to an overestimation of model performance, where models achieving impressive benchmark results fail dramatically when applied to genuinely new protein-ligand complexes in real-world drug discovery [3] [1].

Q2: How does the BDB2020+ benchmark address the problem of data leakage?

BDB2020+ is designed as a strictly independent test set. It was created by matching high-quality binding data from BindingDB with protein-ligand complex structures from the Protein Data Bank (PDB) that were deposited after 2020. Furthermore, it is filtered using similarity control criteria to ensure that its contents are not highly similar to the complexes in the training data, such as the Leak Proof PDBBind (LP-PDBBind) set. This makes it a robust benchmark for evaluating a model's true generalization capability [2] [15].

Q3: What is the goal of the Target2035 initiative, and how will it benefit computational researchers?

Target2035 is a global, open-science consortium with the ambitious goal of creating a pharmacological modulator (like a chemical probe) for every human protein by 2035. A key part of its roadmap is to generate massive, publicly available datasets of high-quality protein-small molecule binding data. For computational researchers, this will provide the large-scale, diverse, and leakage-aware data needed to train and validate robust machine learning models, ultimately enabling the prediction of hits for proteins with no existing experimental data [3] [29].

Q4: My model, trained on PDBBind, performs well on the standard CASF benchmark but poorly on my own experimental data. What is the likely cause?

This is a classic symptom of data leakage. The standard PDBBind training set and the CASF benchmark share a high degree of structural similarity. Your model's high performance on CASF is likely inflated because it is encountering highly similar complexes during testing. Your own experimental data, representing truly novel complexes, provides a more realistic assessment, revealing the model's lack of generalizability. Retraining your model on a leak-proof split like LP-PDBBind or PDBbind CleanSplit is recommended [1].

Q5: Are there automated tools available to help create data splits that minimize leakage?

Yes. Tools like DataSAIL are specifically designed for this purpose. DataSAIL formulates data splitting as a combinatorial optimization problem to minimize similarity between training and test sets. It can handle complex, heterogeneous data (like protein-ligand pairs) and supports both identity-based and similarity-based splitting strategies to prevent information leakage for a more realistic evaluation of model performance [12].

Experimental Protocols for Independent Benchmarking

Protocol 1: Implementing a Leak-Proof Benchmarking Strategy Using BDB2020+

Obtain a Leak-Proof Training Set: Start with a reorganized dataset where data leakage has been minimized. The LP-PDBBind dataset is available from its GitHub repository, which includes meta-information and scripts for dataset creation [15]. Alternatively, use the PDBbind CleanSplit dataset [1].
Train Your Model: Use the training split of your chosen leak-proof dataset (e.g., LP-PDBBind's training set) to train your scoring function.
Benchmark on BDB2020+:
- Download the BDB2020+ dataset, which contains both meta-information (in a CSV file) and prepared structure files [15].
- Use your trained model to predict the binding affinities for all complexes in the BDB2020+ set.
- Compare your predictions against the experimental measurements. Use standard metrics like Pearson's correlation coefficient (R), Root-Mean-Square Error (RMSE), and ranking power to assess performance.
Interpret Results: Strong performance on BDB2020+ is a reliable indicator that your model can generalize to novel targets, as this benchmark is temporally and structurally independent of your training data [2].

Protocol 2: Incorporating Target-Level Benchmarks (SARS-CoV-2 Mpro and EGFR)

To further test ranking power on specific, therapeutically relevant targets:

Source Specialized Datasets: The LP-PDBBind repository provides curated datasets for the SARS-CoV-2 main protease (Mpro) and the epidermal growth factor receptor (EGFR) [15]. These include protein structures, ligand structures, and binding affinity information.
Validate Model Ranking: After training your model on a general leak-proof set (which does not include these specific proteins), use it to score and rank the complexes in the Mpro and EGFR sets.
Analyze Correlation: Calculate the correlation between your model's predicted scores and the experimental binding affinities. A high correlation indicates that your model can correctly prioritize high-affinity binders for a specific target, a critical task in lead optimization [2].

Benchmark Characteristics and Research Reagents

The table below summarizes the key characteristics of the independent benchmarks discussed.

Benchmark Name	Core Purpose	Key Feature	Temporal Independence	Accessibility
BDB2020+ [2] [15]	Evaluate generalizability to novel complexes	Matches BindingDB affinities with PDB structures deposited after 2020	Yes (Post-2020 structures)	Publicly available via GitHub repository
PDBbind CleanSplit [1]	Train and evaluate models without leakage	Uses a structure-based filtering algorithm to remove similar complexes from training	Not primarily time-based	Methodology published; dataset likely available upon request
Target2035 [3] [29]	Provide a foundational dataset for future models	Large-scale, open-access data from high-throughput screening (AS-MS, DEL)	Future-oriented initiative	Data will be made publicly available as generated
SARS-CoV-2 Mpro/EGFR Sets [2] [15]	Evaluate target-specific ranking power	Curated sets for specific, therapeutically relevant proteins	Structures not in LP-PDBBind training	Publicly available via GitHub repository

The Scientist's Toolkit: Key Research Reagents

Reagent / Resource	Type	Function in Research
LP-PDBBind Dataset [2] [15]	Curated Dataset	A leak-proof version of PDBBind for training generalizable scoring functions.
BDB2020+ Dataset [2] [15]	Independent Benchmark	A strictly independent test set for evaluating model performance on novel complexes.
DataSAIL [12]	Software Tool	A Python package for performing optimal data splitting to minimize information leakage.
Target2035 Data [3] [29]	Future Data Resource	Upcoming large-scale, open-access binding data to empower next-generation models.
CENsible [30]	Scoring Function	An interpretable, machine-learning scoring function that provides insight into affinity contributions.

Workflow: From Data Leakage to Generalizable Models

The following diagram illustrates the problem of data leakage and the pathway to creating a model that generalizes well using independent benchmarks.

A Real-World Case Study: The Power of Leak-Proof Data

A revealing study retrained top-performing affinity prediction models on the PDBbind CleanSplit dataset, which rigorously removes data leakage. The result was a substantial drop in their benchmark performance, proving that their previously high scores were largely driven by memorization rather than true learning [1]. This underscores that careful data curation is not just a theoretical exercise but a practical necessity for developing models that can reliably contribute to drug discovery efforts. By adopting the benchmarks and protocols outlined here, researchers can build models with robust and trustworthy predictive power.

Conclusion

The mitigation of data leakage is not merely a technical refinement but a fundamental prerequisite for developing reliable and generalizable AI models in drug discovery. The strategies outlined—from implementing rigorous, structure-based dataset splits like PDBbind CleanSplit and LP-PDBBind to addressing overarching data quality issues—collectively form a new foundation for the field. The evidence is clear: models trained on these cleaned datasets may show a performance drop on old, compromised benchmarks, but they achieve something far more valuable: robust predictive power on truly novel protein-ligand complexes. The future of computational drug discovery hinges on a commitment to data integrity, necessitating a industry-wide shift towards open, high-quality, and leakage-aware datasets, as championed by initiatives like Target2035. Embracing these practices will finally allow the promise of AI to be fully realized, accelerating the development of new therapeutics with greater confidence and accuracy.