insights from experiments on water maps in protein-ligand simulations
TLDR; receptor pruning substantially affects solvation in protein-ligand binding.

Adding a painting to the beginning of this blog is an idea entirely stolen from Rowan.
It is a well-known fact that solvation is among the most dominant, if not the most dominant, factor in protein-ligand (PL) binding affinity. Countless studies highlight this point; just to name a few:
Unfortunately, accurately capturing the effects of water on a PL complex in molecular dynamics (MD) simulations is notoriously difficult. Methods like PBSA and GBSA try to handle solvent via a continuum approximation, but that approximation is extremely crude. Approaches that attempt to stick to explicit waters require a ton of simulation time in addition to careful system preparation. It’s incredibly easy to make some seemingly negligible change to one’s simulation that ends up affecting the water dynamics drastically. And, in turn, the predicted solvation effects can take a huge hit.
I’ve dabbled in handling solvation in MD simulations primarily via implementing PBSA and GBSA, done in a work summarized in a previous blog. However, I recently had the urge to explore methods of studying solvent effects in binding dynamics more thoroughly. The goal was to get a good feel for how finicky water can be in PL MD simulations. I also suspected water maps would make for sexy visuals, but this definitely wasn’t my main motivation…

My procedure for constructing these water maps is straightforward. I constructed a fixed-size box with the ligand at its center such that the ligand was completely contained within the box with breathing room between the ligand and the box’s edges. Then, I chose a voxel size to partition the box. These ranged from 0.5Å (although there wasn’t great convergence of water maps for this voxel size, so I started most plots at a 1Å voxel size) up to 4Å. Finally, after running many MD replicas for a single PL complex, I tracked the presence of waters in each of the box’s voxels over each complete trajectory to construct a density map per replica.
Some ‘experimental design’ considerations I thought about:
- Density maps are only meaningful if the ligand retains roughly the same conformation during the entire trajectory; thus, I studied a pretty restrained ligand.
- The obvious variables to vary are voxel size and trajectory length per replica. We should observe good agreement between density maps of the same system with larger voxels, but it’s harder to make a similar statement for trajectory length as we may end up leaving the initial conformational basin in a way that nullifies the density map.
- A less intuitive, but likely still interesting variable to vary, is the pruning radius of the protein. Explicitly, we can preprocess our PL systems by excluding all protein residues that are beyond some cutoff distance from the ligand.
- It’s not unreasonable to expect something here because water density statistics can converge slower than simpler observables.
The pruning point is the focus. Loads of studies on mid-tier binding affinity prediction methods (e.g. MM/GBSA, MM/PBSA, etc.) prune receptors based on cutoffs in the 8Å-16Å range. For some examples, see this work on MM/GBSA applied to SARS-CoV-2, this assessment of MM/GBSA and MM/PBSA, and this study on the normal-mode entropy in MM/GBSA. Again, this is due to conventional MD wisdom indicating that long-range electrostatics are lost here, but these can be approximately added back by techniques like particle-mesh Ewald.
To probe these thoughts, I ran an eight nanosecond MD simulation of a single PL complex with different pruning radii—10Å, 15Å, and 20Å. I ran 14 replicas for each radius. As a part of initial data processing, though, I excluded runs that had significantly higher ligand RMSD compared to the majority of runs. This is because, as was previously mentioned, water maps really only make sense in a single conformational basin, and high RMSD indicates too much kinetic exploration for meaningful statistics1. Below are the main results.
Cross-System Agreement

The above plots capture a not-so-surprising but insightful trend: the density maps at “adjacent” pruning radii agree more strongly than the density maps at more spaced pruning radii. It’s interesting, though, that 15Å vs 20Å is worse than 10Å vs 15Å. Moreover, the fact that even at a 5Å voxel size, which is huge, 10Å vs 20Å still only gives roughly a 0.80 correlation coefficient could indicate that the water maps are converging to different density distributions here. That’s definitely important.
The monotonicity of the three plots is a good sanity check; after all, with increasing voxel size, we should expect the maps to look more and more like each other.
Cross-Replica Agreement

Another test I ran was the density map agreement between replicas at a given pruning radius. For this, I took all pairs of replicas within a fixed pruning radius and averaged the Pearson correlations at different voxel sizes. Note that the number of pairs is the most for 10Å because I kept 14 replicas after applying the ligand RMSD thresholding, whereas I discarded one of the 15Å replicas and two of the 20Å replicas. This plot clearly shows that the 20Å replicas are more stable and reproducible. Combined with the previous plots, one may suspect a couple things:
- If 20Å-pruned replicas are more reproducible and they have less than a 0.8 Pearson correlation coefficient with the 10Å-pruned replicas, even at large voxel sizes, the density maps at smaller pruning radii either require many more replicas or are simply incorrect.
- It could actually be more efficient to run replicas with a larger pruning radius as fewer replicas would likely be needed for density map convergence.

These heatmaps, with the explicit pairwise Pearson correlations between replicas stated, make the previous points abundantly clear.
Closing Thoughts
The takeaway from this, in my opinion, is that pruning radius is likely a first-order design parameter in running MD simulations. Given that I only ran experiments on a single system, it’s hard to make super concrete statements; I’d still be cautious about pruning systems too extremely. Moreover, as I didn’t directly probe the resulting effects on binding affinity explicitly, one can’t conclude anything about the nature of the water map error. For example, if the error is all systematic, then it’s not as substantial. But, at a minimum, this experiment should at least dissuade mixing simulations with different pruning radii.
More broadly, I feel that the field of MD simulation needs to be more pedantic about picking values for all the different hyperparameters that exist to be tuned: restraints on molecules, pruning radii, friction coefficients, etc. The effects of adjusting these parameters can show up in subtle ways, potentially tainting simulation data without detection. This is especially important for generating data to be used in machine learning models.
Miscellaneous Details
Last updated 04/13/2026.
The ligand studied has SMILES Nc1ncnc2c1ncn2[C@@H]1OC@HC@@H[C@H]1O, and the protein is ROCK1.
In particular, I used a pretty standard median + 3*MAD metric for the outlier threshold. This isn’t perfect but I think it’s probably alright. In the 15Å pruning radius runs, this gave a 1.82Å Kabsch RMSD threshold, with one outlier (1.99Å). In the 20Å pruning radius runs, this gave a 2.16Å Kabsch RMSD threshold, with two outliers (2.24Å, 2.63Å). The 10Å pruning radius runs had no outliers per this thresholding strategy. ↩
I used Axon, by Mirror Physics, for most code and MD. Thanks to Sam Norwood and the Mirror Physics team!
