James Proudfoot

Machine learning for virtual screening in chemical and biological systems

Project Summary

Screening projects in chemistry and biology often involve search procedures to traverse candidate spaces for the optimisation of numerical properties, for finding or optimising drug molecules, experimental conditions, reactions, catalysts, materials, and proteins. Experimental screening will always be associated with a cost proportional to the number of steps taken in the search procedure, for example time and labour cost associated with setting up pharmacological assays, performing chemical reactions, or synthesising new chemical entities. Virtual screening (VS) uses computational techniques to alleviate some of the costs associated with screening processes, by using simulation techniques or statistical modelling, such as machine learning (ML), as a surrogate for true laboratory experiments.

This project applies the technique of Bayesian optimisation (BO) to accelerate VS across a range of problems. BO works by iteratively re-training a (ML) model on the growing data sets generated during an experimental or computational screening process. The hypotheses generated by the surrogate models, and estimates of the uncertainties in their predictions, inform an acquisition function that is used to rank and select new data points from a defined search space for labelling. By repeated train-predict-select cycles, BO can rapidly seek out the tails of the distribution of a property of interest, such as drug potency, reaction rates, or enzyme efficiency.

We have investigated BO for three problems: drug activity optimisation, molecular docking classification, and directed evolution of enzymes. Recent work has shown that BO can benefit from using an inexpensive but weakly-correlated analogue (“low-level predictor”, or LLP) of the optimisation target. In drug activity search/optimisation, we have used the simple technique of molecular docking as an LLP, both as a ML feature and in the initialisation of the BO search, leading to reduced data requirements for optimisation and increased recall of ‘hit’ compounds. We have found that the success of rigid receptor molecular docking can be highly dependent on the 3D shape (conformation) of the protein target, expressed in a PDB (protein data bank) file. We have therefore investigated BO for the optimisation of docking classification accuracy (actives vs. inactive compounds) using descriptors of the protein geometry, pocket volume, and surface area. Enzymes are proteins that catalyse (accelerate) chemical reactions and are widely used in both research and industrial labs, as well as consumer products. All enzymes are composed of amino acids, and changing (mutating) the amino acids at specific sites in known enzymes can improve their catalytic efficiency, thermostability and water-solubility, in a process known as “directed evolution” because it mimics the natural evolutionary processes that have occurred over billions of years. We have begun investigating BO for purely ‘in-silico’ directed evolution of enzymes, targeting computational simulations of enzyme catalytic competence.

Research Interests

Chemistry. Machine learning. Transfer learning. Neural networks. Gaussian Process Regression.

Background

BA/MSc Natural Sciences, University of Cambridge

Supervisors

Dr Matthew Grayson

Dr Pranav Singh

James Proudfoot