Designing and engineering molecules not only tests our understanding of nature but also plays an important role in improving both human health and industrial productivity. Two such examples are drug discovery, which aims to design new molecules to treat diseases, and protein engineering, which develops useful proteins for medical purposes or catalyzing industrial chemical reactions. Drug discovery and protein engineering are both time-consuming and financially expensive processes because they require multiple rounds of trial-and-error. One effective path to reducing these costs and accelerating these processes is through the development of computational methods that rationalize the course of design and engineering. Facilitated by methodological developments and the increasing availability of computational resources, computational strategies are becoming effective approaches to assist drug discovery and protein engineering tasks. In this dissertation, I describe the development of novel computational methodologies for drug discovery and protein engineering that exploit evolving accelerated computing architectures and the intersection between statistical approaches and statistical mechanics.
Protein-ligand docking and free energy calculations are widely employed computational methods in drug discovery. In the dissertation, I first describe the development of an accelerated version of the protein-ligand docking method, CDOCKER, by introducing two new features — fast Fourier transform based docking and parallel simulated annealing, both of which utilize the parallel computing power of graphical processing units (GPUs). These advances not only accelerate CDOCKER by more than an order of magnitude but also provide an approach to calculate an upper bound on the docking accuracy of current scoring functions. In the second project that is directed toward a more rigorous assessment of a ligand’s binding affinity for a receptor, I introduced two new methods for protein-ligand binding free energy calculations: the Gibbs Sampler λ-Dynamics (GSLD) methodology and Rao-Blackwell estimators (RBE) for improved analysis of the simulation results from GSLD. Compared with the original λ-dynamics approach, GSLD is more flexible, easier to implement, and retains the capacity to calculate free energies for multiple ligands in a single simulation. Compared with the empirical estimator used in λ-dynamics, RBE has the advantages of being an unbiased estimator that does not depend on ad hoc cutoff values as previously used in the empirical estimators associated with λ-dynamics. Additionally, RBE has smaller variance than the empirical estimators.
In the realm of protein engineering, I investigated the development and application of variational auto-encoder (VAE) models to infer protein stability, evolution, and fitness landscapes based on alignments of protein sequences. VAE models are probabilistic generative models that embed discrete sequences in a lower dimensional continuous latent space. Utilizing the multiple sequence alignment from a protein family as training data, VAE models learn a probability distribution of sequences for the protein family. The probability distribution may then by employed to predict protein stability changes upon mutation. The embedding of sequences in a low dimensional latent space not only provides an approach to visualize a protein family's sequence space, but also captures evolutionary relationships between sequences. Together with experimental fitness data, the embedding enables the visualization and expression of the fitness landscape in a low dimensional continuous space. Exploiting the rapidly increasing amount of protein sequence data resulting from advances in sequencing technology, we demonstrate that these features of the VAE models are of significance for studying protein properties and evolution as well as guiding protein engineering efforts.