"Distillation of Protein Stability and Evolutionary Information From Multiple Sequence Alignments Using a Machine Learning Variational Auto-Encoder"
Protein sequence data has been increasing rapidly due to advances in sequencing technology. This sequence data contains rich information about protein structure, function and evolution. Methods that can distill the information from such sequence data are invaluable tools for both understanding and engineering proteins. In this paper, we investigate how probabilistic generative machine-learning models, called variational auto-encoders, can be used to extract protein stability and evolutionary information from multiple sequence alignments (MSAs). Utilizing MSAs as training data, the variational auto-encoder learns a probability distribution over the protein sequence space, and this probability distribution is shown to be useful in the prediction of protein stability changes upon mutation. In addition, using the variational auto-encoder model, we can project the protein high dimensional sequence data into a low dimensional (2d or 3d) continuous latent space, which provides a facile representation in which to visualize relationships within the sequence space. Through simulated studies, we show that the latent space representation captures evolutionary relationships between sequences and provides an intuitive way to interpolate sequences. Overall, our findings suggest that the variational auto-encoder model is effective at inferring information regarding protein stability and evolutionary relationships from MSAs and should be useful as a guide for protein engineering efforts.