February 27, 2019

"Learning latent embeddings for genomic and proteomic data"

4:00 PM to 5:00 PM

Forum Hall, 4th Floor, Palmer Commons Building

CCMB Seminar Series – sponsored by DCMB
by Dr. William Stafford Noble (University of Washington)


Deep machine learning architectures capture nonlinear relationships in big data sets by mathematically embedding the input data into a series of latent spaces. Typically, this latent representation is a byproduct on the way to a prediction, but the latent representation itself can also be useful, particularly to transfer information between related machine learning tasks. I will discuss two recent projects in which we have used latent representations to encode information about, respectively, genomics and proteomics. In the genomic setting, we train a deep tensor factorization method, called Avocado, to impute missing epigenomic data sets. We then demonstrate that the latent genome representation is useful in several other predictive settings, including predicting gene expression and chromatin features. In the proteomic setting, a deep Siamese network called GLEAMS learns to embed tandem mass spectra in a latent space in such a way that spectra generated by the same peptide are close together. In subsequent exploration of that space, we detect groups of unidentified, proximal spectra representing the same peptide, and we show how to use spectral communities to reveal misidentified spectra and to characterize frequently observed but consistently unidentified molecular species.