"Deep Learning-based Ab Initio Protein Structure Prediction and Structure-based Protein Function Annotation"
Prediction protein structure from its sequence (especially at the absence of structure templates) and deduction of biological function from structure remains a significant and unsolved problem. Much progress in ab initio (i.e. template-free) modeling of protein structure in recent years is due to introduction of deep learning predicted inter-residue distances.
We present D-QUARK, an ab initio protein folding algorithm guided by deep learning predicted residue-residue distance and orientations. The D-QUARK pipeline is distinct from existing protein folding programs in the following aspects. Firstly, for a target sequence, it generates high quality multiple sequence alignment (MSA) with deep and diverse sequence homolog alignment using the in-house DeepMSA algorithm. Secondly, to generate input feature for deep learning prediction of distances and orientations from the MSA, raw coevolution features are extracted in the form of covariance matrix and pseudo-likelihood maximization parameters, rather than traditional post-process coevolutionary features. Thirdly, the distance and orientation potentials are incorporated into a comprehensive replica-exchange Monte Carlo (REMC) simulation with a uniquely designed flat well potential for ab initio protein folding. The high quality MSA, accurate deep learning prediction, and REMC simulation with carefully design energy terms all contribute to the high performance of D-QUARK, which outperforms our previous contact-based protein folding algorithm, C-QUARK, by 43.4% and two state-of-the-art distance-based structure prediction programs, DMPfold and trRosetta, by 22.9% to11.4 %, respectively, in terms of first model TM-score on a set of 301 hard targets. In a post-CASP experiment, D-QUARK achieves 8.1% higher first model TM-score on CASP13 FM target proteins than AlphaFold.
To annotate protein functions, including Gene Ontology (GO) terms, Enzyme Commission (EC) numbers, and ligand binding sites, from predicted structure model, we developed COFACTOR. COFACTOR combines functional templates identified by structure alignment against the target structure model as well as sequence homologs and protein-protein interaction partners to derive consensus function annotations. COFACTOR is blindly tested in the community-wide CAFA3 function annotation challenge and was ranked among the top groups.
The structure and function prediction pipeline developed in this thesis was applied to proteome-wide annotation projects for several model organisms, including human and the JCVI-syn3.0 minimal bacterial genome, where our pipeline reveals previous uncharacterized proteins with important functions.