Massively parallel single-cell and single-nucleus RNA sequencing (sc/snRNA-seq) has opened the way to systematic tissue atlases in health and disease, but as the scale of data generation is growing, so is the need for computational pipelines for scaled analysis. We developed Cumulus, the first comprehensive cloud-based framework, to address the big data challenge arising from sc/snRNA-seq analysis. Cumulus combines the power of cloud computing with improvements in algorithm and implementation to achieve high scalability, low cost, user-friendliness and integrated support for a comprehensive set of features. We benchmark Cumulus on the Human Cell Atlas Census of Immune Cells dataset of bone marrow cells and show that it substantially improves efficiency over conventional frameworks, while maintaining or improving the quality of results, enabling large-scale studies.
In recent years, biologists have found that sc/snRNA-seq alone is not enough to reveal the full picture of how cells function and coordinate with each other in a complex tissue. They begin to couple sc/snRNA-seq with other common data modalities, such as single-cell ATAC-seq (scATAC-seq), single-cell Immune Repertoire sequencing (scIR-seq), spatial transcriptomics and mass cytometry. This data coupling is called single-cell multimodal omics. As it is becoming a new common practice, new analysis needs emerge along with two major computational challenges: big data challenge and integration challenge. The big data challenge requires us to develop scalable computational infrastructure and algorithms to deal with the ever-growing large datasets produced from the community. The integration challenge requires us to design new algorithms to enable holistic integration of heterogeneous data from different modalities. In the last part of my talk, I will discuss my team’s efforts and plans to develop Cumulus as an integrated data analysis framework for scaled single-cell multimodal omics.
Single-cell multimodal omics has the potential to provide a more comprehensive characterization of complex multicellular systems than the sum of its parts. As the datasets produced from the community keep growing substantially, the enhanced Cumulus will continue playing an important role in the effort to build atlases of complex tissues and organs at higher cellular resolution, and in leveraging them to understand the human body in health and disease.
Dr. Bo Li is an assistant professor of medicine at Harvard Medical School, the director of Bioinformatics and Computational Biology at Center for Immunology Inflammatory Diseases, Massachusetts General Hospital, and an associate member of the Broad Institute of MIT and Harvard. His research focuses on large-scale single-cell and single-nucleus genomics data analysis. He received his Ph.D. in computer science from UW-Madison and completed two postdoctoral trainings with Dr. Lior Pachter at UC Berkeley and Dr. Aviv Regev at Broad Institute. He is best known for developing RSEM, an impactful RNA-seq transcript quantification software. RSEM is cited 9,384 times (Google Scholar) and adopted by several big consortia such as TCGA, ENCODE, GTEx and TOPMed.