Abstract
The availability of genomic data has increased dramatically in recent years. These datasets have great potential to contribute to our understanding of evolution at different scales, from changes in populations of cancer cells to the tree of life. One challenge to building species-scale phylogenies (i.e., evolutionary trees) is extracting informative data from large genomic datasets, particularly in non-model organisms. We have developed easy-to-use, open-source software (SISRS: Site Identification from Short Read Sequences) to rapidly identify potentially informative data directly from high-throughput sequencing reads without a reference genome. Currently, we are characterizing these potentially informative regions of the genome to understand the contributions of different types of data in phylogenetics, and the nature of non-phylogenetic signal. To characterize differences among populations, we are developing software that identifies populations based on samples’ k-mer frequencies and describes the nature of these differences. We hope this work will be applicable to identifying patterns in the evolution of tumor cells across individuals and cancer types, and to evolutionary changes in broader contexts.