Unstructured clinical data, such as clinical notes and reports, along with the computational infrastructure and tools, have seen an increasing demand from the research community in the last years, mostly fueled by recent advances in statistical and machine-learning approaches to data insight. We are meeting this demand with the Information Commons – a research data platform that hosts and provides direct access to de-identified data, advanced analytics tools, and computational environments for our research community.
While we are realizing access to de-identified electronic health records, images, omics and biobank data, this session highlights the progress made to provide more than 110 million de-identified notes to the research community. We developed and operationalized a fully automatic de-identification algorithm and implemented EMERSE, a user-friendly tool for non-programmatic access and sophisticated textual searches on the de-identified clinical notes.
As of December 2021 Our de-identification algorithm and our clinical notes are certified de-identified and are currently available for the UCSF researchers with IRD. The presentation covers the entire pipeline from data extraction to publication and data access focusing on the secured computational infrastructure. Furthermore, we discuss the rigorous evaluation techniques to ensure the quality of the deidentification process and the resulting data according to HIPAA and UCSF Security and Privacy protection requirements. Lastly, we showcase highlights from our research collaborations enabled by this new resource of machine-redacted, unstructured clinical notes linked with de-identified structured EHR data using EMERSE and their impact on the research community.