Research

Improving sequence alignment and reducing reference bias

Many genomic analyses start from aligning sequenced reads to a reference genome, such as GRCh38. Despite the great popularity and efficiency of this approach, these analyses can suffer from "reference bias" due to (a) assembly artifacts (b) the lack of diversity.

While new, complete references like T2T-CHM13 resolve many assembly artifacts, it's challenging to migrate all genomic analyses to it because of the lack of genomic annotations. We developed the "levioSAM2" approach to utilizing the improved quality of T2T-CHM13 while generating results compatible with old reference coordinates.

To deal with the lack of genetic diversity in a linear haploid reference, we designed the "reference flow" method to efficiently align reads to multiple population references. We also studied the prioritization of variants to include in a graph aligner ("FORGe") and developed fast software to lift over alignments ("levioSAM").

Accurate variant calling utilizing population data and deep learning

Variant calling is a genomic data processing method to identify differences between a sample and the reference, which is usually done using sophisticated processing and statistical methods. DeepVariant is a deep-learning-based variant caller developed by Google Health. While interning on the DeepVariant team, I led the development of a population-aware DeepVariant model to incorporate population information in variant calling. This model is more accurate than the default DeepVariant model.

Developing computational methods for biological and medical data

Many research areas have been adopting modern sequencing technology for faster and more accurate analysis. I collaborate with researchers at Johns Hopkins University to predict chromatin accessibility using machine learning methods. Also, I work with collaborators at National Taiwan University to study immunogenomics such as the human leukocyte antigen (HLA) complex and the Adaptive Immune Receptor Repertoire (AIRR).

Designing efficient hardware architecture for computational genomics

With the enormous volume of genomics data, how to process it efficiently has become a challenging topic. One solution is using specialized hardware architecture based on FPGA or ASIC to make computing efficient. I design specialized algorithms to meet hardware requirements (usually more limited than software) and develop architectures for efficient computing. My research includes short-read and long-read alignment, multiple sequence alignment, and assembly.