Detection and Modeling of Protein-Protein Interactions on a Proteome-wide Scale

Protein-protein interactions (PPIs) play critical roles in all aspects of biology. Despite decades of effort, the structures of many protein complexes remain unknown, and many PPIs have not yet been identified. We developed a deep-learning based method to identify interacting proteins in silico through coevolution between positions of interacting partners. Application of this method to prokaryotic proteins demonstrated that its accuracy was comparable to commonly used experimental screens such as yeast two hybrid or affinity purification mass spectrometry. Recent breakthroughs in Artificial Intelligence (AI) methods for proteins structure modeling further empowered our coevolution-based PPI screen. Leveraging on the state-of-the-art AI tools, we recently carried out a project to identify and model core eukaryotic protein complexes. This project revealed functional insights for a wide range of eukaryotic cellular processes and provided new targets for therapeutic intervention. Our results herald a new era of structural biology in which computation plays a fundamental role in both interaction discovery and structure determination. Similarly coevolution-based approches help shed light on protein-protein interactions and their interfaces for human mitochondrial proteins.

Molecular Basis of Pathogenicity and Host Defense

A better understanding of the molecular machineries that are used by pathogens and host immune systems during their arms races holds promise for better disease prevention and treatment strategies. Several of our projects aimed at characterizing virulence factors of human pathogens and understanding the human immune responses. Careful protein sequence analysis was used to uncover an important motif in innate immune adapter proteins that can induce interferon activation during host defense against pathogens. We also analyzed Ebolavirus species with different pathogenicity to human to suggest features of Zaire ebolavirus that are responsible for its high pathogenicity during the Ebola Outbreak in 2014-2016. Recently, combining the coevolution-based protein complex modeling and Cryo-EM, we determined the 3D structure of the entire bacterial type IV secretion system, a machinery that allows bacteria to exchange genetic materials and deliver DNA or proteins to host cells. Accurate 3D structures of protein complexes also allowed us to detect homology between fast-evolving virulence factors, which helped our collaborators discover a class of virulence regulators in pathogenic bacteria.

Interpretation of Disease-causing Genetic Variants and Somatic Mutations in Cancer

In the past, Dr. Cong's contributions in this direction were: (1) building a web server to help researchers to gather information about genetic variants in a protein; (2) helping a physician to interpret the molecular mechanism of the disease-causing variants she observed in her patients; (3) serving as an assessor for several rounds of Critical Assessment of Genome Interpretation (CAGI). We are interested in understanding how genetic variants and somatic mutations affect the function of proteins and thus contribute to diseases. Recently, our advances in predicting and accurately modeling protein-protein interaction allowed us to move a significant step towards resolving the interactome of cancer driving proteins. Integrating the cancer interactome with somatic mutation landscape in cancer, we were able to identify possible cancer driver mutations on crucial protein-protein interfaces and explain their functional consequences.

Evolutionary Genomics of Butterflies

The success of NGS technologies enabled us to study biological questions by comparative genomic analysis of a large group of organisms. Dr. Cong has developed experimental and computational pipelines to sequence and comparatively analyze genomes of butterflies. We obtained the first reference genomes for all but one butterfly families, and we used genomic data to study the genetic basis of evolution, speciation and adaptation. We gathered whole genome sequences of all 845 species of butterflies recorded from the United States and Canada. From this extensive collection of data, we uncovered the patterns of speciation and diversification, and suggested that frequent gene exchange between species through hybridization might drive rapid diversification and adaptation in animals.

Genetic Basis for Unique Phenotypic Traits in Animals

One of the most important questions of genomic studies is to correlate genotypes with phenotypes. We were able to suggest the genetic basis for unique features of animals in a number of projects. We found a unique expansion of isoprenoid biosynthesis enzymes in the genomes of swallowtails, possibly enabling them to synthesize terpenes for chemical defense against birds. Comparison of genomes across populations of skipper butterflies revealed that paler coloration in some species arose from introgression of genes from a closely related paler species. Sequencing and analysis of the genome of the gypsy moth, a notorious pest accidentally introduced to America from Europe suggested reasons behind the differences in flight capacity of different moth populations. Moving to mammals, we used jumping mice as models to uncover the molecular mechanisms and genetic adaptations behind hibernation behavior of animals.

Prediction of Protein Structures and its Quality Assessment

Protein structure prediction is the cornerstone of computational biophysics. Working as a member of the assessment team for the CASP9 (critical assessment of structure prediction) experiment, Dr. Cong established a new approach to evaluate predictions that are of poor quality and yet show promise in structure modeling, which was the most challenging part of the assessment. This method became one of the standard approaches used in assessment of subsequent CASP experiments. Dr. Cong also developed MESSA, a comprehensive server to predict structural properties and produce models of protein structures. Dr. Cong contributed to the refinement of protein structure models, with the resulting method being one of the top performers in CASP13. Furthermore, Dr. Cong's work improved contact prediction using deeper alignments and more sophisticated statistical treatment of data. All these results culminated in the astounding progress in structure prediction in the last couple of years.