Sequence-Structure Relationships in Proteins
This event is part of the Biophysics/Condensed Matter Seminar Series.
The rapid growth in sequence data through genomic projects presents the obvious challenge of accurately modeling the structure and function of newly discovered proteins. The most efficient modeling approach is based on homology. In homology modeling we first identify a protein similar to the target (template). The template is used to model the structure of the new sequence. I shall describe our machine learning efforts to design fully automated systems for better detection of templates, alignment of the new sequence against the template, and producing and scoring final models. I will present a few blind tests, and an example of significant biological interest of a gene that controls the size of the tomato fruit.
Another challenge presented by the new data is the development of a global model of sequence and structure relationships in proteins. We simulate the global relationships between protein sequences and structures with randomized algorithms. Since structures are preserved much better than sequences, we ask: What is the sequence capacity of protein structures? i.e. how many sequences fold into a particular protein shape? We also seek ways to characterize the evolutionary interplay of protein sequences and folds. We define the “temperature of evolution” to characterize mutation mechanisms, and present a directed graph for all known protein folds to describe the flow of sequences between all known structures. Implications to experimentally observed evolutionary relationships and protein design will be discussed.