2024

Pascal Notin, Nathan Rollins, Yarin Gal, Chris Sander, Debora Marks

Nature Biotechnology, 2024 ; Feb 2024

Machine Learning for Functional Protein Design

Abstract

Recent breakthroughs in AI coupled with the rapid accumulation of protein sequence and structure data have radically transformed computational protein design. New methods promise to escape the constraints of natural and laboratory evolution, accelerating the generation of proteins for applications in biotechnology and medicine. To make sense of the exploding diversity of machine learning approaches, we introduce a unifying framework that classifies models on the basis of their use of three core data modalities: sequences, structures and functional labels. We discuss the new capabilities and outstanding challenges for the practical design of enzymes, antibodies, vaccines, nanomachines and more. We then highlight trends shaping the future of this field, from large-scale assays to more robust benchmarks, multimodal foundation models, enhanced sampling strategies and laboratory automation.

Links

View on Journal Website

Applying machine learning to biological sequences - DNA, RNA and protein - has enormous potential to advance human health, environmental sustainability, and fundamental biological understanding. However, many existing machine learning methods are ineffective or unreliable in this problem domain. We study these challenges theoretically, through the lens of kernels. Methods based on kernels are ubiquitous: they are used to predict molecular phenotypes, design novel proteins, compare sequence distributions, and more. Many methods that do not use kernels explicitly still rely on them implicitly, including a wide variety of both deep learning and physics-based techniques. While kernels for other types of data are well-studied theoretically, the structure of biological sequence space (discrete, variable length sequences), as well as biological notions of sequence similarity, present unique mathematical challenges. We formally analyze how well kernels for biological sequences can approximate arbitrary functions on sequence space and how well they can distinguish different sequence distributions. In particular, we establish conditions under which biological sequence kernels are universal, characteristic and metrize the space of distributions. We show that a large number of existing kernel-based machine learning methods for biological sequences fail to meet our conditions and can as a consequence fail severely. We develop straightforward and computationally tractable ways of modifying existing kernels to satisfy our conditions, imbuing them with strong guarantees on accuracy and reliability. Our proof techniques build on and extend the theory of kernels with discrete masses. We illustrate our theoretical results in simulation and on real biological data sets.

Links

View on Journal Website

View on Journal Website

Mike T. Veling, Dan T. Nguyen, Nicole N. Thadani, Michela E. Oster, Nathan J. Rollins, Kelly P. Brock, Neville P. Bethel, David Baker, Jeffrey C. Way, Debora S. Marks, Roger L. Chang, Pamela A. Silver
bioRxiv; 01 Oct 2021

Natural and designed proteins inspired by extremotolerant organisms can form condensates and attenuate apoptosis in human cells

Generative probabilistic modeling of biological sequences has widespread existing and potential application across biology and biomedicine, from evolutionary biology to epidemiology to protein design. Many standard sequence analysis methods preprocess data using a multiple sequence alignment (MSA) algorithm, one of the most widely used computational methods in all of science. However, as we show in this article, training generative probabilistic models with MSA preprocessing leads to statistical pathologies in the context of sequence prediction and forecasting. To address these problems, we propose a principled drop-in alternative to MSA preprocessing in the form of a structured observation distribution (the "MuE" distribution). The MuE is a latent alignment model in which not only the alignment variable but also the regressor sequence can be latent. We prove theoretically that the MuE distribution comprehensively generalizes popular methods for inferring biological sequence alignments, and provide a precise characterization of how such biological models have differed from natural language latent alignment models. We show empirically that models that use the MuE as an observation distribution outperform comparable methods across a variety of datasets, and apply MuE models to a novel problem for generative probabilistic sequence models: forecasting pathogen evolution.

Links

View on Journal Website

View on Journal Website

2019

David K Yang, Samuel L Goldman, Eli Weinstein, Debora S Marks
Machine Learning in Computational Biology; Published 13 Dec 2019

Generative models for codon prediction and optimization

Abstract

Optimizing foreign DNA sequences for maximal protein production in a specified host organism is an important problem for synthetic biology and biomanufacturing. Experimental results have demonstrated that simply interchanging codons, triplets of three DNA bases, with synonymous alternatives can in fact amplify protein production several-fold while holding the produced protein constant. Previous methods for codon optimization are frequency based, which cannot consider factors such as RNA secondary structure that contribute to protein expression. Here, we apply a deep learning framework to model the distribution of codons in highly expressed bacterial and human transcripts. We show that our LSTM-Transducer model is able to predict the next codon of a genetic sequence with improved accuracy and lower perplexity on a held out set of transcripts, outperforming the previously state of the art frequency-based approach to modeling codon distribution.

Links

View on Journal Website

Anna G Green, Hadeer Elhabashy, Kelly P Brock, Rohan Maddamsetti, Oliver Kohlbacher, Debora S Marks
bioRxiv 2019; Preprint 02 Oct 2019

Proteome-scale discovery of protein interactions with residue-level resolution using sequence coevolution

Abstract

The majority of protein interactions in most organisms are unknown, and experimental methods for determining protein interactions can yield divergent results. Here we use an orthogonal, purely computational method based on sequence coevolution to discover protein interactions at large scale. In the model organism Escherichia coli, 53% of protein pairs in the proteome are eligible for our method given currently available sequenced genomes. When assaying the entire cell envelope proteome, which is understudied due to experimental challenges, we found 620 likely interactions and their predicted structures, increasing the space of known interactions by 529. Our results show that genomic sequencing data can be used to predict and resolve protein interactions to atomic resolution at large scale. Predictions and code are freely available at https://marks.hms.harvard.edu/ecolicomplex

Data availability https://marks.hms.harvard.edu/ecolicomplex

Abstract

Coevolutionary sequence analysis has become a commonly used technique for de novo prediction of the structure and function of proteins, RNA, and protein complexes. We present the EVcouplings framework, a fully integrated open-source application and Python package for coevolutionary analysis. The framework enables generation of sequence alignments, calculation and evaluation of evolutionary couplings (ECs), and de novo prediction of structure and mutation effects. The combination of an easy to use, flexible command line interface and an underlying modular Python package makes the full power of coevolutionary analyses available to entry-level and advanced users.

Links

View on Journal Website

Yuanpeng Janet Huang, Kelly P Brock, Yojiro Ishida, Gurla VT Swapna, Masayori Inouye, Debora S Marks, Chris Sander, Gaetano T Montelione
Academic Press Methods in Enzymology, pp. 363-392

Combining evolutionary covariance and NMR data for protein structure determination

Abstract

Accurate protein structure determination by solution-state NMR is challegning for proteins greater than about 20 kDa, for which extensive perdeuteration is generally required, providing experimental data that are incomplete (sparse) and ambiguous. However, the massive increase in evolutionary sequence information coupled with advanced in methods for sequence covariance analysis can provide reliable reside-residue contact information for a protein from sequence data alone. These "evolutionary couplings (ECs)" can be combined with sparse NMR data to determine accurate 3D protein structures. This hybrid "EC-NMR" method has been developed using NMR data for several soluble proteins and validated by comparison with corresponding reference structures determined by X-ray crystallography and/or conventional NMR methods. For small proteins, only backbone resonance assignments are utilized, while for larger proteins both backbone and some sidechain methyl resonance assignments are generally required. ECs can be combined with sparse NMR data obtained on deuterated, selevtively protonated protein samples to probide structures that are more accurate and complete than those obtained using such sparse NMR data alone. EC-NMR also has significant potential for analysis of protein structures from solid-state NMR data and for studies of integral membrane proteins. The requirement that ECs are consistent with NMR data recorded on a specific member of a protein family, under specific conditions, also allows identification of ECs that reflect alternative allosteric or excited states of the protein structure.

Links

View on Journal Website
PDF
DOI
PubMed

Yuanpeng Janet Huang, Kelly P Brock, Chris Sander, Debora S Marks, Gaetano T Montelione

Integrative Structural Biology with Hybrid Methods; Published 08 Jan 2019

Sequence Data

Abstract

While 3D structure determination of small (< 15 kDa) proteins by solution NMR is largely automated and routine, structural analysis of larger proteins is more challenging. An emerging hybrid strategy for modeling protein structures combines sparse NMR data that can be obtained for larger proteins with sequence co-variation data, called evolutionary couplings (ECs), obtained from multiple sequence alignments of protein families. This hybrid “EC-NMR” method can be used to accurately model larger (15–60 kDa) proteins, and more rapidly determine structures of smaller (5–15 kDa) proteins using only backbone NMR data. The resulting structures have accuracies relative to reference structures comparable to those obtained with full backbone and sidechain NMR resonance assignments. The requirement that evolutionary couplings (ECs) are consistent with NMR data recorded on a specific member of a protein family, under specific conditions, potentially also allows identification of ECs that reflect alternative allosteric or excited states of the protein structure.

Links

View on Journal Website

2018

Xuewu Sui, Henning Arlt, Kelly Brock, Zon Weng Lai, Frank DiMaio, Debora Marks, Maofu Liao, Robert V Farese Jr, Tobias C Walther

Journal of Cell Biology; Published 16 October 2018

Cryo–electron microscopy structure of the lipid droplet–formation protein seipin

Abstract

Metabolic energy is stored in cells primarily as triacylglycerols in lipid droplets (LDs), and LD dysregulation leads to metabolic diseases. The formation of monolayer-bound LDs from the endoplasmic reticulum (ER) bilayer is poorly understood, but the ER protein seipin is essential to this process. In this study, we report a cryo–electron microscopy structure and functional characterization of Drosophila melanogaster seipin. The structure reveals a ring-shaped dodecamer with the luminal domain of each monomer resolved at ∼4.0 Å. Each luminal domain monomer exhibits two distinctive features: a hydrophobic helix (HH) positioned toward the ER bilayer and a β-sandwich domain with structural similarity to lipid-binding proteins. This structure and our functional testing in cells suggest a model in which seipin oligomers initially detect forming LDs in the ER via HHs and subsequently act as membrane anchors to enable lipid transfer and LD growth.

Links

View on Journal Website

Benjamin Schubert, Rohan Maddamsetti, Jackson Nyman, Maha R Farhat, Debora S Marks

Nature microbiology 03 December 2018

Abstract

The analysis of whole genome sequencing data should, in theory, allow the discovery of interdependent loci that cause antibiotic resistance. In practice, however, identifying this epistasis remains a challenge as the vast number of possible interactions erodes statistical power. To solve this problem, we extend a method that has been successfully used to identify epistatic residues in proteins to infer loci strongly coupled and associated with antibiotic resistance from whole genomes. Our method reduces the number of tests required for an epistatic genome-wide association study and increases the likelihood of identifying causal epistasis. We discover 38 loci and 250 epistatic pairs that influence the dose needed to inhibit growth for five different antibiotics in 1102 isolates of Neisseria gonorrhoeae, that were confirmed in an independent dataset of 495 isolates. Many of the know resistance-affecting loci were recovered, and more sites within those genes, however the majority of loci occurred in unreported genes, including murE which was associated with cefixime. About half of the novel epistasis we report involves at least one locus previously associated with antibiotic resistance, including interactions between gyrA and _par_C associated with ciprofloxacin, leaving many combinations involving unreported loci and genes. Our work provides a systematic identification of epistasis pairs in N. gonorrhoeae resistance and a generalizable method for epistatic genome-wide association studies.

Links

PDF
DOI
bioRxiv
PubMed

2017

Adam J Riesselman ^, John B Ingraham ^, Debora S Marks
^ joint first authors
arXiv preprint 2017; Available on arXiv 18 December 2017

Deep Generative models of genetic variation capture mutation effects

Abstract

The functions of proteins and RNAs are determined by a myriad of interactions between their constituent residues, but most quantitative models of how molecular phenotype depends on genotype must approximate this by simple additive effects. While recent models have relaxed this constraint to also account for pairwise interactions, these approaches do not provide a tractable path towards modeling higher-order dependencies. Here, we show how latent variable models with nonlinear dependencies can be applied to capture beyond-pairwise constraints in biomolecules. We present a new probabilistic model for sequence families, DeepSequence, that can predict the effects of mutations across a variety of deep mutational scanning experiments significantly better than site independent or pairwise models that are based on the same evolutionary data. The model, learned in an unsupervised manner solely from sequence information, is grounded with biologically motivated priors, reveals latent organization of sequence families, and can be used to extrapolate to new parts of sequence space.

Links

View on Journal Website
PDF
DOI
bioRxiv

Benjamin Schubert, Charlotta PI Schärfe, Pierre Dönnes, Thomas A Hopf, Debora S Marks, Oliver Kohlbacher
arXiv preprint 2017; Available on arXiv 28 June 2017

Population-specific design of de-immunized protein biotherapeutics

Abstract

Immunogenicity is a major problem during the development of biotherapeutics since it can lead to rapid clearance of the drug and adverse reactions. The challenge for biotherapeutic design is therefore to identify mutants of the protein sequence that minimize immunogenicity in a target population whilst retaining pharmaceutical activity and protein function. Current approaches are moderately successful in designing sequences with reduced immunogenicity, but do not account for the varying frequencies of different human leucocyte antigen alleles in a specific population and in addition, since many designs are non-functional, require costly experimental post-screening. Here we report a new method for de-immunization design using multi-objective combinatorial optimization that simultaneously optimizes the likelihood of a functional protein sequence at the same time as minimizing its immunogenicity tailored to a target population. We bypass the need for three-dimensional protein structure or molecular simulations to identify functional designs by automatically generating sequences using probabilistic models that have been used previously for mutation effect prediction and structure prediction. As proof-of-principle we designed sequences of the C2 domain of Factor VIII and tested them experimentally, resulting in a good correlation with the predicted immunogenicity of our model.

Links

View on Journal Website
PDF
DOI

John B Ingraham, Debora S Marks
ICML 2017; Available on arXiv 14 June 2017

Variational inference for sparse and undirected models

Abstract

Undirected graphical models are applied in genomics, protein structure prediction, and neuroscience to identify sparse interactions that underlie discrete data. Although Bayesian methods for inference would be favorable in these contexts, they are rarely used because they require doubly intractable Monte Carlo sampling. Here, we develop a framework for scalable Bayesian inference of discrete undirected models based on two new methods. The first is Persistent VI, an algorithm for variational inference of discrete undirected models that avoids doubly intractable MCMC and approximations of the partition function. The second is Fadeout, a reparameterization approach for variational inference under sparsity-inducing priors that captures a posteriori correlations between parameters and hyperparameters with noncentered parameterizations. We find that, together, these methods for variational inference substantially improve learning of sparse undirected graphical models in simulated and real problems from physics and biology.

Links

View on Journal Website
PDF

Charlotta PI Schärfe, Roman Tremmel, Matthias Schwab, Oliver Kohlbacher, Debora S Marks
Genome Medicine 2017; Published 22 December 2017

Abstract

Variability in drug efficacy and adverse effects are observed in clinical practice. While the extent of genetic variability in classic pharmacokinetic genes is rather well understood, the role of genetic variation in drug targets is typically less studied.

Links

View on Journal Website
PDF
DOI
bioRxiv
PubMed

Thomas A Hopf ^, John B Ingraham ^, Frank J Poelwijk, Charlotta PI Schärfe, Michael Springer, Chris Sander, Debora S Marks
^ Joint first authors
Nature Biotechnology; Web 16 Jan 2017

Mutation effects predicted from sequence co-variation

Abstract

Many high-throughput experimental technologies have been developed to assess the effects of large numbers of mutations (variation) on phenotypes. However, designing functional assays for these methods is challenging, and systematic testing of all combinations is impossible, so robust methods to predict the effects of genetic variation are needed. Most prediction methods exploit evolutionary sequence conservation but do not consider the interdependencies of residues or bases. We present EVmutation, an unsupervised statistical method for predicting the effects of mutations that explicitly captures residue dependencies between positions. We validate EVmutation by comparing its predictions with outcomes of high-throughput mutagenesis experiments and measurements of human disease mutations and show that it outperforms methods that do not account for epistasis. EVmutation can be used to assess the quantitative effects of mutations in genes of any organism. We provide pre-computed predictions for ∼7,000 human proteins at http://evmutation.org/.

Links

View on Journal Website
PDF
DOI
PubMed
EV Mutation Homepage

2014

Thomas A Hopf, Satoshi Morinaga, Sayoko Ihara, Kazushige Touhara, Debora S Marks, Richard Benton
Nature Communications 6, Article number: 6077

Amino acid coevolution reveals three-dimensional strucutre and functional domains of insect odorant receptors

Abstract

Insect odorant receptors (ORs) comprise an enormous protein family that translates environmental chemical signals into neuronal electrical activity. These heptahelical receptors are proposed to function as ligand-gated ion channels and/or to act metabotropically as G protein-coupled receptors (GPCRs). Resolving their signalling mechanism has been hampered by the lack of tertiary structural information and primary sequence similarity to other proteins. We use amino acid evolutionary covariation across these ORs to define restraints on structural proximity of residue pairs, which permit de novo generation of three-dimensional models. The validity of our analysis is supported by the location of functionally important residues in highly constrained regions of the protein. Importantly, insect OR models exhibit a distinct transmembrane domain packing arrangement to that of canonical GPCRs, establishing the structural unrelatedness of these receptor families. The evolutionary couplings and models predict odour binding and ion conduction domains, and provide a template for rationale structure-activity dissection.

Links

View on Journal Website
PDF
DOI
PubMed

Thomas A Hopf ^, Charlotta PI Schärfe, João PGLM Rodrigues, Anna G Green, Oliver Kohlbacher, Chris Sander, Alexandre MJJ Bonvin, Debora S Marks
^Joint first authors
elife 2014;3:e03430

Sequence co-evolution gives 3D contacts and structures of protein complexes

Abstract

Protein–protein interactions are fundamental to many biological processes. Experimental screens have identified tens of thousands of interactions, and structural biology has provided detailed functional insight for select 3D protein complexes. An alternative rich source of information about protein interactions is the evolutionary sequence record. Building on earlier work, we show that analysis of correlated evolutionary sequence changes across proteins identifies residues that are close in space with sufficient accuracy to determine the three-dimensional structure of the protein complexes. We evaluate prediction performance in blinded tests on 76 complexes of known 3D structure, predict protein–protein contacts in 32 complexes of unknown structure, and demonstrate how evolutionary couplings can be used to distinguish between interacting and non-interacting protein pairs in a large complex. With the current growth of sequences, we expect that the method can be generalized to genome-wide elucidation of protein–protein interaction networks and used for interaction predictions at residue resolution.

Links

View on Journal Website
PDF
DOI
bioRxiv
PubMed

2012

Debora S Marks, Thomas A Hopf, Chris Sander
Nature Biotechnology 30, pp. 1072-1080(2012)

Protein structure prediction from sequence variation

Abstract

Genomic sequences contain rich evolutionary information about functional constraints on macromolecules such as proteins. This information can be efficiently mined to detect evolutionary couplings between residues in proteins and address the long-standing challenge to compute protein three-dimensional structures from amino acid sequences. Substantial progress has recently been made on this problem owing to the explosive growth in available sequences and the application of global statistical methods. In addition to three-dimensional structure, the improved understanding of covariation may help identify functional residues involved in ligand binding, protein-complex formation and conformational changes. We expect computation of covariation patterns to complement experimental structural biology in elucidating the full spectrum of protein structures, their functional interactions and evolutionary dynamics.

Links

View on Journal Website
PDF
DOI
PubMed

Thomas A Hopf, Lucy J Colwell, Robert Sheridan, Burkhard Rost, Chris Sander, Debora S Marks
Cell, Vol. 149, Issue 7, pp. 1707-1721

Three-Dimensional structure of membrane proteins from genomic sequencing

Abstract

We show that amino acid covariation in proteins, extracted from the evolutionary sequence record, can be used to fold transmembrane proteins. We use this technique to predict previously unknown 3D structures for 11 transmembrane proteins (with up to 14 helices) from their sequences alone. The prediction method (EVfold_membrane) applies a maximum entropy approach to infer evolutionary covariation in pairs of sequence positions within a protein family and then generates all-atom models with the derived pairwise distance constraints. We benchmark the approach with blinded de novo computation of known transmembrane protein structures from 23 families, demonstrating unprecedented accuracy of the method for large transmembrane proteins. We show how the method can predict oligomerization, functional sites, and conformational changes in transmembrane proteins. With the rapid rise in large-scale sequencing, more accurate and more comprehensive information on evolutionary constraints can be decoded from genetic variation, greatly expanding the repertoire of transmembrane proteins amenable to modeling by this method.

Links

View on Journal Website
PDF
DOI
PubMed

December 2014 New Results

2011 and earlier

Debora S Marks ^, Lucy J Colwell ^, Robert Sheridan, Thomas A Hopf, Andrea Pagnani, Riccardo Zecchina, Chris Sander
^Joint first authors PLoS One 2011 June 12 :e28766. Epub* 2011 Dec 7

Protein 3D structure computed from evolutionary sequence variation

Abstract

The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing.

Abstract

MicroRNAs (miRNAs) interact with target mRNAs at specific sites to induce cleavage of the message or inhibit translation. The specific function of most mammalian miRNAs is unknown. We have predicted target sites on the 3′ untranslated regions of human gene transcripts for all currently known 218 mammalian miRNAs to facilitate focused experiments. We report about 2,000 human genes with miRNA target sites conserved in mammals and about 250 human genes conserved as targets between mammals and fish. The prediction algorithm optimizes sequence complementarity using position-specific rules and relies on strict requirements of interspecies conservation. Experimental support for the validity of the method comes from known targets and from strong enrichment of predicted targets in mRNAs associated with the fragile X mental retardation protein in mammals. This is consistent with the hypothesis that miRNAs act as sequence-specific adaptors in the interaction of ribonuclear particles with translationally regulated messages. Overrepresented groups of targets include mRNAs coding for transcription factors, components of the miRNA machinery, and other proteins involved in translational regulation, as well as components of the ubiquitin machinery, representing novel feedback loops in gene regulation. Detailed information about target genes, target processes, and open-source software for target prediction (miRanda) is available at http://www.microrna.org. Our analysis suggests that miRNA genes, which are about 1% of all human genes, regulate protein production for 10% or more of all human genes.

Links

View on Journal Website
PDF
DOI
PubMed