# FUNCTION AND FLEXIBILITY: FRIEND OR FOE?

EDITED BY: Kris Pauwels and Peter Tompa PUBLISHED IN: Frontiers in Molecular Biosciences

#### *Frontiers Copyright Statement*

*© Copyright 2007-2016 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

*The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88919-972-3 DOI 10.3389/978-2-88919-972-3

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

## What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# **FUNCTION AND FLEXIBILITY: FRIEND OR FOE?**

Topic Editors:

**Kris Pauwels,** Vlaams Instituut voor Biotechnologie & Vrije Universiteit Brussel, Belgium **Peter Tompa,** Vlaams Instituut voor Biotechnologie & Vrije Universiteit Brussel, Belgium; Research Centre for Natural Sciences of the Hungarian Academy of Sciences, Hungary

Protein flexibility can be represented by a conformational ensemble. Sixteen different conformations of phosphorylated Sic1 are represented, which were generated based on NMR and SAXS data (entry 1AAA in the Protein Ensemble Database). Image by Luis Ferrer Campins, Sara Contreras Martos and Kris Pauwels.

Louis Sullivan (1856 - 1924) revolutionized architecture by designing the first skyscraper and he became famous by proclaiming that "form follows function". When x-ray crystallographers visualized the structures of proteins for the first time, the structural biology field embraced the view that "function follows form" as the 3D-architecture of proteins could unveil various aspects of their function. Despite the original "1 gene - 1 protein structure - 1 function" relationship, nowadays a far more complicated picture emerges where the flexibility and dynamics of a protein can play a central role in a multitude of functions. The ultimate form(s) that a protein adopt when interacting with (a) partner molecule(s) are the most biologically relevant and in this context Sullivan's quote is still appropriate: the conformation that the protein adopts follows from the function of that protein.

Despite the fact that many well-characterized proteins have a well-folded structure, there is a growing interest in the conformational flexibility within proteins. This flexibility is also a balanced phenomenon: excess of flexibility can be detrimental for protein behaviour, as well as the lack thereof. Notwithstanding its importance, studying intrinsically disordered protein regions or conformational rearrangements can be a very challenging. Therefore, flexibility can be perceived as a friend or a foe, depending on the context.

This e-book showcases the impact of the study of protein flexibility on the structural biology field and presents protein flexibility in the context of disease as well as its benign aspects. As detailed knowledge of the structural aspects of polypeptides remains essential to comprehend protein function, one of the future challenges for structural biology also lies with large macromolecular protein complexes. Also there the dynamics and flexibility are essential for proper functioning and molecular movement, which is an important aspect of living matter. This challenge stimulated the development of advanced techniques to study protein flexibility and the use of those techniques to address fundamental biological and biomedical problems. Those innovations should help us to unravel the intimate link between protein function and flexibility and explore new horizons.

**Citation:** Pauwels, K., Tompa, P., eds. (2016). Function and Flexibility: Friend or Foe? Lausanne: Frontiers Media. doi: 10.3389/978-2-88919-972-3

# Table of Contents


# Editorial: Function and Flexibility: Friend or Foe?

#### Kris Pauwels 1, 2 \* and Peter Tompa1, 2, 3

*<sup>1</sup> VIB Structural Biology Research Center, Vlaams Instituut voor Biotechnologie, Brussels, Belgium, <sup>2</sup> Structural Biology Brussels, Vrije Universiteit Brussel, Brussels, Belgium, <sup>3</sup> Research Centre for Natural Sciences of the Hungarian Academy of Sciences, Institute of Enzymology, Budapest, Hungary*

Keywords: intrinsically disordered proteins, protein dynamics, protein flexibility, conformational ensemble, protein function

**The Editorial on the Research Topic**

#### **Function and Flexibility: Friend or Foe?**

Protein structural biology aims to link snapshots of three-dimensional macromolecular structures to their biological function. The high-resolution information that is obtained traditionally by x-ray crystallography or nuclear magnetic resonance (NMR) experiments is instrumental for understanding their functional properties, their biological roles, and their potential roles in diseases ("function follows form"). Yet, proteins are not rigid and/or static entities: their dynamics and flexibility are essential for proper functioning and molecular movement, which is an important aspect of living matter. Many proteins even completely lack a well-defined 3D-structure under physiological conditions, the so-called intrinsically disordered proteins (IDPs). Up to 35% of human proteins are predicted to possess intrinsically disordered regions (IDRs) of at least 30 consecutive disordered residues, that play important roles in cell signaling and regulation (Guharoy et al., 2015) ("flexibility facilitates function"). Therefore, this research topic covers the impact of the study of protein flexibility on the structural biology field.

The articles in this e-book feature plenty of examples where protein flexibility controls protein functionality. In their fascinating Perspective, Kern and colleagues provide an excellent overview of our actual mechanistic insights of how the anticancer drug Gleevec selectively inhibits the Abl kinase (Agafonov et al.). Their work showcases how rigorous kinetic and structural analysis yields definitive conclusions that selectivity is a function of a conformational change after binding (induced-fit) and the resulting slow dissociation rate of Gleevec from the Abl kinase, whereby the flexibility in the famous and highly conserved DFG-loop plays an important role (Agafonov et al., 2014; Wilson et al., 2015). By reconstructing the evolution of the energy landscape of kinases through the synergy of "old-fashioned" stopped-flow kinetics and "modern" ancestral sequence reconstruction, they advocate for the combined use of experimental studies and molecular dynamics approaches to find effective and selective kinase inhibitors.

The benign role of protein flexibility is also nicely illustrated in the review by Gontero and colleagues who demonstrate the central and multiple functionality of C- and N-terminal intrinsically disordered tails of globular proteins in photosynthetic organisms (Thieulin-Pardo et al.). They exemplify that protein flexibility at the N- and C-terminal extremities accounts for an increased number of binding partners and how new roles may emerge by the evolutionary addition of an intrinsically disordered extension. Indeed, often IDRs play a role in molecular recognition and binding events, whereby they can undergo a folding transition induced by the partner protein ("form follows function"). By a large scale thermodynamic assessment of mostly binary protein-protein interactions of ordered-ordered and ordered-disordered protein complexes, Kragelund and colleagues help shedding light on the debated role of kinetics and thermodynamics

Edited and reviewed by: *Annalisa Pastore, King's College London, UK*

> \*Correspondence: *Kris Pauwels krpauwel@vub.ac.be*

#### Specialty section:

*This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences*

> Received: *08 June 2016* Accepted: *23 June 2016* Published: *07 July 2016*

#### Citation:

*Pauwels K and Tompa P (2016) Editorial: Function and Flexibility: Friend or Foe? Front. Mol. Biosci. 3:31. doi: 10.3389/fmolb.2016.00031* in the binding properties of IDPs (Teilum et al.). Through this capacity for interaction with other molecules, protein flexibility can also be linked to disease (Hubin et al., 2014; Uversky, 2014; Guharoy et al., 2015). Fraternalli and colleagues studied the localization of common and disease-related mutations within (dis)ordered protein regions (Lu et al.). They highlight that intra-domain ordered and intra-domain disordered regions show high propensity for disease-related mutations, while interdomain disordered regions are enriched in common variants. Their analysis offers interesting perspectives for the further development of the field of protein flexibility and disorder. It also supports the fact that, in the field of IDPs, computational approaches play a major role. As such, Craveur et al. show that the concept of structural alphabets is suitable to analyze the dynamics and flexibility of proteins. In their comprehensive review they advocate that structural alphabets are required to begin to understand the complexity of protein flexibility by discriminating flexibility from mobility and deformability.

The IDP field is also one of the few areas in structural and molecular biology where the experiments provide support to computations to achieve an accurate understanding of the conformational properties of these complex proteins. Varadi et al. review the current characterizations of IDPs by combining computations and experiments. The mini-review identifies key developments in the field, including the employment of experimental data into structural refinement in search of the functional repertoire of IDPs. With regard to wet-lab experimental approaches, several emerging techniques allow to overcome some of the technical problems of studying IDPs and to obtain essential information on protein dynamics. In their original research paper, Barran and collaborators exemplify the potential of ion-mobility mass spectrometry to track conformational changes in unstructured proteins on a millisecond timescale (Dickinson et al.). They characterize the effect of two small molecule compounds RITA and nutlin-3 on their IDP targets with a multi-technique approach. The minireview by Belle and coworkers showcases the power of site-directed spin labeling with electron paramagnetic resonance

#### REFERENCES


to investigate flexible regions and fuzziness in proteins (Le Breton et al.). The information obtained by NMR can generate conformational ensembles that visualize the conformations that IDPs sample under functional conditions. Because protein disorder can be evaluated at the residue level with NMR, Nielsen and Mulder compiled a small database of disorder-containing proteins using experimental NMR chemical shift data in their original research paper that is felicitously entitled "There is Diversity in Disorder – 'In all Chaos there is a Cosmos, in all Disorder a Secret Order"'. They demonstrate that those proteins span the full spectrum of disorder, yet segregate into two classes: proteins mostly disordered but with small segments of order scattered along the sequence, or structured proteins with small segments of disorder inserted between the different structured regions. This study is also illustrative for the concept of "form and function follow (NMR) frequency."

Recently the D<sup>3</sup> -concept was introduced for IDPs by revealing the interconnections between protein intrinsic Disorder and Degenerative Diseases (Uversky, 2014). In analogy, it is opportune to introduce the F<sup>3</sup> -concept for flexible proteins, since "Function Follows Flexibility." Whereas in the past intrinsic disorder could cause frustration because IDRs were considered frivolous and flamboyant, their flirtatious behavior flaunted formidable features. We hope this e-book can stimulate the research community to finally stop fumbling for the fugacious forms of flexible proteins and bring their functional framing to fruition.

### AUTHOR CONTRIBUTIONS

Both authors made substantial, direct and intellectual contribution to the work, and approved it for publication.

## ACKNOWLEDGMENTS

PT is supported by the Odysseus grant G.0029.12 from Research Foundation Flanders (FWO) and KP is the recipient of a FWO postdoctoral fellowship (#1218713).

Wilson, C., Agafonov, R. V., Hoemberger, M., Kutter, S., Zorba, A., Halpin, J., et al. (2015). Using ancient protein kinases to unravel a modern cancer drug's mechanism. Science 347, 882–886. doi: 10.1126/science.aaa1823

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Pauwels and Tompa. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Evolution and intelligent design in drug development

#### Roman V. Agafonov, Christopher Wilson and Dorothee Kern\*

*Howard Hughes Medical Institute and Department of Biochemistry, Brandeis University, Waltham, MA, USA*

Sophisticated protein kinase networks, empowering complexity in higher organisms, are also drivers of devastating diseases such as cancer. Accordingly, these enzymes have become major drug targets of the twenty-first century. However, the holy grail of designing specific kinase inhibitors aimed at specific cancers has not been found. Can new approaches in cancer drug design help win the battle with this multi-faced and quickly evolving enemy? In this perspective we discuss new strategies and ideas that were born out of a recent breakthrough in understanding the molecular basis underlying the clinical success of the cancer drug Gleevec. An "old" method, stopped-flow kinetics, combined with old enzymes, the ancestors dating back up to about billion years, provides an unexpected outlook for future intelligent design of drugs.

#### Edited by:

*Peter Tompa, Flanders Institute of Biotechnology (VIB), Belgium*

#### Reviewed by:

*Doriano Lamba, Consiglio Nazionale delle Ricerche, Italy Miquel Adrover, Universitat de les Illes Balears, Spain*

#### \*Correspondence:

*Dorothee Kern, Howard Hughes Medical Institute and Department of Biochemistry, Brandeis University, Waltham, MA 02452, USA dkern@brandeis.edu*

#### Specialty section:

*This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences*

> Received: *24 March 2015* Accepted: *08 May 2015* Published: *21 May 2015*

#### Citation:

*Agafonov RV, Wilson C and Kern D (2015) Evolution and intelligent design in drug development. Front. Mol. Biosci. 2:27. doi: 10.3389/fmolb.2015.00027* Keywords: drug design, evolution, cancer drugs, protein kinases, conformational selection and induced fit, Gleevec

# The Beauty and Curse of Protein Kinases

Why are we more sophisticated than a yeast cell? One of the reasons is protein kinases, that exploded both in numbers (more than 500 in humans compared to 130 in yeast) and sophistication with the development of multicellularity (Richter and King, 2013). The evolution of specialized kinases enabled complex regulatory networks in higher organisms thereby providing a huge evolutionary edge. However, a crack in this machinery as little as a single point mutation in a kinase can cause cancer—an Achilles heel that has elevated protein kinases into the number one drug target of the twenty-first century (Cohen, 2002; Cohen and Alessi, 2013; Wang et al., 2014). The stringent requirements for catalyzing a chemical reaction that uncatalyzed would take about 7000 years (Stockbridge and Wolfenden, 2009; Kerns et al., 2015) resulted in a strong conservation of the active sites, which have thus been extensively targeted in cancer drug development. Unfortunately, inhibitors targeting the ATP binding site tend to be unselective due to this active site conservation, leading to unwanted side effects. The popularity of the field of protein kinase inhibition as well as alternative strategies such as inhibition of substrate binding and protein interaction sites is best reflected by a number of recent reviews (Wang et al., 2014 and a special issue in ACS Chemical Biology, 2015). In addition, new high-throughput assays are being constantly developed to facilitate screening of the compounds (Acker and Auld, 2014), however, the major goal of the pharmaceutical industry to develop specific kinase inhibitors remains a daunting challenge.

# The Wonder Drug of the Century

Gleevec is an exception, as it has great specificity for the onco-protein BCR-Abl (Capdeville et al., 2002; Cohen et al., 2002). The BCR-Abl fusion protein results from reciprocal translocation between chromosome 9 and chromosome 22, widely known as the Philadelphia translocation, leading to a constitutively active kinase (Rowley, 1973; Daley and Baltimore, 1988). Gleevec was approved by the FDA for clinical use in 2001, and has proven to be remarkably successful in treating chronic myeloid leukemia (CML) and gastrointestinal stromal tumors. Its success generated tremendous enthusiasm in the scientific community and even general public, after the reports about "new ammunition in the war against cancer" and its outstanding effectiveness were picked up by the media (Lemonick and Park, 2001; Newsweek, 2001; Wade, 2001). Gleevec was viewed as a "proof of principle drug," which showed the possibility of rational design of an inhibitor that would specifically target a kinase of interest. Unfortunately, tireless efforts aimed at understanding the molecular mechanisms of Gleevec's selectivity over the last 20+ years were mostly unsuccessful, and the original expectations of a steady stream of new therapeutics emerging from basic research turned out to be overoptimistic. As reviewed recently (Cohen and Alessi, 2013; Wang et al., 2014) since Gleevec's triumph, approximately 20 new kinase inhibitors were developed and entered clinical use. This is a rather small number considering that there are more than 500 human kinases and multiple inhibitors are needed for each of them to combat the inevitable mutations that lead to drug resistance. A fundamental pitfall in drug development is a lack of understanding of the detailed biophysical mechanisms that make inhibitors successful.

# Conformational Selection and the Famous "DFG-Loop"

In the search for the physical determinants of Gleevec selectivity, the DFG – loop (Asp-Phe-Gly), a 100% conserved element in the kinome (**Figure 1A**), stood out as a structural feature that differs between kinases that bind Gleevec tightly or weakly. In the xray structure of Abl, this loop adopts an "out" conformation in both the apo and Gleevec-bound protein, while in the closest homolog and weak binder Src kinase it occupies a bindingincompetent "in" conformation in the apo protein that would have to move into the "out" position to accommodate the drug (Xu et al., 1997; Schindler et al., 2000; Nagar et al., 2003; Seeliger et al., 2007). These structures, together with the fact that the active conformations look too similar to provide selectivity, shifted attention toward structural determinants of inactive conformations.

It was hypothesized that the preferential occupancy of the DFG-out state by Abl but not Src is the primary source of Gleevec selectivity. This model of an equilibrium between bindingincompetent (DFG-in) and competent state (DFG-out) (KCS) being the source for differential drug affinities is a classical conformational selection mechanism (Cowan-Jacob et al., 2005; Dar et al., 2008; Shan et al., 2009; Aleksandrov and Simonson, 2010; Lovera et al., 2012; Lin and Roux, 2013; Lin et al., 2014) that has recently gained popularity in biology (see the special issue of Biophysical Chemistry and references within) (Biophysical Chemistry, 2014) (Scheme in **Figure 1E**). This hypothesis was further substantiated by the observation that less selective inhibitors such as Dasatinib do not differentiate between "in" and "out" conformations of the DFG-loop.

The elegance of this hypothesis, the direct observation of two different states of the DFG-loop in crystal structures and the excellent fit to the "expected model" of drug selectivity resulted in a wealth of literature focusing on this aspect of protein dynamics. A variety of approaches, both experimental (Vogtherr et al., 2006; Vajpai et al., 2008) and computational, were taken to quantify the free energy profile of the DFG-loop dynamics (Levinson et al., 2006; Aleksandrov and Simonson, 2010; Lovera et al., 2012; Lin and Roux, 2013; Lin et al., 2014; Meng et al., 2015). However, experimental studies of DFG-loop equilibrium in solution were complicated by high dynamics of this loop hampering quantification of this equilibrium. Some computational reports seemed to quite impressively quantitatively recapitulate the experimentally observed Gleevec affinities for the different kinases (Lin and Roux, 2013; Lin et al., 2014) despite the widely acknowledged current computational limitations for accurate energy calculations (Shaw et al., 2010; Piana et al., 2011; Lindorff-Larsen et al., 2012). Other computational studies were contradictory, and results varied depending on the methodology used. Despite the lack of direct experimental observation of the DFG-loop equilibrium, the DFG-loop hypothesis underlying selectivity became so popular that all active site kinase inhibitors were classified as class I (binding to both DFG-in and -out conformations) and class II (binding exclusively to the DFG-out state).

Although large screens hinted at a trend that class-II inhibitors may be more selective, many counterexamples of selective type I and promiscuous type II inhibitors were observed (Davis et al., 2011; Treiber and Shah, 2013). These data suggested that the DFG-loop may not be as essential for selectivity as initially thought. Paradoxically, despite its logical appeal, the DFG-loop conformational selection model did not lead to new highly selective kinase drugs. What is missing?

# Old Fashioned?

A surprising breakthrough came from an unexpected direction. A new method of molecular time-travel back to the origin of these kinases and resurrection of their evolutionary trajectories into the modern kinases delivered the mechanism of Gleevec selectivity. Ironically, not only the resurrected enzymes that provided the understanding were old, so was the technique that yielded the answer. Stopped-flow kinetics, first described in the 1940s (Chance, 1940; Gibson et al., 1955) and often perceived as old-fashioned, has enormous potential when it comes to characterizing enzyme–drug interactions.

However the first hint for a new and unanticipated model came from following Gleevec binding to human Abl and Src by NMR, which revealed a slow conformational transition after drug binding that was different for the two kinases. Moreover, binding was sensed by residues far from the binding pocket indicating propagated conformational changes (Agafonov et al., 2014). Stopped-flow fluorescence experiments with modern Abl and Src (Agafonov et al., 2014) delivered quantification of the steps observed in the NMR experiments.

Contrary to the previously explored models, the dominant role in Gleevec's selectivity belongs to the conformational transitions in the kinase-drug complex (induced fit, **Figure 1E**), and not to the DFG-loop conformational selection or the physical binding step. These induced fit transitions are the slowest steps with the forward rate (kconf+) roughly 10 times faster in Abl compared to Src (**Figure 1C**). The rate of the reverse step, kconf−, measured by dilution experiments, is 70-fold slower in Abl (**Figure 1D**), leading to a 700-fold difference in the overall equilibrium (KIF) (**Figure 1E**). Because of simple principles of coupled equilibria, this 700-fold shift of the induced fit step equilibrium results in a 700-fold increase in the overall affinity for Gleevec, therefore accounting for most of the observed 3000 fold difference [the remaining four-fold difference comes from the DFG-loop conformational selection (see below)]. The actual binding step to the two kinases is nearly identical highlighting the limited usefulness of docking studies that play a prominent role in the current computational efforts in drug design. This "numbers-game" from the stopped-flow experiments delivered a new mechanism that quantitatively accounts for the long-known

difference in kinase affinities for Gleevec and hence answers the long-standing question of specificity (Agafonov et al., 2014).

Inspired by the new findings for Gleevec we advocate that the full energy profiles need to be considered, since the differences between kinases are rooted in the differences of the free energies of all states along the binding trajectory. The role of induced fit in substrate binding to enzymes for better substrate positioning for catalysis has been appreciated, however its experimental quantification is still not a commonly applied practice. Possible roles of induced fit for drug binding was also nicely discussed (Copeland, 2011), but its role in inhibitor affinity and selectivity remains undervalued. Notably, only the local rearrangements around the drug-binding pocket instead of long-range conformational transitions are often considered in rational drug design. Such long-range dynamics is, in fact, in play for the Gleevec specificity, as exposed by the ancestor resurrection.

# The Devil is in the [Atomistic] Details

While a physical chemist might be satisfied having figured out the kinetic scheme with hard numbers that rationalize the different drug affinities, the structural biologist will ask: which residues are responsible for the different energy landscapes? This might appear easy—just start mutating residues in the weak binding Src to mimic Abl. However, in spite of a large number of tested substitutions, such efforts were not successful indicating that the underling mechanism for Gleevec selectivity is more complex than anticipated (Seeliger et al., 2007). This approach although tempting has the following unavoidable drawbacks. Many differences accumulated during divergent evolution result from neutral drift (substitutions that are neutral for function and thus are not under selective pressure), and basically represent noise, from which one needs to fish out the sequence changes linked to the property of interest. To make the mater worse, some amino acid changes only come into play in the background of other mutations – a phenomenon called epistasis (Depristo et al., 2005; Harms and Thornton, 2013; Boucher et al., 2014). As a consequence, simple sequence swaps between two modern enzymes don't work because they miss the effect of the corresponding evolution of the amino acid background.

As illustrated in Wilson et al. (2015), ancestral sequence reconstruction (ASR) can be a powerful tool to overcome this challenge. ASR is a rapidly developing method that allows the inference of now nonexistent ancestral sequences using the growing amount of sequence information available. This approach was already formulated more than 50 years ago by Pauling and Zuckerkandl (1963). Modern enzymes (even the ones close in structure) still often differ from each other by 100+ residues. Such divergence in combination with neutral drift and epistasis makes it virtually impossible to rationally analyze the sequence differences. Ancestral reconstruction kills two birds with one stone. First, the sequence differences between two ancestors (or an ancestor and a modern protein) are smaller than those between the two modern enzymes, which makes a productive analysis of sequences more probable. Second, swaps between ancestor and its "grand-grand-children" can indeed shed light into atomistic mechanisms since epistasis is naturally accounted for.

In the work of Wilson et al. (2015) a phylogenetic tree of 76 modern kinases from different families and organisms of non-receptor tyrosine kinases was reconstructed, and protein sequences corresponding to key evolutionary branching points were resurrected (**Figure 2A**). Remarkably, all reconstructed ancient enzymes, differing by up to 100 amino acids from anything you can find today in nature, are fully active! The common ancestor of Src and Abl (called ANC-AS) had an intermediate affinity for Gleevec that increased along the evolutionary branch leading to Abl and decreased along the Src branch (**Figure 2B**).

Combining ancestral reconstruction with their Gleevec binding kinetics and structure illustrates the evolution of divergent energy landscapes (**Figure 2C**). Of interest to drug designers, it indeed delivered the atomistic mechanism responsible for Gleevec selectivity. Fifteen amino acid differences (out of 146) were identified to encode Gleevec specificity for Abl (**Figure 2D**) (Wilson et al., 2015). Their role in the induced fit step can now be rationalized structurally including stabilizing effects on drug–protein interaction and tuning differential flexibility via H-bonds remote from the drug-binding site (**Figure 2D**) (Wilson et al., 2015). So indeed long-range dynamics and epistasis are in play for Gleevec binding as first seen in the NMR studies (Agafonov et al., 2014) and hinted by the unsuccessful early swop approach (Seeliger et al., 2007).

Interestingly, the same residues correlated well with several resistant mutations found in patients who developed Gleevec resistance (Wilson et al., 2015). In other words, current evolution appears in these "dynamic hotspots," and the rationalization of the underlying atomistic mechanism for Gleevec resistance might help in designing drugs that overcome this detrimental evolution of cancer cells.

# New Tool in Biophysics—Ancestral Sequence Reconstruction (ASR)

The reader should wonder why an evolutionary approach is useful to solve a mechanism of a modern-day, man-made molecule? Obviously Abl did not evolve to bind Gleevec and be "strangled" by it! Rather, Gleevec accidently took advantage of differences in kinase regulation created by divergent evolution. While kinases are similar in their turnover rates upon activation, they vary drastically in their regulatory mechanisms. Such evolution of regulation became necessary with the developing of multicellularity and increasingly complex signaling cascades. Although in the case of Gleevec phylogenetic considerations were not part of the design, and overlap between Gleevec's selectivity and evolution of regulation was coincidental, we propose that targeting the unique energy landscapes underlying the regulatory features of a kinase of interest can be a powerful strategy for developing new selective inhibitors.

Evolution is rooted in the most fundamental process of random mutations, and driven by selection for better fitness. In light of this, the weird link between Gleevec selectivity and

evolution is actually not so far-fetched. Evolution as a result of chance shows itself in this story as a friend and foe: it led to the development of humans, but also to cancer and drug resistance. Using ASR to solve a modern cancer drugs mechanism is unorthodox, since until recently this method has been applied to recapitulate nature's paths to modern proteins with differential functions. Arguably the most famous ASR story has come from the Thornton lab in their successful inference of ancient corticoid receptors (Thornton et al., 2003; Ortlund et al., 2007; Bridgham et al., 2009). A story spanning over a half a dozen research papers not only shed light on the understanding of the different selectivity of modern steroid receptors for their corresponding

Frontiers in Molecular Biosciences | www.frontiersin.org May 2015 | Volume 2 | Article 27 |

hormones, but also answered some long standing questions in the field, including the role of epistasis in macromolecular evolution. Recently, ASR has leaped over to successfully recreate ancestral enzymes with reaction efficiencies near that of modern day enzymes (Perez-Jimenez et al., 2011; Hobbs et al., 2012; Ingles-Prieto et al., 2013; Risso et al., 2013; Boucher et al., 2014). Resurrection of enzymes has been an important step in validating the accuracy of ASR because of the need to maintain enzymatic activity, which is extremely sensitive to mutational change. These studies have largely focused on understanding changes in the enzyme's melting temperatures or underlying structural changes. The Gleevec story takes it to the next step to characterize the evolution of energy landscapes that ultimately underlies function.

## "Tell Us What the Future Holds, So We May Know That You Are Gods" (Isaiah 41, 23)

How can understanding of Gleevec selectivity and the differential evolution of kinases guide the creation of better cancer drugs? We are not god, we do not have the ultimate answer. However the door to intelligent design of successful cancer therapeutics may have opened a little wider with recent advances in genome information including ASR and personal genomic profiling, characterization of free energy landscapes of the drug binding process to targets, advances in medicinal chemistry and computation.

The history of Gleevec research teaches us a number of lessons: First, the correct microscopic binding model (meaning the correct scheme), ideally with quantification of each step, is crucial. Slow progress in understanding Gleevec's selectivity was in large extent due to the overwhelming attention to the DFG-loop conformational selection model (Cowan-Jacob et al., 2005; Dar et al., 2008; Shan et al., 2009; Aleksandrov and Simonson, 2010; Lovera et al., 2012; Lin and Roux, 2013; Lin et al., 2014). Second, the physical binding step that has been the major focus in docking simulations is only one piece of the puzzle, and conformational changes are crucially linked to both affinity and selectivity (**Figure 2C**). Therefore, experimental and computational efforts should be more centered on the dynamics of the target and drug/target complex. Third, the

## References


trivial (simple laws of thermodynamics) but at the same time profound recognition that conformational change after binding (an induced fit step) delivers two essential components of a good drug: increased affinity and long drug residence times on the target (**Figure 2C**). In addition, it can provide excellent specificity particularly when such conformational changes involve elements remote from the binding site as seen in Abl-Gleevec. In contrast, conformational selection (ability of the apo protein to sample multiple conformations) by definition weakens the overall drug affinity by the fraction of the protein in the binding-incompetent states. While such a step can offer drug specificity, the new results suggest that DFG-loop conformational selection seems to play only a minor role for kinase selectivity due to the fact that the DFG-loop readily interconverts between states. We propose that induced fit steps are in play in many successful drugs leading to very tight binding and long on-target residence times. Finally, molecular dynamics simulations will play an increasing role in rational drug design, but such simulations need to be based on the solid foundation of biochemical research. In the case of Gleevec and other kinase inhibitors, future computational emphasize should be centered on dynamics of the enzyme/drug complex characterizing the induced fit step and not on the DFGloop dynamics. Having the correct binding scheme established with corresponding structural information available, MD can sample the conformational space identifying new local minima and potentially cryptic or allosteric sites that are hard to trap experimentally if they are low-populated. If such states are unique for a particular kinase, they can be excellent targets for new specific inhibitors.

We are excited about the future prospect of a happy marriage between experiments and computation, and between basic academic research and pharmaceutical industry to tackle the very challenging but rewarding goal of designing perfect weapons against deadly diseases.

## Acknowledgments

This work was supported by the Howard Hughes Medical Institute, the Office of Basic Energy Sciences, Catalysis Science Program, U.S. Dept. of Energy, award DE-FG02-05ER15699, and NIH (GM100966-01) to DK.


modern cancer drug's mechanism. Science 347, 882–886. doi: 10.1126/science. aaa1823

Xu, W., Harrison, S. C., and Eck, M. J. (1997). Three-dimensional structure of the tyrosine kinase c-Src. Nature 385, 595–602. doi: 10.1038/ 385595a0

**Conflict of Interest Statement:** DK is the inventor on a patent applied for by Brandeis University that describes a biophysical platform for drug development based on energy landscapes. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Agafonov, Wilson and Kern. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Fairy "tails": flexibility and function of intrinsically disordered extensions in the photosynthetic world

Gabriel Thieulin-Pardo, Luisana Avilan, Mila Kojadinovic and Brigitte Gontero\*

UMR 7281, Centre National de la Recherche Scientifique, Aix-Marseille Université, Marseille, France

Intrinsically Disordered Proteins (IDPs), or protein fragments also called Intrinsically Disordered Regions (IDRs), display high flexibility as the result of their amino acid composition. They can adopt multiple roles. In globular proteins, IDRs are usually found as loops and linkers between secondary structure elements. However, not all disordered fragments are loops: some proteins bear an intrinsically disordered extension at their C- or N-terminus, and this flexibility can affect the protein as a whole. In this review, we focus on the disordered N- and C-terminal extensions of globular proteins from photosynthetic organisms. Using the examples of the A2B2-GAPDH and the α Rubisco activase isoform, we show that intrinsically disordered extensions can help regulate their "host" protein in response to changes in light, thereby participating in photosynthesis regulation. As IDPs are famous for their large number of protein partners, we used the examples of the NAC, bZIP, TCP, and GRAS transcription factor families to illustrate the fact that intrinsically disordered extremities can allow a protein to have an increased number of partners, which directly affects its regulation. Finally, for proteins from the cryptochrome light receptor family, we describe how a new role for the photolyase proteins may emerge by the addition of an intrinsically disordered extension, while still allowing the protein to absorb blue light. This review has highlighted the diverse repercussions of the disordered extension on the regulation and function of their host protein and outlined possible future research avenues.

#### *Edited by:*

Peter Tompa, Flanders Institute of Biotechnology (VIB), Belgium

#### *Reviewed by:*

Anastassios Papageorgiou, University of Turku, Finland Esko Oksanen, European Spallation Source, Sweden

#### *\*Correspondence:*

Brigitte Gontero, Centre National de la Recherche Scientifique-BIP, 31 Chemin Joseph Aiguier, 13 402 Marseille, France bmeunier@imm.cnrs.fr

#### *Specialty section:*

This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences

> *Received:* 25 February 2015 *Accepted:* 04 May 2015 *Published:* 19 May 2015

#### *Citation:*

Thieulin-Pardo G, Avilan L, Kojadinovic M and Gontero B (2015) Fairy "tails": flexibility and function of intrinsically disordered extensions in the photosynthetic world. Front. Mol. Biosci. 2:23. doi: 10.3389/fmolb.2015.00023 Keywords: intrinsically disordered proteins, GAPDH, CP12, Rubisco activase, cryptochromes, transcription factors

# Introduction

Proteins occupy a central position in the architecture and functioning of living matter. A major objective of protein biochemistry consists in explaining the physiological functions of these molecules by means of structural studies, also known as the "structure-function" relationship. Among others, X-ray crystallography is a powerful tool to solve macromolecular three-dimensional 3D structures. However, some proteins cannot be crystallized because they are fully disordered or possess disordered parts that are missing in the electron density map of the crystals. In the 1990s, Sedzik and Kirschner (1992) attempted to crystallize the myelin basic protein (MBP), the predominant extrinsic protein in both central and nervous system myelins. After several attempts, the authors concluded that MBP adopts a random coil conformation and that as long as its flexibility was not suppressed, it was not possible to obtain crystals (Sedzik and Kirschner, 1992). MBP was one of the first examples of many other "un-crystallizable" proteins. These proteins, originally named Intrinsically Unstructured Proteins (IUPs), are nowadays termed Intrinsically Disordered Proteins (IDPs) (Wright and Dyson, 1999; Dunker et al., 2001, 2008a,b). In 1998, Romero et al. showed that 15 000 proteins from the Swiss-Prot database contain one or more Intrinsically disordered regions (IDRs) comprising more than 40 amino acid residues (Romero et al., 1998). It was shown later that despite their lack of well-defined 3D structure, many partially or completely disordered proteins are functional (Wright and Dyson, 1999; Dunker et al., 2001, 2008a,b; Tompa, 2002). In the late 1990s, studies of disordered yet functional proteins emerged as a new research field, extending the traditional paradigm to include a more comprehensive view of protein structure-function (Wright and Dyson, 1999; Dunker et al., 2001; Tompa, 2002; Dunker et al., 2008a,b). In the past, different models have been proposed to explain protein functioning, and protein flexibility has appeared as a key point (Fersht, 1998). Among these models, the "inducedfit" model (Koshland et al., 1966) introduced the idea that protein conformational changes could be triggered upon ligand binding. These notions were applied to IDPs, and many of them were shown to undergo an "induced-folding" upon binding to their partners (Dunker et al., 2002). Short motifs called MoREs (Molecular Recognition Elements) are often involved in the interaction, involving disorder to order transitions (Fuxreiter et al., 2004, 2007; Oldfield et al., 2005; Mohan et al., 2006; Vacic et al., 2007; Hazy and Tompa, 2009). However, the idea that preformed binding elements exist before the binding, and even in the absence of a partner, led to the "conformational selection" model. In some cases, the IDP is not fully structured in the presence of its partner and the term "fuzziness" was coined by Fuxreiter and Tompa to describe such complexes (Tompa and Fuxreiter, 2008; Hazy and Tompa, 2009; Fuxreiter and Tompa, 2012). The flexibility of IDPs increases the chance of their polypeptide chains adopting the right conformations in the presence of their partners. Furthermore, the high ratio of hydrophilic residues in IDPs facilitates initial contacts with their partners. The interactions are also stronger with IDPs: their lack of structure or absence of rigidity increasing association constants (Dunker et al., 2001; Meszaros et al., 2007; Chouard, 2011). The ability of the IDPs to adopt multiple conformations allows the same region to adapt to different binding sites in many "induced-fits" and thus to have multiple partners (Uversky et al., 2000; Tompa, 2002; Uversky, 2002, 2011; Meszaros et al., 2007; Carmo-Silva and Salvucci, 2013). The discovery of IDPs and their singular lack of definite structure brought nuances to the "structure-function" dogma, showing that the same structure, or lack of one, could have multiple partners and thus multiple functions (Wright and Dyson, 1999; Dunker et al., 2001; Tompa, 2002; Uversky, 2002, 2011; Meszaros et al., 2007; Sun et al., 2013). In this regard, IDPs could be seen as the "master keys" of the protein-protein interaction network.

The ability of IDPs to bind to multiple partners makes them naturally good regulators, as they can modulate the activity of several proteins in a coordinated way (Dunker et al., 2001, 2002, 2005; Gavin et al., 2002, 2006; Haynes et al., 2006; Patil and Nakamura, 2006; Mittag et al., 2010; Uversky, 2010; Pancsa and Tompa, 2012). Therefore, multiple areas of the cellular response can be affected by a single signal allowing IDPs to play a major role in regulatory pathways (Tompa, 2002; Dunker et al., 2005; Haynes et al., 2006; Uversky, 2010; Pancsa and Tompa, 2012). The flexibility of IDPs can also be modified by the cellular environment or by posttranslational modifications. IDPs and IDRs are often targets of different post-translational modifications (the most common being phosphorylation, methylation and ubiquitination) which can radically affect their affinity for their partners and their stability, thus multiplying the possibilities for a fine-tuned regulation (Tompa, 2002; Dunker et al., 2005; Haynes et al., 2006; Uversky, 2010; Pancsa and Tompa, 2012). These particularities make IDPs the hubs in a vast net of protein-protein interactions (Gavin et al., 2002, 2006). They carry out basic functions such as regulation of metabolic pathways, transcription, translation or cellular signal transduction; they can act as scavengers of toxic molecules and they play a key role in the assembly of multiprotein complexes (Uversky, 2011). Moreover, their roles in several diseases of major medical interest, such as cancer (Castillo et al., 2014; Saha et al., 2014; Xue et al., 2014), Alzheimer's disease (Uversky, 2009; Kovacech and Novak, 2010; Salminen et al., 2011; Karagoz and Rudiger, 2015) prion disease (Tompa, 2009; Uversky, 2009; Breydo and Uversky, 2011) or Parkinson's disease (Uversky, 2009; Hazy et al., 2011; Breydo et al., 2012; Alderson and Markley, 2013) have been extensively studied (Babu et al., 2011).

While the discovery and characterization of IDPs and IDRs is a rapidly growing, and an increasingly recognized, area of protein science, (Tompa, 2002; Uversky, 2010; Uversky and Dunker, 2010; Chouard, 2011), little information is available photosynthetic organisms, where IDPs have been described as central players in many responses such as biotic and abiotic stress, development, metabolism regulation, or adaptation to oxic atmosphere (Kragelund et al., 2012; Pancsa and Tompa, 2012; Yruela and Contreras-Moreira, 2012, 2013; Pietrosemoli et al., 2013; Sun et al., 2013; Panda and Ghosh, 2014). Published data mainly concern Arabidopsis thaliana, a higher plant model with one of the best-annotated sequenced genomes (Arabidopsis Genome Initiative, 2000). Yet, the recent analysis of 12 plant genomes revealed that the occurrence of disorder in plants is similar to the one found in other eukaryotes (Bracken et al., 1999; Yruela and Contreras-Moreira, 2012, 2013; Sun et al., 2013). An in silico analysis of plant nuclear proteomes suggested a higher disorder in the internal part of nuclear-encoded plant proteins rather than at their extremities, in contrast to the chloroplastand mitochondrion-encoded proteomes (Yruela and Contreras-Moreira, 2012). This is also pointed by studies on prokaryotes showing that the IDRs may be more frequent at the extremities of the proteins that act as "molecular shields" such as chaperones (Krisko et al., 2010; Chakrabortee et al., 2012).

In this review, we describe several globular proteins with N- or C-terminal IDR extensions in photsynthetic organisms, as opposed to entirely disordered proteins or globular proteins containing one or more IDRs in the middle of their sequences. The aim of this work is not to give an exhaustive list of the roles undertaken by such disordered extensions, as this has recently been reviewed (Uversky, 2013). Instead, we focus on globular proteins or domains that acquired their disordered tails during evolution, using examples from photosynthetic organisms. The addition of a disordered extension to a globular protein created new regulation opportunities, making these proteins responsive to environmental factors through selfregulation, post-translational modifications or new proteinprotein interactions. We illustrate the impact of disordered extensions by describing proteins involved in photosynthetic metabolism and regulation of gene expression (**Table 1**).

# GAPDH and CP12

As knowledge about proteins progressed, new results appeared showing that some enzymes were able to carry out more than


\*All the extensions present features of "disorder": enrichment in hydrophylic charged amino acids and few hydrophobic residues.

one function in a single polypeptide chain and were classified as multifunctional proteins (Kirschner and Bisswanger, 1976). In many cases, the dual function resulted from the fusion of two genes that initially encoded different proteins. Later on, the term "moonlighting" (Jeffery, 1999) categorized proteins that have different functions. The glyceraldehyde-3-phosphate dehydrogenase (GAPDH) is a well-known moonlighting enzyme and has at least ten distinct, confirmed non-enzymatic activities apart from its enzymatic function (Sirover, 1999, 2011; Hildebrandt et al., 2015). The GAPDHs constitute a large and diverse family of dehydrogenases universally represented in living organisms. They catalyze the reductive dephosphorylation of 1, 3-bisphosphoglyceric acid (BPGA) producing glyceraldehyde-3-phosphate (GAP) and inorganic phosphate using NAD(P)H as a cofactor (Trost et al., 2006). Glycolytic GAPDHs(also named GapC) are NAD-specific and mainly found in the cytosol, but in land plants a second type of glycolytic GAPDH named GapCp is targeted to the chloroplast(Petersen et al., 2003; Marri et al., 2005; Munoz-Bertomeu et al., 2010). Both GapC and GapCp are NAD-specific and form homotetramers in vivo that are not subject to complex regulatory mechanisms. However, in photosynthetic organisms, another GAPDH catalyzes the unique reductive step of the Benson-Calvin cycle is present and uses both NADH and NADPH with a marked preference for NADPH (Falini et al., 2003).

Like all GAPDHs, the NADPH-dependent GAPDH is made up of two functional domains, one corresponding to the catalytic domain (residues 148–313 in spinach GAPDH) and the other one being the cofactor-binding domain, or Rossman fold (residues 1–147 and 313–334, respectively). The latter contains a flexible, arginine-rich region called the S-loop (Fermani et al., 2001). In higher plants, this GAPDH exists in different forms such as a heterotetramer made up of two GapA and two GapB subunits (A2B2), a homotetramer made up of four GapA subunits, and as a hexadecamer (A8B8) (Baalmann et al., 1994, 1995; Scheibe et al., 2002; Howard et al., 2011a,b). The GapB subunit is similar to the GapA subunit but bears a Cterminal extension which has a regulatory function (Cerff, 1979; Brinkmann et al., 1989; Baalmann et al., 1996; Li and Anderson, 1997; Scagliarini et al., 1998; Sparla et al., 2002) (**Figure 1A**). This subunit is thought to derive from a gene duplication

homologous to the C-terminal of CP12 and present two regulatory cysteine residues. (B) Schematic representation of the autonomous redox regulation of the A2B2 -GAPDH. When oxidized, the C-terminal extension of the GapB subunit presents a disulfide bridge, which places the C-terminal amino

regulation of the A4 of the CP12 protein presents a disulfide bridge, which places its C-terminal amino acids inside the active site of GAPDH, resulting in its inhibition. The disulfide bridge can be reduced by the thoioredoxin f (TRX) or DTT and the enzyme becomes active.

Thieulin-Pardo et al. Photosynthetic fairy tails

event which occurred near the origin of Streptophyta (which include charophytes and land plants) (Petersen et al., 2006) and might be a construct of a GapA moiety fused at the Cterminus with the C-terminal half of the CP12, a 8.2–8.5 kDa Chloroplast Protein (Pohlmeyer et al., 1996). The portion of CP12 acquired by GapB subunits confers redox properties to GapB-containing GAPDH (Pupillo and Piccari, 1973; Pupillo and Giuliani Piccari, 1975; Wolosiuk and Buchanan, 1976, 1978; Trost and Pupillo, 1993; Baalmann et al., 1994, 1995; Sparla et al., 2002). Cyanobacteria, primitive photosynthetic eukaryotes (including the glaucophyte Cyanophora paradoxa), and red and green algae (except charophytes and sister lineages) seem to contain exclusively the GapA subunit (Petersen et al., 2006; Trost et al., 2006). However, the small prasinophyte green alga Ostreococcus tauri has been shown to possess GapA and GapB (Robbens et al., 2007).

CP12 is a protein of about 80 amino acid residues that was originally described by Pohlmeyer et al. (1996) and has been found in most photosynthetic organisms (Groben et al., 2010; Gontero and Maberly, 2012; Gontero and Salvucci, 2014; Lopez-Calcagno et al., 2014). The CP12 proteins show high primary sequence variability, in particular at the N-terminus. However, they share some remarkable common features.CP12 proteins have an amino acid composition poor in orderpromoting residues although they contain cysteine residues (Groben et al., 2010), and behave abnormally under gel electrophoresis and size exclusion chromatographies (Gontero and Maberly, 2012) suggesting that they are IDPs. Moreover, recent data from fluorescence correlation spectroscopy (FCS) show that the hydrodynamic radius of CP12 from the green alga Chlamydomonas reinhardtii is large compared to that expected for globular proteins of this molecular mass (Moparthi et al., 2014). The cysteine residues are involved in the formation of disulfide bridges and peptide loops and are found as pairs at the C-terminus and/or at the N-terminus. When oxidized, CP12 proteins may present α helices maintained by the N-terminal disulfide bridge (Graciet et al., 2003a; Gardebien et al., 2006). The algal CP12 is a key component of a supra-molecular complex controlling the activity of the Benson-Calvin cycle by regrouping several key enzymes of the cycle, including GAPDH, phosphoribulokinase (PRK) and fructose 1,6-bisphosphate aldolase (FBP aldolase). Within the ternary GAPDH/CP12/PRK complex, both enzymes are strongly inhibited (Avilan et al., 1997; Lebreton et al., 1997; Graciet et al., 2003a; Erales et al., 2008; Marri et al., 2008). CP12 forms a fuzzy complex with the green alga C. reinhardtii A4- GAPDH, as revealed by EPR studies (Mileo et al., 2013) (see the Minireview by Lebreton et al. in this Topic Research). The ternary complex has also been found in the cyanobacterium Synechococcus elongatus and in the higher plant A. thaliana. The A4-GAPDH-CP12 sub-complex from these organisms have been crystallized but in both complexes, the first 50 amino acid residues were not visible in the density map, consistent with a high flexibility of this region in the crystal (Avilan et al., 2000; Matsumura et al., 2011; Fermani et al., 2012). Recently, it was also observed using FCS and FRET (Förster Resonance Energy Transfer) that the algal CP12 flexibility is not abolished by its interactions with either GAPDH or with PRK (Moparthi et al., 2014, 2015).

In the case of the GapB subunit of GAPDH from higher plants, the C-terminal end of the protein (ca 30 residues)is strongly homologous to the C-terminal part of CP12 (Pohlmeyer et al., 1996; Trost et al., 2006; Groben et al., 2010) (**Figure 1A**).The regulation of the A. thaliana A2B2-GAPDH activity by the C-terminal extension of the GapB subunit is now very well understood: upon oxidation (which happens during the daynight transition) the two cysteine residues of the CP12-like tail form a disulfide bridge that places the C-terminal ultimate glutamate residue (E362) inside the active site (stabilized by the electrostatic interactions with an arginine residue R77 involved in the NADP cofactor binding). Consequently, the NADPH cofactor is not able to enter the catalytic site and thus NAPDHdependant A2B2-GAPDH activity is inhibited (Sparla et al., 2005; Fermani et al., 2007) (**Figure 1B**). In contrast, during the night-day transition, the disulfide bridge maintaining the Cterminal extension into the active site is reduced by thioredoxin f, thereby releasing the CP12-like tail and resulting in A2B2- GAPDH activity (**Figure 1B**) (Sparla et al., 2002; Trost et al., 2006; Fermani et al., 2007). This mechanism is very similar to the one observed in C. reinhardtii between the homotetrameric A4-GAPDH and free CP12, where the penultimate glutamate (E79) of the CP12 interacts with the arginine residue R82 of A4- GAPDH (**Figure 1C**) (Trost et al., 2006; Erales et al., 2011; Avilan et al., 2012). The reduction of the GAPDH-CP12 by dithiothreitol (DTT) in the alga results in a more active NADPH-GAPDH as a consequence of the rupture of disulfide bridges on CP12. Of interest, DTT, in vitro mimicks thioredoxins in vivo and it has been shown that CP12 can be reduced by thioredoxin f in the light (Marri et al., 2009).

In the higher plant, A. thaliana and in the green alga, C. reinhardtii, the stoichiometry of the oxidized A4-GAPDH-CP12 sub-complex is two CP12 molecules for one A4-GAPDH (Marri et al., 2008; Kaaki et al., 2013), while four CP12 molecules interact with each GAPDH tetramer in the cyanobacterium, S. elongatus (Matsumura et al., 2011). When interacting with CP12, A4- GAPDH activity decreased by two-fold (in the case of the C. reinhardtii proteins, the catalytic constant kcat of the free enzyme was 430 ± 17 s−<sup>1</sup> , and became 251 ± 9 s−<sup>1</sup> in the presence of CP12), suggesting that only two of the four active sites were blocked (Graciet et al., 2003b) (**Figure 1C**). The same observation was made for the A. thaliana A2B2-GAPDH: upon oxidation, the A2B2-GAPDH activity decreased by 2-fold (its kcat changed from 59 ± 19 s−<sup>1</sup> to 27 ± 10 s−<sup>1</sup> ) although its catalytic constant in a reduced state (kcat = 59 ± 19 s−<sup>1</sup> ) was comparable to the one of the free A4-GAPDH (kcat = 61 ± 4 s−<sup>1</sup> ) (Sparla et al., 2004). The regulation of the plant A2B2-GAPDH and of the algal A4- GAPDH-CP12 complex is thus very similar. With the addition of the C-terminal extension within the GapB subunit, the A2B2- GAPDH has become autonomously redox-regulated, a property that was previously provided through interaction with CP12.

Although the appearance of the GapB subunit represents an important step in the evolution of the redox control of the Calvin-Benson cycle enzymes, this new autonomous regulation co-exists with the CP12-based one in higher plants (Scheibe et al., 2002), and a A2B2-GAPDH-PRK complex entirely devoid of CP12 has yet to be identified. The presence of CP12 is likely to be required for the assembly of larger supramolecular complex, and in C. reinhardtii, A4-GAPDH-CP12-PRK was shown to interact with the aldolase (Erales et al., 2008). In this regard, one may wonder how this system will continue to evolve, and if more enzymes of the Benson-Calvin cycle will also acquire similar CP12-like disordered extensions, possibly meaning that the CP12 protein will become redundant. However, CP12 seems to be a part of numerous other processes in photosynthetic organisms (Singh et al., 2008; Howard et al., 2011a,c; Stanley et al., 2013), so it is unlikely to disappear completely from higher plants genomes in the future.

# Rubisco Activase

Ribulose-1,5-bisphosphate carboxylase/oxygenase (Rubisco) is the enzyme that catalyzes the formation of two molecules of phosphoglyceric acid using one molecule of ribulose 1,5 bisphosphate (RuBP) and one of carbon dioxide (CO2). As the primary CO<sup>2</sup> acceptor of most photoautotrophic organisms, Rubisco can represent up to half of soluble proteins in higher plants, and is believed to be the most abundant protein on Earth (Ellis, 1979; Losh et al., 2013; Raven, 2013). Rubisco from most photosynthetic organisms, including plants and cyanobacteria, is a very large protein (550 kDa), composed of large (L, 52 kDa) and small (S, 12 kDa) subunits arranged as a L8S<sup>8</sup> hexadecamer. For this enzyme to be active, a lysine residue inside the Rubisco active site (K201 in Nicotiana tobacum) must be carbamylated and bind a Mg2<sup>+</sup> ion (Lorimer et al., 1976; Andersson and Backlund, 2008). The addition of the non-catalytic CO<sup>2</sup> molecule to the active site is a spontaneous process, but the presence of RuBP or other sugar-phosphate at the active site decreases the carbamylation efficiency of Rubisco and thus its activity (Lorimer et al., 1976; Cleland et al., 1998). The Rubisco activases (RCAs) exhibit ATPase activity and were first characterized for their ability to promote the carbamylation of such RuBP-inhibited Rubisco (Portis et al., 1986). With time, it became clear that the RCAs allowed the CO<sup>2</sup> to enter the active site of Rubisco by removing the hindering RuBP or its analog, carboxyarabinitol bisphosphate (CABP) (**Figure 2A**) (Portis et al., 2008). The presence of RCAs allows Rubisco to function at its maximal

capacity in sub-optimal CO<sup>2</sup> concentration that would normally not permit carbamylation in vivo (Portis et al., 1986). In higher plants, RCAs, as expected, are mostly present in the parts of plants involved in photosynthesis (Watillon et al., 1993; Liu et al., 1996), and their expression follows a daily cycle that is regulated by external factors like light and temperature (Martino-Catt and Ort, 1992; Watillon et al., 1993; Liu et al., 1996; To et al., 1999).

In most organisms from the green lineage, two isoforms of RCA are found: an α isoform of 45–46 kDa and a β isoform of 41–43 kDa (Werneke et al., 1989; Rundle and Zielinski, 1991; To et al., 1999; Gontero and Salvucci, 2014). The only difference between the two RCA isoforms is the presence of a short Cterminal extension (ca 30 amino acid residues depending on the species) on the α isoform (**Figure 2B**). Both the α and β isoforms were found in Arabidopsis thaliana, spinach, and rice, although only one RCA gene is present (Werneke et al., 1988; To et al., 1999); in these species, the presence of the two RCA isoforms is the result of alternate splicing (Werneke et al., 1989; Rundle and Zielinski, 1991; To et al., 1999). On the other hand, other species like barley, cotton, maize and tobacco have multiple RCA genes (Rundle and Zielinski, 1991; Qian and Rodermel, 1993; Salvucci et al., 2003; Yin et al., 2010). In most cases, these organisms have separate genes coding for α and β RCAs without alternative splicing (Rundle and Zielinski, 1991), although all the genes identified in tobacco and cucumber appear to only encode the β isoform (Portis, 2003). To the best of our knowledge, the C-terminus of the α and β RCA isoforms were never tested for intrinsic disorder. Using several disorder predictors, including MeDor (Lieutaud et al., 2008) and MFDp2 (Mizianty et al., 2014), we were able to determine that the end of both the Cterminal part of the α and β RCAs (ca 50 residues for the α isoform and 20 residues for the β isoform) seem to be intrinsically disordered, including the entire C-terminal extension of the α RCA (**Figure 3**). The most remarkable features of this disordered tail are the two cysteine residues (C392 and C411 in the A. thaliana protein), that are highly conserved among the α RCA isoforms (Zhang and Portis, 1999).

The crystal structure of the tobacco β RCA was recently solved at 3 Å (Stotz et al., 2011), showing that RCA proteins are functional doughnut-shaped hexamers displaying an AAA<sup>+</sup> fold, as was predicted using other AAA<sup>+</sup> proteins (ATPases involved in a multitude of processes, Neuwald et al., 1999 as templates Portis, 2003). Interestingly, the last 23 residues of the protein were absent from the structure, indicating that this part of the molecule is very flexible and can adopt different conformations. Moreover, the substrate recognition site of the RCA from the creosote bush, Larrea tridentata, was solved at the atomic level (Henderson et al., 2011). Unfortunately, the structural studies were performed only on the β RCA isoform. As the α RCA core is identical, its structure should not be different from the β RCA, but the α RCA was shown to form functional αnβ<sup>n</sup> heteromers rather than α<sup>n</sup> homomers (Crafts-Brandner et al., 1997; Zhang et al., 2001). In the light of these new structural data, we can suppose that α RCA can form heterohexamers α3β<sup>3</sup> (**Figure 2C**). These structural data also show the presence of three loops containing hydrophilic amino acid residues lining the surface of a central pore. Sitedirected mutations introduced in this part of the proteins severely diminished the Rubisco activation and the ATP hydrolysis by the RCA proteins (Stotz et al., 2011), confirming that this region is implicated in the binding of ATP (Salvucci et al., 1993; Li et al., 2006). Based on this information, the model that has been proposed for RCA interaction with Rubisco includes one face of the flat hexamer interacting with the surface of the Rubisco, while an exposed loop of the Rubisco protein could fit into the central hole. Minor conformational changes of this Rubisco loop, allowed by the ATP hydrolysis, would then be transmitted to Rubisco allowing the inhibiting RuBP

secondary structure elements and the HCA plot, are shown above and below the amino acid sequence, respectively. Arrows below the HCA plot correspond to regions of predicted disorder (Lieutaud et al., 2008).

to be released and Rubisco to be carbamylated (Stotz et al., 2011).

The activity of the α and β RCAs is classically described to be dependent on the ATP:ADP ratio (Zhang and Portis, 1999; Carmo-Silva and Salvucci, 2013), and is extremely sensitive to high temperatures (Portis, 2003; Salvucci, 2004). Moreover, the activity of the α RCA is regulated by light (Mächler and Nösberger, 1980; Perchorowicz et al., 1981). This observation is linked to the action of thioredoxin f on the two cysteine residues present on its C-terminal extension (Shen and Ogren, 1992; Zhang and Portis, 1999; Zhang et al., 2001, 2002; Portis, 2003; Wang and Portis, 2006).A site-directed mutagenesis study (Shen and Ogren, 1992) showed that the substitution of only one of the two cysteine residues was enough to abolish the light regulation of α RCA, implicating the involvement of a disulfide bridge. Several studies showed that the mechanism of inhibition involves the blocking of the ATP-binding region by the C-terminal extension upon oxidation. This self-inhibition would be stabilized by strong electrostatic forces between the negatively-charged tail and the positively charged nucleotide site (Shen and Ogren, 1992; Zhang and Portis, 1999; Zhang et al., 2001, 2002; Wang and Portis, 2006; Carmo-Silva and Salvucci, 2013) (**Figure 2C**). It was also observed that the β RCA, although devoid of regulatory cysteine residues, could be lightregulated in the presence of the α isoform (Zhang and Portis, 1999; Zhang et al., 2001). In the hypothesis that RCAs form α3β<sup>3</sup> heterohexamers, we can assume that the combined bulk of C-terminal extensions efficiently inhibit the whole complex in dark conditions (**Figure 2C**). The RCA activity can be restored consequently by the reduction of the C-terminal disulfide bridge by the thioredoxin f, which occurs upon dark-light transitions (Carmo-Silva and Salvucci, 2013). In this case, the acquisition of a C-terminal tail, originally by alternate splicing, has allowed the RCA protein to fine-tune the activity of Rubisco in function of the light availability in addition to the energetic state of the cell.

In other photosynthetic organisms, the Rubisco activase system is different or works in a different way. In β-cyanobacteria, the RCA protein has the same main domains as plant RCAs, but lacks the N-terminal domain necessary for Rubisco activation found in plants and green algae (Van De Loo and Salvucci, 1996; Li et al., 1999; Stotz et al., 2011; Gontero and Salvucci, 2014; Mueller-Cajar et al., 2014). This could explain why no Rubisco activation has been observed using cyanobacterial RCAs (Li et al., 1999; Pearce, 2006). The latter also possess a very long (180 residues) intrinsically disordered C-terminal extension that seems to target the protein to the carboxysomes (Zarzycki et al., 2013) (**Figure 2B**). Organisms from the red lineage (αproteobacteria, rhodophyta, heterokontophyta, etc.) do not have exactly the same Rubisco as the green lineage, and the socalled "Red Rubisco" has a slightly longer large subunit. These organisms do not have RCA genes, but the same Rubisco activase function is carried out by another protein, CbbX (Pearce, 2006; Gontero and Salvucci, 2014) (**Figure 2B**). The crystal structure of CbbX has recently been solved, showing that this protein is organized in hexamers arranged in a very comparable manner to green RCAs (Mueller-Cajar et al., 2011). It was also suggested that CbbX mechanisms are based on the same principles as the one of RCA, with the C-terminus of the large Rubisco subunit inserted into the central hole of CbbX (Mueller-Cajar et al., 2011). It should be noted that CbbX seems to have an IDR at its C-terminus, but its implications in CbbX activity is yet to be studied.

Rubisco activase is not the only "friendly" protein involved in the regulation of Rubisco, since other proteins are needed during its assembly and folding, including the cpn60 chaperone, which also has a disordered C-terminal tail (Goloubinoff et al., 1989; Cloney et al., 1992; Libich et al., 2013).

# Three Transcription Factor Families (NAC, bZIP and TCP)

Disordered regions are ideal for proteins coordinating regulatory events and as such, transcription factors participating in regulation and signaling functions are enriched in IDRs.

The NAC family (named after No Apical Meristem, ATAF, Cup-Shaped Cotyledon) is one of the largest families of plantspecific transcription factors (Ooka et al., 2003; Olsen et al., 2005; Rushton et al., 2008; Sun et al., 2013). These family members are involved in a very large variety of processes, including plant development (Olsen et al., 2005), biotic and abiotic stress responses (Jensen et al., 2010b; Seo and Park, 2010; Seo et al., 2010) and leaf senescence (Kjaersgaard et al., 2011). The NAC transcription factors usually contain two domains: the N-terminal NAC domain and the C-terminal extremity domain (**Figure 4A**). The NAC domain is mainly conserved and well-ordered, displaying a typical structure comprising α helices flanking one β strand (Ernst et al., 2004). This domain binds the consensus DNA sequence CGT(GA) (Olsen et al., 2005). The C-terminal domain of the NAC proteins is highly variable within the family; however, some motifs in the Cterminus may display a sub-family-specific conservation (Jensen et al., 2010a). The C-terminal domains composition reveals a very high percentage of hydrophilic (Asp, Glu, Ser, Thr) and proline (Pro) residues, whereas the proportion of hydrophobic and aromatic residues is very low (Olsen et al., 2005; Jensen et al., 2010a). These specificities are typical of IDRs, and the C-terminal domain of some NAC proteins was experimentally characterized as an IDR (Jensen et al., 2010a,b). Despite this IDR feature, some hydrophobic and/or aromatic residues are present in this domain; interestingly, these amino acid residues are often conserved among a subfamily (Jensen et al., 2010a,b; Kjaersgaard et al., 2011). The IDR C-terminal domains of the NAC proteins are predicted to contain MoREs that are conserved in sub-families (Jensen et al., 2010a). It has been experimentally confirmed that these particular residues are very important to the specific function of each sub-group in the NAC family, and are essential to activation mechanisms often involving many different partners (Ooka et al., 2003; Ernst et al., 2004; Taoka et al., 2004; Olsen et al., 2005; Ko et al., 2007; Jensen et al., 2010a) (**Figure 4B**).

The bZIP (basic Leucine Zipper) transcription factors family is ubiquitous and is one of the largest families of transcription factors in eukaryotes. bZIP transcription factors take part

in a multitude of regulatory pathways such as development, metabolism, circadian rhythm and response to stress (Sun et al., 2013). The bZIP proteins are composed of two domains: a C-terminal bZIP domain and a N-terminal activation domain (**Figure 4A**). The C-terminal bZIP domain gives its name to the family and displays large patches of basic residues and leucine zipper motifs (Ellenberger et al., 1992; Vinson et al., 1993). The leucine zipper regions are organized in α helices and are responsible for the dimerization of the proteins through the formation of a coiled-coil structure (Vinson et al., 1993; Yoon et al., 2006), while the basic regions bind to the DNA molecule (Ellenberger et al., 1992). Interestingly, the basic regions have been described either as fully ordered, very flexible or intrinsically disordered depending on the protein (Bracken et al., 1999; Podust et al., 2001; Moreau et al., 2004; Yoon et al., 2006). When bound to DNA, the basic regions have however been observed as α helices, suggesting that the interaction triggers folding in response to a specific DNA motif (Hollenbeck et al., 2002), illustrating once more the disorder to order transition (inducedfit). The N-terminal regions of bZIP proteins act as regulators (Ang et al., 1998; Sun et al., 2013), and are mostly intrinsically disordered (Campbell et al., 2000; Moreau et al., 2004; Yoon et al., 2006; Sun et al., 2013). These regions typically contain different MoREs, and their flexibility allows the interaction with multiple partners, again by adopting different secondary structures (Ang et al., 1998; Campbell et al., 2000; Oldfield et al., 2005; Yoon et al., 2006) (**Figure 4B**). Through these activating or inhibiting interactions, transcription of the genes targeted by bZIP proteins is effectively modulated in response to several signals. The Nterminal disordered domain also modulates the activity of bZIP transcription factors through post-translational modifications, and phosphorylation in particular. In plants, bZIP transcription factors can be phosphorylated in response to illumination, which disrupts the interactions between the bZIP proteins and their activating partners (Ciceri et al., 1997; Hardtke et al., 2000). The phosphorylated proteins also have lower affinity for their DNA targets, resulting in a decrease of gene activation (Ciceri et al., 1997; Hardtke et al., 2000). Interestingly, some bZIP proteins also display IDRs in their C-terminal domain. In the case of bZIP28 (initially a transmembrane protein), these IDRs are exposed to the lumen of the endoplasmic reticulum and allow the interaction, through MOREs with BIP, the majority reticulum chaperone. In response to stress, bZIP28 is relocated to the Golgi and the cytoplasmic domain is detached, allowing it to enter the nucleus and to control gene expression (Srivastava et al., 2013, 2014).

A recent study on TCP8, a transcription factor belonging to the TCP [Teosinte branched 1 (tb1), Cycloidea (cyc) and Proliferating Cell Factor (PCF)] family, showed the presence of three IDRs, two of them at the N- and C-terminal extremities (Valsecchi et al., 2013). While the N-terminus binds DNA in an induced-fit mechanism, the C-terminal region is involved in the TCP protein self-association in a coiled-coil structure (Valsecchi et al., 2013). Furthermore, it seems that different transcription factors from the TCP family can interact, modulating the response of different pathways to multiple stimuli (Baier and Latzko, 1975; Viola et al., 2011, 2012; Steiner et al., 2012; Valsecchi et al., 2013).

As illustrated in these examples, the disordered tails of transcription factors have an essential role in modulating their activities through protein-protein interactions with a wide range of activators and inhibitors. Moreover, these extensions are often prone to phosphorylation and constitute another level of regulation. Together, these IDRs form a complex signaling web, turning the transcription factors into hubs and allowing the genes involved in adaptive responses to be finely regulated.

# GRAS Family

The GRAS family comprises proteins involved in numerous aspects of plant development and growth. This large family is named after its first members, Gibberellic Acid Insensitive (GAI), Repressor of Gai (RGA) and Scarecrow (SCR), and its members are mostly related to signaling in response to phytohormones [gibberellic acid (GA), auxin, brassinosteroids] and biotic and abiotic stress (Bolle, 2004; Sun et al., 2011). The GRAS family proteins are composed of one variable N-terminal region and a commonly conserved C-terminal GRAS domain (**Figure 4A**), and are divided into ten subfamilies based on phylogeny (Bolle, 2004; Tian et al., 2004; Lim et al., 2005; Sanchez et al., 2007; Sun et al., 2011). The conserved GRAS domain (ca 380 residues depending on the subfamilies) acts as a transcriptional coactivator (Heery et al., 1997) through leucine-rich motifs. GRAS domains typically contain two leucine-rich motifs, which are needed for specific protein-protein interactions(Cui et al., 2007; Vacic et al., 2007; Fode et al., 2008; Hirsch and Oldroyd, 2009; Hirsch et al., 2009; Hou et al., 2010). The GRAS proteins interact with a large number of nuclear proteins, most of which are transcription factors, thereby modulating their target activity (Hirsch and Oldroyd, 2009; Hirsch et al., 2009; Hou et al., 2010; Sun et al., 2012).

In contrast to the highly conserved GRAS domain, the Nterminal domains of the GRAS family proteins display a rich diversity at the sequence level, although the N-terminus is conserved within subfamilies (Sun et al., 2010, 2011, 2012, 2013). Moreover, these N-terminal domains have recently been identified as intrinsically disordered (Sun et al., 2010, 2011, 2012, 2013). Interestingly, patches of repeated hydrophobic and/or aromatic residues are found in the N-terminal region (Triezenberg, 1995; Sun et al., 2011). These patches are arranged in conserved motifs within subfamilies (Triezenberg, 1995; Sun et al., 2011), and are involved in specific multiple protein-protein interactions (Sun et al., 2010, 2012). In the case of the DELLA subfamily which has been intensively studied, the N-terminal domain can interact with the gibberellic acid receptor GIB1, but only when GIB1 has bound its ligand (Murase et al., 2008; Hirano et al., 2010; Sun et al., 2010, 2012). Moreover, each DELLA protein domain (N-terminal and C-terminal domains) can interact with several partners, making these proteins a hub at the center of the gibberellic acid response pathway. Other examples of GRAS proteins are important in other regulatory pathways, although subfamilies are always specialized in a precise type of stimulus (phytohormones, biotic and abiotic stress, etc. . . ) (Sun et al., 2010, 2011, 2012, 2013).

A common feature of the GRAS proteins is their ability to acquire a structure when bound to a partner, unlike the fuzzy GAPDH/CP12 complex (Mileo et al., 2013). As mentioned above, MoREs are present in GRAS proteins; each one was predicted to occur within the N-terminal domains, and more specifically in the elements conserved within subfamilies, strengthening the idea that these motifs are the key to the specificity of GRAS proteins (Sun et al., 2011, 2012). In the case of the DELLA subfamily, the presence of the MoREs has been verified experimentally (Sun et al., 2010, 2011, 2012). Interestingly, the N-terminal domain of the GRAS proteins is also the target of phosphorylation, which again introduces another way to fine-tune the regulation of these proteins (Fu et al., 2002; Iakoucheva et al., 2004; Hussain et al., 2007; Mittag et al., 2010). Phosphorylation of the N-terminal domain is directly linked to the activity of the GRAS proteins, modulating the affinity of the N-terminus for its partners, and having a direct effect on the GRAS proteins stability through the control of their degradation (Day et al., 2004; Hussain et al., 2005; Itoh et al., 2005; Czikkel and Maxwell, 2007).

When considering the GRAS family as a whole, it is remarkable how conserved the GRAS domains and patterns are, while the N-terminal domains are highly variable. It seems that the addition of a disordered protein segment to the GRAS domain has increased its number of partners, and thus turned it into a signal-integration hub involved in many different pathways. On the other hand, one could consider that the addition of GRAS domains to pre-existing IDPs involved in the phytohormonal and/or stress responses has allowed these IDPs to control, even more directly, the cellular responses by acting on gene expression.

# Cryptochrome

Cryptochromes are a group of proteins in which most members have an intrinsically disordered C-terminal tail that can have a profound impact on their overall function. Together with the photolyases, these proteins belong to the photolyase/cryptochrome family (Lin and Shalitin, 2003; Sancar, 2004; Chaves et al., 2006; Ozturk et al., 2007; Fortunato et al., 2015).

Photolyases are ancient enzymes that use blue light to catalyze the repair of DNA lesions caused by ultraviolet light. Lesions such as cyclobutane pyrimidine dimers (CPD) and pyrimidinepyrimidone photoproducts are repaired by photolyases CPD and by photolyases 6–4, respectively. Photolyase capacity to use blue light is due to the presence of two chromophores: a photoantenna pterin (5,10-methenyltetrahydrofolateor ahydroxy-5-deazaflavin) and flavin adenine dinucleotide (FAD). During the DNA repair, the two chromophores cofactors absorb blue photons and initiate splitting of the cyclobutane ring by a mechanism involving reactive radicals (Liu et al., 2011b).

Cryptochromes, the other group of proteins in the photolyase/cryptochrome family, have a photolyase homologous region (called PHR) and a C-terminal tail (**Figure 5A**) (Yu et al., 2010). Cryptochromes are able to absorb blue light in a very similar way to the photolyases. Another group within this family includes DASH-type cryptochromes named after the Drosophila, Arabidopsis, Synechocystis and Human. Members of this group are closer to photolyases than to cryptochromes, and are able to repair single-stranded DNA (Chaves et al., 2011) and may also have N-terminal and C-terminal disordered extensions.

In contrast to photolyases, cryptochromes do not have the ability to repair DNA. However, in many organisms, the absorption of photons by the chromophores in the photolyase homologous region of these proteins, induces conformational change (through electron transfer and subsequent phosphorylation), which in turn trigger specialized signaling events through protein-protein interactions (Liu et al., 2011b). It has been shown that the function of cryptochromes resides mainly within their C-terminal tails (Yang et al., 2000; Green, 2004; Chaves et al., 2006, 2011; Yu et al., 2010). Interestingly, this tail is poorly conserved among groups of organisms. In Arabidopsis, two cryptochromes are present, CRY1 and CRY2, that have different C-terminal extensions although a DAS motif is found in both (Lin and Shalitin, 2003). The length of the C-terminal tail in cryptochromes of animals, plants and some unicellular organisms varies from 30 to 250 residues and, as mentioned above, is intrinsically disordered. This characteristic has been established by sequence analysis, biochemical methods such as analysis of the sensitivity to protease cleavage, and physical methods such as circular dichroism and nuclear magnetic resonance (NMR)on recombinant C-terminal extensions of both Arabidopsis and human cryptochromes (Partch et al., 2005). Comparison of the proteolysis susceptibility between full-length cryptochromes and their C-terminal tail showed that this tail interacts with the photolyase domain, causing it to adopt a tertiary structure. The susceptibility to proteolysis of the C-terminal tail of the CRY1

FIGURE 5 | Models of the function of the C-terminal tail of cryptochromes. (A) Representation of cryptochrome (CRY) and photolyase. Cryptochromes have a photolyase-homologous region (PHR) and a C-terminal tail. The chromophore molecules of the PHR are shown. (B) Model of the action mechanism of cryptochromes from Arabidopsis. After absorption of light, the C-terminal tail is phosphorylated and a change in conformation is triggered in the entire molecule. The C-terminal tail is exposed at the surface of the protein and as a consequence interactions with partner proteins such as COP1 and SPA are induced (Liu et al., 2011a,b). (C) In darkness, the C-terminal tail of the cryptochrome from Drosophila inhibits the binding of the proteins involved in the circadian rhythm. After illumination, the inhibition by the tail is released and the PHR domain interacts through electrostatic interaction with the protein partners TIM and JET (Green, 2004; Czarna et al., 2013). (D) In mammals, cryptochrome is necessary for the translocation of the protein into the nucleus in which it is part of the core of the transcription/translation feedback that controls the circadian clock together with the proteins PER, BMAL, and CLOCK.

from A. thaliana increases after illumination, which is consistent with a conformational change (Partch et al., 2005). Indeed, the crystal structure of the complete cryptochrome from Drosophila confirmed that the C-terminal tail stays in a groove of the photolyase domain and mimics the recognition of photolyases with DNA (Zoltowski et al., 2011; Czarna et al., 2013).

In plants, cryptochromes play a role, together with other photoreceptors, in a variety of functions. In general, the cryptochromes of plants are involved in mechanisms that respond to blue light and their action has been explored in the inhibition of the elongation of hypocotyls, in the photoperiodic induction of flowering, in the circadian clock as in animals, and in other functions (Yu et al., 2010; Chaves et al., 2011; Liu et al., 2011b).These studies have been mainly performed in the model plant A. thaliana. Studies using transgenic plants overexpressing the C-terminal tail of CRY1 or CRY2, fused with β-glucuronidase (GUS) showed a constitutive morphogenic phenotype similar to that produced by blue light (Yang et al., 2000), indicating that, in the cryptochrome molecule, the C-terminal tail is responsible for the light-induced function. Moreover, NC80, an 80-residues segment present in the Arabidopsis protein, is responsible for the function of the C-terminal tail of CRY2 (Yu et al., 2007). The C-terminal tail of these proteins interacts with other proteins such as COP1 (constitutive photomorphogenic 1) (Wang et al., 2001; Yang et al., 2001), a multifunctional E3 ubiquitin ligase, and SPA1 (suppressor of phytochrome A 1) (Zuo et al., 2011; Liu et al., 2011a). This interaction is part of the initial steps for the light signaling and mechanisms to modulate the developmental process in the plant either by: (1) modulation of gene transcription or (2) suppression of proteolysis of regulators involved in development (i.e., flowering) (Liu et al., 2011a,b). Models have been proposed to explain the mode of action of plant cryptochromes (Lin and Shalitin, 2003; Partch et al., 2005; Yu et al., 2007; Liu et al., 2011a). In general, in these models, the photolyase domain and the C-terminal tail form a closed conformation in the dark. Upon illumination, an open and active conformation is adopted and, in this new conformation, the C-terminal tail is exposed allowing its interaction with other proteins to initiate signaling (**Figure 5B**). A model of action that includes dimerization and light dependent-phosphorylation that explains the exposure of the C-terminal tail as a result of charge repelling has also been proposed (**Figure 5B**) (Lin and Shalitin, 2003; Yu et al., 2007).

Although cryptochromes of plants are involved as photoreceptors in the circadian cycle, the molecular role of cryptochromes in relation to this cycle has been more elucidated in Drosophila. In this organism, the cryptochrome modulates the central oscillator, or clock, through the light-dependent interaction with the protein Timeless (TIM) (Busza et al., 2004), one of the components of the clock core. This interaction favors the degradation of both TIM and the cryptochrome itself, thus triggering the light/dark cycle each day by synchronization of the clock with the environment. The protein Jetlag (JET), an E3 ligase, also binds to the cryptochrome in a light-dependent manner and is responsible for the ubiquitination and subsequent proteolysis of both the cryptochrome and TIM (Peschel et al., 2009). In this case, and in contrast with the cryptochromes in Arabidopsis, the binding of the cryptochrome from Drosophila to its partners is performed by the photolyase domain of the protein (**Figure 5C**), whereas in the dark, the C-terminal tail inhibits this binding determining thus the photosensitivity of the circadian clock (Busza et al., 2004; Green, 2004).

In contrast to their homologs from plants and Drosophila, where the disordered C-terminal tail is used for light signaling, mammalian cryptochromes are light–independent transcriptional repressors in the core of the circadian clock (**Figure 5D**). Mammalian cryptochromes repress transcription processes that are dependent on the protein complex BMAL/CLOCK (Sancar, 2004; Chaves et al., 2011). In the case of these cryptochromes, the function of the C-terminal tail is more complex: (i) it is involved in the nuclear localization of the protein and (ii) with the photolyase domain, it also has a role in the interaction with other components of the clock such as BMAL (Chaves et al., 2006). Interestingly, the C-terminal tail also contributes to the circadian period length, since its phosphorylation affects the level of the protein, either promoting its own degradation in the case of CRY 2 (Harada et al., 2005) or stabilizing the protein as for CRY 1 (Gao et al., 2013).

It has been proposed that cryptochromes have evolved several times independently as an example of convergent evolution (Green, 2004). Only small changes have occurred in the photolyase domain, this part of the protein being conserved among cryptochromes and photolyases. One possible mechanism to explain the acquisition of C-terminal extensions in existing proteins would be through gene fusion (Marsh and Teichmann, 2010). If this mechanism had taken place at the origin of cryptochromes, it would suggest that proteins related to the Cterminal tail of cryptochromes already existed independently and had a function of their own. These independent domains became later associated to a photolyase domain providing them with the capacity to detect light. As mentioned above, the plant cryptochromes C-terminal domain is active and has the information needed to achieve signaling (Yang et al., 2000). During evolution, this protein could have fused with a duplicate of photolyase. In this hypothesis, the addition of the lightdependent photolyase module might be a way to adjust the physiology of the organisms to their environment through light perception. This could therefore be seen as an IDP having acquired a globular extension. Since in plants, a motif (DAS) within the C-terminal tail is conserved, it has been proposed that the ancestral plant cryptochrome emerged from a fusion of a photolyase with a protein containing the DAS motif (Lin and Shalitin, 2003). Another hypothesis that could explain the acquisition of the C-terminal tail in cryptochromes is by gene extension into a non-coding region (Marsh and Teichmann, 2010). The photolyase gene could thus have been extended through junk DNA. Analysis of phylogenetic relationships of gene families in animals showed that extension of an existing gene by "exonization" of a previous non-coding region seems to be an important evolutionary strategy to add a C-terminal disordered extension to proteins (Buljan et al., 2010). The high variability and different functions of the C-terminal tail of cryptochromes among plants and animals are in accordance with this hypothesis. Studies on the origin and evolution of the C-terminal tail of cryptochromes will give insights into the adaptation of organisms to light.

## Conclusion

Within the present review, we tried to demonstrate the central and multiple roles of intrinsically disordered tails carried by

## References


certain globular proteins. Describing several examples of proteins displaying IDRs in photosynthetic organisms, we discussed how IDRs impact on both the functions and mechanisms of action of their "host" proteins. The examples of the A2B2-GAPDH and the α-Rubisco activase isoform show that their C-terminal disordered extensions participate in the light-dependent redox regulation of the photosynthetic metabolism. The cases of the multiple transcription factors with a disordered tail are very similar yet very different. In the few examples listed here, the disordered region plays a major role in the regulation of the DNA-binding domain through protein-protein interactions or post-translational modifications. Their sensitivity to a large number of signals allows the activity of the transcription factors to be modulated according to many factors (one to many), turning these proteins into hubs in a large signaling web. Lastly, the cryptochrome family is a prime example of a disordered extension changing the fundamental function of the initial photolyase into a light-dependent signaling protein, conserving the ability to absorb blue light and repurposing it.

The examples presented here are but a few of the multitude of proteins that have acquired a disordered extension (Uversky, 2013), although most examples do not usually come from the photosynthetic world. We can expect that in the years to come, an increasing number of these proteins will be identified. A great question that remains is how these proteins originated. While in some cases, the addition of an IDR seems to be quite recent like the GapB subunit. In other cases, this addition might be very ancient as in the NAC, bZIP, and GRAS families, in which there are multiple disordered extensions families that may derive from multiple fusion events, or a long succession of duplications followed by diverging evolution of the subfamilies. We hope that the expansion of the IDP field in general and specifically, the one involved in "green" biochemistry, will 1 day answer these questions.

#### Acknowledgments

Studentship (GT) was given by Ecole Normale Supérieure, Cachan. This work (BG) was supported by the Centre National de la Recherche Scientifique (CNRS) and Aix-Marseille Université. LA got a fellowship from ANR-13-BIME-0001-02. We would like to thank Pr. Carla Green for her kind help. We also thank Pr. Stephen Maberly for editing the english.


reinhardtii chloroplasts. J. Biol. Chem. 275, 9447–9451. doi: 10.1074/jbc.275.13.9447


through the C-terminus of CP12 in Chlamydomonas reinhardtii. Biochemistry 50, 2881–2888. doi: 10.1021/bi1020259


carboxylase/oxygenase activase-like function in heterocystous cyanobacteria. Plant Mol. Biol. 40, 467–478. doi: 10.1023/A:1006251808625


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Thieulin-Pardo, Avilan, Kojadinovic and Gontero. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# **Globular and disordered—the non-identical twins in protein-protein interactions**

#### *Kaare Teilum\*, Johan G. Olsen and Birthe B. Kragelund\**

*Structural Biology and NMR Laboratory, Department of Biology, University of Copenhagen, Copenhagen, Denmark*

In biology proteins from different structural classes interact across and within classes in ways that are optimized to achieve balanced functional outputs. The interactions between intrinsically disordered proteins (IDPs) and other proteins rely on changes in flexibility and this is seen as a strong determinant for their function. This has fostered the notion that IDP's bind with low affinity but high specificity. Here we have analyzed available detailed thermodynamic data for protein-protein interactions to put to the test if the thermodynamic profiles of IDP interactions differ from those of other protein-protein interactions. We find that ordered proteins and the disordered ones act as non-identical twins operating by similar principles but where the disordered proteins complexes are on average less stable by 2.5 kcal mol−1.

#### *Edited by:*

*Kris Pauwels, Vrije Universiteit Brussel, Belgium*

#### *Reviewed by:*

*Alfonso De Simone, Imperial College London, UK Daniel E. Otzen, Aarhus University, Denmark*

#### *\*Correspondence:*

*Kaare Teilum and Birthe B. Kragelund, Department of Biology, Ole Maaløes Vej 5, 2200 Copenhagen, Denmark kaare.teilum@bio.ku.dk; bbk@bio.ku.dk*

#### *Specialty section:*

*This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences*

> *Received: 30 April 2015 Accepted: 29 June 2015 Published: 09 July 2015*

#### *Citation:*

*Teilum K, Olsen JG and Kragelund BB (2015) Globular and disordered—the non-identical twins in protein-protein interactions. Front. Mol. Biosci. 2:40. doi: 10.3389/fmolb.2015.00040* **Keywords: ITC, IDP, intrinsically disordered, entropy, enthalpy, stability**

## **Introduction**

Proteins function though the action and communication with other molecules and the intricate interplay among residues within every binding site results in diagnostic thermodynamic profiles implicit to the particular molecular pair. In protein-protein interaction the majority of the binding energy comes from a few critical hot-spot interactions (Clackson and Wells, 1995), but the binding energy also depends on other factors such as interface size, residue composition, flexibility of the interacting partners as well as on environmental cues. The discovery of a large fraction of the proteome being intrinsically disordered (ID) means that a substantial fraction of proteinprotein interactions involves proteins or parts of proteins, which do not adopt a well-defined three-dimensional structure in the unbound state. These proteins, or regions in proteins, originate from the class of *intrinsically disordered proteins* (IDPs) (Dunker et al., 2000; Tompa, 2002; Nilsson et al., 2011). They are central to a plethora of key biological processes, are multi-specific and possess a versatile interaction potential placing many of them centrally in cellular hubs (Han et al., 2004). The prevailing notion is that IDPs are able to bind with high specificity, but low affinity, although recent kinetic studies suggest that this concept may not be straightforward (Dogan et al., 2014; Iesmantavicius et al., 2014; Krieger et al., 2014). IDPs contain very few hydrophobic residues (Dunker et al., 2001), which suggests that their interaction energies may be comparatively low, substantiated by the entropy loss of ordered complex formation from a disordered peptide chain. Specificity, on the other hand, arises when the polypeptide chain adopts the correct conformation in which the distribution of side chains match electrostatic and hydrogen bonding donors and acceptors as well as hydrophobic patches on the target. This paradigm of lower affinity of IDPs compared to globular proteins has been suggested but never challenged by a large-scale thermodynamic assessment, which is the aim of the present paper.

# **Results and Discussion**

Based on previous collections of data (Stites, 1997; Huang and Liu, 2013) and including several additional data from the literature found by searching PubMed for "ITC protein-protein interactions," "ITC intrinsically disordered protein," "thermodynamics protein-protein interactions," and "thermodynamics intrinsically disordered protein," we have compiled thermodynamic parameters from close to 200 different protein-protein interaction studies (Supplementary Table 1). The data were standardized to 298 K assuming that -Cp = 0, as -Cp has only been estimated for very few of the complexes. We have estimated that the error introduced in -G<sup>0</sup> is less than 0.2 kcal mol−<sup>1</sup> in the most extreme cases where the data were measured at 281 K. For most cases where there is less than 5 K difference the error is less than 0.05 kcal mol−1. We subsequently compared and correlated the parameters for interactions that involve only globular proteins (91 complexes), to the parameters for interactions, where one partner is an IDP (106 complexes). To avoid over-representing a single protein-protein complex we exclusively compared wild-type proteins so that protein specific irregularities will be averaged out. In the cases where a structure of the complex has been determined, we have calculated the interaction surface area using PISA (Krissinel and Henrick, 2007) (Supplementary Table 1), and determined the amino acid composition of the interface using NCONT from the CCP4i suite (Winn et al., 2011). The amino acids were divided into four classes for analysis (FWY, CILMV, AGPST, and DEHKNQR) based on the BLOSSUM50 substitution matrix as defined by Weathers et al. (2004). The interfaces of the all ordered (ORD-ORD) complexes and the ordered-IDP (ORD-IDP) complexes were then compared in this context (**Figure 1**).

Complexes from the two groups were almost equally represented (ORD-ORD: 52 structures average interface size of 886 <sup>±</sup> 46 Å2; ORD-IDP: 41 structures, average interface size of 905 <sup>±</sup> 80 Å2) in the protein data bank (Berman et al., 2000). The sizes of the binding interface areas in the two groups of proteins were not significantly different (*t*-test, *P* > 0.05), (Supplementary Table 1). This is perhaps not unexpected, although one might have anticipated the IDPcomplexes to have—on average—smaller interfaces, as many of their interactions are mediated by small linear motifs (SLiMs) (Dinkel et al., 2014), and short molecular recognition motifs (MoRFs) (Mohan et al., 2006). These motifs are typically peptide regions that fold into regular secondary structure on binding. Thus, one conclusion is that in the globular complexes analyzed here, there are equally many small interfaces, matching those of SLiMs and MoRFs of IDPs.

The second result of the structural analysis is that the intermolecular interactions, as reflected in the distribution of the four groups of amino acids, is the same (**Figure 1**). This observation is perhaps more surprising since the amino acid composition of IDPs is very distinct and different from that of globular proteins (Weathers et al., 2004; Uversky et al., 2005; Han et al., 2006; Hansen et al., 2006) with the low content of hydrophobic residues as the underlying reason for IDPs not forming globular structures. However, the amino

acid composition on the surface of globular proteins seems to resemble that of IDPs more than the overall composition (Fukuchi and Nishikawa, 2001; Tompa, 2002; Levy, 2010). Moreover, it differs significantly from the composition of interfaces in obligate oligomers that are typically much more hydrophobic (Janin et al., 2008). In a previous study the residue composition of extended binding surfaces of IDPs bound to an ordered partner was investigated (Wong et al., 2013). Compared to interfaces between two ordered proteins, the IDPs in complex with an ordered partner had in that work an overrepresentation of hydrophobic residues as leucine and isoleucine in the core of the interface, and the ordered binding partner had an increased number of charged residues. Thus, this apparent counter balance is in full accordance with the overall sum of the interface we report here. A decomposition of the distribution into individual residues within the current set supports previous findings, although the effect is small (the largest difference is for Cys which is 41% less abundant in the ORD-IDP complexes) (**Figure 1A**). Therefore, if specificity is embedded in interactions between charged and polar side-chains in the interface (Eaton et al., 1995; Wong et al., 2013), we find no indication to suggest that the IDPs bind to globular proteins with higher specificity than globular proteins do.

Recall the basic thermodynamic relation, -<sup>G</sup><sup>0</sup> <sup>=</sup> -<sup>H</sup>0<sup>−</sup> T-S0 in which the entropy-enthalpy compensation infers that -H<sup>0</sup> and T-S0 are highly correlated (Brady and Sharp, 1997; Williams et al., 2004; Teilum et al., 2009). Thus, -G<sup>0</sup> for the complexes in the selected sets covers a narrow range from <sup>−</sup>19.8 kcal mol−<sup>1</sup> to <sup>−</sup>4.2 kcal mol−<sup>1</sup> (corresponding to *K*<sup>d</sup> from 3 fM to 830µM) compared to -H<sup>0</sup> and T-S0 that are found in the ranges from <sup>−</sup>66.7 to 19.9 kcal mol−<sup>1</sup> and from <sup>−</sup>56.1 to 28.5 kcal mol−1, respectively. The analysis of the thermodynamic parameters shows that the enthalpy (-H◦) and the entropy (-S◦) for binding are *not* significantly different between the two groups of proteins (*t*-test, *P* > 0.1). However, the average entropic contribution (−T-S◦) to the binding free energy for interactions between two ordered proteins is 2.5 ± 1.6 kcal mol−<sup>1</sup> smaller (more stabilizing) than for interactions between an ordered and a disordered protein. Within both groups there is a linear correlation between T-S◦ and -H◦ (ORD-ORD: slope = 1.09 ± 0.03, *r* = 0.97; ORD-IDP: slope = 1.06 ± 0.02, *r* = 0.98), which demonstrates a similar entropyenthalpy compensation (**Figure 2A**). Thus, the same underlying thermodynamic principles are true for both groups.

In contrast to -S◦ and -H◦, there is a significant difference in -G◦ between the groups (*t*-test, *P* < 0.0001). For the ORD-ORD complexes <-<sup>G</sup>◦<sup>&</sup>gt; = −11.1<sup>±</sup> 0.4 kcal mol−1, and for the ORD-IDP complexes <-<sup>G</sup>◦<sup>&</sup>gt; = −8.<sup>5</sup> <sup>±</sup> <sup>0</sup>.2 kcal mol−1. The difference in <-<sup>G</sup>◦<sup>&</sup>gt; is 2.5 <sup>±</sup> 0.4 kcal mol−1, which is primarily accounted for by the difference in T-S◦ (*vide supra*). This number is close to the 2.6 kcal mol−<sup>1</sup> recently published from a much smaller dataset based on mutation studies (Huang and Liu, 2013). Note that the distribution of -G◦ among the complexes for which a structure is available is similar to the distribution in the full dataset, and that this is true for the difference in <-G◦> too. As we see no differences in the sizes of the binding interfaces or the amount of hydrophobic residues in the interfaces, and since the disordered proteins in the ORD-IDP complexes rarely form extended hydrophobic cores in their folded conformations, the hydrophobic surface area buried in ligand binding process must be similar in the two classes of protein complexes. Consequently, the difference in T-S◦ is unlikely to arise from significant differences in the desolvation entropy contribution. This conclusion is in contrast to a computational study of complexes involving extended IDPs, which were selected based on a radius-of-gyration criterion of the

three-dimensional structure of the complex (Wong et al., 2013). However, in that work the energetic terms were not decomposed into enthalpic and entropic contributions. Nevertheless, the experimental data for the large group of complexes that we have compiled suggest to us that the less favorable entropic contribution for the ORD-IDP complexes primarily originates from loss in conformational entropy. Indeed, it agrees with the mechanistic difference between binding an ordered and a disordered ligand. The disordered polypeptide has to fold to form

code as in **(A)**. The solid lines represent the best linear fits to the data.

the final complex, which inherently will be associated with a relative large loss in conformational entropy. It is important to note that it is not possible to conclude from equilibrium ITC data *when* the folding of the ligand occurs during the binding process. It is highly likely that for some of the protein complexes the IDP folds and then binds in a conformer selection process while for others the IDP folds upon binding in an induced fit process. The difference in <-G◦> may still, however, be explained by the required folding of the disordered ligand in the ORD-IDP complexes.

We next analyzed the distribution of the -G◦-values for the ORD-ORD and ORD-IDP complexes (**Figure 2B**). Interestingly, the most stable complexes (-<sup>G</sup>◦ <sup>&</sup>lt; <sup>−</sup>15 kcal mol−1) are exclusively formed between two ordered proteins, and the least stable complexes (-<sup>G</sup>◦ ∼ −5 kcal mol−1) are exclusively formed between an ordered and a disordered protein. Among the most stable complexes we find several enzyme: inhibitor complexes, such as the bacterial DNAses in complex with bacterial immunity proteins (Keeble et al., 2006). These DNAses form both very stable cognate complexes and less stable non-cognate complexes with immunity proteins. All these complexes are formed with similar on-rates in the order of 107 M−<sup>1</sup> s−1, and the stronger binding is achieved by slower off rates (Keeble and Kleanthous, 2005; Keeble et al., 2006). Another strong binding complex is that of barnase and barstar which has become a classical example where the electrostatic surfaces of the proteins have evolved to enhance the on-rate (*k*ass <sup>=</sup> <sup>10</sup><sup>8</sup> <sup>−</sup> <sup>10</sup><sup>9</sup> <sup>M</sup>−<sup>1</sup> <sup>s</sup>−1) (Schreiber and Fersht, 1996). Similar fast on-rates are reported for IDPs (Arai et al., 2012; Rogers et al., 2013; Dogan et al., 2014) and similar on-rate dependence on electrostatics has been noted (Rogers et al., 2013). Consequently, the main difference between ORD-IDP and ORD-ORD complexes seems not to reside in on-rate differences, but may therefore reside in off-rates, noted earlier in comparative kinetic studies (Huang and Liu, 2009; Shammas et al., 2012; Dogan et al., 2014). It is still possible that the electrostatic influence from a globular binding partner will cause an induction of a binding-competent conformation within the ensemble distribution of the IDP. A result of this is that it can potentially influence the on-rate and subsequently the binding energy. Alternatively, the binding-competent conformation of the IDP may be required to guide it into the electrostatic field of the globular partner. We do not currently have any data to elaborate further on these scenarios.

One of the hall-marks of IDPs is their ability to interact with many different proteins, for instance in cellular hubs (Oldfield et al., 2008; Cumberworth et al., 2013). Based on computational analyses of structures from a large set of both ordered and disordered hub-complexes from yeast, it was suggested that the binding energies become weaker as the number of interacting proteins increases (Carbonell et al., 2009). Thus proteins with only one binding partner bound with higher affinity than promiscuous proteins with more than on binding partner. This difference may possibly be caused by a broader distribution of hot spots in the promiscuous proteins (Carbonell et al., 2009). It is possible that the difference in average binding affinity <-G◦> observed in the current set is related to an increased number of interacting partners. We have no data on the number of alternative binding partners for complexes in our analysis. Still, it is interesting to note that the difference in binding energy between specific-to-specific complexes and specific-topromiscuous was 0.08 <sup>±</sup> 0.01 kcal mol−<sup>1</sup> residue−<sup>1</sup> (Carbonell et al., 2009), which with an average of 47 residues in the interfaces provided in our data set, amounts to 3.8 <sup>±</sup> 0.5 kcal mol−1, close to the average difference of 2.5 <sup>±</sup> 0.4 kcal mol−<sup>1</sup> that we found between the ORD-ORD and the ORD-IDP complexes.

One alternative explanation for the lower average stability of IDP-ORD complexes may be purely technical and unrelated to any *de facto* differences between globular proteins and IDPs. The vast majority of the experimental studies in our set are conducted on recombinant proteins, typically expressed in *Escherichia coli*. Since phosphorylations and other post translational modifications are widespread in IDPs and is a way of regulating their activity (Iakoucheva et al., 2004) the 2.5 kcal mol−<sup>1</sup> displacement of the average -G◦ could reflect the fact that some of the IDPs examined lack certain posttranslational modifications that would be stabilizing to the interaction. However, the same argument may hold also for globular proteins and a phosphorylation may even destabilize a complex. The lack of other factors (chaperones, carrier proteins, methyl-groups, carbohydrates), which may alter the energy of binding *in vivo* cannot be excluded as origin for the displacement either, but again we see no reason why this should not be an even more pronounced effect for the ORD-ORD complexes.

Based on the data collected, we have reached the—perhaps counterintuitive conclusion that interfaces formed between globular proteins and IDPs are not overall significantly different from the interfaces between two globular proteins, although the contribution of residues within the binding interface is slightly skewed. We find instead that there is a small but significant difference in the average binding free energy in favor of the ORD-ORD complexes. We suggest that this difference is primarily caused by the loss of conformational free energy upon IDP binding, which affects the off-rate of the complex, although other reasons may exists such as an increased number of binding partners for the IDP.

Finally, we would like to add that the present analysis almost exclusively involves binary complexes. It has been suggested that IDPs are particularly well suited as scaffolds for large complexes or as hubs for signaling assemblies. Therefore, we may have missed thermodynamic fingerprints that stand out and reveal IDPs that diverge more from their globular twins than the ones analyzed in the present paper. Allostery in IDPinteractions where more binding sites are in play is an emerging subject (Ferreon et al., 2013; Shammas et al., 2014) and in the ensemble view of allostery, IDP-linked negative and positive allostery is possible (Motlagh et al., 2014). This aspect is not decomposed in the present set of data and allostery may be one underlying cause of the observed differences. Also, highly fuzzy complexes acting e.g. as electrostatic clouds (Mittag et al., 2010; Fuxreiter and Tompa, 2012) are most likely not captured by the methods available for measuring the thermodynamics of protein interactions and are most definitely not targets for structure determination and hence do not contribute to the current analyses. Although the concept of fuzziness has emerged from studies on IDPs we cannot exclude that they also exist for complexes of two ordered proteins.

The reason, if any, for the evolution of protein intrinsic disorder remains to be disclosed. The present paper hints strongly that the answer does not lie directly in differences in thermodynamic parameters or the energetic principles of ligand binding.

## **Author Contributions**

All authors contributed equally to the work and wrote the manuscript in collaboration.

# **References**


#### **Acknowledgments**

This work was supported by a grant from the Danish Natural Research councils (grant number 12-12880 3 to BK).

## **Supplementary Material**

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fmolb. 2015.00040


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Teilum, Olsen and Kragelund. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Anatomy of protein disorder, flexibility and disease-related mutations

#### Hui-Chun Lu<sup>1</sup> , Sun Sook Chung1, <sup>2</sup> , Arianna Fornili 1, 3 and Franca Fraternali <sup>1</sup> \*

<sup>1</sup> Randall Division of Cell and Molecular Biophysics, King's College London, London, UK, <sup>2</sup> Department of Haematological Medicine, King's College London, London, UK, <sup>3</sup> School of Biological and Chemical Sciences, Queen Mary University of London, London, UK

Integration of protein structural information with human genetic variation and pathogenic mutations is essential to understand molecular mechanisms associated with the effects of polymorphisms on protein interactions and cellular processes. We investigate occurrences of non-synonymous SNPs in ordered and disordered protein regions by systematic mapping of common variants and disease-related SNPs onto these regions. We show that common variants accumulate in disordered regions; conversely pathogenic variants are significantly depleted in disordered regions. These different occurrences of pathogenic and common SNPs can be attributed to a negative selection on random mutations in structurally highly constrained regions. New approaches in the study of quantitative effects of pathogenic-related mutations should effectively account for all the possible contexts and relative functional constraints in which the sequence variation occurs.

Keywords: non-synonymous SNPs, protein disorder, order-disorder propensity, disease-related mutations, protein flexibility

## Introduction

Because of the intrinsic complexity of biological systems, reductionist approaches have traditionally been used that concentrate on carefully chosen sub-systems. The availability of complete genome sequences and large (but incomplete) collections of biomolecular structures at atomic resolution favors large-scale computational approaches to investigate multiple components and their interactions (Lu et al., 2013). The undisputed relationship between protein-coding elements and their protein products has dominated the field of genomics/proteomics research in the past and the relationship between structure and function has been widely investigated.

Large-scale studies have been performed on how disease-related mutations may disrupt protein functions and ultimately regulate the function of biological systems (Studer et al., 2013). Mutations are classified as "loss of function," "gain of function," or "neutral" according to their effect on protein function. These effects can be mediated by alterations of the protein stability induced by the mutation (Yue et al., 2005; Studer et al., 2013). The impact of SNPs on protein function and structural stability has been extensively studied at the level of the single protein (Yue et al., 2005; Schuster-Bockler and Bateman, 2008; Wang et al., 2012; Nishi et al., 2013; Studer et al., 2013; Yates and Sternberg, 2013; Scharner et al., 2014) and a number of predictors have been developed to evaluate the impact of SNPs on individual proteins (Thomas and Kejariwal, 2004; Capriotti et al., 2005; Bromberg and Rost, 2007; Adzhubei et al., 2010;

#### Edited by:

Kris Pauwels, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Elena Papaleo, University of Copenhagen, Denmark Sreenivas Chavali, MRC Laboratory of Molecular Biology, UK

#### \*Correspondence:

Franca Fraternali, Randall Division of Cell and Molecular Biophysics, King's College London, New Hunt's House, Guy's Campus, London SE1 1UL, UK franca.fraternali@kcl.ac.uk

#### Specialty section:

This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences

> Received: 31 May 2015 Accepted: 29 July 2015 Published: 12 August 2015

#### Citation:

Lu H-C, Chung SS, Fornili A and Fraternali F (2015) Anatomy of protein disorder, flexibility and disease-related mutations. Front. Mol. Biosci. 2:47. doi: 10.3389/fmolb.2015.00047 Reva et al., 2011; Al-Numair and Martin, 2013; Shihab et al., 2013; Pires et al., 2014; Yates et al., 2014). With the intention to expand the single protein structure-function paradigm, the interplay between Protein Protein Interactions (PPI) networks, structures, and disease mutations has been explored by several groups (see reviews Lu et al., 2013; Yates and Sternberg, 2013) and reference therein, Kelley et al., 2015; Mosca et al., 2015). Particularly the crucial role of interfaces in modulating the effects of pathogenic variation in binding and signaling (Stefl et al., 2013; Yates and Sternberg, 2013) has been generally accepted. In recent years additional findings have contributed to further expanding classical structure-function approaches: firstly, the widely recognized importance of non-coding elements (Necsulea and Kaessmann, 2014; Ling et al., 2015) (not discussed here) and the enrichment of SNPs in these (Consortium, 2012; Kircher et al., 2014); secondly, the role of unstructured regions, intrinsically disordered elements and flexibility in protein function versatility (Uversky, 2013; Dunker et al., 2015; Wright and Dyson, 2015). Even in the absence of intrinsic disorder, there is growing evidence that conformational flexibility is important in regulating protein-protein interactions (Dobbins et al., 2008; Stefl et al., 2013; Uversky, 2013). This effect has also been shown for proteins that have multiple partners (hubs) and are essential in protein-protein communication and signaling. Hubs' promiscuous binding sites have been demonstrated to display specific dynamical properties, pre-existing in the isolated state of the individual protein, allowing for polyvalent partner binding (Fornili et al., 2013). In any case, quantification of the occurrence of SNPs in disordered and flexible protein regions is a complex task, because different shades of disorder have been identified as playing a role in protein function stability and binding (Uversky et al., 2014; Wright and Dyson, 2015 and references therein). One particularly interesting case is represented by mutations related to disorder-to-order (D-O) transitions; there are often associated to post-translational modifications or with defense mechanisms to protect proteins from toxic aggregation and oxidative stress (Winter et al., 2008) and therefore may result in a stronger impact on the protein functional role. Consequently, order/disorder-sensitive descriptors of the specific chemico-physical environment in the vicinity of the observed variant are needed to evaluate rigorously the relationship between disorder and disease-related mutations.

We aim to contribute to this debate by exploring and quantifying in a systematic way the relationship between order/disorder and the occurrence of common variants (dbSNP: common variations from the 1000 Genomes project, Sherry et al., 2001), disease-related SNPs (OMIM: Mendelian genetic diseases, Hamosh et al., 2005) and cancer-related SNPs (COSMIC, Forbes et al., 2011). To this end we decompose the protein sequences anatomically in folded domain regions, unfolded-disordered (intra-domain) regions and inter-domain disordered regions and calculate the enrichment/depletion of SNPs in each of these regions. These comparisons based on mapping SNPs on static, crystallographic structures, represent a first step in quantifying the different roles played by the two (ordered vs. disordered) environments in which a common or pathogenic mutation may occur. We also explored scenarios of mutual effects of mutations in ordered regions on the disorder content within that domain. We discuss two cases of hubs that are strongly involved in cancer: BRAF (Haling et al., 2014; Thevakumaran et al., 2015) and JAK2 (Bandaranayake et al., 2012), both with phenotypic pathogenic mutations occurring in ordered regions and affecting the disorder content of distal sites in the domain.

#### Results

#### Dearth of Disease-Related SNPs in Inter-domain Disordered Regions

The relative enrichment of SNPs in the dissected disordered regions of the protein have been analyzed by comparing three different classes of SNPs: (a) the common variants from the 1000 Genomes project (dbSNP) (Sherry et al., 2001), (b) the geneticdisease variants from OMIM (Hamosh et al., 2005), and (c) the COSMIC cancer-related SNPs (Forbes et al., 2011). Details on the enrichment/depletion measures are given in the Section "Strategy for the investigation of disordered regions and SNPs occurrence."

The outcome of our analysis is presented **Figure 1B** and the barplots relative to each region are colored according to the scheme in **Figure 1A**. The results for dbSNP data are reported as a comparison of the observed human variation in the analyzed regions vs. the pathogenic mutations observed for OMIM and COSMIC data. To our knowledge, this is the first time that such a comparison is presented. The results have been statistically tested (see Strategy section) and the p-values of the comparison tests between the distributions are annotated with stars to show their significance (see **Figure 1** legend for clarification).

The most striking difference amongst all data lies in the opposite behavior observed for INTER-domain disordered regions (INTER-Dom DRs, red) in dbSNP vs. OMIM and COSMIC data (enrichment vs. depletion, respectively). The reasons for such trend can be ascribed to the fact that common variations usually do not occur in structurally and functionally constrained regions but rather accumulate in disordered regions, particularly inter-domain ones. These are usually more flexible to allow the orientation of protein domains and binding multiplicity (Fong and Panchenko, 2010). Conversely, an opposite trend is observed for the INTRA-domain ordered regions (INTRA-Dom OR, light green) of dbSNP vs. the disease-related INTRA-Dom OR plots. For both the disease-related OMIM (**Figure 1B**, center) and COSMIC (**Figure 1B**, right) datasets, there is clear evidence that pathogenic mutations are enriched in ordered domain regions. These are the fragile sites that once mutated can cause a functional impairment of the protein either by destabilizing the fold (Studer et al., 2013), or by affecting structurally important regions for partner binding and consequent signaling activity (Yates and Sternberg, 2013). The enrichment in INTER-Dom disordered regions vs. INTRA-Dom ordered regions is particularly pronounced for the OMIM dataset, but also significantly important for the COSMIC data. The difference in the relative order/disorder populations of the two datasets might be related to the fact that mutations with Mendelian inheritance

are potentially more harmful to the protein than some of the passenger mutations observed in cancer.

Our results support previous studies that compared differences in "natural" mutations from dbSNP and diseaseassociated OMIM data (De Beer et al., 2013). The difference in order vs. disorder propensities observed in our study is therefore an additional discriminant in evaluating the mutability of proteins.

#### Examples of Intra-domain Mutations and Effects on Disorder Occurrence

In a number of recent studies it has been reported that disordered regions harbor pathogenic mutations (Iakoucheva et al., 2002; Uversky et al., 2008; Babu et al., 2011; Hu et al., 2011; Pajkos et al., 2012; Vacic and Iakoucheva, 2012; Vacic et al., 2012). Some of these observations referred to SNPs in segments involved in D-O transitions, but as we observed a clear dearth of pathogenic mutations in INTER-domain disordered regions (INTER-Dom DRs), we decided to investigate the occurrence of SNPs in INTRA-Dom DRs in more detail. A particularly interesting case is the mutual effect of intradomain pathogenic mutations and disorder observed within the domain, even at sites distant from the original mutation. We found such examples in BRAF and JAK2 kinases, which are involved in cancer pathologies (Vogelstein and Kinzler, 2004).

We previously studied the BRAF V600E mutation that destabilizes the inactive conformation of the BRAF kinase and consequently induces ERK activation (Satoh et al., 2012; Lu et al., 2013). The V600 residue is in a cluster of hydrophobic residues with Phe468, therefore the presence of a negative charge (residue E) will be disruptive for this cluster, resulting in destabilization of the inactive conformation. Interestingly, introducing the V600E mutation in the BRAF protein kinase domain increases the INTRA-Dom DRs prediction, as shown in the table (**Figure 2A**) and the plot (**Figure 2B**). By running the DISOPRED2 predictor for the V600E mutant, one can observe an increase in the span of the predicted disordered region found in a distal site (607–611). Notably, the predicted disorder region span was not affected by mutations found within the INTRA-Dom DRs (yellow residues in **Figure 2A** for BRAF). These findings suggest that, besides destabilizing the hydrophobic cluster, the V-E substitution in the kinase domain (Pkinase\_Tyr(PF07714)) might also have an effect on the INTRA-Dom disorder content by unwinding the downstream loop, as shown in the wild type 3D structure (4MNE\_B) (Haling et al., 2014) (**Figure 2B** and Figure S1). This could in turn affect the ligand-binding region (structure 4WO5\_A) (Thevakumaran et al., 2015), with a possible impact on the binding affinity.

The mutation V600E has been studied in detail by sophisticated enhanced sampling methods (Marino et al., 2015) and one of the main consequences of the pathogenic variant highlighted in this study is reflected the enhancement of the active-to-inactive state barrier and the increased flexibility (disorder) of the activation loop (region 602–612). These combined effects result in keeping the kinase in an active state and therefore favor phosphorylation to occur. This study supports the idea that an accurate descriptions of the structure, dynamics, and energetics of the protein and its mutated states are necessary to extract molecular fingerprints that rationalize the impact of pathogenic vs. commonly occurring mutations. Interestingly, in recent times the tendency of BRAF in adopting permanently an active state not detectable by current structure has been highlighted as one of the paradigmatic cases for which the currently adopted strategies for structure-based drug discovery may be ineffective (Holderfield et al., 2014).

Long-range effects of mutations on domain-disorder content are partially observed also for the V617F SNP of the JAK2 kinase, a mutation mostly observed in leukaemias. Our predictions indicate that this mutation leads to an extension of the INTRA-Dom DRs, (Bandaranayake et al., 2012) as shown in the table (Figure S2D) and the plot (Figure S2E).

The changes of disorder probability between the wild type sequences (BRAF and JAK2) and those with the cancer driver SNPs (V600E and V617F, respectively) have been predicted by five different methods which include highly ranked methods in CASP10 (Monastyrskyy et al., 2014) such as DISOPRED3, PrDOS, Biomine\_MFDp and a recent method using backbone dynamics, DynaMine (Cilia et al., 2014) (Figures S2, S4). The results do not show a strict consensus in the boundaries and in the absolute differences of disorder content, this can be ascribed to the different algorithms used. However, most of the methods predict an increase of the disorder probability in the mutation distal regions we observe for BRAF that the cancer driver mutation is at the periphery of the kinase binding site and in an ordered region, while the non-driver mutations mostly occur in the disordered regions. The two locations seem to be correlated in the sense that the observed change in the driver mutation alters the disorder content of the other mutation loci. This may be ascribed to correlated dynamical couplings between disordered and ordered regions within the same protein domain that may lead to an enrichment of pathogenic variants in flexible and less structured regions. This long-range coupling is an indirect and probably down-tuned mutational effect on the protein function, which may result in a higher acceptance of the mutation in these regions.

# Strategy for the Investigation of Disordered Regions and SNP Frequencies

#### Data Set Preparation

A data set of human proteins, using UniProt accession identifiers as reference, was generated by mapping SNPs onto experimentally resolved 3D structures of proteins. Native and homologous structures were identified by running NCBI-BLAST (version 2.2.29+) (Camacho et al., 2009) against the PDB sequence library. Homologues were accepted above the 30% sequence identity threshold. Non-synonymous SNPs were retrieved from the dbSNP database (build 141) (Sherry et al., 2001), germ-line disease-related SNPs were extracted from the "Online Mendelian Inheritance in Man" (OMIM) database (version July 2014) (Hamosh et al., 2005) and somatic cancerrelated SNPs were taken from the "Catalog of Somatic Mutations in Cancer" (COSMIC) database (version July 2014) (Forbes et al., 2011). Only the proteins having a native/homologous structure and SNP information were selected, yielding a reference data set comprising 5587 proteins.

#### Definition of Protein Domains and Disordered Regions

The selected proteins were assigned with domain definitions and disordered region predictions. For each protein sequence of the reference data set as query, the HMM sequence aligner HMMER3 (Finn et al., 2011) was used to search against the Pfam domain sequence library (Pfam-A.hmm version 26.0) (Punta et al., 2012) and to assign the matched PFAM domain definition to the query protein, given the alignment E-value was smaller than 1e-3. Disordered regions of the selected proteins were predicted using the DISOPRED program (Ward et al., 2004). The combination of domain definitions and disordered region predictions leads to three distinct regional classes (**Figure 1A**): (1) intra-domain ordered region (INTRA-Dom OR), (2) intra-domain disordered region (INTRA-Dom DR), and (3) inter-domain disordered region (INTER-Dom DR).

#### SNPs Enrichment Analysis

We computed the regional enrichment/depletion of SNPs as propensities P(SNPregion) by normalizing the relative regional

frequency with the relative frequency over the total protein length (Equation 1).

$$P\left(\text{SNP}\_{\text{region}}\right) = \frac{\left(\text{N(SNPs)}\_{\text{region}} / \text{length}\_{\text{region}}\right)}{\left(\text{N(SNPs)}\_{\text{protein}} / \text{length}\_{\text{protein}}\right)}\tag{1}$$

These propensities are plotted in **Figure 1B** as relative entropies log(P(SNPregion)). A relative entropy of zero indicates a regional frequency equal to the background frequency (denominator), positive values indicate relative enrichment and negative values correspond to relative depletion. All SNPs (from dbSNP, OMIM, and COSMIC) were mapped onto the protein sequences: 5504 of 5587 proteins were mapped with SNPs from dbSNP, 700 of 5587 with SNPs from OMIM and 5422 of 5587 with SNPs from COSMIC. SNPs from each database were further classified into different classes by mapping their positions onto the corresponding protein regions (INTRA-Dom OR SNPs, INTRA-Dom DR SNPs, and INTER-Dom DR SNPs). The number of SNPs in the different classes and the mean lengths of the regions are given in **Figure 1C**.

#### Statistical Evaluation

To obtain an estimate of the uncertainty associated with the propensity calculations and to reduce biases incurred by the protein selection procedure, we used a bootstrapping method (R function boot()) to create random re-sampled subsets of the reference data set. 10,000 independent subsets were generated of the SNPs propensities within each pre-defined protein regional class (INTRA-Dom OR SNPs, INTRA-Dom DR SNPs, and INTER-Dom DR SNPs) and the mean of each subset was computed. The distributions of the resampled means are by normally distributed, as expected (Figure S4). Confidence intervals at the 95% level were calculated from the bootstrap distributions. The statistical significance of differences between propensity distributions was calculated by Student's t-test on the confidence intervals (Wolfe and Hanley, 2002). The statistical analyses were performed using R (R Core Team, 2014).

## Conclusions and Perspectives

We performed a large-scale statistical analysis of the relationship between protein disorder and disease-related mutations. We report that both genetic-disease variants from OMIM and cancerrelated SNPs from COSMIC are depleted in disordered regions compared to common human variation. This is in line with the fact that mutations in highly constrained regions of the protein are more likely to be disruptive or deleterious. This is why mutations in ordered states of proteins (domains, ligand-binding sites, PPI sites) have been investigated quite in detail in the last years.

We offer here a starting and objective point to discriminate between completely ordered regions, disordered regions occurring in ordered domains, and inter domain predicted disordered segments. We observe and quantify the result of the mapping of available SNPs data onto a large set of human proteins and their close homologs. From this study a number of interesting cases can be extracted for functional validation and close investigation of the dynamical role played by the disorder content.

New perspectives in the field can be explored from this starting point, as the more complicate cases in which flexibility and/or disorder play a direct role in the protein function have not yet been fully elucidated. Particularly complex are the cases where flexible residues modulate protein binding and promiscuity (Fornili et al., 2013) and disorder-to-order causing mutations (Vacic and Iakoucheva, 2012; Dunker et al., 2015). These more "dynamically" driven processes are difficult

## References


to parametrise and the restraints playing a role in selecting the actual functional states are not always quantifiable. At this purpose, systematic studies collecting critical examples of experimentally proved correlations between flexibility, presence of disordered states, D-O transitions, and functional studies are needed in the field for the benchmarking and validation of predictive tools for the impact of pathogenic variation on proteins and their partners. Most recent development in disorder prediction methods exploits successfully the mutual interplay between backbone and side-chain dynamics (Cilia et al., 2014; Kosciolek and Jones, 2015).

Nevertheless, more sophisticated methods are needed to quantify these observations, like large-scale molecular simulations, [so far performed for isolated cases Vacic et al., 2012; Marino et al., 2015] and measurements of conformational signal transduction within protein structures (Pandini et al., 2012). The correlated dynamical couplings between disordered and ordered regions may be exploited in the design of drugs targeting distal sites from the dominant mutation, and by fine-tuning the effects on the overall protein function. Additionally, the possibility to predict the "allosteric" modulation of mutations occurring in regions with a different level of order/disorder and possibly correlated with the same or different pathogenic manifestation can open new avenues to investigate the underlying molecular mechanisms and rectify current strategies for drug-discovery.

# Acknowledgments

We thank Dr. Jens Kleinjung and Dr. Nicholas Shaun B. Thomas for their critical reading of the manuscript. This research was supported by the Biotechnology and Biological Sciences Research Council (BB/H018409/1 to FF), the British Heart Foundation (FS/12/41/29724 to AF and FF) and the Leukaemia & Lymphoma Research (to FF). SSC is funded by a Leukaemia & Lymphoma Research Gordon Piller PhD Studentship.

#### Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fmolb. 2015.00047


network features. J. Mol. Biol. 426, 2692–2701. doi: 10.1016/j.jmb.2014. 04.026


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Lu, Chung, Fornili and Fraternali. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Protein flexibility in the light of structural alphabets

Pierrick Craveur 1, 2, 3, 4 , Agnel P. Joseph<sup>5</sup> , Jeremy Esque<sup>6</sup> , Tarun J. Narwani 1, 2, 3, 4 , Floriane Noël 1, 2, 3, 4, Nicolas Shinada1, 2, 3, 4, Matthieu Goguet 1, 2, 3, 4, Sylvain Leonard1, 2, 3, 4 , Pierre Poulain1, 2, 3, 4, 7, Olivier Bertrand1, 3, 4, Guilhem Faure<sup>8</sup> , Joseph Rebehmed<sup>9</sup> , Amine Ghozlane<sup>10</sup>, Lakshmipuram S. Swapna11, 12, Ramachandra M. Bhaskara11, 13 , Jonathan Barnoud1, 2, 3, 4, 14, Stéphane Téletchéa1, 2, 3, 4, 15, Vincent Jallu<sup>16</sup>, Jiri Cerny <sup>17</sup> , Bohdan Schneider <sup>17</sup>, Catherine Etchebest 1, 2, 3, 4, Narayanaswamy Srinivasan<sup>11</sup> , Jean-Christophe Gelly 1, 2, 3, 4 and Alexandre G. de Brevern1, 2, 3, 4 \*

#### Edited by:

Kris Pauwels, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Mark Pfuhl, King's College London, UK Elena Papaleo, University of Copenhagen, Denmark

#### \*Correspondence:

Alexandre G. de Brevern, Institut National de la Santé et de la Recherche Médicale U 1134, 6 Rue Alexandre Cabanel, 75015 Paris, France alexandre.debrevern@ univ-paris-diderot.fr

#### Specialty section:

This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences

> Received: 28 February 2015 Accepted: 30 April 2015 Published: 27 May 2015

#### Citation:

Craveur P, Joseph AP, Esque J, Narwani TJ, Noël F, Shinada N, Goguet M, Leonard S, Poulain P, Bertrand O, Faure G, Rebehmed J, Ghozlane A, Swapna LS, Bhaskara RM, Barnoud J, Téletchéa S, Jallu V, Cerny J, Schneider B, Etchebest C, Srinivasan N, Gelly J-C and de Brevern AG (2015) Protein flexibility in the light of structural alphabets. Front. Mol. Biosci. 2:20. doi: 10.3389/fmolb.2015.00020 1 Institut National de la Santé et de la Recherche Médicale U 1134, Paris, France, <sup>2</sup> UMR\_S 1134, DSIMB, Université Paris Diderot, Sorbonne Paris Cite, Paris, France, <sup>3</sup> Institut National de la Transfusion Sanguine, DSIMB, Paris, France, <sup>4</sup> UMR\_S 1134, DSIMB, Laboratory of Excellence GR-Ex, Paris, France, <sup>5</sup> Rutherford Appleton Laboratory, Science and Technology Facilities Council, Didcot, UK, <sup>6</sup> Institut National de la Santé et de la Recherche Médicale U964,7 UMR Centre National de la Recherche Scientifique 7104, IGBMC, Université de Strasbourg, Illkirch, France, <sup>7</sup> Ets Poulain, Pointe-Noire, Congo, <sup>8</sup> National Library of Medicine, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, USA, <sup>9</sup> Centre National de la Recherche Scientifique UMR7590, Sorbonne Universités, Université Pierre et Marie Curie – MNHN – IRD – IUC, Paris, France, <sup>10</sup> Metagenopolis, INRA, Jouy-en-Josas, France, <sup>11</sup> Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Bangalore, India, <sup>12</sup> Hospital for Sick Children, and Departments of Biochemistry and Molecular Genetics, University of Toronto, Toronto, ON, Canada, <sup>13</sup> Department of Theoretical Biophysics, Max Planck Institute of Biophysics, Frankfurt, Germany, <sup>14</sup> Laboratoire de Physique, École Normale Supérieure de Lyon, Université de Lyon, Centre National de la Recherche Scientifique UMR 5672, Lyon, France, <sup>15</sup> Faculté des Sciences et Techniques, Université de Nantes, Unité Fonctionnalité et Ingénierie des Protéines, Centre National de la Recherche Scientifique UMR 6286, Université Nantes, Nantes, France, <sup>16</sup> Platelet Unit, Institut National de la Transfusion Sanguine, Paris, France, <sup>17</sup> Institute of Biotechnology, The Czech Academy of Sciences, Prague, Czech Republic

Protein structures are valuable tools to understand protein function. Nonetheless, proteins are often considered as rigid macromolecules while their structures exhibit specific flexibility, which is essential to complete their functions. Analyses of protein structures and dynamics are often performed with a simplified three-state description, i.e., the classical secondary structures. More precise and complete description of protein backbone conformation can be obtained using libraries of small protein fragments that are able to approximate every part of protein structures. These libraries, called structural alphabets (SAs), have been widely used in structure analysis field, from definition of ligand binding sites to superimposition of protein structures. SAs are also well suited to analyze the dynamics of protein structures. Here, we review innovative approaches that investigate protein flexibility based on SAs description. Coupled to various sources of experimental data (e.g., B-factor) and computational methodology (e.g., Molecular Dynamic simulation), SAs turn out to be powerful tools to analyze protein dynamics, e.g., to examine allosteric mechanisms in large set of structures in complexes, to identify order/disorder transition. SAs were also shown to be quite efficient to predict protein flexibility from amino-acid sequence. Finally, in this review, we exemplify the interest of SAs for studying flexibility with different cases of proteins implicated in pathologies and diseases.

Keywords: protein structures, disorder, secondary structure, structural alphabet, protein folding, allostery, protein complexes, protein—DNA interactions

# Introduction

Analysis of protein structures is crucial to understand protein dynamics and functions. X-ray crystallography, the goldstandard method for solving 3D structures at atomic resolution, is impeded by protein dynamics. Hence, tricks are frequently used to restrict motions. It is why proteins have been often considered as static macromolecules, composed ofrigid repetitive secondary structures and less rigid random coils. However, more and more emerging evidences show that protein structures are more complex with their internal dynamics being a key determinant of their function. Analyses of protein structures are often performed with a simplified three-state description known as α-helix, β-strand and coil which constitutes the classical secondary structures (Corey and Pauling, 1953; Kabsch and Sander, 1983). A more precise and complete description of protein backbone conformation exists based on the definition of libraries of small protein fragments, namely the structural alphabets (SAs) (Unger et al., 1989; Fetrow et al., 1997; Camproux et al., 1999; Offmann et al., 2007; Tyagi et al., 2007; Joseph et al., 2010a,b). SAs are designed to approximate every part of the local protein structures providing conformational detail. They have performed remarkably well spanning various problems in structural bioinformatics, from the characterization of ligand binding sites to the superimposition of protein structures (Joseph et al., 2010b). Furthermore, SAs are also very well suited to analyze the internal dynamics of protein structures. SAs have been used at three different levels to comprehend protein flexibility: (i) for studying specific fundamental biological and biomedical problems, (ii) to analyze changes associated with protein complexation and allostery, and (iii) to predict protein flexibility.

Here, we present state-of-the-art of developments in the study of protein flexibility using SAs based approximation. The backbone conformational variations can be described as changes in the pattern of SAs, which acts as fingerprints of the dynamics involved. These innovative approaches are useful, customizable, and deal with specific proteins involved in pathologies and diseases. They are also powerful to evaluate generalized principles from large biological complex structures. Thus, SAs provide new vision for detailed analysis and prediction flexibility of proteins.

# The Different Views of Protein Structures

The primary sequence of the protein—the succession of amino acids—is assumed to encompass all the information necessary for its function. The protein structures resolved from X-ray crystallography or Nuclear Magnetic Resonance (NMR) (see **Figures 1A,B**) can be obtained in the Protein DataBank format (PDB, Bernstein et al., 1977; Berman et al., 2000). From the very beginning, theoreticians or experimentalists have described local protein structures by using three states (see **Figure 1C**, Corey and Pauling, 1953; Kabsch and Sander, 1983; Eisenberg, 2003). Two of them are repetitive structures stabilized by hydrogen bond patterns, namely the α-helices and the β-sheets (composed of β-strands). These structures are connected with more variable structures, i.e., random coil or loops. Later studies have identified spotted small repetitive and regular structures such as the βhairpins or different kinds of turns in several protein structures (Richardson, 1981). These simplified descriptions were nicely represented with 3D visualization software (e.g., arrows for βsheets, springs for α-helix) and accompanying the emergence of macromolecular crystallography. However these simplistic representations also contributed to the static and rigid views of these structures (Chavent et al., 2011).

In fact, growing evidence shows that proteins are highly dynamic macromolecules and that this dynamics is crucial in many biological processes. Thus, recent studies have demonstrated that conformational transitions in folded states of many proteins are essential to accomplish their functions, e.g., enzyme catalysis, activity regulation (Goh et al., 2004; Grunberg et al., 2004; Lensink and Mendez, 2008). Flexibility also allows interactions with different partners, with ligands by inducedfit interaction, with other proteins, or nucleic acids to form complex structures. NMR based methods and computational experiments such as Molecular Dynamic (MD) simulations, have largely contributed to gain valuable insights into the observation, understanding, and analyses of flexibility (Hirst et al., 2014). Flexibility can be versatile and covers a large range of timescales and amplitudes of structural modifications. It encompasses different kinds of conformational changes corresponding to (i) mobility of rigid part of the protein, e.g., domain motions (ii) deformability of the protein backbone, e.g., crankshaft motions or (iii) both. These different transitions are shown by analyzing and comparing protein structures (see **Figure 1D**). At a local level, the flexibility can be identified by the information contained in diffraction images of X-ray crystallography experiments and quantified along the refinement process through the Debye-Waller factors (expressed as surface units) also known as "Bfactors" or temperature (displacement) factors. These so-called B-factors reflect atom mobility due to thermal vibration and measure the static disorder. They allow quantifying different levels of flexibility in proteins (see **Figure 1E**, Marsh, 2013). This criterion is also used by majority of flexibility prediction methods (from the sequence) (Schlessinger and Rost, 2005).

In this context, missing coordinates of whole residues in X-ray protein structures (usually labeled as missing residues, see **Figure 1F**) and several dedicated biochemical analyses have suggested these protein segments should be considered as disordered regions (see **Figure 1G**, Uversky et al., 2000; Dunker et al., 2001). From few years, beside the paradigm of a well-defined 3D folded state, new visions of protein structure and dynamics have emerged, namely the Intrinsically Disordered Proteins (IDP) or disordered regions. IDP may exhibit large structural rearrangements like the formation (then the loss) of secondary structures depending on the environment or the interacting partners. The impressive amount of research in this field is motivated by the implication of IDP in multiple crucial biological functions (Dunker et al., 2000; Dunker and Obradovic, 2001), for e.g., 14-3-3 proteins (Uhart and Bustos, 2014) or the Innate Antiviral Immunity (Xue and Uversky, 2014). Nevertheless, the regions with missing residues can be found resolved in other PDB structures of the same (or highly homologous) protein (see **Figure 1I**) (see

**Figure 1H**, Berman et al., 2000). These ambiguous regions, termed Dual Personality Fragments (DPFs, see Dunker, 2007; Zhang et al., 2007), complicate the distinction and per se the definition of disorder versus flexibility (see **Figure 1J**). In **Figure 1**, we show a protease (PDB code 1dbi chain A), the corresponding DPF found (with a good resolution) in another protease (PDB code 1wmd chain A). Correlation between B-factors (representing flexibility) and disorder predictor outputs has been explored and shows a good agreement (Jin and Dunbrack, 2005; Schlessinger et al., 2009).

In the light of the above observations, the classic representation of protein structure as a succession of repetitive ordered secondary structures and random coil does not allow understanding of the complexity associated with structural flexibility. Actually, the coarseness of the secondary structure assignment may prevent from identifying conformational changes. Therefore distinction between flexible loops and rigid loops, for example, cannot be made on the sole basis of a three-state secondary structure assignment. A more precise and local description of protein structure is needed. In this regard, Structural Alphabet (SAs), allow to investigate primarily the complexity of the protein conformations, and consequently of their associated dynamics.

A SA is a library of N structural prototypes (the letters). Each prototype is representative of a backbone local structure of lresidues length. The combination of those structural prototypes is assumed to approximate any given protein structure. Many different libraries have been developed, (e.g., Unger et al., 1989; Fetrow et al., 1997; Camproux et al., 1999; Tung et al., 2007). Depending on the targeted accuracy, the length l and the number N can vary significantly. The length l typically ranges between 4

and 9 while can vary, the most frequent value being close to 20 (see Offmann et al., 2007; Joseph et al., 2010a,b for more details). The various structural alphabets also differ by the description parameters of the protein backbone. The description can be based on Cα coordinates, Cα-Cα distances, α or dihedral angles. The classification and learning methods that were used, are also various, e.g., hierarchical clustering, empirical function, Kohonen Maps, neural network or Hidden Markov Model Besides their interest to provide a finer description, They SA have been also designed for prediction purpose, which requires to decipher the sequence—structure relationship.

As example, in their respective work, Park and Levitt (1995) and Kolodny et al. (2002) aimed at finding representations based on smallest libraries of protein fragments to accurately construct protein structures. Fragments of four to seven residues long were considered in a library of 25–300 fragments. Micheletti et al. (2000) did similar studies and constructed a library that encompassed from 28 to 2561 recurrent local structures.

To date, one of the most developed and comprehensive SA is the Protein Blocks approach (PBs, de Brevern et al., 2000). This SA is composed by 16 local structure prototypes of 5 residues fragments (see **Figure 2**). It was shown to efficiently approximate every part of the protein structure. The PBs m and d can be roughly described as prototypes for the central region of α-helix and β-strand, respectively. PBs a-c primarily represent the Ncap of β-strand while e and f correspond to C-caps; PBs g -j are specific to coils, PBs k and l correspond to N cap of α -helix while PBs n-p to C-caps. PBs have been used to address various problems, including protein superimposition (Gelly et al., 2011; Joseph et al., 2012), general analyses of flexibility (Dudev and Lim, 2007; Wu et al., 2010) or and prediction of structure and flexibility (Zimmermann and Hansmann, 2008; Rangwala et al., 2009; Suresh et al., 2013; Joseph and de Brevern, 2014).

The assignment algorithm (see **Figure 3A**, de Brevern et al., 2000) runs through the 3D structure of the target protein, from the N to the C-ter of the sequence. The algorithm is iterative and uses 5 residues long overlapping windows over the entire sequence to assign a PB to every position. For each "n th" position of the structure, 8 dihedrals ψ (n − 2), ϕ (n − 1), ψ (n − 1), ϕ (n), ψ (n), ϕ (n + 1), ψ (n + 1), ϕ (n + 2) are compared to each of the 16 PBs. The comparison is made by a least squares approach to match the RMSDA criteria (Root mean square Deviation on Angular Values) (Schuchhardt et al., 1996):

$$\text{RMSDA}\left(V\_1, V\_2\right) = \sqrt{\frac{1}{2(M-1)}\sum\_{i=1}^{i=M-1} \left[\psi\_i(V\_1) - \psi\_i(V\_2)\right]^2}$$

$$+ \left[\varphi\_{i+1}(V\_1) - \varphi\_{i+1}(V\_2)\right]^2\tag{1}$$
 $\text{DMCMA formula}$ 

#### RMSDA formula

where V<sup>1</sup> is the vector of 8 dihedral angles extracted from the 5 residues long window, and V<sup>2</sup> is the 8 vector of dihedral corresponding to the individual PB type. The PB with the lowest RMSDA, is assigned to the corresponding position for that window. This PB captures the overall local conformation and approximates the transition along the main-chain smoothly.

PB assignments can be done using the Python PBxplore tool (https://github.com/pierrepo/PBxplore, in preparation). The result is a translation of a 3D structure into a 1D sequence of PBs.

Interestingly, the subtle differences between protein conformations can be captured by the assignment of the PB sequences. By analyzing the variation of PBs assigned at a given position for multiple conformers, the local conformational properties and corresponding changes can be easily identified. Moreover, a quantification of the flexibility at a given position n can be obtained by calculating, the average number of PBs across a set of conformers in this position or the "equivalent number" of PBs (Neq). Neq is based on a statistical metric similar

to Shannon entropy (de Brevern et al., 2000) and is calculated as follows:

$$N\_{\text{eq}} = -\exp(-\sum\_{\mathbf{x}=1}^{16} f\_{\mathbf{x}} \ln \left( f\_{\mathbf{x}} \right)) \tag{2}$$

Neqformula

Delano, 2013).

where f<sup>x</sup> is the frequency of PB x (x takes values from a to p). A Neq value of 1 indicates that only one type of PB is observed, while a value of 16 is equivalent to a random distribution. For example Neq value equal to 6, could mean that 6 different PBs are observed in equal proportions (1/6), or that more than 6 PBs are observed in different proportions. By plotting the computed value for each residue position (see **Figure 3B**), it is possible to easily localize which protein regions present local conformation change, or in other words, which regions represent local flexibility.

This PB derived-entropy index is an interesting feature of PBs, which can be used to analyze PB prediction (de Brevern et al., 2000) or an ensemble of structures, corresponding to the same protein solved in different experiments, or to several structures extracted from MD simulation (Jallu et al., 2012). Note that PBxplore can be used to calculate Neq, and to visualize in various ways the PB variation for each position from a collection of models or through a MD trajectory (de Brevern et al., 2005).

Other interesting SAs used in the flexibility context. We have proposed an extension of our SA through a novel library consisting of 120 overlapping structural classes of 11-residues fragments, firstly defined as PBs series (Benros et al., 2006). This library was constructed with an original unsupervised structural clustering method called the Hybrid Protein Model (de Brevern and Hazout, 2003). For each class, a mean representative fragment, or "local structure prototype" (LSP), correctly approximate the local structures with an average Cα RMSD of 1.61 Å. LSPs capture both the continuity between the identified recurrent local structures and long-range interactions. From this description, two methodologies were developed to predict flexibility. The first one was based on simple logistic functions and supervised with a system of experts (Benros et al., 2006). The second one was a combination of Support Vector Machines (SVMs) and evolutionary information (Bornot et al., 2009).

Pandini and co-workers developed their own SA; it is derived from the notion of attractors in conformational space, a more complex approach than PBs (Pandini et al., 2010). Pandini and co-workers developed their own SA; it is derived from the notion of attractors in conformational space, a more complex approach than PBs (Pandini et al., 2010). They focused on four-residue long fragments, the conformation of each being defined by internal angles between Cα atoms, i.e., two pseudo-bond angles and one pseudo torsion angle. All protein fragments were mapped as points in a three-dimensional space of these internal angles. The optimal number of clusters, i.e., structural prototypes, was assessed by the quality of the reconstructed protein structures and by information content. They ended with an alphabet of 25 letters, called M32K25. The alphabet starts from extended structures (e.g., A letter) and ends with turns (e.g., Y letter), passing through loops (e.g., P letter) and helical structures (e.g., U letter). The authors compared their approach with other SAs of four-residue fragments and showed the superiority of their method (Camproux et al., 2004; Tung et al., 2007). An interesting point was the analysis of the correlation between local flexibility and variability in the assignment. Thereafter, they have developed GSATools, (http://mathbio.nimr.mrc.ac. uk/wiki/GSATools, Pandini et al., 2013), composed of a set of programs, that encode ensembles of protein conformations into alignments of structural strings using their Structural Alphabet. This software package is particularly well suited for the investigation of the conformational dynamics of local structures, the analysis of functional correlations between local and global motions, and the mechanisms of allosteric communication. It performs a wide range of statistical analyses using a various set of external tools, mainly from R (Ihaka and Gentleman, 1996) and Python (Python Software Foundation, 2015). The software has been integrated into the GROMACS environment (Lindahl et al., 2001; Van Der Spoel et al., 2005). The user must compile it specifically.

GSATools was used to finely analyse the NtrC receiver domain and its homologs CheY and FixJ. For this purpose, different conformations of the protein extracted from a MDs simulation were encoded. The distributions of SA strings were used to compute different mutual information matrices using information theory. Remarkably, they were able to detect allosteric signal transmission from protein dynamics (Pandini et al., 2012). They also applied this methodology to a larger set of related proteins to show how evolutionary conservation and binding promiscuity have opposite effects on intrinsic protein dynamics (Fornili et al., 2013). Other examples are provided in Section 4.

These innovative approaches have been useful to study specific proteins implicated in pathologies and diseases. They are also sufficiently powerful to analyze large datasets of protein structures using automated pipelines. To summarize, SAs provide new visions for the analyses and prediction of protein structure flexibility. Different examples will be detailed in the following sections.

# Duffy Antigen/Chemokine Receptor (DARC) Protein

Using the approaches described above, we analyzed conformations of different proteins implicated in pathologies. A very first study was done on predicting flexibility of loops in the Duffy antigen/receptor for chemokine (DARC) protein (Cutbush and Mollison, 1950; Compton and Haber, 1960). DARC is a transmembrane protein localized in the plasma membrane of red blood cells. It is a non-specific receptor for several chemokines (Allen et al., 2007); it is also named atypical chemokine receptor 1, Fy glycoprotein (FY), or CD234 (Cluster of Differentiation 234). The transmembrane chemokine receptors comprise two main families, defined by differences in their ligands. Indeed, chemokines can contain either two consecutive Cysteines (the CC chemokines) or two adjacent Cysteines with one amino acid in-between (the CXC chemokines). Furthermore, the two families of chemokine receptors have a specific linear sequence motif in their C-terminus region that enables signal transduction. In contrast, DARC lacks the specific motif, thus showing a specific difference coming probably from a distinct evolution.

This protein is also known as the receptor for the human malarial parasites Plasmodium vivax and Plasmodium knowlesi (Miller et al., 1975, 1976). Polymorphisms of DARC are the basis of the Duffy blood group system. While malaria is the most important sickness associated with DARC (Guerra et al., 2006; Cutts et al., 2014), DARC plays also a role in numerous other diseases, such as HIV and cancer, and risk factor associated with many other diseases is emerging (Liu et al., 1999; Horne and Woolley, 2009).

Like most transmembrane proteins, no experimental structure of DARC is currently available (de Brevern et al., 2005). We designed a structural model based on a comparative modeling approach. Using rhodopsin (the only available related structure at this time) as a structural template (a simple alignment showed a very low sequence identity value of 12%, e.g., close to a random value), we carefully built different structural models, based on a hierarchical and iterative procedure. A first step was to predict using more than 10 methods the positions of the 7 transmembrane helices along the sequence. From this initial and rough model, helices of DARC were aligned with rhodopsin helices assigned from the 3D structure. The same methodology was used for the loops, a complete alignment was generated using helices and connecting loops. A specific treatment was done for N- and C-termini region, combining Protein Blocks prediction

(de Brevern et al., 2004; Etchebest et al., 2005) with threading approaches.

Experimentally, 40 Alanine mutants had been produced and associations binding constants with CXC-L8 were evaluated (Tournamille et al., 2003, 2005). We used these experiments to assess the quality of our best refined models. From the results, we generated new models by manually changing the positions of helices (and the alignments). Building and refinements were done 10 times until a proper set of characteristics were obtained. In regards to these experiments, in silico analysis of protein flexibility has underlined specific characteristics of different epitopes and interaction regions.

Interestingly, we obtained two different conformations (see **Figure 4A**) that were both as compatible with experimental data and similarly scored by the few assessment approaches available for transmembrane structural models. Interestingly five years later, an attempt to generate better models with the best available methods was not crowned with success (de Brevern et al., 2009; Smolarek et al., 2010).

It took us one year to build such models (models are available at Model Archive website (http://modelarchive.org/, Schwede et al., 2009). The Nterminus is particularly important in the infection by Plasmodium vivax (Batchelor et al., 2014). It is nearly 55 residues long and different disorder prediction methods (i.e., DisEMBL, Linding et al., 2003 or PrDOS, Ishida and Kinoshita, 2007), predicted as partially disordered, with the beginning of the sequence as fully disordered.

To evaluate the different conformational states of the Duffy protein, we carried out numerous MDs simulated annealing simulations with the GROMACS software (Lindahl et al., 2001; Van Der Spoel et al., 2005). MD simulated annealing allows a harsh sampling of the conformational space by crossing energetic barriers in an efficient and fast way. Many runs were performed and the different conformations obtained at room temperature were analyzed using Protein Blocks. In practice, we encoded each 3D protein structural model conformation into a 1D string (the length of the protein sequence) using Protein Blocks. Then, we computed, the number of times each PB was observed for each position. Positions with a high frequency of a single PB exhibit no local change, while some others positions exhibit local deformations that require a more in-depth analyses. Few variations could be observed in the helical regions (PB m and encompassing PBs) that were weakly restrained with harmonic forces. Instead, loops sampled large regions of the conformational space. A very interesting result was observed for the Nterminus region, and especially the distal region. In contrast to what was suggested by disorder predictors, this region was not a random coil region, but in fact a small β-sheet composed of two β-strands (PBs d and encompassing PBs, seen in orange on **Figure 4B**), connected by a short turns (in yellow). In the β-sheets, some positions, e.g., 12 and 13, were invariant. Likewise, the closest region to the first helix was more constrained than expected and not disordered (in pink and violet). Even the central regions (in gray) showed some tendencies to be structured. It was a striking example of a complex series of conformations which cannot be analyzed for instance through classical secondary structure (Kabsch and Sander, 1983).

A second example on DARC loops was the last extra-cellular loops for which a specific and constrained loop conformation was observed. Remarkably, this unexpected conformation explains a "lethal" mutation for the binding of CXCL8. It was the first time a structural alphabet was used to analyze the dynamics of a protein structures or structural model.

## Human Integrin α2bβ3

In another project, we were interested in integrins, a large family of cell surface receptors involved in cell—cell or cell matrix adhesion. Integrins are type I membrane glycoproteins composed of two distinct α and β subunits. Each subunit has a large extracellular region (composed of multiple structural domains), a trans-membrane segment and a short intracellular domain. Integrins interact with cell cytoskeleton and mediate bidirectional trans-membrane signal transduction. These receptors are expressed in vertebrate, but also in lower metazoans including sponges, nematode Caeorhabditis elegans and fruitfly Drosophila Melanogaster. In mammals, 18 α and 8 β subunits assemble in 24 distinct integrin complexes. Integrins play critical roles in many physiological processes like hemostasis, immune response, leukocyte trafficking, development and angiogenesis or in pathology like cancer. In human, they are responsible for many diseases from genetic or immune origins. They also make effective targets for drug therapies in thrombosis and inflammation. Furthermore, integrins are binding sites for many viruses and bacteria (Hynes, 2002; Takada et al., 2007).

In regard to these various characteristics, integrins have been extensively studied over the past decades. Especially, structural analyses have provided substantial insights to explain functional mechanism(s). In 2004, the first structure of the extracellular domain of αVβ3 integrin, a vitronectin receptor found in platelets, was proposed (Xiao et al., 2004). Then, several structures of αVβ3 but also of αIIbβ3 integrin (Zhu et al., 2008), a fibrinogen receptor involved in platelet aggregation, were resolved in different activation states. Molecular models for both trans-membrane and cytoplasmic domains were also proposed. Thus, it opens the way to investigate impact of mutant using in silico mutagenesis.

Hence, we examined the effect of the β3-Leu253Met substitution of αIIbβ3 complex in patients with Glanzmann thrombasthenia (Jallu et al., 2010), a rare bleeding disorder characterized by an impaired platelet aggregation (George et al., 1990). For the first time, we showed that residue Leu253 localized at the interface of the complex—is playing a major role in the stability of αIIbβ3. Nonetheless, structural models reflecting static specific states do not depict structural dynamics accompanying the various aspects of integrin functions. For instance, when integrins are activated by substrates, large conformational changes are observed. Analyses of static structures (e.g., B-factor, electrostatics), give only a limited view of the protein complex behavior, contrary to MDs simulations which are able to some extent, to reproduce the inner dynamics of protein structures.

α and β subunits of integrins are associated to rigid, flexible and even disorder properties (such as Duffy protein presented in the section above). We ran independent MDs simulations on different systems, i.e., the wild type but also variants and mutants, using GROMACS MDs package (Van Der Spoel et al., 2005) to examine specific regions of αIIbβ3. We observed different opposite behaviors depending on the region and mutants studied.

Hence, we studied the Cab3a<sup>+</sup> alloantigen resulting from a Leu841Met substitution in the αIIb chain. This polymorphism might result in severe life-threatening thrombocytopenias. Cab3a<sup>+</sup> corresponds to a Leu841Met mutation. We evaluated the flexibility by using Neq index and found that this polymorphism locates in a very flexible sequence in the wild type (with a Neq > 4), but the mutation did not modify the Neq behaviors (Jallu et al., 2013). Moreover, no change in the secondary structure content, neither the PBs adopted by residues of encompassing sequences change. Hence, intriguingly, this substitution would have little effect, if any, on the backbone structure of the peptide 829– 853. It must be noticed that disorder prediction does not show this region has flexible property, i.e., prediction with IUPred (Dosztanyi et al., 2005) or DisEMBL (Linding et al., 2003).

In Caucasian population, the Human Platelet Alloantigenic (HPA) system 1 is involved in most neonatal thrombocytopenias (NAITP) and post-transfusion purpura (PTP) (Espinoza et al., 2013). The HPA-1 system results from a Leucine to Proline substitution in position 33 of the β3 chain (alleles HPA-1a and HPA-1b, respectively) in platelet αIIbβ3 integrin (Jallu et al., 2012). Alloantibodies to the HPA-1a variant can induce very severe immune thrombocytopenia (Espinoza et al., 2013). Furthermore, the Pro33 allelic variant of β3 is considered as a risk factor of thrombosis in patients with cardiovascular diseases.

To compare the HPA-1a and -1b variants, we have proposed for the first time to use a combination of standard analysis of flexibility (namely Root Mean Square Fluctuation, RMSF) and Protein Blocks analyses. MD simulations have revealed that (i) the Leu33Pro substitution of the β3 knee (a domain of β3 integrin chain) leads to adverse structural effects not highlighted by static models; and (ii) that these alterations can explain the increased adhesion potential of HPA-1b platelets to fibrinogen and the possible thrombotic risk associated with the HPA-1b phenotype (Jallu et al., 2012). These molecular simulations also support a novel structural explanation for the epitope complexity of the HPA-1 antigen (Jallu et al., 2012).

Although not yet known to be involved in an alloimmune response, a third variant discovered more recently and characterized by a Valine in position 33 of β3, was also examined. Analyses of the protein flexibility properties can mainly explain the variable reactivity of anti-HPA-1a alloantibodies. This result suggests that dynamics plays a key role in the binding of these alloantibodies. Unlike the L33P substitution which increases the local structure flexibility, the L33V transition would not affect the local structure flexibility, and consequently the functions of αIIbβ3 (Jallu et al., 2014). Although, this region is considered as rigid by disorder prediction, both RMSF and PBs analysis shows a high mobility. This behavior may be explained by a local rigidity, surrounded by deformable regions.

**Figure 5** represents another MDs simulation focusing here only on the Calf-1 domain (a domain of α2b integrin chain), using same parameters as before. Simulations were analyzed through PB approaches underlining its interest for flexibility studies using PBxplore. **Figures 5A,B** show the superimposition of two distinct snapshots (in red and in yellow) extracted from the MDs simulation. **Figure 5C** shows the frequency of PBs at each position, calculated along the MD trajectory, and represented as a WebLogo graphic (Crooks et al., 2004) obtained with PBxplore. WebLogo (Crooks et al., 2004) summarizes this information with an entropy of every PBs at each position. **Figure 5D** is the superimposition of Neq and RMSF. Interestingly, even though some regions show similar tendencies, namely large RMSF associated with large Neq, other regions exhibit different and even opposite tendencies. For example, focusing on the residues near position 66 of Calf-1, the RMSF given **Figure 5G**, is the highest one (in blue on **Figure 5E**) as highly flexible, but it is not the case as the Neq-values for this residue is not high. Therefore, these residues appear to be a mobile region between two deformable regions. This example confirms the interest to examine Neq index beside RMSF because each measure brings related but different information on flexibility.

**Figure 6** shows the structural alphabet distribution during the simulation obtained with GSATools (see Section The Different Views of Protein Structures). The most frequent letters seen (in black) are from the beginning of the alphabet, underlining its allβ composition (**Figure 6A**). The decomposition by this SA shows a large number of conformational changes at each position of the sequence. Only few positions, e.g., 10, 131, and 44 represented by B, H, and X letters, respectively, remained unchanged during the Calf-1 simulations. The transition probability matrix calculated between SA letters (**Figure 6B**) reflects how the local structure changes occur. Along the diagonal, high values are found, the highest ones being for letters N and X while the lowest ones being for letters U and Y The Mutual Information (MI) matrix presented in **Figure 6C** describes the correlation of local conformational changes among the protein fragments. Significant off-diagonal values are found but actually they correspond to strands forming β-sheets. Hence, in contrast to the examples detailed in (Pandini et al., 2012, 2013), the allβ conformation of the protein impedes to enlighten long-range correlations, except between β–strands close in 3D and found all along the protein sequences. The Shannon entropy per position shows quite similar profile between β –strands (mainly between 1.0 and 2.5 bits). All the lowest values correspond to residues inside loops. One of the most interesting features of GSATools is the graph representation of the correlated local motions from the MI matrix; it describes the relative importance of the nodes in the network useful to analyze allosteric behaviors. **Figure 6F** is a visualization of the two most important peaks underlined, they are found far away from the rigid β -sheet region.

# Protein Complexes and Allostery

It is well documented that protein–protein interactions are often guided by flexibility (Jones and Thornton, 1996; Salwinski et al., 2004) and that alternative conformations can have a significant influence on the binding process. It is why predicting the structure of a complex using the unbound structures of the partners remains highly challenging, despite a scrutinizing examination of the amino acid composition of the interface (Janin et al., 2008). Thus, in most cases, protein structures change during the formation of the complex. The changes can be limited to few side chains motions but can also correspond to major reorganization in the fold. Therefore, we undertook the analysis of the protein–protein complexes in the light of structural alphabet. We compared proteins 3D structures in free form, and as part of larger macromolecular complexes.

The building of the protein dataset was quite strict leaving only 76 high quality complexes representing very different configurations with free and bound forms (Swapna et al., 2012). Accordingly, structural changes occurring between the free and bound forms of the protein were analyzed using three different measures: the Cα root mean square deviation, the percentage of PB change and a specific PB substitution score. This last score relies on a PB structural substitution matrix that quantifies the cost to replace a given PB by another PB. The more similar the PBs, the more favorable the substitution score. Consequently, this score permits to quantify the conformational change by distinguishing similar PBs from to the most distinct ones. Comparison between unbound and bound forms shows that significant structural rearrangement occurs at the interface but also in regions away from the interface upon the formation of a highly specific, stable and functional complex. For 50% of them, which correspond to signaling proteins, the major changes correspond to allosteric ones, localized far away from the interface. These sites could be associated to mutations known to be involved in multiple diseases such as cancer. PB allows distinguishing here also between large movements, from mobility to deformability or flexibility. Normal Mode Analysis was also performed to gain deeper insights (Swapna et al.,

2012). The results obtained for signaling complexes underline the importance of allostery-like structural changes much more than appreciated before (see **Figure 7**).

Flexibility becomes a critical issue in complexes especially the ones involving intrinsically disordered protein. Fine analyses have shown that disordered proteins can also adopt well-defined conformations in their bound form; their inherently dynamic nature is cast into their complexes (Meszaros et al., 2011). Protein families with more diverse interactions exhibit less average disorder over all members of the family (Fong and Panchenko, 2010). Inter-domain linkers are evolutionarily well conserved and are constrained by the domain-domain interface interactions (Bhaskara et al., 2013). An interesting resource is the ComSin database which provides a collection of structures of proteins solved in unbound and bound form, targeted toward disorder–order transitions (Lobanov et al., 2010).

# Protein/DNA Interfaces

Beside protein-protein interactions, which govern many biological functions, fundamental biological processes like transcription also require complex formation, i.e., between protein and DNA. As for protein-protein interaction, complexation can change structures of both partners, but most studies focused on the protein side. Most of protein/DNA interfaces only extend the classical approaches to analyze protein/protein interfaces or protein/ligands interface. For instance, in PDIdb (Ferrada and Melo, 2009) or Biswas and coworkers studies, the interface is classified into core and rim regions, the first one being more sequentially conserved. Biswas and coworkers proposed a new classification scheme for the interfaces based on the composition of secondary structures (Biswas et al., 2009). Beyond this description in terms of

regular local structures, Sunami and Kono (2013) conducted a quantitative analysis to understand the conformational changes in proteins when they bind to DNA. They compared DNA-free and DNA-bound forms of proteins and used structural alphabets to describe conformational changes in 4-residue fragments. They found that (i) three specific alphabets appeared in the DNA interfaces, (ii) conformational changes in DNA interfaces are more frequent than in non-interfaces and importantly, (iii) regions involved in DNA interfaces have more conformational variations in the DNA-free form. This study underlines also the importance of intrinsic flexibility of interacting regions to fit into DNA structure.

Another recent analysis has explored an extensive set of protein/DNA complexes and looked at conformational changes occurring in proteins but also in DNA. Importantly, for both molecules, structural alphabets were used. The alphabet used for describing protein backbone is the Protein Blocks. For DNA, a structural alphabet was obtained using a new approach of registering torsion angles of a dinucleotide unit combined with Fourier averaging and clustering (http://www.dnatco.org/, Svozil et al., 2008; Cech et al., 2013). These structural alphabets describe biopolymer conformations at greater detail than the 3-state protein secondary structure and basic DNA structural types such as A, BI and BII. **Figure 8** shows an example of different conformations. This study compared structural features of the protein/DNA interface with the features of non-interacting parts of protein and DNA molecules. Clear differences in preferences for occurrences of local protein and DNA conformations were observed. Specific preferences were underlined between complexes containing various types of proteins such as transcription factors and nucleases. Minor DNA conformers are often significantly enriched at the interface so that

the ability of DNA to adopt non-canonical conformers, rare in naked DNA, is clearly essential for the recognition by proteins. Rare DNA conformations introduce significant deformations to the DNA regular structure. The occurrence of these rare forms was estimated and characterized enabling a better understanding of the role of non-B-DNA structures. A critical feature was the distinct interaction patterns for the DNA minor groove relative to the major groove and phosphate, and the importance of water-mediated contacts. Indeed, water molecules mediate a proportionally largest number of contacts in the minor groove and form the largest proportion of contacts in complexes of transcription factors (Schneider et al., 2014). It corroborates to previous researches on the importance of mobility of such water molecules (Luo et al., 2011; Russo et al., 2011).

whether they are intrinsically mobile or rigid. Regions identified to be

The above-discussed analyses pointed to some remarkable features about the protein/DNA interfaces, so that we performed a more specific analysis of the protein and DNA dynamics based on crystal structures. The analysis of B-factors (Schneider et al., 2014) showed that the dynamics of biopolymer residues, amino acids and nucleotides, as well as ordered water molecules is first of all a function of their neighborhood: amino acids in the interior of proteins have the tightest distribution of their displacements, residues forming the biopolymer interfaces (protein/protein or protein/DNA) intermediate, and residues exposed to the solvent the widest distribution (**Figure 9**). This general picture is best pronounced for structures with the highest crystallographic resolution since discrimination of different types of residues in structures becomes unclear with lower crystallographic resolution. Besides, amino acid residues in the protein core display a unique feature: their backbone and side chain atoms have virtually identical B-factor distributions. The protein core is therefore extremely well packed leaving minimum free space for atomic movements. B-factors of water molecules bridging protein and DNA molecules were surprisingly significantly lower than B-factors of DNA phosphates; in opposite, solventaccessible phosphates were extremely flexible. An unexpected conclusion of this analysis is that a part of the observed trends could be due to improper refinement protocols that may need slight modifications (Schneider et al., 2014). Hence, the B-factors of high-resolution structures reflect the expected dynamics of residues in protein–DNA complexes but the B factors of lower resolution structures should be treated cautiously. Based on such kinds of ideas, Vriend proposed a dedicated dataset of refined B-factors (http://www.cmbi.umcn.nl/bdb/, Touw and Vriend, 2014).

the program PyMOL (http://www.pymol.org, Delano, 2013).

## PTMs

As seen in the previous sections, protein flexibility is essential for interactions between proteins and ligand, nucleic acid, or protein partners. Apart from interaction with partners, chemical modifications like formation or breaking of covalent bonds, can impact structural and dynamics properties. One of the most spectacular examples is depicted by the serpin family members when they interact with the protease (see **Figure 10**, Huntington et al., 2000; Kim et al., 2001). An initial large conformational change, consecutive to the cleavage of the reactive center of the serpin by the protease, occurs. The loop involved in the cleavage moves, folds as a β-strand that inserts between the other strands of the β-sheet composing the serpin protein core. The two

FIGURE 8 | Examples of protein/DNA interactions. (A) Structure of human centromere protein B (CENP-B) binding to DNA CENP-B box (PDB code 1HLV, Tanaka et al., 2001). The image highlights contacts between arginine 125 (chain A, green) in PB m (regular helix) and cytosine 15 (chain B, red) in ntC 41. (B) Details of methionine

repressor protein (MetJ) binding to DNA metbox (PDB code 1MJQ, Garvie and Phillips, 2000). The same PB m and amino acid residue (arginine 40 in chain H, green) is in contact with guanine 2 (chain K, red) in NtC 13. Visualization was created by the program PyMOL (http://www.pymol.org, Delano, 2013).

proteins are tightly linked, which significantly affects the protease that looses more than 30% of its structure.

Among chemical modifications, post-translational modifications (PTMs), like phosphorylation, play a major role in many biology processes. Integrins, for example, can be activated consecutive to phosphorylation. The impact of these modifications on the structure and the dynamics of proteins is thus of particular interest.

Recent studies have shown that PTMs have significant effects on the protein conformations and on their flexibility. Hence Xin and Radivojac used 3D structures from the PDB and studied the conformational heterogeneity of protein structures corresponding to identical sequences in their unmodified and modified forms (Xin and Radivojac, 2012). They demonstrated that PTMs induce conformational changes at both local and global level, but with a limited impact. Accordingly PTMs would affect regulatory and signaling pathways (Nussinov et al., 2012; Xin and Radivojac, 2012) by subtle but common mechanisms of allostery. Some prediction approaches and are included into dedicated databases (Matlock et al., 2015), but few analyzed precisely the whole PTMome.

This led us to conduct a deep analysis of structures of the same protein with or without PTMs. As an example, we selected 157 PDB chains of the human Cyclin-dependent kinase 2 (UniProt AC: P24941) in complex form, and 222 PDB chains of unbound monomer. Based on data from PTM-SD (Craveur et al., 2014), a database of structurally solved and annotated post-translational modifications, 112 chains among the 157

complexes, present a phosphorylated threonine at position 160 in the structure of the kinase. As described in **Table 1**, we compared the backbone flexibility of three different cases: unbound kinase, kinase complex, and phospho-Thr160 kinase in complex.

Comparison of the three Neq profiles, shown in **Figure 11**, highlights significant differences in local flexibility of the kinase structures. **Figure 11A** shows that, when kinase is in unbound form, the polypeptide chain presents a flexible fragment (colored in green), which corresponds to a large loop. When complex is formed (**Figure 11B**), this loop is placed at the interface and leads to stiffening of its edges and higher flexibility in the neighborhood of Thr-160. This change is characterized by a diminution and an increase of Neq-values, respectively. Finally, when the Thr-160 is phosphorylated (**Figure 11C**), the green region becomes comparatively rigid, which results to limited flexibility (Neq ≤ 3.16). However, another region in kinase (position 8 to 18) is associated with increasing flexibility. When the complex is forming, the Neq range in this area increases from (1;2.77) to (1;3.76), and secondly, when the phosphorylation is in place, the range increases to (1;5.91). Interestingly, this region corresponds to the neighboring positions of two other phosphorylation sites, at Thr-14 and Tyr-15. It is important to note that these phosphorylations were absent in the structures used here for the Neq computation.

In a functional point of view, the phosphorylation in position 160 is known to promote the activation of the kinase, while the phosphorylation of position 14 and 15 slightly reduce its activity (Gu et al., 1992). Thereby, the changes in flexibility observed at these 3 phosphorylation sites, could reflect that the activity of the kinase is regulated by a mechanism of complementary rigidity/flexibility of local protein backbone, which could be related to allosteric effects.

The red line plotted in **Figure 11** represents the number of available structural data for each position. Interestingly, the green region in **Figure 11** is proportionally less resolved when kinase is in monomer than when it is in complex, and even more solved when the Thr-160 is phosphorylated. This observation emphasizes that the decrease of flexibility in this region facilitates the resolution of the structures. Several structures of the same protein present specific regions that are disordered in some crystals and ordered in others. These regions were defined by Zhang and collaborators as "Dual Personality Fragments" (Zhang et al., 2007), and the corresponding fragment of the green region in Cyclin dependent kinase was the emblematic example used by Zhang et al. (2007) to defined DPF. In the same way, the region between positions 35 to 45 were also identify as DP fragments.

## Prediction of Protein Flexibility

The growing gap between the number of protein sequences and the number of atomic structures imposes to resort to alternative approaches to gain structural and dynamics information. They are mainly based on crystallographic B-factor analyses. It is often seen that crystallographic B-factors are a mix of properties, dynamics being one of them. Recent approaches show that NMR spectroscopy provides an ever increasing amount of dynamics data, going well beyond the simple thermal vibrations (Powers et al., 1993; Palmer, 2001; Olsson et al., 2014). None of them can describe all the important flexible movement or even disorder. Hence, it must be taken into account that everything in protein dynamics cannot be assessed based on a single view Prediction methods are therefore of particular importance. Flexibility prediction from sequences started as a Boolean prediction, i.e., rigid or flexible, using simple statistical analyses of Bfactor values (Karplus and Schulz, 1985; Vihinen et al., 1994). Following developments combined evolutionary information to different machine learning methods, such as Artificial Neural Networks (Schlessinger et al., 2006), support vector regression coupled with random forest (Pan and Shen, 2009), and support vector machines (Kuznetsov, 2008; Kuznetsov and McDuffie, 2008). Additional sources of information were progressively take into account, rather than X-ray B-factors, as Nuclear magnetic resonance data (NMR) (Trott et al., 2008; Zhang et al., 2010), dihedral angles and accessibility (Hwang et al., 2011), or computational data from Normal Mode Analysis (Hirose et al., 2010). At last, some methodology, dedicated to predict protein disorder were also developed and designed to high flexibility prediction (Galzitskaya et al., 2006; Mamonova et al., 2010; Jones and Cozzetto, 2015). Recent approaches are quite complex like (i) the DynaMine webserver (http://dynamine.ibsquare.be/, Cilia et al., 2013, 2014), DynaMine predicts backbone flexibility at the residue-level in the form of backbone N-H S<sup>2</sup> order parameter values learnt from NMR data, or (ii) as a predictor which used relative solvent accessibility (RSA) and custom-derived amino acid (AA) alphabets. The prediction is done in two-stage linear regression model that uses RSA-based space in a local sequence window in the first stage and a reduced AA pair-based space in the second stage as the inputs (Zhang and Kurgan, 2014).


TABLE 1 | Composition of structural complexes involving the Cyclin-dependent kinase 2.

We also proposed prediction of protein flexibility of an amino acid sequence using the potentialities of SA prediction. The approach is not only innovative through the use of local protein conformations, but also with specific definition of flexibility. Flexibility is often defined based on α-carbon Bfactor values obtained from X-ray experiments. As mentioned above, these data reflect protein flexibility, but may also be prone to experimental and systematic biases. Hence, flexibility was considered with X-ray B-factor descriptors and the RMSF observed in MDs simulations, which is calculated from the amplitude of atom motions during simulation. Both descriptors were combined to define and to examine flexibility classes of SA.

This dedicated prediction method is divided in two steps: first an SA prediction from sequence, and second a flexibility prediction from the SA predicted. The SA used in this method is the LSP (see Section The Different Views of Protein Structures). They consist of 120 overlapping structural classes of 11-residue long fragments (Benros et al., 2006), which encompass all known local protein structures and ensure good quality 3D local approximation. The major advantage of this library is its capacity to capture the continuity between the identified recurrent local structures (Benros et al., 2009). We can notice that is quite difficult to have a good correlation between theoretical results to actual experiments. With LSPs, we have shown that they have on average a correlation > 0.9 with B-factors.

Relevant sequence–structure relationships were also observed and further used for prediction. Briefly, LSP prediction is based on SVM training. With the LSP prediction, a Confidence Index (CI) that is based on the discriminative power of the SVMs is provided. The higher CI, the better the prediction rate is. The prediction rate reaches 63.1%, a rather high value given the high number of structural classes (Bornot et al., 2009).

In a second step, we considered the two descriptors for quantifying protein dynamics, X-ray B-factors and RMSF. They were combined to define 3 flexibility classes of LSPs: rigid, intermediate and flexible. Then for each 11-residue long target sequences, the SA prediction provided a list of five possible LSP candidates. Based on the previously defined flexibility classes of these structural candidates, the prediction of target flexibility is made. Interestingly, the prediction rate is slightly better than the one of PROFbval (Schlessinger et al., 2006) that was optimized for only two classes.

Hence, the originality of the method lies (i) in the use of a combination of B-factors and RMSF for quantifying protein dynamics, (ii) in prediction of flexibility through SA prediction of LSPs, and (iii) in prediction of three classes of flexibility, which are usually limited to two. The method is implemented in a web server named PredyFlexy (http://www.dsimb.inserm.fr/dsimb\_ tools/predyflexy, de Brevern et al., 2012), in which the users have access to a confidence index (CI) for assessing the quality of the prediction rate.

## Conclusion

The protein structure organization is characterized by a conformational arrangement of repetitive structures (secondary structures, i.e., α-helices, β-sheets and coils/loops). Static observation of protein organization has revealed some of their essential properties, i.e., active sites are generally found at the protein core in which residues are well packed and mainly hydrophobic, while the surface residues, exposed to solvent or to another partner(s) (protein, DNA), are more flexible because less constrained than the core. The function of proteins and their interaction mechanism need some flexible properties that are considerably more complex than this simplistic binary view.

By exploiting various structural data sources and by developing different computational methods (B-factor, NMR data, MDs Simulation, NMA, . . . ) dynamics of proteins turn out to cover a large spectrum of conformational changes (combined by mobility of rigid fragment and deformability of backbone), by the existence of intrinsic disorder region, by allosteric effect. . . Some of these flexible mechanisms need structural reorganization at a local level. Thus investigation of protein flexibility requires a more local and complex description of protein structures than the classic representation.

In this review we have illustrated using numerous examples (DARC protein, Human integrins, Protein Complexes, Protein/DNA interfaces, Proteins with Post-Translational Modifications) how the approaches, based on Structural Alphabets, are a valuable tool to study flexibility at this level.

From our experiences with these examples, we can state that the use of SAs allows to tackle and address the important problem of the comparison of an ensemble of protein conformations. Indeed, in a recent paper, Scott and Strauss (Scott and Straus, 2015) underlinesthe bias related to the use of RMSD, which needs beforehand an optimally superimposed approach often remains as rigid bodies. They proposed an elegant method, fleximatch, of protein structure comparison that tries to take flexibility into account. As it was done for protein superimposition methods (Yang and Tung, 2006; Tung et al., 2007; Tung and Yang, 2007; Le et al., 2009; Budowski-Tal et al., 2010; Gelly et al., 2011; Leonard et al., 2014), SA is an efficient approach, not considering proteins as rigid bodies. We underline the interest of our approach based on Protein Blocks with the PBxplore tools (https://github.com/pierrepo/PBxplore, in preparation) or GSAtools (http://mathbio.nimr.mrc.ac.uk/wiki/ GSATools, Pandini et al., 2013) in other cases. The use of SAs and the development of associated metrics such as Neq is required to study the details and begin to understand the complexity of protein flexibility. It allows discriminating flexibility from mobility and deformability, which is not currently considered by other available methods. Nonetheless, it also had drawbacks as no simple threshold will guide the researcher to point out that certain segment is THE highly flexible part and not the other, same as for RMSF. In the same way, use of information theory with GSATools also requires expertise. Moreover, as SA represents a simplification of the 3D description, its results can be compared to the Normal Mode Analysis based on Elastic Network Model (Suhre and Sanejouand, 2004; Tiwari et al., 2014; Eyal et al., 2015) that are efficient to define large movement. However, changes at a finer level such as side chain rotameric states or minor changes in the backbone (but essential for the biological functions) are more difficult to handle. Here as always, a good knowledge of the biological system is essential as a correct definition of the scientific question and its scale (Buehler and Yung, 2009).

To conclude, we can find that all these approaches are suitable for highlighting both flexible and rigid parts of a protein from

# References


structures derived from NMR, X-ray diffraction or molecular simulation.

# Author Contributions

Section Introduction: PC, APJ, JE, TJN, FN, JR, CE, NS, JCG. Section The Different Views of Protein Structures: PC, APJ, JE, TJN, MG, SL, PP, GF, JR, AG, LSS, RMB, JB, ST, JC, BS, CE, NS, JCG. Section Duffy Antigen/Chemokine Receptor Protein: NS, OB, CE. Section Human Integrin α2bβ3: APJ, NS, MG, PP, JR, JB, ST, VJ. Section Protein Complexes and Allostery: PC, APJ, JE, LSS, RMB, NS. Section Protein/DNA Interfaces: JC, BS, JCG. Section PTMs: PC, JE, TJN, FN, SL, GF, JR. Section Prediction of Protein Flexibility: PC, APJ, JE, SL, GF, AG, CE, JCG. Section Conclusion: PC, APJ, GF, CE, NS. AdB conceived the review and participated in all the different sections.

# Acknowledgments

PC, APJ, JE, TJN, FL, NS, MG, SL, PP, OB, GF, JR, AG, JB, ST, CE, JCG, and AdB are supported by the Ministry of Research (France), University Paris Diderot, Sorbonne Paris Cité (France), National Institute of Blood Transfusion (INTS, France), National Institute of Health and Medical Research (INSERM, France), and labex GR-Ex (France). The labex GR-Ex, reference ANR-11- LABX-0051 is funded by the program "Investissements d'avenir" of the French National Research Agency, reference ANR-11- IDEX-0005-02. This work was supported by the Czech-France collaboration Partenariat Hubert Curien Barrande (MEB021032) for JCG, JC, BS, and ADB. This study was also supported by BIOCEV CZ.1.05/1.1.00/02.0109 from the ERDF and RVO: 86652036 for BS and JC, by University of Toronto (Canada) for LSS, by Max Planck (Germany) for BMR, by INTS (France) for VJ, by CNRS (France) and Univ Lyon (France) for JB, by CNRS (France), INSERM (France) and Univ Strasbourg (France) for JE, by CNRS (France) and Univ Nantes (France) for ST, by HFSP to APJ; by NIH (USA) for GF, by Department of Biotechnology (India) to NS, and by CNRS (France), Univ Pierre & Marie Curie (France) and ANR NaturaDyRe (France, ANR-2010-CD2I-014- 04) to JR. APJ was supported by CEFIPRA number 3903-E and TJN is supported by CEFIPRA number 5203-2. NS and AdB also acknowledge to CEFIPRA for collaborative grant (numbers 3903-E and 5203-2).


in multi-domain proteins. J. Biomol. Struct. Dyn. 31, 1467–1480. doi: 10.1080/07391102.2012.743438


receptor: the Duffy antigen/receptor for chemokine (DARC). Biochim. Biophys. Acta 1724, 288–306. doi: 10.1016/j.bbagen.2005.05.016


protein-protein interaction. BMC Struct. Biol. 10:20. doi: 10.1186/1472-6807- 10-20


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Craveur, Joseph, Esque, Narwani, Noël, Shinada, Goguet, Leonard, Poulain, Bertrand, Faure, Rebehmed, Ghozlane, Swapna, Bhaskara, Barnoud, Téletchéa, Jallu, Cerny, Schneider, Etchebest, Srinivasan, Gelly and de Brevern. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Computational approaches for inferring the functions of intrinsically disordered proteins

Mihaly Varadi 1, 2, Wim Vranken1, 2, 3, Mainak Guharoy 1, 2 and Peter Tompa1, 2 \*

<sup>1</sup> Flemish Institute of Biotechnology, Brussels, Belgium, <sup>2</sup> Department of Structural Biology, VIB, Vrije Universiteit Brussels, Brussels, Belgium, <sup>3</sup> ULB-VUB - Interuniversity Institute of Bioinformatics in Brussels (IB)2 , Brussels, Belgium

Intrinsically disordered proteins (IDPs) are ubiquitously involved in cellular processes and often implicated in human pathological conditions. The critical biological roles of these proteins, despite not adopting a well-defined fold, encouraged structural biologists to revisit their views on the protein structure-function paradigm. Unfortunately, investigating the characteristics and describing the structural behavior of IDPs is far from trivial, and inferring the function(s) of a disordered protein region remains a major challenge. Computational methods have proven particularly relevant for studying IDPs: on the sequence level their dependence on distinct characteristics determined by the local amino acid context makes sequence-based prediction algorithms viable and reliable tools for large scale analyses, while on the structure level the in silico integration of fundamentally different experimental data types is essential to describe the behavior of a flexible protein chain. Here, we offer an overview of the latest developments and computational techniques that aim to uncover how protein function is connected to intrinsic disorder.

Keywords: intrinsically disordered proteins, IDP ensembles, IDP function, disorder prediction, protein ensemble database

# Introduction

The traditional goal of protein structural biology is to relate the well-defined three-dimensional structure(s) of a protein to its biological function. This structure-function paradigm continues to facilitate many important discoveries, but has largely ignored the possible roles of conformational flexibility on function (Forman-Kay and Mittag, 2013). Yet, in the recent years it became apparent that structural disorder is ubiquitously present in diverse cellular processes, and has a particularly prominent role in regulation and signaling events occurring in the complex cellular environment (Tompa et al., 2006; Dunker et al., 2015). Proteins or protein regions that are enriched in conformational flexibility are referred to as intrinsically disordered proteins (IDPs) or protein regions (IDRs) (Dyson and Wright, 2005; Fink, 2005; Tompa, 2005). IDPs and IDRs lack a well-defined, stable three-dimensional fold, and therefore they populate ensembles of dynamically exchanging conformations, separated by low energy barriers. This dynamic behavior challenges the traditional structure-function paradigm (Wright and Dyson, 1999; Chouard, 2011), since it is far from trivial to describe the structural behavior of proteins that adopt such an extensive range of conformations, let alone infer their biological role (Tompa, 2011). Even though intrinsic disorder is occurring ubiquitously—more than 30% of the proteins in the known eukaryotic proteomes have disordered segments of 30 or more consecutive disordered residues

#### Edited by:

Piero Andrea Temussi, Università di Napoli Federico II, Italy

#### Reviewed by:

Alfonso De Simone, Imperial College London, UK Henriette Molinari, Istituto di Chimica delle Macromolecole ISMAC CNR, Italy Tobias Madl, Medical University Graz, Austria

#### \*Correspondence:

Peter Tompa, Department of Structural Biology, Flemish Institute of Biotechnology (VIB), Structural Biology Research Center, Vrije Universiteit Brussels, Pleinlaan 2, Brussels 1050, Belgium ptompa@vub.ac.be

#### Specialty section:

This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences

> Received: 29 May 2015 Accepted: 21 July 2015 Published: 05 August 2015

#### Citation:

Varadi M, Vranken W, Guharoy M and Tompa P (2015) Computational approaches for inferring the functions of intrinsically disordered proteins. Front. Mol. Biosci. 2:45. doi: 10.3389/fmolb.2015.00045 (Dunker et al., 2000)—we are only beginning to understand how protein function arises from the disordered state (Tompa, 2011). Here, we provide an overview of the recent developments in terms of computational methods and data resources that facilitate the understanding of intrinsic disorder and its connection to protein function.

# Functional Consequences of Intrinsic Disorder

IDPs and IDRs can be viewed as having complementary functions to those of their folded counterparts. While the latter are often involved in enzymatic activities, molecular transportation or binding short peptides and small molecules, IDPs are mainly involved in signaling, regulation and enzymatic activity inhibition (Xie et al., 2007; Dunker et al., 2015), for example in cell cycle regulation (Yoon et al., 2012), cell division and differentiation (Ward et al., 2004; Xie et al., 2007).

There are a number of possible ways in which an IDP/IDR can realize its function. In perhaps the simplest scenarios, they serve as entropic chains, effectively influencing the orientation and distance between folded domains (Chong et al., 2010), and organizing the super-tertiary structure of the protein (Tompa, 2012). In some cases they are entropic springs or even timers, where the length and flexibility of the linker can determine stochastically how often two folded domains may encounter each other (Bentrop et al., 2001; Smagghe et al., 2010).

Another important role of conformational flexibility is in binding protein or nucleic acid partners. IDPs excel in establishing specific, but transient interactions (Dunker et al., 1998). From an energetic point of view, the reason behind their weaker binding affinities is that the entropic cost of stabilizing a single conformation from the dynamic ensemble that the IDP/IDR is sampling is relatively high (Dyson and Wright, 2005). However, in some cases the fine-tuning of favorable interactions is known to yield surprisingly strong affinities (Ferreon et al., 2013; Follis et al., 2013). An additional advantage of the high degree of conformational freedom is that an IDR can bind very diverse partners because it can easily adopt different conformations (Wang et al., 2011; Hsu et al., 2013) and is often enriched in short binding- and recognition motifs. It is therefore no surprise that IDPs are often hub- (Kim et al., 2008) or scaffold proteins (Dyson and Wright, 2005; Kim et al., 2008; Mittag et al., 2010a) that play essential roles in the cell by integrating signals (Lobley et al., 2007), so increasing the complexity of cellular networks (Dunker et al., 2005; Oldfield et al., 2008). Consequently, IDPs are often implicated in pathological conditions where loss of regulation is the major issue, such as different types of cancer (Andresen et al., 2012). Their involvement in diseases has recently turned IDPs into potential drug targets by either targeting the IDP, or its protein-protein interactions (Funk and Galloway, 1998; Metallo, 2010; Rezaei-Ghaleh et al., 2012).

Conformational flexibility implies high accessibility for potential binding partners and/or enzymes. Consequently, posttranslational modification (PTM) sites are often found to be enriched in intrinsically disordered regions (Iakoucheva et al., 2004), with especially phosphorylation sites being prevalent (Gao et al., 2010). While IDPs often go through disorder-to-order transitions upon binding to their partners (Mohan et al., 2006; Wright and Dyson, 2009), in many cases they remain partially or fully flexible in their bound state, forming fuzzy complexes (Tompa and Fuxreiter, 2008; Fuxreiter and Tompa, 2012). One of the advantages of this fuzziness is that PTM sites within the chain can remain relatively accessible, allowing easier regulation of the IDP by modification enzymes (Mittag et al., 2010a). Such regulation by PTM sites is not limited to activation/deactivation of the protein; the modification of the surface of the IDR may also be the prerequisite of binding to a different partner (Oldfield et al., 2008), or even to the same partner, but with increased affinity (Mittag et al., 2010b).

However, intrinsic disorder also has a dark side. In particular, the amino acid compositional bias of IDPs coupled with relatively high propensities to form β-sheets and turns leads to elevated aggregation potentials, and the formation of amyloidtype beta-structures (Levine et al., 2015). Indeed, IDPs have been implicated in aggregation-based diseases, such as Alzheimer's and Parkinson's (Huang and Stultz, 2009; Uversky, 2010).

# Sequence-based Investigation of IDPs

There are a number of experimental techniques currently available for identifying and characterizing intrinsic disorder, such as circular dichroism (CD) (Weinreb et al., 1996), protease digestion (Johnson et al., 2012), Förster resonance energy transfer (FRET) (Haas, 2012), Electron Paramagnetic Resonance (EPR) spectroscopy (Drescher, 2012), small-angle X-ray and neutron scattering (SAXS and SANS) (Bernado and Svergun, 2012; Gabel, 2012) and nuclear magnetic resonance spectroscopy (NMR) (Kosol et al., 2013; Konrat, 2014). For initial and for highthroughput investigations, computational methods are however a very popular choice (Ward et al., 2004; Ishida and Kinoshita, 2007). Intrinsic disorder is associated with distinct sequence characteristics; IDPs/IDRs are enriched in "disorder promoting" amino acids, such as charged or polar residues, glycines and prolines, while hydrophobic residues are underrepresented (Uversky et al., 2000).Their conformational flexibility also implies that the local sequence context predominantly dictates the amino acid interactions that can take place, making IDPs more amendable to prediction of their characteristics from sequence. Throughout the last decade many disorder prediction algorithms were designed to exploit the information contained within the amino acid sequence of an IDP; there are more than 50 disorder predictors worldwide (He et al., 2009). The first disorder predictors, such as DisEMBL (Linding et al., 2003) were primarily based on the distinct compositional bias of IDPs. They were followed by faster and more reliable algorithms, such as IUPred (Dosztanyi et al., 2005), RONN (Yang et al., 2005), and Espritz (Walsh et al., 2012). Some of these more advanced methods rely on machine learning techniques (Bellay et al., 2012), or combine the results of several algorithms, such as the metapredictor metaPrDOS (Ishida and Kinoshita, 2008). Overall, the accuracy of most predictors is consistently above 80%, with the best methods currently peaking around 85% (Monastyrskyy et al., 2014). Alternatively, the novel Dynamine approach predicts backbone dynamics, which correlates (negatively) with intrinsic disorder (Cilia et al., 2014); interestingly, this approach is trained on estimations directly from NMR data and avoids structurebased information, complex machine-learning and evolutionary information (Cilia et al., 2013). The distribution of charged amino acids in the sequence of an IDP can also offer information on whether the protein chain is extended or collapsed (Das and Pappu, 2013).

Often it is unnecessary even to predict disorder, since there are a number of openly accessible online resources that store information of the disorder content of specific proteins. The Disordered Protein Database (DisProt) is the primary one of these sequence-based resources (Sickmeier et al., 2007). DisProt is manually curated, and stores information on proteins for which intrinsic disorder was experimentally determined. Where available, the proteins are also annotated with their known functions. However, DisProt houses data for 694 disordered proteins, which is only a minor fraction of the expected number of IDPs. MobiDB (Potenza et al., 2015) and D2P2 (Oates et al., 2013) on the other hand are online resources that store IDPs identified using prediction algorithms from the whole UniProt in addition to experimentally determined ones.

Sequence information on intrinsic disorder can be exploited for more than merely the prediction of disordered residues. In fact, there are many recent algorithms that aim at predicting functional sites and/or the functional role of IDRs. For example, larger hydrophobic residues such as tryptophan and leucine are often found within peptide motifs that act as recognition units located within IDR segments, called molecular recognition features (MoRFs) (Mohan et al., 2006; Fuxreiter et al., 2007; Brown et al., 2010). Disordered motifs are generally short, 3- 15 residue long segments; therefore identifying them poses a computational challenge (Gould et al., 2010). Consequently, predicting functional sites in IDRs is not straightforward, and prone to high false positive rates (Tompa, 2011). Additional layers of information can enhance the performance, for example MoRFpred, which uses order/disorder patterns (Cheng et al., 2007), ANCHOR, which estimates the interaction of a segment with a general partner (Meszaros et al., 2009) or DisCons, which takes into consideration the evolutionary conservation of both the amino acid sequence and of the disorder as a feature (Varadi et al., 2015).

# Structural Representation of IDPs

Ideally it would be possible to describe the structure of an IDP/IDR in full atomic detail. Indeed, the set of conformations IDPs sample is often not completely random, for example both p21 and p27 are known to sample distinct secondary structural elements that are biologically relevant and involved in conformation selection (Kriwacki et al., 1996; Sivakolundu et al., 2005). Due to their inherent conformational flexibility, however, the structure of an IDP cannot be described with a single, static conformation (Tompa and Varadi, 2014). This conformational diversity of IDPs precludes crystallization and examination by X-ray crystallography is therefore not a viable option. NMR spectroscopy, while more attuned to conformational diversity, remains hindered by distinct difficulties such as peak overlap (Bellay et al., 2012), while traditional structure calculation protocols do not properly account for multiple conformations (Vranken, 2014).

In response to this challenge, a number of approaches were developed that combine experimental data with computational methodology with the aim to accurately describe the full conformational ensemble adopted by IDPs. Experimental data from techniques that rely on measurements performed in solution are particularly well suited for studying the dynamic structure of an IDP, even though they often represent an average over the different conformations that are adopted by the IDP. These experimental measurements predominantly include NMRderived parameters, such as chemical shifts (CSs) (Jensen et al., 2011), residual dipolar couplings (RDCs) (Mittag et al., 2010b), paramagnetic relaxation enhancements (PREs) (Mittag et al., 2010b), and J-couplings (Mittag et al., 2010b), as well as scattering intensities from small-angle X-ray scattering (SAXS) (Allison et al., 2009) and probe distances from Forster resonance energy transfer (FRET) (Haas, 2012). These experimental data are then combined with computational methods to determine an ensemble of conformations for an IDP, with two main approaches being used; the first approach is referred to as pool-based modeling, while the second one is based on molecular dynamics (MD) simulations (Tompa and Varadi, 2014) (**Figure 1**).

When using a pool-based approach, such as the ensemble optimization method (EOM) (Bernado et al., 2007) which was designed to model ensembles based on SAXS data, the initial step is to generate a very large random or semi-random pool of conformations based on the amino acid sequence of the IDP/IDR with algorithms such as Flexible-Meccano (Ozenne et al., 2012). The sampling of conformations can be biased using experimental data, such as secondary structure propensities derived from chemical shifts. Next, theoretical parameters are estimated for each conformer in the pool, for example the theoretical scattering intensities calculated by software such as CRYSOL (Bernado and Svergun, 2012). Selection algorithms are then deployed in order to select subsets of the pool in a way that when the theoretical parameters are averaged over a subset, they are in excellent agreement with the experimental data (Sibille and Bernado, 2012). Another example is the ENSEMBLE methodology, which uses as input a large set of conformations together with relevant experimental data, and then prunes the ensemble of conformations to a smaller subset. During the filtering step, conformations are assigned weights so that the resulting ensemble average values fit the experimental input values. Structures that do not contribute to this fitting are discarded (Krzeminski et al., 2013).

In contrast, ensemble modeling procedures are based on molecular dynamics (MD) simulations and begin with random conformations in parallel. Multiple "replica" simulations are initiated using these initial conformations, and constraints are applied over multiple models based on the experimental data, i.e., sets of conformations are required to satisfy experimentally determined constraints, such as pair-wise

distances or secondary structure propensities (Cavalli et al., 2013). Given the conformational heterogeneity of IDPs, which sample an extensive range of conformations during their biological lifetime, extensive simulations are required to ensure adequate sampling of relevant regions of the conformational space. This significantly increases the computational costs associated with IDP simulations and makes it difficult to achieve even for systems of modest size. Recent developments to address these issues include techniques such as multi-scale enhanced sampling (MSES) (Lee and Chen, 2015) and replica exchange with guided annealing (RE-GA) (Zhang and Chen, 2014). The MSES protocol combines coarse-grained, topology-based models with atomistic force fields to enhance sampling and was recently optimized for simulating IDP conformational ensembles, where it could capture reversible helix-coil transitions (Lee and Chen, 2015). RE-GA has been suggested to be suitable for systems with small conformational transition barriers (as is the case for IDPs), and helped the disordered kinase inducible domain (KID) protein to efficiently escape non-specific compact states while requiring less computation (Zhang and Chen, 2014).

Therefore, the main differences between these two classes of modeling techniques are that the pool-based approach is much faster; conformations can be easily generated, but the final results strongly depend on the quality and diversity of this initial pool of conformations. The ensemble modeling MD approaches are in contrast very slow; conformational sampling depends on the MD simulation, but they have the advantage that the experimental data are continuously applied, a timeline of conformational changes is available, and due to their more rigorous simulation of physical reality, they should give a better representation of the thermally accessible structural ensemble.

These two modeling approaches (i.e., pool and MD-based) are currently the state-of-the-art and have been applied to generate the structural ensemble of many IDPs (**Table 1**). These ensemble models are, however, not straightforward to interpret (Tompa and Varadi, 2014). The key issue is that the experimental information that is available to either filter or constrain during the calculation is very sparse compared to the immense degree of conformational freedom the IDP experiences in solution, resulting in a hugely underdetermined problem. As a direct consequence, many ensembles of models can describe the experimental data equally well, allowing multiple, ambiguous solutions that strongly depend on the calculation approach and the amount and type of experimental data available. In fact, one can model the ensemble of an IDP with an excellent fit to the data, then discard the ensemble and remodel another, unique and different ensemble with an equally good fit. In this sense, ensembles should be considered as a whole and their structural characteristics analyzed as the average over the ensemble, and over-interpretation of single conformations should be avoided. However, if certain characteristics or preformed secondary structural elements are consistently modeled in multiple ensembles, then such structural features might be functionally relevant.

Varadi et al. Computationally inferring the function of IDPs

TABLE 1 | Recently published ensemble models from the Protein Ensemble Database.


The field of ensemble modeling therefore still presents exciting opportunities for further development, and several important issues will have to be addressed before the techniques become more reliable. In our view, the first is increasing the number of experimentally derived constraints, which will lead to higher quality models, or cross-validation with new types of experimental data, which will also increase the power of ensembles. The second is the further incorporation of knowledge based information into the calculations, such as the results from reliable predictions or improved force fields. The third is that specific validation and evaluation approaches are required for these ensembles, likely starting from the currently welldeveloped NMR validation field (Rosato et al., 2013; Vuister et al., 2014) but with better accounting for multiple conformations (Vranken, 2014). Especially NMR CS values, whenever available, are very useful for the estimation of residue-level backbone and side-chain dynamics (Berjanskii and Wishart, 2005, 2013) as well as secondary structure populations (Shen and Sali, 2006; Camilloni et al., 2012), with new methods providing reference chemical shift values for IDPs (Tamiola et al., 2010). They are already effectively used to generate pools with predetermined conformations and as restraints in molecular dynamics simulations (Krieger et al., 2014), but have immense potential for the validation and evaluation of ensembles. Finally, the overwhelming majority of the already generated ensemble models were previously unavailable to the scientific community, impeding the establishment of standardized validation and evaluation protocol. The Protein Ensemble Database (PED) is an international initiative launched to address this issue, effectively making the experimental data and the ensemble models available to the scientific community (Varadi et al., 2014). This is expected to facilitate the development of the next generation of ensemble modeling techniques, and should provide a basis for defining standards of validation and evaluation.

# Toward the Functional Interpretation of IDP Ensembles

While ensemble models do not yet possess a predictive power comparable to that of the structure of folded proteins/domains, these models can already offer insights regarding the function of an IDP. Through the integration of experimental data into an ensemble model, functionally important segments might be inferred. For example, transient secondary structural elements in the ensemble of an IDP are often important in terms of function. Such pre-formed elements are often molecular recognition units, playing major roles in binding to various partners. For example, thep27 protein samples transient helices are consistent with the secondary structure of the bound state p27-Cdk2-cyclin (Sivakolundu et al., 2005). Therefore, in accord with the notion of conformational selection, if a certain secondary structural element is sampled in the ensemble, it might be functionally relevant in the bound form as well (Yoon et al., 2012). Again, such interpretations have to be treated with caution given that the ensembles are based on lower resolution, averaged experimental observations, and the ensemble models should therefore be accurate on average, not by single conformations.

# Conclusions

IDPs are involved ubiquitously in biological processes, and play essential roles in the regulation of complex cellular systems (Tompa et al., 2006; Dunker et al., 2015). These multi-purpose proteins combine conformational flexibility with an enrichment of binding motifs and post-translational modification sites (Iakoucheva et al., 2004; Wang et al., 2011; Hsu et al., 2013). Due to their biological importance, it is imperative to characterize them and attempt to relate their sequence and structure with their physiological roles (Tompa, 2011). Such an endeavor has the potential to offer valuable insights that can be translated into new drugs and therapies (Funk and Galloway, 1998; Metallo, 2010; Rezaei-Ghaleh et al., 2012). Sequence-based in silico techniques such as disorder prediction algorithms are already comparable in terms of accuracy to that of the secondary structure prediction algorithms of folded proteins (Monastyrskyy et al., 2014), and functional prediction algorithms are also widely available (Cheng et al., 2007; Meszaros et al., 2009; Varadi et al., 2015). Yet, a major breakthrough is expected when ensemble models based on diverse experimental data prove to be biologically relevant, so that we can confidently infer specific protein function from the structural representation of an IDP (Tompa and Varadi, 2014). However, in order to realize this goal a number of challenges need to be tackled first (Tompa, 2011). Chief among these issues are improving the amount and available types of experimental data and establishing standardized protocols for the validation and evaluation of the ensemble modeling procedures. Only then can the field advance in terms of increasing the predictive power of ensemble models (Tompa and Varadi, 2014).

## Acknowledgments

This work was supported by the Odysseus grant G.0029.12 (FWO, Research Foundation Flanders) to PT and by a VIB international postdoctoral (omics@VIB) Marie-Curie COFUND fellowship for MG. WV acknowledges the Brussels Institute for Research and Innovation (Innoviris) grant BB2B 2010-1-12.

## References


14-3-3 with their partners. BMC Genomics 9(Suppl. 1):S1. doi: 10.1186/1471- 2164-9-S1-S1


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Varadi, Vranken, Guharoy and Tompa. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The use of ion mobility mass spectrometry to probe modulation of the structure of p53 and of MDM2 by small molecule inhibitors

#### Edited by:

Kris Pauwels, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Tom Edwards, University of Leeds, UK Manuel M. Sánchez Del Pino, Universidad de Valencia, Spain

#### \*Correspondence:

Perdita E. Barran, The Michael Barber Centre for Collaborative Mass Spectrometry, Manchester Institute of Biotechnology, School of Chemistry, University of Manchester, 131 Princess Street, Manchester M1 7DN, UK perdita.barran@manchester.ac.uk

#### Specialty section:

This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences

> Received: 27 April 2015 Accepted: 22 June 2015 Published: 10 July 2015

#### Citation:

Dickinson ER, Jurneczko E, Nicholson J, Hupp TR, Zawacka-Pankau J, Selivanova G and Barran PE (2015) The use of ion mobility mass spectrometry to probe modulation of the structure of p53 and of MDM2 by small molecule inhibitors. Front. Mol. Biosci. 2:39. doi: 10.3389/fmolb.2015.00039 Eleanor R. Dickinson<sup>1</sup> , Ewa Jurneczko<sup>2</sup> , Judith Nicholson2, 3, Ted R. Hupp<sup>3</sup> , Joanna Zawacka-Pankau<sup>4</sup> , Galina Selivanova<sup>4</sup> and Perdita E. Barran<sup>1</sup> \*

<sup>1</sup> The Michael Barber Centre for Collaborative Mass Spectrometry, Manchester Institute of Biotechnology, School of Chemistry, University of Manchester, Manchester, UK, <sup>2</sup> School of Chemistry, University of Edinburgh, Edinburgh, UK, 3 Institute of Genetics and Molecular Medicine, CRUK Cancer Research Centre, University of Edinburgh, Edinburgh UK, <sup>4</sup> Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm, Sweden

Developing drug-like molecules to inhibit the interactions formed by disordered proteins is desirable due to the high correlation of disorder with protein implicated in disease, but is challenging due in part to the lack of atomistically resolved and resolvable structures from conformationally dynamic systems. Ion mobility mass spectrometry (IM-MS) is well-positioned to assess protein ligand interactions along with the effect of a given inhibitor on conformation. Here we demonstrate the use of IM-MS to characterize the effect of two inhibitors RITA and Nutlin-3 on their respective binding partners: p53 and MDM2. RITA binds N-terminal transactivation domain of p53 (Np53) weakly, preventing direct observation of the complex in the gas phase. Nonetheless, upon incubation with RITA, we observe an alteration in the charge state distribution and in the conformational distributions adopted by Np53 in the gas phase. This finding supports the hypothesis that RITAs mode of action proceeds via a conformational change in p53. Circular dichroism corroborates our gas phase findings, showing a slight increase in secondary structure content on ligand incubation, and HDX-MS experiments also highlight the dynamic properties of this protein. Using the same approach we present data to show the effect of Nutlin-3 binding to the N-terminal domain of MDM2 (N-MDM2), N-MDM2 presents as at least two conformational families in the absence of Nutlin-3. Upon Nutlin-3 binding, the protein undergoes a compaction event similar to that exhibited by RITA on Np53. This multi-technique approach highlights the inherent disorder in these systems; and in particular exemplifies the power of IM-MS as a technique to study transient interactions between small molecule inhibitors and intrinsically disordered proteins.

Keywords: conformational dynamics, ion mobility mass spectrometry, p53, MDM2, small molecule modulation

# Introduction

The transcription factor p53, dubbed the Death Star (Vousden, 2000), is a multi-domain, intrinsically disordered protein (IDP) (Bell et al., 2002; Dawson et al., 2003). The protein comprises the disordered N-terminal domain (Np53) (Joerger and Fersht, 2008) containing the transactivation domain (residues 1–61) and the proline-rich domain (residues 62–94), the central DNA binding domain (residues 94–292), the tetramerization domain (residues 325–355) and the C-terminal regulatory domain (residues 363– 393). It is strongly implicated in tumor suppression pathways, where it functions to block tumor development by triggering cellular senescence or apoptosis upon signals indicating DNA damage, oncogene activation, or telomere erosion (Vousden and Prives, 2009). Under non-stressed conditions, low p53 levels are tightly maintained by MDM2 (murine double minute 2). MDM2 is a ∼55 KDa IDP with roles as an Ubiquitin E3 ligase, as a molecular chaperone and also in translational control. MDM2 comprises the disordered "lid" mini-domain (residues 1–24), (Uhrinova et al., 2005) the N-terminal domain (residues 25– 109), the disordered central acidic domain (residues 221–276), the zinc finger domain (residues 299–331), and the C-terminal RING (really interesting new gene) domain (residues 430–480). MDM2 down regulates p53 activity in a negative autoregulatory feedback loop via three mechanisms; firstly, MDM2 blocks the transcription ability of p53 by direct binding through their respective N-terminal domains (Wu et al., 1993; Haupt et al., 1997). Secondly, MDM2 exports p53 from the nucleus and thirdly, targets p53 by Ubiquitination for degradation via the proteasome (Freedman and Levine, 1998; Tao and Levine, 1999). p53 N-terminal domain binds into the MDM2 N-terminal domain hydrophobic pocket as an amphipathic helix, with residues Ph19, Trp23, and Leu26 comprising a triad of required contacts which insert into the MDM2 binding cleft (Kussie et al., 1996).

Alteration of the p53 pathway is an almost universal hallmark of human cancers, with 22 million cancer patients living with abrogation of the p53 pathway, half of which display suppressed p53 function (Brown et al., 2009) and half of which exhibit p53 mutations. Cellular overexpression of MDM2 effectively abolishes p53 function, allowing unregulated cell cycle events in tumor cells. Inhibition of the p53:MDM2 complex is therefore a highly desirable therapeutic strategy; releasing, reactivating and stabilizing p53 levels, thus providing an attractive cancer therapy drug target. To date, numerous p53:MDM2 protein–protein interaction (PPI) antagonists have been identified, including cisimidazolines (Vassilev et al., 2004; Vu et al., 2013), "stapled" peptides (Brown et al., 2012; Chang et al., 2013), terphenyls (Yin et al., 2005), oligobenzamides (Lu et al., 2006), spirooxindoles (Ding et al., 2006), chromenotriazolopyrimidine (Rew et al., 2012), Benzodiazepinedione (Grasberger et al., 2005), and Chromenotriazolopyrimidines (Allen et al., 2009). The cisimidazoline Nutlin-3 is composed of enantiomers a and b, of which enantiomer a is 150 times more potent, and binds MDM2 in the p53 peptide groove, mimicking the three p53 residues responsible for the bulk of binding interactions (Vassilev et al., 2004). Nutlin-3 is effective in numerous cell lines, and is able to arrest or induce apoptosis in proliferating cancer cells with micromolar concentrations (Tovar et al., 2006). The drug candidate RITA (reactivation of p53 and induction of tumor cell apoptosis, NSC 652287) has been shown to restore wild-type p53 function in tumor cells by preventing the p53:MDM2 interaction (Issaeva et al., 2004). In contrast to the Nutlins, which bind MDM2 in its N-terminal hydrophobic pocket (Vassilev et al., 2004), RITA binds to p53 N-terminal domain with estimated K<sup>D</sup> = 1.5 nM. It is hypothesized that RITA binds outside of the p53/MDM2 binding cleft, allosterically exerting its effect via a conformational change in the highly disordered N-terminus of p53 (Np53) (Issaeva et al., 2004).

Since its advent in the 1970's (Hogg and Kebarle, 1965; Kebarle and Hogg, 1965), the hybrid gas phase technique Ion Mobility-Mass Spectrometry (IM-MS) has gained credibility as a tool to study the conformations adopted by proteins and peptides in the gas phase. IM-MS is especially effective in its use for studying IDPs (Bernstein et al., 2004; Harvey et al., 2012; Pagel et al., 2013) due to its ability to observe conformations adopted by analytes on a millisecond time scale (Wyttenbach et al., 2001; McCullough et al., 2008). IM-MS provides information regarding charge, mass and shape of an analyte. The simplest setup of IM-MS is that of drift time IM-MS (DT IM-MS) (McAfee and Edelson, 1963). Ions are separated by their mobility (K) as they traverse a drift cell of known length filled with buffer gas to a known pressure and temperature. Ions travel down a weak electric field (5–50 V cm−<sup>1</sup> ) colliding with buffer gas molecules which counter their progress until an equilibrium drift velocity, proportional to the electric field, is reached. The mobility (K) of an ion is the ratio between the drift velocity (vd) and the applied electric field (E). The mobility of an ion can be used to calculate the rotationally averaged collision cross section (CCS, , Å<sup>2</sup> ) using Equation (1) (Mason and McDaniel, 2005):

$$K\_0 = \frac{3ze}{16N} \left(\frac{2\pi}{\mu k\_B T}\right)^{0.5} \frac{1}{\Omega} \tag{1}$$

Where K<sup>0</sup> is the reduced mobility; z is the ion charge state; e is the elementary charge; N is the gas number density; µ is the reduced mass of the ion-neutral pair; k<sup>B</sup> is the Boltzmann constant, and T is the gas temperature.

Here we employ native mass spectrometry, DT IM-MS, circular dichroism (CD) and hydrogen-deuterium exchange coupled to mass spectrometry (HDX-MS) to observe the conformations of N-terminal p53 domain (Np53) and the Nterminal domain of MDM2 (N-MDM2) both in the gas phase and in solution. We also probe the binding and conformational changes conferred by small molecule inhibitors; Nutlin-3 for N-MDM2, and RITA for Np53. Further information about DT IM-MS, CD and HDX-MS methodology can be found in the Supporting Information.

#### Materials and Methods

Expression and purification of both Np53 (residues 1–100) (Szekely et al., 1993; Bakalkin et al., 1995) and N-MDM2 (residues 1–126) (Worrall et al., 2010) have been previously described. Before the analysis reported here, the protein samples were thawed and dialysed in 50 mM ammonium acetate using Bio-RAD micro bio-spin chromatography columns (Bio-Rad Laboratories, Inc.). Concentrations of purified proteins were measured by the Thermo Scientific NanoDrop Spectrophotometer ND 1000 (Thermo Scientific, USA). Small molecule RITA [2,5-bis(5-hydroxymethyl-2-thienyl) furan, NSC 652287] was reconstituted in 100% IPA and stored at −20◦C. Before analysis, RITA was thawed and diluted to 100µM and an IPA concentration of 5% using 50 mM ammonium acetate. Nutlin-3 was reconstituted in 100% DMSO and stored at −80◦C. Before analysis, Nutlin-3 was thawed and diluted to 500µM and a DMSO concentration of 1% using 50 mM ammonium acetate.

MS and IM-MS experiments were performed on Np53 and N-MDM2 from solutions buffered with ammonium acetate (pH 6.8). Np53 samples were incubated with 5% IPA for 30 min at 37◦C to account for the solvent present in the RITA sample. N-MDM2 samples were incubated with 0.5% DMSO for 30 min at room temperature to account for the solvent present in the Nutlin-3 sample. Binding experiments were performed on Np53 with RITA in a 1:2 protein:ligand ratio, samples were incubated for 30 min at 37◦C. Binding experiments were performed on N-MDM2 and Nutlin-3 in a 1:10 protein:ligand ratio, samples were incubated for 30 min at room temperature. All MS and DT IM-MS data were acquired on an in-house modified quadropole time-of-flight mass spectrometer (Waters, Manchester, UK) (McCullough et al., 2008) containing a copper drift cell of length 5.1 cm. Ions were produced by positive nano-electrospray ionization (nESI) with a spray voltage of 1.3–1.62 kV. Helium was used as the buffer gas, its pressure measured using a baratron (MKS Instruments, UK). Buffer gas temperature and pressure readings (294.31–303.69 K and 3.518–3.898 Torr, respectively) were taken at each drift voltage and used in the analysis of drift time measurements. The drift voltage across the cell was varied by decreasing the cell body potential from 60 to 15 V, with arrival time measurements taken at a minimum of five distinct voltages. Instrument parameters were kept as constant as possible and are as follows: cone voltage: 114–119 V, source temperature: 80◦C.

nESI tips were prepared in-house using a micropipette puller (Fleming/Brown model P-97, Sutter Instruments Co., USA) using 4′′ 1.2 mm thin wall glass capillaries (World Precision Instruments, Inc., USA) and filled with 10–20µL of sample.

Data was analyzed using MassLynx v4.1 software (Waters, Manchester, UK), Origin v9.0 (OriginLab Corporation, USA) and Microsoft Excel. Experiments were carried out in triplicate. Ion arrival time distributions were recorded by synchronization of the release of ions into the drift cell with mass spectral acquisition. The collision cross section distributions (CCSD) are derived from arrival time data using Equation (2) (Mason and McDaniel, 2005):

$$\Omega\_{\text{avg}} = \frac{(18\pi)^{1/2}}{16} \left[ \frac{1}{m\_b} + \frac{1}{m} \right]^{1/2} \frac{ze}{(\mathcal{K}\_B T)^{1/2}} \frac{1}{\rho} \frac{t\_d V}{L^2} \tag{2}$$

Where is the collision cross section (Å<sup>2</sup> ); m and m<sup>b</sup> are the masses of the ion and buffer gas, respectively; z is the ion charge state; e is the elementary charge; k<sup>B</sup> is the Boltzmann constant; T is the gas temperature; ρ is the buffer gas density; L is the drift tube length; V is the voltage across the drift tube; and t<sup>d</sup> is the drift time. For these experiments where the CCS has been evaluated experimentally with helium as the buffer gas and using a drift tube with a linear field we use the convention DTCCSHe to report our collision cross section values in the context of the mobility technique employed as well as the buffer gas used.

HDX-MS experiments were carried out using a fully automated LEAP autosampler system (HTS PAL, Leap Technologies, Carrboro, NC, USA) previously described (Chalmers et al., 2006; Zhang et al., 2009) and an online Acquity UPLC M-class HDX System (Waters Inc., Manchester, UK). Np53 and RITA were mixed at a 1:2 protein:ligand ratio and incubated for 30 min at 37◦C before analysis. Stock protein solutions (50µM Np53 ± 100µM RITA, with 5% IPA) were diluted to 10µM with equilibration buffer. 3.8µl protein solution was incubated with D2O (54.2µl labeling buffer) and incubated at 18◦C for 15, 30, 60, or 120 s. Following deuterium on-exchange, 50µl of the labeled protein solution was quenched by adding 50µl of quench buffer at 1◦C, and samples were passed across an immobilized pepsin column (enzymate BEH pepsin column, Waters Inc., Manchester, UK) at 100µL min−<sup>1</sup> (H2O + 0.1% formic acid, 20◦C). The resulting peptides were trapped on a UPLC BEH C<sup>18</sup> Van-Guard Pre-column (Waters Inc., Manchester, UK) and then gradient eluted (1 min loading time, 8–85% ACN + 0.1% formic acid gradient, 40µl min−<sup>1</sup> , 1◦C) across a UPLC BEH C<sup>18</sup> column (Waters Inc., Manchester, UK) before undergoing electrospray ionization and analysis using a Synapt G2Si mass spectrometer (Waters Inc., Manchester, UK). Data was analyzed using ProteinLynx Global Server (PLGS) (Waters, Manchester, UK), Dynamx v1.0 (Waters, Manchester, UK) and Origin v9.0 (OriginLab Corporation, USA).

Hundred percentage of sequence coverage was obtained for Np53 ± RITA. Selected peptides were restricted to be present in all three repeats of 0 s incubation time experiments.

## Results

#### Modulation of N-terminal p53 by RITA

In the absence of RITA and under near neutral solution pH conditions, the mass spectra of Np53 (**Figure 1A**) presents a broad monomeric charge state distribution (CSD) range 5 ≤ z ≤ 12, with three major signals corresponding to the ions [M+6H]6+, [M+7H]7+, and [M+8H]8+, of which the [M+6H]6<sup>+</sup> species is most intense. Upon incubation of Np53 with RITA we observe a shift in the CSD toward lower charge states. Specifically, Np53 in the presence of RITA (**Figure 1B**) exhibits a significant decrease in intensity of the [M+7H]7<sup>+</sup> and [M+8H]8<sup>+</sup> species, along with an increase in the intensity of the [M+5H]5<sup>+</sup> species, an appearance of the [M+4H]4<sup>+</sup> species and a loss of the high charge states z >10. Although source conditions were carefully controlled to give gentle ionization of the sample, the Np53:RITA complex was not strong enough to be retained during desolvation at any protein:ligand ratio (data not shown).

DT IM-MS was performed on Np53 both in the absence and presence of RITA. The collision cross section distribution

( DTCCSDHe) (**Figure 2A** top panel) shows the Np53 [M+6H]6<sup>+</sup> charge state presents as two conformational families; a more populated compact form (denoted C1, blue Gaussian distribution) centered at ∼1250 Å<sup>2</sup> and a low intensity extended form (denoted X, green Gaussian distribution) centered at ∼1500 Å<sup>2</sup> . Two conformations are also observed for [M+7H]7<sup>+</sup> (**Figure 2B**), which are assigned to X and a more intense larger distribution, centered at ∼1750 Å<sup>2</sup> , which is assigned to an unfolded form of the protein (U, purple Gaussian distribution). [M+8H]8<sup>+</sup> (**Figure 2C**), is also made up of U, along with low intensity signal from a still more extended form (U2, gold Gaussian distribution), although this latter distribution is poorly resolved. [M+9H]9<sup>+</sup> (Figure S2) presents in three conformational families; X, U, and U2, of which the most extended U2 is most populated.

Upon incubation with RITA the DTCCSDHe for [M+6H]6<sup>+</sup> is significantly altered (**Figure 2A** bottom panel); we no longer observe the extended conformer X, observe a reduction in the population of compact conformer C1, and the induction of a highly populated novel conformational family centered at ∼950 Å<sup>2</sup> , C<sup>0</sup> (red Gaussian distribution). The [M+7H]7<sup>+</sup> CCSD (**Figure 2B**) is also altered by the presence of RITA, with loss of conformer U, and induction of both conformers C<sup>1</sup> and C0. This change is accompanied by a decrease in intensity of this charge state. [M+8H]8<sup>+</sup> (**Figure 2C**) behaves similarly to [M+7H]7+, with loss of conformer U2, and induction of conformers X and C1. We observe an increase in the intensity of the [M+5H]5<sup>+</sup> species (Figure S1) along with the appearance of a highly compact form of the protein C0. This compaction is evident in all charge states, for example [M+9H]9<sup>+</sup> (Figure S2) has lost the population of the unfolded conformer U<sup>2</sup> upon incubation with RITA, alongside a reduction in intensity of conformers X and U and induction of highly populated C<sup>1</sup> conformational family. This alteration of the conformational spread as shown by the DTCCSDHe is supported in solution by CD. Figure S3 (Supporting Information) shows the secondary structure content of Np53 increases upon incubation with RITA, supporting the hypothesis that RITA induces a novel conformer of Np53. Structural analysis using DiChroWeb (Lobley et al., 2002) using CONTILL algorithm (Provencher and Gloeckner, 1981; Sreerama and Woody, 2000) predicted that Np53 is 32% disordered, and upon incubation with RITA the level of disorder reduced to 28%.

Hydrogen–deuterium exchange coupled to mass spectrometry (HDX-MS) was used to ascertain if the conformational changes induced by RITA could be mapped in the solution phase. Np53 was incubated for varying time points in deuterated buffer, and the mass shift of peptides was determined. Np53 shows a significant uptake of deuterium at the shortest experimental time point of 15 s for a large proportion of peptides detected (**Figures 3A–D**). From the mas spectrometry data of each peptide, we observe no significant difference between deuterium uptake in the absence or presence of RITA, as shown by the deuterium uptake curves for selected representative peptides residues 23–30 and 53–63 (**Figures 3E,F**, respectively). This indicates that we cannot sample the interconverting solution conformations for this highly dynamic protein over the longer timescale of the HDX-MS experiment. The butterfly plot in **Figure 3G** depicts the overall deuterium uptake differences between Np53 in the absence and presence of RITA. Each set of points along the x-axis represent a peptide, with time points denoted by different colored points and lines [15 (yellow), 30 (red), 60 (blue), and 120 (black) s incubation time]. Gray bands indicate the error in the uptake level and vertical lines indicate the sum of uptake differences for each time point. Several peptides show deuterium uptake differences slightly above the error, but all at <1 Da, indicating that this protein is highly dynamic with or without RITA, for example, the greatest deuterium uptake difference in Np53 in the absence and presence of RITA being 0.271 Da, for a peptide spanning residues 23–39 with a [M+2H]2<sup>+</sup> charge state.

#### Modulation of N-terminal MDM2 by Nutlin-3

Mass Spectra for MDM2 (**Figure 4A**) sprayed from native conditions with 50 mM ammonium acetate and 0.5% DMSO show a broad bimodal CSD spanning charge states 5 ≤ z ≤ 14. The most intense species is [M+10H]10<sup>+</sup> with significant intensity also in [M+7H]7<sup>+</sup> and [M+6H]6+. We observe low intensity [D+11H]11+, [D+13H]13+, and [D+15H]15<sup>+</sup> dimers,

which means that the species attributed to [M+5H]5<sup>+</sup> will also contain some [D+10H]10<sup>+</sup> (and the [M+6H]6<sup>+</sup> some [D+12H]12<sup>+</sup> etc.) but since the flanking unique m/z dimers are of an intensity of <5% we ignore this contribution. Upon incubation with Nutlin-3 (**Figure 4B**) we see a CSD shift toward the lower charge states, with the [M+6H]6<sup>+</sup> species most intense, although the CSD range is retained. Binding of one Nutlin-3 molecule to MDM2 is observed at the [M+5H]5+, [M+6H]6+, and [M+7H]7<sup>+</sup> charge states. The shift in the N-MDM2 CSD upon incubation with Nutlin-3 is substantially greater than that caused by DMSO alone (Figure S4, Supporting Information) suggesting that Nutlin-3 confers a structural or conformational change in N-MDM2.

DT IM-MS analysis reveals that N-MDM2 in the absence of Nutlin-3 presents as at least two conformational families at all charge states (Figure S5, Supporting Information). The [M+5H]5<sup>+</sup> charge state (**Figure 5A**) presents as two conformers centered at ∼1000 and ∼1250 Å<sup>2</sup> , referred to as C<sup>1</sup> (black Gaussian curve) and C<sup>2</sup> (red Gaussian curve), respectively. The [M+6H]6<sup>+</sup> charge state (**Figure 5B**) presents as three conformers, the compact C<sup>1</sup> and C<sup>2</sup> families and a more extended family, X (blue Gaussian curve) centered at ∼1400 Å<sup>2</sup> . The [M+7H]7<sup>+</sup> charge state (**Figure 5C**) exhibits conformational family C<sup>1</sup> and X and also presents as a large conformer, U (green Gaussian curve) centered at ∼1700 Å<sup>2</sup> . Upon binding to Nutlin-3, we see a change in the DTCCSDHe of N-MDM2 for each charge state. The [M+5H]5<sup>+</sup> charge state (**Figure 5A**, middle panel) shows retention of the compact conformer C1, but a significant decrease in the intensity of C2. The [M+6H]6<sup>+</sup> charge state, when bound to Nutlin-3, undergoes a compaction event to produce a single conformational family centered at ∼1250 Å<sup>2</sup> , corresponding to conformer C<sup>2</sup> (**Figure 5B**, middle panel). This effect is again seen for the [M+7H]7<sup>+</sup> charge state (**Figure 5C**, middle panel), which presents as a single conformer corresponding to conformer X when bound to Nutlin-3. These altered conformations remain, even when Nutlin-3 is not bound to N-MDM2. **Figure 5** bottom panels show the DTCCSDHe of N-MDM2 in the presence of Nutlin-3, but not bound in a complex. The [M+5H]5<sup>+</sup> species undergoes a minor change in the DTCCSDHe with an increase in conformational family C<sup>2</sup> compared with the bound complex. Charge states [M+6H]6<sup>+</sup> and [M+7H]7<sup>+</sup> remain in the single conformational families C<sup>2</sup> and X, respectively, even after the ligand is no longer bound.

## Discussion

The MS spectra for Np53 in the absence of RITA (**Figure 1A**) corroborates previous reports of disorder for the N-terminus of p53 (Bell et al., 2002; Dawson et al., 2003), a broad charge state range 5 ≤ z ≤ 12, indicative of a disordered system (Testa et al., 2011; Beveridge et al., 2014) with numerous residues available for protonation in solution. We are unable to preserve the binding of small molecule RITA to Np53, suggesting the binding is lower affinity than that reported previously (K<sup>D</sup> = 1.5 nM, Issaeva et al., 2004) or that it proceeds principally by hydrophobic interactions that are significantly diminished in the absence of solvent, resulting in loss of ligand during desolvation. We observe a narrowing of the CSD for Np53 on ligand incubation, we also observe the isolated ligand (data not shown), which also supports our assertion of ligand dissociation during desolvation. This CSD shift toward lower charge states suggests conformational tightening induced by RITA, a hypothesis that is supported

by DT IM-MS data. The DTCCSDHe for Np53 is significantly altered in the presence of RITA at all charge states present, with loss of larger conformational families and induction of more compact conformers. We observe a compact conformer C<sup>0</sup> for [M+5H]5+, [M+6H]6+, and [M+7H]7<sup>+</sup> charge states, which is not present in the absence of RITA. Whilst the [M+8H]8+species does not contain any of the C0, it no longer contains conformer U, rather is populated by the more compact conformers C<sup>1</sup> and X, although conformer X is poorly resolved. [M+9H]9<sup>+</sup> undergoes loss of conformer U<sup>2</sup> with induction of compact conformer C<sup>1</sup> at a much lower DTCCSHe. The use of IM-MS to discern conformational tightening due to ligand binding has been previously reported, (Harvey et al., 2012) and along with these findings provides an exciting prospect as a method for screening inhibitors to conformationally dynamic systems. As RITA is predicted to bind outside of the p53:MDM2 hydrophobic binding pocket (Issaeva et al., 2004), it has been asserted that the observed inhibition proceeds via a conformational change, which in turn will allosterically prevent the binding of MDM2. Our IM-MS data is evidence for the conformational modulation of Np53 by RITA. The induction of a smaller conformation is corroborated by CD results, which show an increase in secondary structure, and a decrease in the disordered content when analyzed using the CONTILL algorithm. We do note, however, that the calculated differences in structural content predictions for Np53 are minimal, with only a 4% decrease in disordered content, this is less informative than the clear conformational change provided by IM-MS.

The use of HDX-MS reinforces the view that Np53 is conformationally dynamic in solution; high levels of deuterium uptake are observed after 15 s incubation, with very little further uptake at longer incubation times. This suggests that backbone amides are solvent exposed and free to exchange with deuterium. When uptake was compared after RITA incubation we observe no significant changes in deuterium uptake for Np53 (**Figure 3G**). While there are several peptides which exhibit deuterium uptake differences outside of the error, the greatest difference is 0.271 Da for a [M+2H]2<sup>+</sup> peptide. As we see no difference greater than 1 Da, the mass difference between a hydrogen and deuterium atom, we can infer that there is no

significant structural difference between Np53 in the absence and presence of RITA on the timescale of these experiments. Our shortest time step (15 s) is insufficient to observe the conformational changes occurring as the protein has enough time to rearrange back to its original conformations. In contrast, the isolated gas phase conformers exiting the electrosprayed droplets appear trapped in distinct conformers at least over the time scale of our IM-MS experiments. We estimate this time to be ∼15 ms including the transmission of ions to our drift cell (McCullough et al., 2008), which is short enough to retain the conformational changes induced by RITA such that they can be observed. Both of the solution approaches indicate conformational flexibility and some slight change in structural content in the presence of the ligand, IM-MS provides a more definitive readout of the modulation of conformation to Np53 in the presence of RITA.

We can contrast the results observed for the RITA interaction with Np53 with that for the well-studied drug candidate Nutlin-3 with MDM2. N-MDM2 presents with a wide CSD (5 ≤ z ≤ 14), again suggesting a disordered protein. DT IM-MS shows that N-MDM2 presents in the gas-phase in at least two conformational families, potentially assignable to the previously reported "open" and "closed" position of the lid mini-domain (Uhrinova et al., 2005; Worrall et al., 2009, 2010). When incubated with Nutlin-3, we observe a substantial CSD shift toward the lower charge states which cannot be attributed to the effect of DMSO alone (Figure S4), again suggesting some conformational effect conferred by Nutlin-3 binding. We observe binding of Nutlin-3 to N-MDM2 over three charge states, [M+5H]5+, [M+6H]6+, and [M+7H]7+. As there is no binding to the more extended high charge states, Nutlin-3 may only be able to bind N-MDM2 in a compact conformation, which is transferred to the gas phase as low charge state complex. DT IM-MS analysis showed that Nutlin-3 configures N-MDM2 into a more compact and inflexible conformer. The [M+5H]5<sup>+</sup> charge state retains both conformational families upon Nutlin-3 binding, however the larger conformer at DTCCSHe ∼1250 Å<sup>2</sup> was greatly reduced. For the [M+6H]6<sup>+</sup> and [M+7H]7<sup>+</sup> ions, Nutlin-3 binding configures the protein into a single conformer with a narrow DTCCSDHe, indicating less dynamics. This single conformational family was centered at a DTCCSHe ∼1250 Å<sup>2</sup> for [M+6H]6+, corresponding to conformational family C2, and ∼1400 Å<sup>2</sup> for [M+7H]7<sup>+</sup> corresponding to family X. We see loss of both the C<sup>1</sup> and X families for [M+6H]6<sup>+</sup> and loss of C<sup>2</sup> and U for [M+7H]7+, suggesting much lower flexibility of the protein when bound to Nutlin-3.

Interestingly, it appears as for Np53, that the ligand free N-MDM2 in the IM-MS experiments also retains a "memory" of its in solution Nutlin-3 bound state. DTCCSDHe of N-MDM2, incubated with Nutlin-3 but in its apo-form, show similar conformers than those which retain binding of Nutlin-3 (**Figure 5**, bottom panels). This suggests that Nutlin-3 binds a higher proportion of analyte molecules than we observe, but is not retained fully during desolvation. The apo [M+5H]5<sup>+</sup> species is not only compact, suggesting that it rearranges back to the free N-MDM2 conformer, or that some of it arises from a conformer in solution that is incapable of binding Nutlin-3. The apo [M+6H]6<sup>+</sup> and [M+7H]7<sup>+</sup> remain in tight, single conformational families, and a much lower proportion of the Nutlin-bound N-MDM2 presents in the [M+5H]5<sup>+</sup> charge state (**Figure 4**) supporting our hypothesis that Nutlin-3 is unable to bind as well to the very compact conformer C1. For the larger conformational families, N-MDM2 seems unable to rearrange back to its original conformations within the timescale of desolvation and analysis.

# Conclusions

Multiple techniques have been used to probe the binding of small molecule inhibitors RITA and Nutlin-3 to N-terminal p53 (Np53) and N-terminal MDM2 (N-MDM2), respectively. Native mass spectrometry of Np53 shows a shift in the CSD toward the lower charge states and loss of the more extended charge states upon incubation with RITA. IM-MS of Np53 reveals two conformational families in the absence of RITA. Upon incubation with RITA, Np53 is configured into a novel, more compact conformer C<sup>0</sup> with loss of the more extended conformational family. We are able to retain this conformational tightening in the gas-phase on the time scale of our DT IM-MS experiments, even though we are unable to preserve the RITA:Np53 complex in the gas phase. HDX-MS data highlights the disordered nature of Np53, with no discernible conformations visible on a longer timescale. Very little differences are noted between the deuterium on-exchange of Np53 in the absence and presence of RITA, and we are unable to locate RITA induced conformational changes.

The nESI mass spectrum of N-terminal MDM2 shows a wide range of charge states (5 ≤ z ≤ 14) indicative of a disordered protein (Testa et al., 2011; Beveridge et al., 2014). The bimodal distribution suggests the protein may possess a more compact and more extended conformer. Indeed, DT IM-MS results show the protein presents as at least two conformational families at all

charge states. Upon incubation with Nutlin-3, we observe ligand binding to the forms of the protein that present to the gas phase with low charge states [M+5H]5+, [M+6H]6+, and [M+7H]7+, suggesting selective binding to a compact conformer of MDM2, or possibly that more extended forms lose Nutlin-3 on transfer to the gas phase. The bound species of MDM2 are compact at all three charge states, with [M+6H]6<sup>+</sup> and [M+7H]7<sup>+</sup> forming a single conformational family centered at a DTCCSHe of the middle conformational family exhibited by apo-N-MDM2. These conformational changes are likely retained by ions which lose the bound Nutlin-3 molecule during desolvation, indicating that the protein is unable to rearrange during the experiment. IM-MS is presented as a promising technique able to track conformational changes in unstructured proteins on a millisecond timescale.

# References


# Acknowledgments

This work has been funded by the award of a BBSRC (Biological and Biotechnological Science Research Council) Industrial Case Studentship to EJ in collaboration with Waters Corp. and by the BBSRC in awards BB/L015048/1 and BB/H013636/1. We are also grateful for the support of the Universities of Edinburgh and Manchester.

## Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fmolb. 2015.00039


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Dickinson, Jurneczko, Nicholson, Hupp, Zawacka-Pankau, Selivanova and Barran. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Exploring intrinsically disordered proteins using site-directed spin labeling electron paramagnetic resonance spectroscopy

Nolwenn Le Breton, Marlène Martinho, Elisabetta Mileo, Emilien Etienne, Guillaume Gerbaud, Bruno Guigliarelli and Valérie Belle\*

Bioénergétique et Ingénierie des Protéines Laboratory, UMR 7281, Aix-Marseille Université and Centre National de la Recherche Scientifique, Marseille, France

#### Edited by:

Peter Tompa, Flanders Institute of Biotechnology (VIB), Belgium

#### Reviewed by:

Patrick Senet, Université de Bourgogne, France Steffen P. Graether, University of Guelph, Canada

#### \*Correspondence:

Valérie Belle, Bioénergétique et Ingénierie des Protéines Laboratory UMR 7281, Aix-Marseille Université and Centre National de la Recherche Scientifique, 31 chemin J. Aiguier, 13402 Marseille, France belle@imm.cnrs.fr

#### Specialty section:

This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences

> Received: 26 March 2015 Accepted: 03 May 2015 Published: 19 May 2015

#### Citation:

Le Breton N, Martinho M, Mileo E, Etienne E, Gerbaud G, Guigliarelli B and Belle V (2015) Exploring intrinsically disordered proteins using site-directed spin labeling electron paramagnetic resonance spectroscopy. Front. Mol. Biosci. 2:21. doi: 10.3389/fmolb.2015.00021 Proteins are highly variable biological systems, not only in their structures but also in their dynamics. The most extreme example of dynamics is encountered within the family of Intrinsically Disordered Proteins (IDPs), which are proteins lacking a well-defined 3D structure under physiological conditions. Among the biophysical techniques well-suited to study such highly flexible proteins, Site-Directed Spin Labeling combined with EPR spectroscopy (SDSL-EPR) is one of the most powerful, being able to reveal, at the residue level, structural transitions such as folding events. SDSL-EPR is based on selective grafting of a paramagnetic label on the protein under study and is limited neither by the size nor by the complexity of the system. The objective of this mini-review is to describe the basic strategy of SDSL-EPR and to illustrate how it can be successfully applied to characterize the structural behavior of IDPs. Recent developments aimed at enlarging the panoply of SDSL-EPR approaches are presented in particular newly synthesized spin labels that allow the limitations of the classical ones to be overcome. The potentialities of these new spin labels will be demonstrated on different examples of IDPs.

Keywords: intrinsically disordered proteins, structural transitions, induced folding, fuzzy complex, nitroxide spin labels, site-directed spin labeling EPR spectroscopy

# Introduction

Characterizing dynamics of macromolecules is a complex task requiring the use of appropriate techniques. In parallel to the development of methods leading to structure resolution of proteins, there is also an increasing need to develop biophysical techniques able to describe structural flexibility, as this dynamical aspect is closely related to protein function (Dyson and Wright, 2002; Salmon et al., 2010). The most highly dynamic biological systems are the socalled Intrinsically Disordered Proteins (IDPs) or Regions (IDRs), which lack a well-defined 3D structure under physiological conditions while being associated to key functions such as regulation, molecular assembly, signaling. . . (for recent reviews see Uversky and Dunker, 2010; Babu et al., 2011; Chouard, 2011; Habchi et al., 2014). Among the various techniques able to give access to dynamic properties of biomolecules is Site-Directed Spin Labeling combined with Electron Paramagnetic Resonance (SDSL-EPR), a technique that was pioneered about 20 years ago by (Hubbell et al., 1996). SDSL-EPR is very sensitive for probing flexible regions of proteins and revealing dynamic changes or interaction sites in protein-protein interactions (Belle et al., 2007; Lorenzi et al., 2012). It can also be used to determine accessibility profiles, a powerful approach to determine the topology of membrane proteins (Gross et al., 1999; Kaplan et al., 2000). More recently, the development of pulse EPR, in particular Double Electron Electron Resonance (DEER) techniques allowed the measurement of distances between spin labeled sites in the range of 1.8–6.0 nm (Pannier et al., 2000), thus covering a wide range of interest for the study of large conformational transitions and biomolecule associations. Excellent reviews describing all of these approaches, as well as applications of SDSL-EPR on proteins, have been recently published (Klare and Steinhoff, 2009; Bordignon, 2011; Mchaourab et al., 2011; Drescher, 2012; Hubbell et al., 2013). The present contribution will focus on dynamic analyses of extremely flexible biological systems and recent synthesis of new spin labels designed to enlarge the potentialities of the technique.

# General Principles and Development of New Spin Labels

SDSL-EPR relies on selective grafting of a paramagnetic species, usually a nitroxide derivative, at a desired position of a protein and subsequent analysis by EPR spectroscopy. Nitroxides are stable radicals in which the unpaired electron is delocalized on the N-O group, leading to a 3-line spectrum arising from the hyperfine interaction between the electron spin and the nuclear spin of the <sup>14</sup>N atom. Nitroxide spin labels are classically functionalized to react specifically with cysteine residues. The technique thus requires the construction of cysteine mutants in order to target specific sites. The most frequently used nitroxide spin label is the MTSL (1-oxyl-2,2,5,5-tetramethylpyrroline-3-methyl) methanethiosulfonate leading to the formation of a disulphide bridge between the side-chain of the cysteine and the label (**Figure 1A**, side-chain R1). Its relatively small size minimizes the potential perturbation of the biological system. Other commercial spin labels are available, such as the maleimide-functionalized ones having the advantage of forming a thio-ether bond with the protein side-chain preventing the label release in the presence of reducing agents. One example of such spin label is the 3-maleimido-2,2,5,5-tetramethyl-pyrrolidinyloxy referred to as Proxyl or P depicted in **Figure 1A** (spin label N2). These labels are, however, more sterically demanding than MTSL, and they can react with amines at high pH (Hideg et al., 2005). As for all labeling techniques, control experiments are essential for comparing wild-type protein and labeled mutants to check the non-invasiveness of the label with respect to protein structural and functional properties. By investigating the relationship between the mobility of the MTSL nitroxide side-chain and the structural elements of the T4 lysozyme taken as a model protein, Mchaourab and co-workers established the basis for the interpretation of EPR spectral shapes of spin labels (Mchaourab et al., 1996). The power of the technique relies on the sensitivity of the EPR spectral shapes to the mobility of the label in the nanosecond time window described by the rotational correlation time τ<sup>c</sup> (**Figure 1B**), a parameter that can be obtained by spectral simulation. Indeed the magnetic hyperfine interaction between the electron spin and the <sup>14</sup>N nucleus is highly anisotropic. This anisotropy is fully averaged when the radical is highly mobile and the spectrum displays three narrow lines. When the mobility decreases, the averaging becomes partial and the lines broaden progressively until reaching a limit of a fully anisotropic spectrum corresponding to an immobilized spin label. A spectral modification thus represents a change in the environment of the label affecting its mobility and thus indicates a structural transition (**Figure 1C**).

The technique has still some limitations, in particular due to the chemical nature of the spin labels, which our recent works aim to overcome. One limitation comes from the poor diversity of EPR spectral shapes of nitroxide labels. The similarity of the 3-line spectral shapes precludes the simultaneous investigation of two different regions of a protein or two interacting proteins, a situation that can be encountered in allosteric mechanism in which ligand binding at one site influences binding at another site through a propagated structural change within the protein. To overcome this limitation, we designed and synthesized a new spin label based on a maleimido-functionalized β-phosphorylated nitroxide: the {2-(diethoxyphosphoryl)- 5-[(2,5-dioxo-2,5-dihydro-1H-pyrrol-1-yl)methyl]-2,5 dimeth ylpyrrolidin-1-yl}oxidanyl, referred to as Phosphorylated Proxyl or PP (**Figure 1A**, N3). Thanks to the supplementary high magnetic coupling between the electron spin and the nuclear spin of <sup>31</sup>P, this label gives a well-resolved 6-line spectrum composed of a doublet of triplets (Le Breton et al., 2014). Another limitation of conventional SDSL concerns the fact that only cysteines are specifically targeted by commercial nitroxides spin labels which is unsuitable when cysteines play roles either in the function and/or in structural elements (active sites, disulfide bridges) of the protein under study. We designed and synthesized various nitroxides to react with the phenol group of the tyrosine via a three-components Mannich type reaction (Lorenzi et al., 2011; Mileo et al., 2013a). The best results were obtained with an isoindoline-based nitroxide: 5-amino-1,1,3,3-tetramethyl-isoindolin-2-yloxyl, referred to as Nox (**Figure 1A**, N4), which has proved to be a good reporter of structural modifications (Mileo et al., 2013a). The successful applications of these two new nitroxides to report on structural transitions on IDPs will be presented in the next section.

# Applications

Several IDPs have already been investigated by SDSL-EPR. This approach has been successfully used to reveal various structural states and higher-order organizations of flexible proteins involved in neurodegenerative diseases such as αsynuclein (Chen et al., 2007), amyloid-β peptide (Torok et al., 2002) and tau (Siddiqua and Margittai, 2010). Another example of IDP studied by SDSL-EPR is the small acid protein IA3 that acts as an inhibitor of the yeast proteinase A (Casey et al., 2014). In the following, results with two highly flexible proteins are reviewed to illustrate the potential of SDSL-EPR in the field of IDPs.

represents the spin label.

#### Cartography of Induced Folding

Most IDPs undergo an induced folding in presence of a partner protein i.e., a disorder-to-order transition that can be limited to a particular region. An illustrative example concerns the cartography of induced folding of nucleoproteins (N) from three viruses belonging to the Paramyxoviridae family, namely Measles (MeV), Nipah (NiV) and Hendra (HeV) viruses. Multiple copies of N are structurally organized to encapsidate the viral genome. The particular case of MeV nucleoprotein has been the most extensively studied by complementary biophysical approaches

spin label side-chains. (B) EPR spectral shape modifications as a function of

(Habchi and Longhi, 2012). MeV N consists of two regions: a N-terminal globular one (aa 1–400) and a C-terminal domain NTAIL (aa 401–525) that is fully disordered. This disordered part is essential for transcription and replication of the virus via interaction with the phosphoprotein of the viral polymerase complex. Our study was focused on the interaction between MeV NTAIL and the C-terminal part (X Domain) of the phosphoprotein PXD (aa 459–507 of P), which is constituted of 3 α-helices (Longhi et al., 2003). In order to precisely localize the regional folding that MeV NTAIL undergoes in the

presence of PXD, we targeted 14 sites for spin-labeling (with MTSL) within NTAIL, 12 of which being concentrated in the Cterminal region (aa 488–525) that was known to be specifically involved in the interaction with MeV PXD (Longhi et al., 2003) (**Figure 2A**, upper panel). For all labeled variants of NTAIL, room temperature EPR spectra were recorded in the absence and presence of PXD (Belle et al., 2008). An important decrease of the mobility of the spin labels was observed in the 488–502 region that we have been able to attribute to the formation of an α-helix using complementary circular dichroism analyses. At one particular position (aa 491) a broad shape component was detected, indicating a very restricted environment of the spin label (**Figure 2A**, left panel). This observation allowed us to confirm the structural model of a chimera construct between MeV PXD and a small NTAIL region (aa 486–504) in which the amino acid side-chain 491 points toward PXD whereas the other

position 496 in the absence and in the presence of PXD. Variation

ones are solvent-exposed (Kingston et al., 2004). Interestingly, the mobility of the induced folding region was found to be slightly but significantly restrained even in the absence of the partner protein, a behavior that could indicate the existence of a pre-structuration of this region. This observation has been further confirmed combining SDSL-EPR data and modeling of local rotation conformational space (Kavalenka et al., 2010) and also by NMR studies and modeling of this region as a dynamic equilibrium between a completely unfolded state and different partially helical conformations (Jensen et al., 2011). Concerning the C-terminal end of NTAIL, EPR spectral shapes of the labels grafted in the 505–525 region showed a moderate decrease of mobility that has been attributed to a gain of rigidity arising from α-helical folding of the neighboring 488–502 region (Belle et al., 2008). This observation was consistent with further analyses where the 505–525 region was found to

ratio (red). In inset: a zoom on the low-field region.

conserve a significant degree of freedom even in the bound form (Kavalenka et al., 2010). Using the same strategy based on multiple individual labeling sites, we also mapped the induced folding of NTAIL in association with PXD for HeV and NiV viruses (Martinho et al., 2013) and validated previously proposed structural models obtained by homology modeling (Habchi et al., 2011).

Thanks to the well-characterized MeV NTAIL-PXD interaction by conventional spin labels, this biological system was used as a model to characterize the new label PP, having a <sup>31</sup>P atom in the vicinity of the nitroxide leading to a 6-line spectrum (**Figure 1A**, N3) and to probe its ability to report structural transitions. Four grafting sites on NTAIL were judiciously chosen (within and outside the induced folding region) (**Figure 2A**, upper panel). For comparative purposes, proxyl P (**Figure 1A**, side-chain R2) was grafted at the same positions. The ability of PP to report on structural transitions was evaluated by analyzing the spectral shape modifications induced either by a secondary structure stabilizer (trifluoroethanol) or by the presence of the partner protein PXD (**Figure 2A**, right panel). All the spectra were simulated to extract the dynamic parameter τ<sup>c</sup> that represents the mobility of the spin labels. The modification of this parameter according to the position of the spin label and the condition (free or bound to PXD) was very similar for both the classical proxyl and the new phosphorylated spin labels (**Figure 2B**, right panel). Taken together the results demonstrated that PP is able to monitor from subtle to larger structural transitions, as efficiently as the classical spin label. Molecular dynamics (MD) calculations were performed to gain further insights into the binding process between the labeled NTAIL and PXD. MD calculations revealed that the new phosphorylated label does not perturb the interaction between the two partner proteins and reinforced the conclusion on its ability to probe different local environments in a protein (Le Breton et al., 2014). Thanks to its new EPR spectral signature, the combination of PP with classical spin labels opens the way to study two protein sites simultaneously.

#### Revealing Fuzziness in a Supramolecular Complex

In this section we will show how SDSL-EPR provides a unique tool to reveal regions of an IDP remaining highly flexible after complex formation, information that often escapes to detection by other biostructural techniques. This study concerns the structural flexibility of a small chloroplastic protein called CP12 (80 aa) and its association with the glyceraldehyde-3-phosphate dehydrogenase (GAPDH) from the green alga Chlamydomonas reinhardtii. This association is a key step in the formation of a ternary supramolecular complex involved in the regulation of the Calvin cycle in many photosynthetic organisms (for details see the review Tieulin-Pardo et al in the present journal). In its oxidized state the green alga CP12 contains four cysteine residues engaged in two disulfide bridges (C23–C31 and C66–C75) and presents some α-helical structural elements modeled from each side of the N-terminal disulfide bridge, whereas the C-terminal part appears mainly disordered (**Figure 2B**) (Gardebien et al., 2006). In contrast, in its reduced state the algal CP12 is fully disordered (Graciet et al., 2003). In a first study, we used MTSL (**Figure 1A**, N1), which led to the labeling of the only C-terminal cysteine residues leaving the N-terminal unmodified as revealed by mass spectrometry analyses. Surprisingly, the partner protein GAPDH induced the cleavage of the disulfide bridge between the cysteine and the label, resulting in the full release of the label. This result showed the existence of a transitory interaction between both proteins and we proposed a mechanism based on a thioldisulfide exchange reaction involving cysteines C21 and C291 of the C. reinhardtii GAPDH (Erales et al., 2009). Even if this observation led us to propose a new role for the algal GAPDH, it however doesn't allow us to study the GAPDH-CP12 complex. As the central region of CP12 has been proposed to be involved in the interaction with GAPDH, we used two cysteine-to-serine mutants (C23S and C31S) that were known to be fully disordered, due to the loss of the N-terminal disulfide bridge, but still able to interact with GAPDH (Lebreton et al., 2006). In order to individually target the two N-terminal cysteine residues, proxyl P was chosen as spin label (**Figure 1A**, N2) to prevent label release induced by GAPDH. EPR analyses revealed that this part of the protein remains fully disordered after association with GAPDH as no spectral modification was detected (**Figure 2B**) (Mileo et al., 2013b). The association between the two partner proteins was checked by probing the accessibility of the labels using a reducing agent. Indeed, in the absence of GAPDH, spin labels grafted on CP12 are reduced by an excess of DTT and become EPR-silent with a characteristic time of about 25 min. On the contrary, if GAPDH is present in the sample, the same amount of DTT has almost no effect as the EPR signal remains stable for at least 2 h confirming that the association really takes place and that CP12 remains highly dynamic in its bound form (Mileo et al., 2013b).

The C-terminal region has been demonstrated to be mainly responsible for the redox regulation of GAPDH (Lebreton et al., 2006). To gain further insights into the dynamics of this region of CP12 while keeping the 2 disulfide bridges intact, tyrosine 78 was chosen as an alternative labeling site. The location of tyrosine 78, highly conserved in CP12s from different organisms, is particularly interesting as it is close to residues that play a crucial role in the activity modulation of GAPDH either in the binary complex (Erales et al., 2011) or in the ternary complex with PRK (Avilan et al., 2012). The new isoindoline-based nitroxide Nox (**Figure 1A**, N4) was used to selectively target this unique tyrosine. The presence of GAPDH led to a slight modification of the EPR spectral shape (**Figure 2B**) indicating that the interaction of the C-tail with the partner protein is not tight. This result showed however that this region is close to the interaction site without being directly involved in it (Mileo et al., 2013b). All together the analyses of the three labeling sites were in good agreement with the partial view of the CP12-GAPDH complexes from other organisms given by recent crystallographic and NMR studies where only the 20 last amino acids of CP12 were detected (Matsumura et al., 2011; Fermani et al., 2012). Finally, the use of different spin labels is this peculiar biological system allowed us to conclude that GAPDH-CP12 from C. reinhardtii is a new example of a fuzzy complex, a concept introduced by Tompa and Fuxreiter (2008), in which the IDP keeps most of its disorder and dynamics upon complex formation. This fuzziness could be one of the keys to facilitate the formation of a ternary complex (CP12, GAPDH and phosphoribulokinase, PRK) as well as functional actions of the machinery which necessitate dynamic assembly and disassembly processes controlled by dark/light transitions.

#### Conclusion

The examples described above, along with numerous other studies, illustrate the power of SDSL-EPR to access information on protein dynamics. Characterized by a remarkable conformational flexibility, IDPs form a unique protein category that is particularly suited for SDSL-EPR applications. SDSL-EPR is a rapidly growing field, in particular with recent developments focused on the detection of labeled protein in intact cells, an area that promises to be very interesting for IDPs applications. We can anticipate that these recent developments will create further need for the design of new labels with greater stability

#### References


toward bio-reduction, an environment that is encountered inside the cell.

#### Acknowledgments

This work was supported by the Agence Nationale de la Recherche (ANR SPINFOLD n◦ 09-BLAN-0100), the Centre National de la Recherche Scientifique (CNRS) and Aix-Marseille Université. The authors are grateful to the EPR facilities available at the national TGE RPE facilities (IR 3443). We thank our collaborators and their teams: R. Lebrun (Proteomic center of IMM, Marseille, France), S. Longhi (AFMB, CNRS-AMU UMR 7257, Marseille, France), J. Golebiowsky (ICN, CNRS Université de Nice UMR 7272, Nice, France), B. Gontero (BIP, CNRS-AMU UMR 7281, Marseille, France), S.R.A Marque (ICR, CNRS-AMU UMR 7273, Marseille, France) and A. Rockenbauer (Research Centre of Natural Sciences, Budapest, Hungary). NL is grateful to Aix-Marseille Université and PACA Region for her PhD fellowship.

regulation through the C-Terminus of CP12 in Chlamydomonas reinhardtii. Biochemistry 50, 2881–2888. doi: 10.1021/bi1020259


from spin-labeling EPR spectroscopy. Structure 19, 1549–1561. doi: 10.1016/j.str.2011.10.009


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Le Breton, Martinho, Mileo, Etienne, Gerbaud, Guigliarelli and Belle. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# There is Diversity in Disorder—"In all Chaos there is a Cosmos, in all Disorder a Secret Order"

Jakob T. Nielsen\* and Frans A. A. Mulder

Department of Chemistry and Interdisciplinary Nanoscience Center, University of Aarhus, Aarhus, Denmark

The protein universe consists of a continuum of structures ranging from full order to complete disorder. As the structured part of the proteome has been intensively studied, stably folded proteins are increasingly well documented and understood. However, proteins that are fully, or in large part, disordered are much less well characterized. Here we collected NMR chemical shifts in a small database for 117 protein sequences that are known to contain disorder. We demonstrate that NMR chemical shift data can be brought to bear as an exquisite judge of protein disorder at the residue level, and help in validation. With the help of secondary chemical shift analysis we demonstrate that the proteins in the database span the full spectrum of disorder, but still, largely segregate into two classes; disordered with small segments of order scattered along the sequence, and structured with small segments of disorder inserted between the different structured regions. A detailed analysis reveals that the distribution of order/disorder along the sequence shows a complex and asymmetric distribution, that is highly protein-dependent. Access to ratified training data further suggests an avenue to improving prediction of disorder from sequence.

Keywords: intrinsically disordered proteins, NMR spectroscopy, data interpretation, statistical, chemical shift, protein conformation

# INTRODUCTION

A systematic re-examination of the protein structure–function paradigm is required to accommodate intrinsically unfolded/disordered proteins (IDPs). There are two major reasons for this reappraisal: (1) the results of bioinformatics analyses of the genomic codes for protein amino acid sequences, and (2) the accumulation of experimental evidence for the existence of a rather large number of protein domains and even entire proteins, lacking ordered structure under physiological conditions (Dyson and Wright, 2001, 2004; Uversky, 2003; Vucetic et al., 2003). Analysis of sequence data for complete genomes indicates that intrinsically disordered proteins are highly prevalent, and that the proportion of proteins that contain such segments increases with the increasing complexity of an organism; Putative long (>30 residue) disordered segments are found to occur in 2.0% of archaean, 4.2% of eubacterial, and 33% of eukaryotic proteins (Ward et al., 2004). Protein disorder roughly segregates into three major classes, depending on whether disorder serves a primary functional role, or serves permanent or transient interactions (van der Lee et al., 2014). Moreover, disorder maps to proteins with important functions, such as signal transduction and control of transcription, and IDPs are involved in all major classes of disease (Gregersen et al., 2006; Uversky et al., 2008). Although proteins may adopt different structures inside a cellular milieu, this paper is concerned solely with proteins studied under in vitro conditions. For the

#### Edited by:

Kris Pauwels, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Wim Vranken, Vrije Universiteit Brussel, Belgium Piero Andrea Temussi, Università di Napoli Federico II, Italy Miquel Adrover, Universitat de les Illes Balears, Spain

#### \*Correspondence:

Jakob T. Nielsen jtn@chem.au.dk Quote from: Carl G. Jung in "Die Archetypen und das kollektive Unbewusste", 4th ed., Walter, 1980

#### Specialty section:

This article was submitted to Structural Biology, a section of the journal Frontiers in Molecular Biosciences

Received: 27 November 2015 Accepted: 25 January 2016 Published: 11 February 2016

#### Citation:

Nielsen JT and Mulder FAA (2016) There is Diversity in Disorder—"In all Chaos there is a Cosmos, in all Disorder a Secret Order". Front. Mol. Biosci. 3:4. doi: 10.3389/fmolb.2016.00004 biological relevance of disorder for the selected proteins, the interested reader is referred to the original papers that described the proteins considered herein.

This paper draws on methodological advances in NMR spectroscopy to study IDPs, and systematic analysis of chemical shift data for the prediction of disorder from sequence. The aim of this work is to develop an experimentally calibrated "ruler" to detect and quantify sequence-specific protein disorder. NMR chemical shifts offer highly reliable and redundant residuespecific information on positional disorder, and this information is easy and unambiguous to get, using recently developed approaches (Jensen et al., 2013; Kragelj et al., 2013; Felli and Pierattelli, 2014; Konrat, 2014). In addition, the growing amount of NMR chemical shift assignment data now allows for rigorous and comprehensive analysis of protein disorder, and to employ this ruler to gauge the types of variation of protein disorder.

In this paper, we search for any potential trends or variations in order/disorder in an assorted set of proteins. To this end, we constructed a comprehensive collection of proteins with varying degrees of (partial) disorder, for which assigned NMR chemical shifts are available. We subsequently asked: "Is disorder similar in proteins, or are there different patterns to be discerned?," "What is a typical variation between order/disorder?," and "Are there proteins that deserve the label super unfolded, and are they representative of the general class of IDPs?." We demonstrate that it is possible to answer these questions, with the methods discussed herein. It is hoped that this initial experimental evaluation of residue-specific protein positional disorder will spark the further evolution of assessment tools for predicting disorder with greater detail and accuracy. In addition, our analysis paints a validated picture of protein disorder for a diverse subset of 117 example proteins that are either classified as IDPs, or possess long intrinsically disordered regions (IDRs), showing a highly abounding sequence context dependence.

#### METHODS

# Generation of a Curated Database with Available NMR Chemical Shift Data

A set of proteins with different degrees of structural disorder, for which assigned chemical shifts were available, was generated in two steps. First, a set of proteins was generated from a keyword search in the BioMagResBank (BMRB) database (http://www.bmrb.wisc.edu; Ulrich et al., 2008). Second, we augmented this database with sequences from the Database of Protein Disorder (DisProt) (http://www.disprot.org; Vucetic et al., 2005; Sickmeier et al., 2007) of disordered regions, for which data were also present in the BMRB. In the first step, the BMRB database was searched for entries according to the BMRB database tag for the physical state of the protein; Entries were selected where \_Entity\_assembly.Physical\_state = "denatured," "intrinsically disordered," "molten globule," "partially disordered," and "unfolded." In addition, entries with the words "disordered" or "unstructured" in the entry title were also included. In the second case, the BMRB database was searched for matches of all SwissProt identifiers present in the DisProt database. The sequences from DisProt and BMRB were aligned using the EMBOSS (http://www.ebi.ac.uk/ Tools/psa/emboss\_needle/) implementation of the Needleman– Wunsch alignment algorithm (Needleman and Wunsch, 1970) and BMRB entries with >20% of the aligned residues classified as disordered by DisProt were retained. In addition, an entry with in-house assigned chemical shift for the N-terminal heavy metal binding domain, residues 1–84, of a cupper-binding ATPase (Gourdon et al., 2011) in an unstructured state was added to the database.

Subsequently, the database was curated. First, conditions that deliberately destabilize folded proteins, such as extremes of pH, extremes of temperature, and denaturants were selected against, in order to avoid the artificial inclusion of disorder under non-native conditions. This step also removes any digressions that such conditions have on the chemical shifts: Only proteins with near-neutral pH (4 < pH < 9), in weak salt buffer, not in complex with other molecules, and without modified amino acids were kept. Typical conditions of excluded entries were: pH < 4.0, bicelle/micelle/SDS present in buffer, TFE/DMSO/GuHCl or other denaturants added, presence of cofactors, and phosphorylation. Next, entries having fewer than 40 amino acid residues or fewer than 50 assigned chemical shifts were removed. Third, the collection of entries was culled in two steps to remove redundant data. To find highly similar entries, the EMBOSS Needle program was used to calculate pairwise sequence identity between the sequences. In the first step, sets of proteins families, defined as sets of chains with >90% mutual sequence identity, were identified and only the entry with the most native-like sequence/condition from each set was kept in the database. For example, wild type sequences were kept and mutant sequences were discarded when both were present, and in case data were available at different acidities, the pH closest to neutral was preferred. In the second step, groups of homologous chains, defined by having >50% sequence identity between at least one other sequence in the set, were identified. In each set, the subset of sequences that yielded the maximum total number of chemical shifts, and where all pairs had < 50% sequence identity, were kept. Finally, it was required that at least one residue could be defined as disordered based on NMR chemical shift data (see below, Equation 5). This procedure resulted in the construction of a set of 117 proteins with assigned chemical shifts, with varying degrees of structure/disorder. The list of 117 BMRB identifiers is provided as Table S1.

# Calculation of Order/Disorder Metrics from Experimental NMR Data

Weighted sum of squared differences between observed and predicted shifts were calculated as follows

$$\chi^{\,^2}(i) = \sum\_{n} \sum\_{j=i-1, i, i+1} \min(\Delta\_{jn}^{\,^2}, 16) \tag{1}$$

where the difference was truncated to four standard deviations using:

$$
\Delta\_{jn} = \frac{\delta\_{obs}\left(j, n\right) - \delta\_{pred}\left(j, n\right)}{\sigma\left(n\right)}\tag{2}
$$

Here δobs j, n is the observed (offset corrected) chemical shift of atom type n for residue j, and δpred j, n is the neighbor corrected random coil chemical shift predicted based on the tripeptide centered at residue j (Tamiola et al., 2010). The chemical shift difference is scaled with the expected difference for residue j in an IDP using σ(n) = 0.627, 0.310, 0.219, 0.281, 0.102, 0.0566, 0.0546 ppm for n = N, C′ , Cβ, Cα, HN, Hα, and Hβ, respectively. The supposed chi-square distributed number χ 2 is transformed to an approximately normal distributed number L, by using linear combinations of fractional powers of χ 2 (Canal, 2005).

$$L = \rho^{1/6} - \frac{1}{2}\rho^{1/3} + \frac{1}{3}\rho^{1/2}, \ \rho = \frac{\chi^2}{N} \tag{3}$$

where N is the number of assigned chemical shifts for the triplet. L is converted to a standard normal distributed number, ZIDR, by correcting with the known mean and standard deviation for L (Canal, 2005).

$$Z\_{IDR} = \frac{L - \mu\_L \text{ (N)}}{\sigma\_L \text{ (N)}} \tag{4}$$

We assign local sequence specific residue states as either disordered (D) according to the definition:

$$Z\_{\text{IDR}} < 3 \tag{5}$$

or alternatively as ordered (O) if ZIDR ≥ 3. A protein sequence can be thought of as consisting of alternating segments of disordered and ordered residues with lengths s<sup>i</sup> . The sequence disorder complexity, CSD, is now defined as

$$C\_{SD} = \frac{e^H - 1}{N}, \ H = -\sum\_{i} \frac{s\_i}{N} \ln\left(\frac{s\_i}{N}\right),\tag{6}$$

where N is the number of residues in the protein. Note that H is closely related to the Shannon entropy of a statistical distribution. If a protein is built of n segments of equal length then e <sup>H</sup> = n and if the segments have different lengths e <sup>H</sup> < n. In particular, a protein with exclusively disordered/ordered residues (pure state) will have H = 0, e <sup>H</sup> = 1, and CSD = 0 whereas a protein where order and disorder continuously alternate along the sequence has a maximal value of CSD = 1. If order and disorder are independently randomly distributed with a probability of 0.5, we simulated with a random number generator that the sequence disorder complexity would be ca. 0.41 on average. For a general probability, p, and random outcome: CSD ≈ p. Therefore, it make sense to compare the sequence disorder complexity to the fraction of disordered residues f<sup>D</sup> = ND/N where N<sup>D</sup> is the number of disordered residues (ZIDR< 3). As such, the relative sequence disorder complexity, CSD/f<sup>D</sup> is expected to have a smaller variation.

## Procedure for Re-Referencing Assigned Chemical Shifts

Chemical shifts are deposited using different referencing procedures at different conditions such as temperature, added salt, and pH, and hence, it is likely that in some cases the observed chemical shift would be slightly, yet systematically, offset from the random chemical shift derived from the sequence. However, since even small deviations from random coil shifts are indicative of structure ordering, we estimated an offset correction for each entry in our database. The chemical shifts were re-referenced for each atom type independently using the following procedure: First, the neighbor corrected random coil chemical shifts were calculated for all residues following the procedure of Tamiola et al. as implemented in the program ncIDP (http://www.proteinnmr.org/; Tamiola et al., 2010), and the deviations from random coil chemical shifts, 1, were calculated using Equations (1) and (2) above. Assuming that the NMR data is correctly referenced, this procedure identifies small deviations due to deviations in pH and temperature of the experimental data relative to the reference database. Next, the standard deviation of 1 was calculated for nine consecutive residues, and the sequence position with the smallest standard deviation was identified. The average of 1 for the nine residues was then used as candidate offset correction. The average value of ZIDR was evaluated using (i) the candidate offset correction as described above and (ii) no offset correction. The scenario leading to the smallest average ZIDR was chosen as the initial offset estimation (i.e., either using the candidate offset or no offset correction). because chemical shift distributions in IDRs show a distribution (Tamiola et al., 2010), much narrower than those in structured parts (Wang and Jardetzky, 2002; Zhang et al., 2003), owing to the many additional contributions to the chemical shifts in structured proteins (Wishart and Case, 2001; Shen and Bax, 2010; Nielsen et al., 2012). Finally, using the chosen initial offset estimation, the average 1 using all the chemical shifts for the particular atom type in disordered residues only (Equation 5) was calculated and this number was used as the final offset correction. To avoid eliminating true deviations originating from structure, in both cases, the revised offset was only used if there was a significant reduction in the chemical shift standard deviation, σ(δoff), relative to the uncorrected equivalent, σ(δ0), considering the number of chemical shift observations, N, by the application of Akaikes Information Criterion (Akaike, 1974, 1985) using a variance inflation factor of 10.0. i.e., the offset correction was only accepted if: N ∗ ln(σ(δ0)/σ(δoff)) > 20.0 and N > 3.

#### RESULTS

## Analysis of Chemical Shift Dispersion for Seven Case Story Proteins

The database of proteins containing disordered regions was constructed as described in Methods. This carefully curated database contains 117 unique proteins chains and 13,069 residues with 65,574 assigned backbone and Cβ/Hβ chemical shifts in total (excluding terminal residues). The database contains many well-characterized IDPs such as α-synuclein (αSyn, bmrbID = 6968) (Bermel et al., 2006), small heat shock protein (Hsp12, bmrbID = 17483; Singarapu et al., 2011), the CD79a cytosolic domain (bmrbID 18867; Isaksson et al., 2013), apo-IscU (bmrbID = 17836; Kim et al., 2012), and the cytoplasmic domain of human neuroligin-3 (bmrbID = 17290; Wood et al., 2012). For each protein in the database we calculated the scaled difference, 1, from random coil shifts (Equation 2). This difference is plotted as a function of residue number for seven representative proteins from the database in **Figure 1**. It is seen that for the two proteins, known to be intrinsically unstructured, αSyn (bmrbID = 6968) and Hsp12 (bmrbID = 17483), fluctuations away from random coil shifts are very small throughout the sequence. This is also borne out by the fraction of disordered residues being 97% in both cases (see Methods Equations 1–5 and below). Conversely, two other proteins reported to be IDPs, the 18.5 kDa isoform of murine myelin basic protein (bmrbID = 15131; Libich et al., 2007), and the Cholera Toxin Enzymatic Domain (1– 167) (bmrbID = 15162; Ampapathi et al., 2008) display larger scatter. Values for the fraction of disordered residues f<sup>D</sup> are 0.45 and 0.28, respectively, which indicates that these order/disorder transitions of the cholera toxin enzymatic domain are essential for function, the protein is, in fact, mostly folded. For three further proteins, we note yet another pattern in their chemical shift residuals along the sequence, containing distinct separate segments with larger local spread of the chemical shifts for all nuclei. The examples chosen here are human cardiac troponin I, residues 1–73 (Hwang et al., 2014), inhibitor-2 involved in protein phosphatase 1 regulation (Kelker et al., 2007), and the small VCP/p97-interacting protein (Wu et al., 2013; BMRB ids: 25118, 15179, and 19485, respectively). In these examples, the larger local scatter is biased in the direction of downfield shifted C ′ and Cα chemical shifts and upfield shifted Cβ, Hα, HN, and N chemical shifts (**Figures 1C–E**). This observation is consistent with the presence of helical structure formation. In addition, in these three examples, we see a variation in helix size within the same protein, as well as amongst different proteins. A variation in the amplitude of the chemical shift residuals is also observed. Since chemical shifts are time-averaged observables, smaller amplitude of the chemical shift residuals correspond to fractional occupancies of helical states as discussed previously (Marsh et al., 2006).

#### Local Disordered Residues

Inspired by the observations of local disorder/order in small segments in a protein, we develop here a formalism, where we state that a residue can be in one of two situations: either an (intrinsically) disordered state (D) corresponding to a non-biased mixture of conformations dictated by the Boltzmann distribution with resulting population-averaged chemical shifts, or in a completely ordered state (O) with a fixed structure. Here we use the simplest probabilistic model, a normal distribution, for the chemical shift in a disordered state. The mean can be estimated from the primary structure as described in Tamiola et al. (2010) and the standard deviation can be derived from statistics (see Methods). Conversely, for an ordered residue, the chemical shifts would have much larger deviation from the mean, corresponding to a bias in the Boltzmann distribution of conformations. Outliers from the normal distribution are indicative of residual order, and can be identified and analyzed for each atom specific chemical shift, but when analyzed in combination the evidence is much more reliable. Therefore, we introduce the Chemical shift Z-score for assessing Order/Disorder (the CheZOD score), ZIDR, derived from the rmsd of all chemical shift residuals within a residue triplet and linear combinations of fractional powers of this rmsd as described in Methods (Equations 1–4). Assuming that the chemical shifts are normally distributed in disordered residues, ZIDR is standard normal distributed. This is valid independent of the number of backbone chemical shifts available. Hence, we define the distinction between disordered and ordered residues based on outliers from the normal distribution for ZIDR; a residue is said to be disordered if the CheZOD score, ZIDR < 3, and protein is said to have local disorder at this position in the sequence. Furthermore, the CheZOD score is not only a binary classifier of order/disorder, but provides a scale for the "degree of disorder," which can both classify partially formed/fractionally occupied structure (ca. 3 < ZIDR < 8) and fully formed structures (ZIDR > 8).

# Disorder Profiles of All Proteins in the Database

The CheZOD score, ZIDR, was calculated for all residues in all the 117 proteins in the database, henceforth the CheZOD database, and the fraction of the disordered residues was calculated for each protein. We observe that no protein has 100% disordered residues according to our definition. However, five proteins have 95% disordered residues or more, among these are αSyn and Hsp12 (discussed above) and also p15(PAF) (bmrbID = 19332) (De Biasio et al., 2014), the CD3e cytosolic domain of Eukaryota Metazoa Homo sapiens T-cells (bmrbID = 18889; Isaksson et al., 2013), and recombinant murine BG21 isoform of Golli myelin basic protein (bmrbID = 7358; Ahmed et al., 2007). Furthermore, the CheZOD database contains 15, 34, and 70 proteins that have >90, 80, and 50% disordered residues, respectively, for which NMR chemical shifts are available, spanning the full range from disordered to ordered.

The CheZOD score along the sequence (disorder profile) is visualized in **Figure 2** for all proteins in the CheZOD database. It is seen that the database roughly separates into proteins that are mostly disordered, with small segments of order scattered along the sequence, and mostly structured proteins, with small segments of disorder bridging between ordered domains, and in particular at the termini. Conversely, cases with roughly equal amounts of disordered and ordered residues are relatively rare (see also **Figures 3A, 4**). A second conclusion from our analysis is, that we did not identify proteins where the fluctuations between ordered and disordered residues were completely random. Rather, all proteins have distinct mediumlength segments of ordered/disordered residues (with a typical length of 10–30 residues) indicative of short-range correlated behavior, irrespective of the average state of the protein. Careful inspection of **Figure 2** reveals that most residues are either fully disordered or fully ordered, whereas fewer are partly ordered (ca. 3 < ZIDR < 8; see also **Figure 3B**). Hence, proteins in "the Twilight zone" which are not completely ordered or disordered,

FIGURE 1 | Examples of IDP disorder profiles. (A–G) weighted difference between observed and predicted shifts, 1jn (Equation 2) shown with blue, red, black, green, cyan, magenta, and yellow dots for C′ , Cα, Cβ, Hα, HN, N, and Hβ, respectively. The Z-score (Equation 4) is shown as a black line, showing the lines, Z = 0 and Z = 3 for reference with black broken lines. The name of the protein analyzed is indicated at the top of each panel to the left. Three numbers are provided on top of each panel (middle) referring to, the BMRB id, the fraction of disordered residues, and the sequence disorder complexity, respectively.

FIGURE 2 | Visualization of disorder profiles for the 117 proteins sequences in the CheZOD database. Each row represents a single protein where disordered residues (ZIDR < 3) are shown in blue, ordered residues (ZIDR > 3) are shown in red, and residues with average order shown in green/yellow. The proteins are depicted from top to bottom sorted according to the average of ZIDR for the protein.

both at the local and global level, appear to be under-represented. An example of such a rare protein is the 18.5 kDa isoform of murine myelin basic protein, shown in **Figure 1F**.

# The Sequence Disorder Complexity of a Protein

We now define a measure of the extent of alternations between ordered and disordered segments, the sequence disorder complexity, CSD, which is defined as the Shannon entropy of ZIDR < 3 (along the sequence), scaled by the length of the sequence (see Equation 6). CSD is 0 for a 100% disordered or ordered protein, and largest for a protein with many alternations between ordered/disordered segments (showing the protein with the maximum sequence disorder complexity in the CheZOD database in **Figure 1F**). In **Figure 4** the sequence disorder complexity, CSD, is shown as a function of the fraction of disordered residues, fD. For comparison, the relative complexities CSD/f<sup>D</sup> and CSD/(1−fD), which have a more even variation, are shown in **Figure 5**. It is seen that (i) the distribution of order/disorder along the sequence is not completely random (i.e., the complexity is always much less than the maximum value for CSD) (ii) The distribution is also not close to minimum complexity, which would correspond to two separate regions of order/disorder. (iii) The relative complexities CSD/f<sup>D</sup> and CSD/(1−fD), as viewed in **Figure 5**, are asymmetric around f<sup>D</sup> = 0.5. It is seen that mostly disordered proteins (f<sup>D</sup> > 0.5) are more complex compared to their mostly structured counterparts, i.e., they appear to have a relative larger number of scattered smaller segments with local order. However, this apparent asymmetry is mostly due to the specific definition of complexity; i.e., if we choose a different cut-off for a unstructured residue, say ZIDR > 6 (rather than ZIDR < 3, Equation 5), the corresponding correlation becomes more symmetric (data not shown) and barely reflects that structured regions are formed internally in the sequence, whereas disordered regions are more typically formed at the termini of the sequence.

# Data Extraction and Fraction of Disordered Residues: BMRB vs. Disprot

In **Figures 4**, **5** there are some trends visible in the fraction of disordered residues, fD, related to the procedure for data

with a different color according to the "physical\_state" tag provided in the BMRB database. The proteins shown in blue were found in the DisProt database search, where proteins shown in blue had "physical\_state" = "native" and cyan points refer to as "unknown" or had missing specification of "physical\_state" in the BMRB file. The proteins shown with other colors than blue and gray were found in the BMRB database key word search.

extraction and the physical\_state tag. In particular, the proteins that have physical\_state = "intrinsically disordered" are indeed mostly disordered (f<sup>D</sup> ≥ 0.5) in 15 cases, except one. Likewise, proteins corresponding to physical state tags "denatured" and "unfolded" describe only 1 of 6 and 1 of 7 proteins, respectively, that are mostly ordered. This is in some contrast to the group of proteins found from text searches of "disordered" or "unstructured" (see Methods), which were also labeled as "native" (green points in **Figures 4**, **5**). For this group, 10 out of 42 are actually mostly ordered, yet all except one of these proteins still have some degree of disorder with f<sup>D</sup> > 0.1. This bias was even more pronounced when considering the group of proteins identified by searching the DisProt database for corresponding entries in the BMRB database (see Methods). The proteins in this group were labeled with a "native" physical state (blue points in **Figures 4**, **5**) or had no label (2 cases, cyan). The search in the

DisProt database also resulted in re-identification of some of the proteins already present in the database from the BMRB physical state tag and text search (shown as colored squares in **Figure 5**, and not included in the counts below).

Since the DisProt database contains sequences with various contents of disorder, we contend that a protein from DisProt/BMRB is validated as disordered if at least 90% of the residues (aligned between the BMRB and DisProt entries) were defined as disordered in the DisProt database. Only the validated Disprot entries are shown in **Figure 4** (together with all the BMRB entries), whereas all DisProt entries are shown in **Figure 5** highlighting the validated ones by circles. It is seen in **Figure 4** that the validated group of DisProt entries contained 11 of 14 mostly ordered proteins, and seven of these were structured proteins, with f<sup>D</sup> < 0.1. It appears that the classification "mostly disordered" in DisProt does not assert that a protein, under the conditions corresponding to the BRMB entry, will be disordered, when judged by chemical shift dispersion. However, if the validated entries from DisProt—which correspond to entries already found by the BMRB physical state searches (**Figure 5**, square points surrounded by circles)—are included in this analysis, the percentage increases, as now 18 of 31 proteins are mostly disordered. Next, we inspected the methods that were used for the disorder classification for the 31 validated DisProt entries (as provided in the DisProt database). A significant correlation is revealed regarding the use of NMR, which was only used in two of the eight cases with almost completely structured proteins, in 3/12 cases for mostly structured, and in 11/19 cases for the mostly disordered proteins (see annotations in **Figure 5**) suggesting that NMR is likely one of the most accurate methods for assessing disorder in proteins.

The above observations together show that the physical\_state tag corresponds well to the content of structure in the protein, and, in particular, using this tag to search for intrinsically disordered proteins is a reliable procedure for identifying IDPs. Although the tags "intrinsically disordered," "unfolded," and "denatured" all consistently yield disordered proteins, one could suspect that in the latter two cases the protein might be in a somewhat non-native state, biased by conditions that could induce unfolding/denaturation of the protein (although it was Nielsen and Mulder Diversity in Disorder

specifically tried to avoid this, by excluding entries where it was specified that denaturants or co-factors were added, see Methods). This impression is supported by the fact that all six "denatured" proteins in the CheZOD database, and four of seven of the "unfolded" proteins, also have a structure available in the PDB database (as defined by the cross-ref PDB match in the BMRB entry), indicating that a folded version of the protein exists under some condition. For comparison, only 2 of 15 proteins with "intrinsically disordered" physical state have a folded structure in the PDB. An alternative procedure to identify "truly disordered" proteins, as demonstrated here, would be to search in the BMRB title for the words "disordered" or "unstructured" and analyze the proteins with a "native" physical state (green points in **Figures 4**, **5**). In this case, only 15 out of 42 had a folded structure available in the PDB database. In general, there is a close relationship between the fraction of disordered residues and the presence of a folded structure in the PDB; 26 of 70 of the mostly disordered proteins have a determined structure, whereas a vast excess of 44 of 47 from the mostly ordered proteins have a structure present in the PDB.

# DISCUSSION

We have constructed here the CheZOD database that contains detailed information about the extent of disorder in 117 carefully curated protein data entries. Following our bottom-up approach to include all available data for potentially disordered proteins in the BMRB database, it was intended that the CheZOD database should be representative for the full range of intrinsically disordered proteins. In the quest for this we used both tag searches in BMRB for the physical state, as well as searches to find chemical shifts for entries in the DisProt database. The CheZOD database is relatively small compared to other databases with 117 proteins, but one can imagine that the CheZOD database could be expanded even further in the future by including searches for entries in other databases such as MobiDB (Potenza et al., 2014), IDEAL (Fukuchi et al., 2014), and D2P 2 (Oates et al., 2013). We also expect further entries with chemical shifts to be available for inclusion in the CheZOD database in the near future following the recent growing focus on IDP research and advances in NMR analysis of IDPs. Despite the efforts made here, we do not claim here that our database is complete in the sense that it would cover all possible classes of proteins or types of disorder. We also concede that our database is slightly "biased" as a whole, in the sense that all entries correspond to proteins amenable to NMR spectroscopy, such as soluble proteins with small/medium size at relatively high concentrations. Notwithstanding these subtle objections, we still argue that the CheZOD database is representative for protein disorder. All entries in the CheZOD database is summarized in Table S1, and the full database including all the CheZOD Z-scores and backbone chemical shifts for each residues is available from www.protein-nmr.org.

We have used here our method for chemical shift referencing based on recalibration of the distribution of chemical shift for the disordered residues, and the random neighbor corrected chemical shifts of Tamiola et al. (2010). We now compare with the LACS method for re-referencing chemical shifts (Wang et al., 2005), Kjaergaard et al. random coil chemical shifts (Kjaergaard and Poulsen, 2011), and the RCI method for estimating local dynamics based on the chemical shifts (Berjanskii and Wishart, 2005).

Our method for re-referencing the chemical shifts identifies the disordered residues and calculates the average secondary chemical shifts, 1ave, for these residues. In theory, 1ave must be very close to zero, but it will deviate from zero for several possible reasons: (i) incorrect referencing by the authors of the entry, (ii) influence by sample conditions such as isotope effects, solvent, buffer, temperature, and pH, (iii) systematic bias in the estimation of neighbor corrected random coil shifts, (iv) small "true deviation" due to residual structure bias. Our method aims at addressing (i–iii) with a phenomenological offset correction by subtracting the average value, 1ave, from the chemical shifts. While some of the effects might be rather small, they could still have a dramatic impact on the interpretation of disorder due to the small variation in secondary shifts for IDPs. Unfortunately, it is of course not possible to separate effects of (i–iii) from "true deviation" which could mask structural signatures and thereby the disordered residues would appear slightly more disordered. We note, however, that this re-referencing does not lead to a larger number of residues being classified as disordered but only effects the amplitude of the disorder, merely leading to a slight skewing of the Z-score scale in the low-value end of the scale. Following the rationale that the effect of (i–iii) would often be much larger than the "true deviation" correction, this is why we applied the offset correction exclusively in cases where it was significantly different from zero as judged by Akaikes information criterion.

Analysis of all the applied offset corrections for the full CheZOD database by each atom type reveals that the offset correction was used in 20.6% of the chemical shift sets when analyzing the individual atom types separately (see Figure S1 in the Supplementary Material). In most cases the offset corrections were small (<0.5 ppm for <sup>13</sup>C and <sup>15</sup>N and <0.15 ppm for <sup>1</sup>H) but for a few entries, a large offset correction (>2 ppm) was needed for the carbon atom types. In these cases with large corrections, the offset for the different carbon atom types were correlated, underscoring the credibility of our method for offset corrections although these chemical shifts are sometimes assigned from different spectra. Some systematic trends in the sign of the offset correction can be learned, in particular the Hα and H<sup>N</sup> correction were often negative and positive, respectively, by ca. 0.1 ppm (discussed in more detail below). Furthermore, it is seen that offsets were more frequently used for more disordered proteins, which was to be expected, since the offset is calculated using only the disordered residues and hence, a deviation from the expected mean would be more statistically significant according to Akaikes information criterion, which is proportional to the number of data points.

In order to provide independent validation of our method, we compared our offset corrections to offset corrections estimated by LACS (Wang et al., 2005), and found that the two methods agree exceedingly well (see Figure S2 in the Supplementary Material), with the exception of the Hα shift, for which there was a small systematic deviation, which is, however, accounted for when using our phenomenological offset correction method. These Hα offsets predicted by our method were ca. −0.1 ppm on average whereas the offsets for the same entries were ca. +0.1 ppm although still apparently linearly related. This observation suggests small differences in the reference values used for Hα random coil chemical shifts for the two methods. There were some notable differences between our method and LACS that implicate that it cannot be applied in all cases; in particular LACS requires assigned Cα and Cβ chemical shifts and only provides offset correction estimates for Cα, Cβ, Hα, and C′ .

To address the possibility that observed chemical shifts could be affected by sample conditions, we compared our secondary chemical shifts and Z-scores with the corresponding results obtained from derivations using the random coil chemical shifts from Kjaergaard et al. (Kjaergaard and Poulsen, 2011), which includes an estimate of the effect due to variations in temperature and pH. Comparing the results for our seven test case entries in **Figure 1**, there were only some small differences for the disordered residues where the secondary chemical shifts were small compared to the difference between the random coil shift estimates, and elsewhere largely very similar results for the two methods (see Figure S3 in the Supplementary Material). This observation suggests that our method already includes the effects on chemical shifts from pH and temperature implicitly. However, since it was not possible to account for all effects of the sample conditions on the chemical shifts or for cooperative/nonlinear effects of the different parameters, we still argue that our phenomenological correction is the most appropriate.

NMR is very sensitive to ensemble averaging of local conformations, which is measured very accurately from the chemical shifts. Therefore, chemical-shift based methods for assessing order/disorder in proteins are both position-specific, and also provide a scale for the extent of disorder/order. These properties contrast with other biophysical techniques such as circular dichroism (CD), and infra-red (IR) spectroscopies, aimed at estimating the content of secondary structure, or techniques for estimating the protein size, density, hydrodynamic drag, or diffusion properties, using for example, size exclusion chromatography, ultracentrifugation, SAXS, or limited proteolysis (Vucetic et al., 2005; Tompa, 2009; Uversky and Longi, 2010). Missing density in X-ray crystallography data may also be indicative of local disorder (though it doesn't provide a scale for the disorder), but X-ray diffraction cannot be applied to proteins that do not contain any structured elements. Hence, with the chemical shift analysis proposed here, it is possible to "zoom-in" to look for finer details, and get a more detailed picture of the diversity in disorder.

Some other methods also provide residue based estimates for the local dynamics including δ2D (Camilloni et al., 2012) and the Random Coil Index (RCI) method (Berjanskii and Wishart, 2005). We have compared our Z-score for our seven test case proteins to the RCI estimates of the S<sup>2</sup> local order parameter in Figure S3. There was qualitatively agreement between the RCI estimated S<sup>2</sup> and our Z-score, i.e., low order for the disordered residue and high order (S<sup>2</sup> close to 1) for the ordered residues. However, there were still some more subtle differences between the two scores. Firstly, the largest differences were for the disordered residues where the secondary chemical shifts were small compared to the difference between the random coil shift estimates. This can be understood, since RCI uses a floor value for the absolute chemical shifts meaning that very small deviations from the chemical shifts are not captured and hence also very disordered residues are not distinguished. Secondly, the full range of orders appears to be less well described by the RCI estimates yielding only small differences for fully and partly formed structure (see e.g., Figure S3d). Thirdly, the RCI method gives a smoother trajectory along the sequence, which is because it uses both a three-point averaging of the absolute secondary chemical shifts as well as for the actual RCI index values. Finally, we note that whereas the RCI method is heuristic, based on a weighted sum of the absolute value of secondary chemical shifts for all atom types, our method is statistical, based on the chi square distribution of squared secondary chemical shifts, and due to this formulation it provides adequate estimates for the local dynamics also in cases were only a subset of the chemical shifts are available such as only proton or carbon chemical shifts.

Inspection of our CheZOD database of IDPs reveals that the proteins span a broad range of fractions of disorder and extent of disorder. Proteins are seen with both many fluctuations between segments of order/disorder, characterized by a high sequence disorder entropy, and with separation into larger completely ordered/disordered domains. This great diversity in disorder reflects the broad class of IDPs known to form various functions or interaction with different targets and malleability at different conditions.

Proteins that are either almost completely disordered, or completely structured, are most abundant in the CheZOD database, whereas cases with roughly equally mixed states are rare. This observation likely reflects the cooperative behavior in order/disorder transitions, where formation of ordered segments promotes the formation of other ordered segments and vice versa. Following this line of thought, proteins in the "Twilight Zone" might reflect folding intermediates on a transition path from unfolded to folded. The seemingly cooperative nature of disorder also provides a clue to why it has been so difficult to construct predictive models for disorder from local amino acid composition alone. It is at current difficult to address whether the formation of medium length segments of alternating order/disorder are due to a cooperative transition together with nearest neighbors or due to properties of the amino acids in the segment.

The analysis of the CheZOD database revealed that entries found in the DisProt database were often mostly structured. The higher degree of structure for these protein entries could be due to differences between the sample conditions during analysis corresponding to the BMRB and DisProt entries. For example, the NMR analysis corresponding to the BMRB entry could be performed under conditions that favor structure determination. Alternatively, one could speculate that the identification of disordered regions, leading to the inclusion in the DisProt database, could be based on methods that are less strict compared to NMR or more "coarse-grained," in the sense that they only provide a classification for the full protein and not at a local level. In support of this, we found that DisProt entries, which used NMR to assess disorder, were also more often confirmed to be disordered in our analysis. Hence, a more critical assessment of disorder based on a more reliable and uniform criterion, such as can be derived with NMR, is recommended.

# Implications for the Bioinformatics Analysis of Protein Disorder

The last years have seen a tremendous increase in the number of bioinformatics tools that try to predict protein (dis)order (Schlessinger et al., 2009; Dosztányi et al., 2010) and several prediction methods participated in recent rounds of the Critical Assessment of Structure Predictions (CASP; Noivirt-Brik et al., 2009; Monastyrskyy et al., 2011). Unfortunately, current datasets of experimentally classified ordered and disordered regions (Sickmeier et al., 2007) contain many misclassified segments: In X-ray crystal structures regions may appear ordered due to binding partners or crystal packing forces, and could be disordered in isolation. In disorder databases segments may even be more prone to misclassification, since many longer disordered regions are characterized by semiquantitative experiments that lack position specific information. This suspicion of misclassification was confirmed in our analysis of the validated entries from the DisProt database where several entries where almost completely structured in the CheZOD database. Furthermore, the order/disorder status is also sensitive to environmental conditions, and this fact is not considered. The lack of sufficiently reliable datasets and the noise in the assignment of order and disorder represent a serious limitation in developing accurate prediction methods for protein disorder (Dosztányi et al., 2010). The final, serious shortcoming, of current prediction methods is their inaccuracy when going down to shorter stretches (preliminary analysis). This study reveals a preponderance of proteins with mixed ordered and disordered segments and high sequence disorder complexity—typically, for proteins with mixed order, CSD ≈ 0.1 corresponding to an average segment length of ca. 10 residues.

The CheZOD database presented here was carefully manually curated to exclude any entries with biasing conditions and strictly contains proteins under native conditions. Here the analysis and processing of chemical shifts provides a unique experimentally validated local and quantitative measure of order/disorder. Furthermore, our database covers a broad range of proteins ranging from completely ordered to almost completely structureless. Since predictions in CASP9 were no better than those in CASP8 (Monastyrskyy et al., 2011), and only a small improvement was noted for CASP10 (Monastyrskyy et al., 2014), we hope that our curated CheZOD database can help form the basis for the development of even more accurate

#### REFERENCES

Ahmed, M. A., Bamm, V. V., Harauz, G., and Ladizhansky, V. (2007). The BG21 isoform of Golli myelin basic protein is intrinsically disordered with a highly flexible amino-terminal domain. Biochemistry 46, 9700–9712. doi: 10.1021/bi700632x

and sophisticated predictive models of order/disorder. This will enable us to ask more detailed questions and provide answers to complex biologically relevant problems related to intrinsically disordered proteins.

## CONCLUSIONS

We have used a systematic analysis of NMR chemical shifts to build a database of experimentally validated disordered proteins, the CheZOD database. In contrast to other methods for order classification, our procedure provides a reliable position-specific quantitative measure of order/disorder through our Chemical shift Z-score for assessing Order/Disorder (the CheZOD score). Examples were observed of both maximum CheZOD score in completely ordered segments, intermediate values in loops and fractionally populated structure, and small values in completely disordered regions. Careful inspection and systematic analysis of the entries in the CheZOD database revealed interesting trends and variations. In particular, it was discussed here how proteins can be completely disordered, partially disordered, or only disordered in a small segment. Some proteins can indeed be classified as super unfolded, like human α-synuclein, indicating that this protein is not an archetypal IDP. Through the introduction of the sequence disorder complexity we found diverse patterns of disorder, e.g., that proteins can be segregated into two distinct parts of an ordered and a disordered domain, but also be composed of smaller segments varying alternatingly between order and disorder. A typical segment length of ordered/disordered residues was estimated to be ca. 10 residues. We foresee that further systematic analysis of our CheZOD database will contribute to a more detailed understanding of the relationship between primary sequence and disorder/structure and function.

## AUTHOR CONTRIBUTIONS

Both authors JTN and FM contributed to designing and developing the project and writing the paper. JTN performed the data collection, processing, and analysis by building the database of (intrinsically disordered) proteins with assigned chemical shifts, calculating the chemical shift Z-score indicative of disorder, and analyzing the trends in the database.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fmolb. 2016.00004

Akaike, H. (1974). A new look at the statistical model identification. Automat. Control IEEE Trans. 19, 716–723. doi: 10.1109/TAC.1974.1100705


activation of cholera toxin. J. Mol. Biol. 377, 748–760. doi: 10.1016/j.jmb.2007. 12.075


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer WV and Handling Editor declared their shared affiliation, and the Handling Editor states that the process nevertheless met the standards of a fair and objective review.

Copyright © 2016 Nielsen and Mulder. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.