Breaking News

A structural perspective on protein-protein interactions and complexes

Genome sequencing has provided nearly complete lists of macromolecules present in an organism [1,2]. However, the component lists alone reveal comparatively little about the function of the biological systems because the functional units in cells often correspond to macromolecular complexes [3]. These complexes vary widely in their activity and sizes [3-7]. They play crucial roles in most cellular processes, and are often depicted as molecular machines [3]. This metaphor accurately captures many of their characteristic features, such as modularity, complexity, cyclic functions, and energy consumption [8]. For instance, the nuclear pore complex, a 50-100 MDa protein assembly, regulates and controls the traffic of macromolecules through the nuclear envelope [9]; the ribosome is responsible for protein biosynthesis; the RNA polymerase catalyzes the formation of RNA [10]; and the ATP synthase catalyzes the formation of ATP  [7] . Macromolecular assemblies are also involved in transcription control (eg, IFNb enhanceosome) [6,11], regulation of cellular transport (eg, microtubulines in complex with molecular motors myosin or kinesin) [12-14], and are crucial components in neuronal signaling (eg, the postsynaptic density complexes) [15]. A structural description of the protein interactions is an important step toward a mechanistic understanding of biochemical, cellular, and higher order biological processes [16-19].

A comprehensive collection of the known structures of protein complexes is provided by the Protein Quaternary Structure (PQS) database, which currently contains ~12,000 assemblies of presumed biological significance that are derived from a variety of organisms (http://pqs.ebi.ac.uk/pqs-doc.shtml) (April 2004) [20]. The PQS database attempts to provide the best possible biological unit for all proteins, a complex task hampered by crystal packing and other problems. Each assembly consists of at least two protein chains. These assemblies can be organized into ~3,500 groups that contain chains with more than 30% sequence identity to at least one other member of the group [19].

The estimation of the total number of macromolecular complexes in a proteome is a non-trivial task. This difficulty can be partly ascribed to the multitude of component types (eg, proteins, nucleic acids, nucleotides, metal ions), and the varying lifespan of the complexes (eg, transient complexes such as those involved in signaling, and stable complexes such as the ribosome). The most comprehensive information about protein-protein interactions is available for the S. cerevisiae proteome, consisting of ~6,200 proteins. This information has been provided by methods such as the yeast two-hybrid system and affinity purifications followed by mass spectrometry [21-29]. The lower bound on binary protein interactions and functional links in yeast has been estimated to be in the range of ~30,000 [30,31]; this number corresponds to ~9 protein partners per protein, though not necessarily all at the same time. The human proteome may have an order of magnitude more complexes than the yeast cell; and the number of different complexes across all relevant genomes may be several times larger still. Therefore, there may be thousands of biologically relevant macromolecular complexes whose structures are yet to be characterized [32].

We review here recent developments in the experimental and computational techniques that have allowed structural biology to shift its focus from the structures of individual proteins to the structures of large assemblies [19,33,34]. We also illustrate these developments by listing their applications to structure determination of specific assemblies of biological importance. In contrast to structure determination of the individual proteins, structural characterization of macromolecular assemblies usually poses a more difficult challenge. We stress that a comprehensive structural description of large complexes generally requires the use of several experimental methods, underpinned by a variety of theoretical approaches to maximize efficiency, completeness, accuracy, and resolution [19,35].

X-ray crystallography and NMR spectroscopy


X-ray crystallography has been the most prolific technique for the structural analysis of proteins and protein complexes, and is still the ‘gold standard’ in terms of accuracy and resolution (Figure 1a). Structures of several macromolecular assemblies have recently been solved by x-ray crystallography: the RNA polymerase [36], the ribosomal subunits [37-41], the complete ribosome and its functional complexes [42], the proteasome [43], the GroEl chaperonin [44], various complexes involved in the cellular transport machinery [12,13], the Arp2/3 complex [45], photosystem I and the light-harvesting complex of photosystem II [46,47], the SRP complex involved in nascent protein targeting [48], and various viral capsid and virion structures [49-51]. However, the number of structures of macromolecular assemblies solved by x-ray crystallography is still quite small compared to that of the individual proteins and it will likely be many years before we have a complete repertoire of high-resolution structures for the hundreds of complexes in a typical cell. This discrepancy is due mainly to the difficult production of sufficient quantities of the sample and its crystallization.

NMR spectroscopy allows determination of atomic structures of ever larger subunits and even their complexes [52-54]. Although NMR spectroscopy is generally not applicable to protein structures with more than 300 residues, it can be applied to molecules in solution. It is increasingly used to determine the residues involved in protein-protein interactions (Figure 1b) [55-58]. For instance, it was recently utilized to describe structural differences among interactions between different LIM and SH3 domains [59].

Electron microscopy and electron tomography


There are several variants of electron microscopy, including single-particle EM (Figure 1c) [60], electron tomography (Figure 1d) [61] and electron crystallography of regular two-dimensional arrays of the sample [62].

For particles with molecular weights larger than 200 to 500 kD, single particle cryo-EM can determine the electron density of an assembly at resolutions as high as 5 Å [63-70]. The full 3D structure of the particle is reconstructed from many 2D projections of the specimen, each showing the object from a different angle. Imaging by cryo-EM requires neither large quantities of the sample nor the sample in a crystalline form. Therefore, single particle cryo-EM is a powerful tool to investigate the structure and dynamics of macromolecular assemblies for which X-ray structure determination is very difficult. Although it is generally impossible to build atomic models solely from cryo-EM density maps, the maps give valuable insights into the structure and mechanism of large complexes. They are particularly useful when combined with atomic-resolution structures of the subunits, as reviewed in the section on hybrid methods below.
One of the most exciting developments in structural biology is the new generation of tomography methods that are based on multiple tilted views of the same object [33,71]. While electron tomography can be used to study the structures of isolated macromolecular assemblies at a relatively low resolution of a few nanometers, its true potential lies in visualizing the assemblies in an unperturbed cellular context [72]. These datasets provide fascinating 3D images of entities as large as a small cell at approximately 5 nm resolution [73]. To widen the scope of cellular tomography, it is necessary to improve the resolution of the tomographic images as well as identification of the structures in these images [73-75]. Theoretical considerations [76] and ongoing improvements  in the instrumentation make a resolution as high as 2 nm a realistic goal  [77].

Low-resolution experimental methods


A number of experimental techniques can provide structural information about protein interactions at low resolution (Figure 1e). This information may be used to infer the configuration of the proteins in a complex. Methods for mapping of protein interactions may provide contact or proximity restraints on pairs of proteins that are useful in the modelling of higher order complexes. Such methods include new implementations of the two-hybrid system [78-81], tagged affinity chromatography [82,83], and a combination of phage display with other techniques [84] such as synthesis of peptides on cellulose membranes (SPOT) [85]. Because of the low-resolution nature of these biochemical characterizations, care is needed in their interpretation. For example, gauging the biochemicaly-derived interaction sets against known 3D structures of complexes identified potential sources of systematic errors in interaction discovery, such as indirect interactions in two-hybrid systems, obstruction of interfaces by molecular labels, and artificial promiscuity in the detected interactions (Figure 2) [86].

Biochemical and biophysical methods can also be used to derive low-resolution information about the relative position and orientation of the domains in a larger complex. These methods include site-directed mutagenesis that can identify residues mediating the interaction [87], various forms of footprinting such as hydrogen-deuterium exchange [88,89] and OH radical footprinting [90] that can identify surfaces buried upon complex formation, chemical cross-linking [91-93] that can identify interacting residues, fluorescence resonance energy transfer (FRET) [94,95] that can determine the distance between the labelled groups on the interacting proteins, and Fourier Transform Infrared Spectroscopy (FTIR) that describes structural changes upon complex formation [96]. Small angle X-ray scattering (SAXS) is another biophysical method that can provide low-resolution information about the shape of a complex. Recently, SAXS has also been used to study the dynamics of conformational changes in Bruton tyrosine kinase [97,98].

Computational protein-protein docking

When atomic structures of the individual proteins involved in an interaction are known, either by experiment or by modeling, there are a number of computational methods available to suggest the structure of the interaction [99]. Most of these docking methods aim to predict an atomic model of a complex by maximizing the shape and chemical complementarities between a given pair of interacting proteins [99-102]. Docking strategies usually rely on a two-stage approach:  They first generate a set of possible orientations of the two docked proteins and then score them in the hope that the native complex will be ranked highly. The searches may be restrained by other considerations, such as the known binding site location. The methods differ in protein representation, scoring of different configurations, and searching for best solutions. Some methods boldly model the actual diffusion/collision trajectories involved in the docking process [103,104].

While the docking methods are not sufficiently accurate to predict whether or not two proteins actually interact with each other, they can sometimes correctly identify the interacting surfaces between two structurally defined subunits [105]. Docking methods are systematically assessed through blind trials in the Critical Assessment of PRediction of Interactions (CAPRI), a community-wide experiment that occurs every two years [101,106]. Predictions are made just before the structures are solved experimentally, followed by assessing the models at the CAPRI meetings. None of the methods assessed in the last CAPRI experiment correctly predicted more than 3 of the 7 target complexes [106].

Methods that are able to work with comparative protein structure models [107] instead of experimentally determined subunit structures would extend the applicability of docking to many more biological problems, but would likely have poorer performance. Currently, docking is often applied in concert with experimental techniques, including site-directed mutagenesis [108], amide hydrogen/deuterium exchange [89], NMR spectroscopy [109,110], as well as solid-state binding and surface plasmon resonance [111].

 

Inferring interactions from homology


Protein interactions can also be modeled by similarity [112-114]. If there is a complex of known structure involving homologs of a pair of interacting proteins, it is usually possible to build a model by comparative modeling using the known complex structure as the template. There are now ~2000 distinct interactions  of known structure (Aloy & Russell, unpublished data) that can be used as templates, stored in the PQS database [20].

Building a model of an interacting pair of proteins based on the known structure of interacting homologs raises some questions. The first one is whether or not homology implies a similarity in interaction. It was found that interactions between proteins of the same fold tend to be similar when the sequence identity is above ~30% [115].  Below this cutoff, there is a twilight zone where interactions may or may not be similar geometrically.

Given a template, it is possible to model an interaction using standard comparative modelling techniques [116]. However, frequently there are multiple templates for the same interaction type. In addition, a single interaction template can be used to model many interactons in a single organism. Therefore, it is important to assess the likelihood of these potential interactions, particularly in the absence of experimental validation [117]. For example, each of the dozens of fibroblast growth factors (FGFs) interacts with one or more of seven receptors with different affinities [118]. Two approaches have been developed recently that attempt to predict specificity by modelling interactions. The first approach, implemented by InterPReTS [112,119] and ModBase [114], uses empirical pair-potentials derived from interfaces of known structure to score how well a pair of homologous proteins fits a known complex structure. The second approach, MULTIPROSPECTOR, is similar, although it attempts to study more distantly related protein sequences by threading sequences onto a library of interacting templates, followed by scoring how well the individual sequences fit their proposed folds as well as the interface between them [120]. Both approaches have since been applied to study large collections of sequences and interactions [113,114,121]. 

For some large complexes, the specificity of interactions within a family of homologous subunits is an important determinant of assembling the complex. For instance, the chaperonin CCT consists of eight homologous subunits that are all similar to the single subunit type comprising the thermosome [122]. Thus, to build CCT using the thermosome requires the conversion of a seven-subunit ring into an eight subunit ring, and then a choice of the correct arrangement out of the 5040 (8!/8) possibilities.  It is possible to guide this process by experiments, such as the detection of sub-complexes that reveal preferred interacting pairs [123] or application of the two-hybrid system [124]. InterPReTS was also applied to select one of the 120 possible arrangements of six exosome subunits (Figure 4) [125] with mixed results.

Low-resolution computational methods


Even when docking or modelling is not feasible, it may still be possible to get some structural insights into a protein-protein interaction using other computational approaches. Various methods combine structures with sequence alignments and phylogenetic trees to identify sites on the surface that are likely to be involved in function or specificity [126-133]. Other computational methods perform alanine scanning to identify hot spots in structures that may correspond to binding sites for both small ligands and proteins [134]. There are also many computational methods for prediction of protein-protein interactions when no structural information is available (P. Bork and E. Marcotte, this issue).

Hybrid methods


In the absence of atomic-resolution assembly structures, approximate atomic models of assemblies can be derived by combining low-resolution cryo-EM data of whole protein assemblies with computational docking of atomic-resolution structures of their subunits [135-143]. It has been estimated that using such fitting techniques improves the accuracy up to one tenth the resolution of the original EM reconstruction.

Hybrid approaches involving the fitting of subunits into the EM maps are illustrated by pseudo-atomic models for complexes of the actin-myosin complex [144], the yeast ribosome [145,146] (Figure 3), the bacteriophage T4-baseplate [147], pre-mRNA splicing complex SF3b [148], the rad51 system involved in homologous recombination and DNA repair [149], and complex virus structures [150,151].

Unfortunately, experimentally determined atomic-resolution structures of the isolated subunits are frequently not available. In addition, even if they are available, the induced fit may severely limit their utility in the reconstruction of the whole assembly. In such cases, it might be possible to get useful models of the subunits by comparative protein structure modeling [116,152-155]. The number of models that can be constructed with useful accuracy is already two orders of magnitude higher than the number of available experimentally determined structures. Models with at least the correct fold can be constructed for domains in approximately 58% of the known protein sequences [114]. Comparative modeling will be increasingly more applicable and accurate because of the structural genomics initiative [156]. One of the main goals of structural genomics is to determine a sufficient number of appropriately selected structures from each domain family, such that all sequences are within modeling distance of at least one known protein structure [157,158].

Structural genomics may in fact contribute to a comprehensive and efficient structural description of complexes in an additional way. While structural genomics currently focuses on single proteins or their domains, it could be expanded to the sampling of domain-domain interactions [115,159,160]. Such an effort would provide a repertoire of templates for binary interactions, which would facilitate building of higher-order complexes.

Although x-ray crystallography and EM in combination with atomic structure docking have been successfully employed to solve structures of protein assemblies, they are not capable of efficiently characterizing the myriad of complexes that exist in a cell. For example, most of the transient complexes cannot be addressed at all with these approaches. Therefore, there is a great need for hybrid methods where accuracy, high throughput, completeness, and resolution are improved by integrating information from all available sources [19,125,161].

The dynamics of complexes

By trapping the complexes in different conformations and configurations, hybrid methods can be used to study the functional role of assembly dynamics. For instance, models of the two different functional states of the E. coli 70S ribosome demonstrated that the complex changes from a compact to a looser conformation, and showed rearrangements of many of the ribosomal proteins [63]. Similarly, the T antigen double hexamers (a replicative helicase of simian virus 40) were assembled at the origin of replication using 27.5 Å cryo-EM maps at different degrees of bending along the DNA axis [162]. Fitting the crystal structure of the Tag helicase domain [163] into the 3D cryo-EM density map ascertained that the C-terminal domains are rotated relative to each other in the complex. The results were combined with the available biochemical data, to propose an integrated model for the initiation of viral DNA replication.  Comparison also revealed details that are key to understanding filament function. Fitting of atomic models of actin and the myosin cross-bridge into 14 Å cryoEM maps showed that the closing of the actin-binding cleft upon actin binding is structurally coupled to the opening of the nucleotide-binding pocket [67].

The dynamics of assembly models can also be studied by theoretical calculations [164-167]. A vibrational analysis of elastic models was employed to capture the essential motions in clamp closure in bacterial RNA polymerase, the ratcheting of 30 and 50S subunits of the ribosome, and the dynamic flexibility of chaperonin CCT [168]. And a quantized elastic deformational model provided a basis to simulate conformational fluctuations related to expansion and contraction of the truncated E2 core from the pyruvate dehydrogenase complex [169].

No comments