Today, DeepMind announced that it has seemingly solved one of biology’s outstanding problems: how the string of amino acids in a protein folds up into a three-dimensional shape that enables their complex functions. It’s a computational challenge that has resisted the efforts of many very smart biologists for decades, despite the application of supercomputer-level hardware for these calculations. DeepMind instead trained its system using 128 specialized processors for a couple of weeks; it now returns potential structures within a couple of days.
The limitations of the system aren’t yet clear—DeepMind says it’s currently planning on a peer-reviewed paper and has only made a blog post and some press releases available. But the system clearly performs better than anything that’s come before it, after having more than doubled the performance of the best system in just four years. Even if it’s not useful in every circumstance, the advance likely means that the structure of many proteins can now be predicted from nothing more than the DNA sequence of the gene that encodes them, which would mark a major change for biology.
Between the folds
To make proteins, our cells (and those of every other organism) chemically link amino acids to form a chain. This works because every amino acid shares a backbone that can be chemically connected to form a polymer. But each of the 20 amino acids used by life has a distinct set of atoms attached to that backbone. These can be charged or neutral, acidic or basic, etc., and these properties determine how each amino acid interacts with its neighbors and the environment.
The interactions of these amino acids determine the three-dimensional structure that the chain adopts after it’s produced. Hydrophobic amino acids end up on the interior of the structure in order to avoid the watery environment. Positive and negatively charged amino acids attract each other. Hydrogen bonds enable the formation of regular spirals or parallel sheets. Collectively, these shape what might otherwise be a disordered chain, enabling it to fold up into an ordered structure. And that ordered structure in turn defines the behavior of the protein, allowing it to act like a catalyst, bind to DNA, or drive the contraction of muscles.
Determining the order of amino acids in the chain of a protein is relatively easy.—they’re defined by the order of DNA bases within the gene that encode the protein. And as we’ve gotten very good at sequencing entire genomes, we have a superabundance of gene sequences and thus a huge surplus of protein sequences available to us now. For many of them, though, we have no idea what the folded protein looks like, which makes it difficult to determine how they function.
Given that the backbone of a protein is very flexible, nearly any two amino acids of a protein could potentially interact with each other. So figuring out which ones actually do interact in the folded protein, and how that interaction minimizes the free energy of the final configuration, becomes an intractable computational challenge once the number of amino acids gets too large. Essentially, when any amino acid could occupy any potential coordinates in a 3D space, figuring out what to put where becomes difficult.
Despite the difficulties, there has been some progress, including through distributed computing and gamification of folding. But an ongoing, biannual event called the Critical Assessment of protein Structure Prediction (CASP) has seen pretty irregular progress throughout its existence. And in the absence of a successful algorithm, people are left with the arduous task of purifying the protein and then using X-ray diffraction or cryo electron microscopy to figure out the structure of the purified form, endeavors that can often take years.
DeepMind enters the fray
DeepMind is an AI company that was acquired by Google in 2014. Since then, it’s made a number of splashes, developing systems that have successfully taken on humans at Go, chess, and even StarCraft. In several of its notable successes, the system was trained simply by providing it a game’s rules before setting it loose to play itself.
Tthe system is incredibly powerful, but it wasn’t clear that it would work for protein folding. For one thing, there’s no obvious external standard for a “win”—if you get a structure with a very low free energy, that doesn’t guarantee there’s something slightly lower out there. There’s also not much in the way of rules. Yes, amino acids with opposite charges will lower the free energy if they’re next to each other. But that won’t happen if it comes at the cost of dozens of hydrogen bonds and hydrophobic amino acids sticking out into water.
So how do you adapt an AI to work under these conditions? For their new algorithm, called AlphaFold, the DeepMind team treated the protein as a spatial network graph, with each amino acid as a node and the connections between them mediated by their proximity in the folded protein. The AI itself is then trained on the task of figuring out the configuration and strength of these connections by feeding it the previously determined structures of over 170,000 proteins obtained from a public database.
When given a new protein, AlphaFold searches for any proteins with a related sequence, and aligns the related portions of the sequences. It also searches for proteins with known structures that also have regions of similarity. Typically, these approaches are great at optimizing local features of the structure but not so great at predicting the overall protein structure—smooshing a bunch of highly optimized pieces together doesn’t necessarily produce an optimal whole. And this is where an attention-based deep-learning portion of the algorithm was used to make sure that the overall structure was coherent.
A clear success, but with limits
For this year’s CASP, AlphaFold and algorithms from other entrants were set loose on a series of proteins that were either not yet solved (and solved as the challenge went on) or were solved but not yet published. So there was no way for the algorithms’ creators to prep the systems with real-world information, and the algorithms’ output could be compared to the best real-world data as part of the challenge.
AlphaFold did quite well—far better, in fact, than any other entry. For about two-thirds of the proteins it predicted a structure for, it was within the experimental error that you’d get if you tried to replicate the structural studies in a lab. Overall, on an evaluation of accuracy that ranges from zero to 100, it averaged a score of 92—again, the sort of range that you’d see if you tried to obtain the structure twice under two different conditions.
By any reasonable standard, the computational challenge of figuring out a protein’s structure has been solved.
Unfortunately, there are a lot of unreasonable proteins out there. Some immediately get stuck into the membrane; others quickly pick up chemical modifications. Still others require extensive interactions with specialized enzymes that burn energy in order to force other proteins to refold. In all likelihood, AlphaFold will not be able to handle all of these edge cases, and without an academic paper describing the system, the system will take a little while—and some real-world use—to figure out its limitations. That’s not to take away from an incredible achievement, just to warn against unreasonable expectations.
The key question now is how quickly the system will be made available to the biological research community so that its limitations can be defined and we can start putting it to use on cases where it’s likely to work well and have significant value, like the structure of proteins from pathogens or the mutated forms found in cancerous cells.
https://arstechnica.com/?p=1726511