Just as ChatGPT generates text by predicting the word most likely to follow in a sequence, a new AI (AI) model can write new proteins that do not occur naturally from scratch.
The scientists used the new model, ESM3, to create a new fluorescent protein that shares only 58 percent of its sequence with naturally occurring fluorescent proteins, they said in a study published July 2 on reprint bioRxiv database. Representatives from EvolutionaryScale, a company founded by former Meta researchers, also outlined details on June 25 in statement.
The research team released a small version of the model under a non-commercial license and will make the large version of the model available to commercial researchers. According to EvolutionaryScale, the technology could be useful in areas ranging from drug discovery to designing new chemicals to break down plastic.
ESM3 is a large language model (LLM) similar to OpenAI’s GPT-4 that powers the ChatGPT chatbot, and scientists have trained its largest version on 2.78 billion proteins. For each protein, they extract information about sequence (the order of the amino acid building blocks that make up the protein), structure (the three-dimensional folded shape of the protein), and function (what the protein does). They randomly masked pieces of information about these proteins and asked ESM3 to predict the missing pieces.
They scaled up this model from research the same team was conducting while it was still at Meta. In 2022 they announced EMSFold — a precursor to ESM3 that predicts unknown microbial protein structures. This year, on Alphabet DeepMind too predicted protein structures for 200 million proteins.
Connected: DeepMind’s AlphaFold3 AI program can predict the structure of every protein in the universe and show how they function
Scientists subsequently pointed out that there is limitations of the predictions of these AI models and that protein predictions need to be verified. But the methods can still greatly speed up the search for protein structures, because the alternative is to use X-rays to map protein structures one by one—which is slow and expensive.
However, ESM3 goes beyond simply predicting existing proteins. Using information gathered from 771 billion unique pieces of structure, function and sequence information, the model can generate new proteins with specific functions. It was described as a “ChatGPT moment for biology” by one of the EvolutionaryScale supporters.
In the new study, the researchers asked the model to generate a new fluorescent protein — a type of protein that captures light and releases it back at a longer wavelength, causing it to glow a new shade of green. These proteins are important to biological researchers, who attach them to molecules they are interested in studying in order to track and image them; their discovery and development won a Nobel Prize in Chemistry in 2008
The model generated 96 proteins with sequences and structures likely to produce fluorescence. The researchers then selected one with the fewest sequences in common with naturally fluorescent proteins. Although this protein was 50 times less bright than natural green fluorescent proteins, ESM3 generated another iteration that resulted in new sequences that increased the brightness – and the result was a green fluorescent protein unlike any found in nature, called “esmGPF” . Achieving these iterations done in moments by AI would take 500 million years of evolution, the EvolutionaryScale team calculated.
“Currently, we still lack a fundamental understanding of how proteins, especially those that are ‘new to science’, behave when introduced into a living system, but this is a great new step that allows us to approach synthetic biology in a new way. AI modeling like ESM3 will enable the discovery of new proteins that the constraints of natural selection would never allow, creating innovations in protein engineering that evolution cannot. However, the claim of simulating 500 million years of evolution only focuses on individual proteins, which doesn’t account for the many stages of natural selection that created the diversity of life we know today, is intriguing, but I can’t help but feel that we might be overconfident in the assumption that we can outsmart complex processes. perfected by millions of years of natural selection.”