英语轻松读发新版了,欢迎下载、更新

Protein Language Model Hits Undruggable Targets, No Structure Required

2025-08-14 06:50:58 英文原文

作者:Fay Lin, PhD

Pranam Chatterjee, PhD, assistant professor of bioengineering at the University of Pennsylvania (UPenn), emphasizes that text is all you need for artificial intelligence (AI) models to effectively learn the basis of language. 

“I give ChatGPT a sentence, but I don’t say ‘this is a noun’ and ‘that’s a verb.’ The model is not trained on the metadata of the text,” explained Chatterjee in an interview with GEN. “The same analogy goes for proteins. You don’t need to annotate sequence with structure for the model to pick up on that information.” 

While drug developers have historically glued their attention to targeting the precise and fine-tuned molecular structures underlying protein function, this approach can miss the majority of the human proteome, where structural disorder drives a wealth of disease-related pathways, including cancer and neurological disease. Chatterjee argues that protein language models, leveraging only amino acid sequences with no structural information, are the key to drugging these historically “undruggable” targets that lack discernible pockets for drug design. 

In a new study published in Nature Biotechnology titled, “Target sequence-conditioned design of peptide binders using masked language modeling,” Chatterjee and colleagues from UPenn, Cornell University, and McMaster University have now developed a generalizable AI method, named PepMLM, that can design peptides, up to 4050 amino acids in length and optimally small for drug development, to bind challenging therapeutic targets relevant to Huntington’s, viral infections, leukemia, and more without structural input.  

According to Chatterjee, PepMLM has seen wide uptake from the biology community, averaging approximately 600 downloads per month since its public release last year. The accessible interface only requires researchers to input a target protein sequence to produce a binder. 

Mask up 

PepMLM is trained on approximately 10,000 peptide-protein sequence pairs sourced from PepNN and Propedia, and is a fine-tuned version of ESM-2, whose training set is composed of approximately 65 million unique protein sequences. In a simple approach, the model attaches a masked peptide sequence to the C-terminus of the target protein. The model is then tasked with “unmasking” the hidden sequence by generating a new peptide binder. 

PepMLM outperformed RFdiffusion, the current gold standard model for de novo protein design, with a higher hit rate of 38% compared to 29%. RFdiffusion is trained on structural information from the protein data bank (PDB) and was developed by the lab of David Baker, PhD, 2024 Nobel Laureate in Chemistry, director of the Institute for Protein Design at the University of Washington (UW), and Howard Hughes Medical Institute (HHMI) investigator. 

PepMLM peptides achieved nanomolar binding affinity on disease-related receptor targets that could not be hit by RFdiffusion, including NCAM1, a key marker of acute myeloid leukemia, and AMHR2, a regulator of polycystic ovarian syndrome.  

While Chatterjee recognizes that his lab’s approach to “dogmatically not touch structure” is a minority in the field, he emphasizes that a sequence-only approach allows models to effectively expand to targets for which known structures do not exist. Notably, an RFdiffusion-based approach for hitting “undruggable” intrinsically disordered proteins showed propensity toward targeting secondary structure, such as alpha helices and beta sheets, given the model’s training on the PDB.

In addition, experimentally-derived structural data often does not represent biologically relevant conditions. 

“If you want to bind to a disease-causing protein in a cell, do you think that the structure is the same one in the PDB?” posed Chatterjee. “Probably not. That structure may have been solved in a frozen environment with a bunch of salts so that it could be crystallized. That’s not how the protein looks in a cancer cell.” 

Target Huntington’s

In contrast to idiopathic neurological disorders such as Alzheimer’s, amyotrophic lateral sclerosis (ALS), and Parkinson’s, Huntington’s is a monogenic disease affecting more than 1 in 10,000 adults and is primarily caused by an expanded CAG repeat in exon 1 of the HTT gene. Chatterjee had his sights on showing PepMLM proof-of-concept in the Huntington’s protein, given its extensive documentation as a therapeutic target. 

Ray Truant, PhD, a professor in the department of biochemistry and biomedical sciences at McMaster University, a Huntington’s disease expert, and co-author of the PepMLM study, began his collaboration with Chatterjee in 2018. The two originally connected when Chatterjee was a graduate student at Massachusetts Institute of Technology (MIT), designing Cas9 enzymes to base edit the repeat region of the HTT gene.  

As the release of AlphaFold and the wide adoption of machine learning for protein design were still a few years away, Truant recalls feeling skeptical that sequence alone was sufficient for generating therapeutic peptides. 

“I did not believe it would work,” recalled Truant in an interview with GEN. “This was not how I was taught, nor how I teach my undergrads. We need to know structure in order to design a protein.” 

The results quickly shifted his opinion. PepMLM peptides fused to E3 ubiquitin ligases were shown to completely degrade Huntington’s disease-driving proteins in vitro. The peptides also demonstrated the ability to tune degradation efficacy, an important feature for drug development, as Huntington’s protein also holds important biological functions in axonal trafficking, regulation of gene transcription, and cell survival. 

“If you have a pathology mechanism where you either want to increase protective protein or decrease toxic protein, PepMLM can generate a peptide that can do that for you,” Truant told GEN. 

Notably, PepMLM allows researchers to modulate protein levels without impacting mRNA, offering a powerful tool to investigate diseases with RNA pathologies, such as Huntington’s disease-like (HDL) syndromes. As a novel treatment approach, Truant’s team is also interested in tethering kinase activity to PepMLM peptides to address hypophosphorylated sites and restore function in dysregulated Huntington’s protein. 

Taken together, Chatterjee said the next steps of the work aim to adapt the model to account for post-translational modifications, motif-specific binding, and tailoring specificity to avoid off-target effects, to improve the therapeutic potential of PepMLM peptides.

关于《Protein Language Model Hits Undruggable Targets, No Structure Required》的评论


暂无评论

发表评论

摘要

Pranam Chatterjee, a bioengineering professor at UPenn, highlights that AI models can learn language and protein function from text alone without additional metadata or structural information. His study in Nature Biotechnology introduces PepMLM, an AI method that designs peptides to bind therapeutic targets for diseases like Huntington’s disease, viral infections, and leukemia, achieving higher success rates than existing methods without requiring structural data. PepMLM has seen significant adoption within the biology community due to its effectiveness and user-friendly interface. Chatterjee emphasizes the model's potential in targeting "undruggable" proteins and highlights its applications in diseases with RNA pathologies.