A trimodal protein language model enables advanced protein searches

2025-10-02 09:55:25 英文原文

作者:Yuan, Fajie

References

  1. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    Article  CAS  PubMed  Google Scholar 

  3. Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nat. Biotechnol. 42, 243–246 (2024).

    Article  PubMed  Google Scholar 

  5. Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).

    Article  CAS  PubMed  Google Scholar 

  6. Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat. Biotechnol. 40, 932–937 (2022).

    Article  CAS  PubMed  Google Scholar 

  7. Gane, A. et al. ProtNLM: model-based natural language protein annotation. Preprint at https://storage.googleapis.com/brain-genomics-public/research/proteins/protnlm/uniprot_2022_04/protnlm_preprint_draft.pdf (2022).

  8. Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks. Nat. Commun. 12, 3168 (2021).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 1–23 (2019).

    Article  Google Scholar 

  10. Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Liu, W. et al. PLMSearch: protein language model powers accurate and fast sequence search for remote homology. Nat. Commun. 15, 2775 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Hong, L. et al. Fast, sensitive detection of protein homologs using deep dense retrieval. Nat. Biotechnol. 43, 983–995 (2025).

    Article  CAS  PubMed  Google Scholar 

  13. Achiam, J. et al. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).

  14. Touvron, H. et al. LLaMA: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).

  15. Touvron, H. et al. LLaMA 2: open foundation and fine-tuned chat models. Preprint at https://arxiv.org/abs/2307.09288 (2023).

  16. Guo, D. et al. DeepSeek-R1: incentivizing reasoning capability in llms via reinforcement learning. Preprint at https://arxiv.org/abs/2501.12948 (2025).

  17. Ferruz, N., Schmidt, S. & Höcker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).

    Article  CAS  PubMed  Google Scholar 

  19. Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112–7127 (2021).

    Article  Google Scholar 

  20. Zhou, X. et al. Decoding the molecular language of proteins with Evolla. Preprint at bioRxiv https://doi.org/10.1101/2025.01.05.630192 (2025).

  21. Peng, F. Z. et al. PTM-Mamba: a PTM-aware protein language model with bidirectional gated Mamba blocks. Nat. Methods 22, 945–949 (2025).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Su, J. et al. SaProt: protein language modeling with structure-aware vocabulary. In Proc. 12th International Conference on Learning Representations (ICLR, 2024); https://openreview.net/forum?id=6MRm3G4NiU

  23. Su, J. et al. SaprotHub: making protein modeling accessible to all biologists. Preprint at bioRxiv https://doi.org/10.1101/2024.05.24.595648 (2024).

  24. Radford, A. et al. Learning transferable visual models from natural language supervision. In Proc. 38th International Conference on Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR, 2021).

  25. Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Healthc. 3, 1–23 (2021).

    Article  CAS  Google Scholar 

  26. Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31, 365–370 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. UniProt Consortium UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

    Article  Google Scholar 

  28. Koehler Leman, J. et al. Sequence–structure–function relationships in the microbial protein universe. Nat. Commun. 14, 2351 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Todd, A. E., Orengo, C. A. & Thornton, J. M. Evolution of protein function, from a structural perspective. Curr. Opin. Chem. Biol. 3, 548–556 (1999).

    Article  CAS  PubMed  Google Scholar 

  30. Douze, M. et al. The Faiss library. Preprint at https://arxiv.org/abs/2401.08281 (2024).

  31. Liu, S. et al. A text-guided protein design framework. Nat. Mach. Intell. 7, 580–591 (2025).

    Article  Google Scholar 

  32. Xu, M., Yuan, X., Miret, S. & Tang, J. ProtST: multi-modality learning of protein sequences and biomedical texts. In Proc. 40th International Conference on Machine Learning (eds Krause, A. et al.) 38749–38767 (PMLR, 2023).

  33. Chen, J. et al. Global marine microbial diversity and its potential in bioprospecting. Nature 633, 371–379 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Hu, Z. et al. Discovery and engineering of small SlugCas9 with broad targeting range and high specificity and activity. Nucleic Acids Res. 49, 4008–4019 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Yu, T. et al. Enzyme function prediction using contrastive learning. Science 379, 1358–1363 (2023).

    Article  CAS  PubMed  Google Scholar 

  36. Kweon, J. et al. Efficient DNA base editing via an optimized DYW-like deaminase. Preprint at bioRxiv https://doi.org/10.1101/2024.05.15.594452 (2024).

  37. Gherardini, P. F., Wass, M. N., Helmer-Citterich, M. & Sternberg, M. J. E. Convergent evolution of enzyme active sites is not a rare phenomenon. J. Mol. Biol. 372, 817–845 (2007).

    Article  CAS  PubMed  Google Scholar 

  38. Doolittle, R. F. Convergent evolution: the need to be explicit. Trends Biochem. Sci. 19, 15–18 (1994).

    Article  CAS  PubMed  Google Scholar 

  39. Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Pomaznoy, M., Ha, B. & Peters, B. GOnet: a tool for interactive Gene Ontology analysis. BMC Bioinformatics 19, 470 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Dauparas, J. et al. Robust deep learning–based protein sequence design using ProteinMPNN. Science 378, 49–56 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. He, Y. et al. Protein language models-assisted optimization of a uracil-N-glycosylase variant enables programmable T-to-G and T-to-C base editing. Mol. Cell 84, 1257–1270 (2024).

    Article  CAS  PubMed  Google Scholar 

  43. Tong, H. et al. Development of deaminase-free T-to-S base editor and C-to-G base editor by engineered human uracil DNA glycosylase. Nat. Commun. 15, 4897 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Ye, L. et al. Glycosylase-based base editors for efficient T-to-G and C-to-G editing in mammalian cells. Nat. Biotechnol. 42, 1538–1547 (2024).

    Article  CAS  PubMed  Google Scholar 

  45. Cornman, A. et al. The OMG dataset: an Open MetaGenomic corpus for mixed-modality genomic language modeling. In Proc. 13th International Conference on Learning Representations (ICLR, 2025); https://openreview.net/forum?id=jlzNb1iWs3

  46. Kavli, B. et al. Excision of cytosine and thymine from DNA by mutants of human uracil-DNA glycosylase. EMBO J. 15, 3442–3447 (1996).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Hayes, T. et al. Simulating 500 million years of evolution with a language model. Science 387, 850–858 (2025).

    Article  CAS  PubMed  Google Scholar 

  48. Burley, S. K. et al. RCSB Protein Data Bank: biological macromolecular structures enabling research and education in fundamental biology, biomedicine, biotechnology and energy. Nucleic Acids Res. 47, D464–D474 (2019).

    Article  CAS  PubMed  Google Scholar 

  49. Richardson, L. et al. MGnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Res. 51, D753–D759 (2023).

    Article  CAS  PubMed  Google Scholar 

  50. Pruitt, K. D., Tatusova, T., Brown, G. R. & Maglott, D. R. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 40, D130–D135 (2012).

    Article  CAS  PubMed  Google Scholar 

  51. Dai, F. et al. Toward de novo protein design from natural language. Preprint at bioRxiv https://doi.org/10.1101/2024.08.01.606258 (2024).

  52. Liu, N. et al. Protein design with dynamic protein vocabulary. Preprint at https://arxiv.org/abs/2505.18966 (2025).

  53. Kuang, J., Liu, N., Sun, C., Ji, T. & Wu, Y. PDFBench: a benchmark for de novo protein design from function. Preprint at https://arxiv.org/abs/2505.20346 (2025).

  54. Ko, Young Su. Using ProTrek for protein binder design. Twitter https://x.com/youngsuko9/status/1865845977673834595 (2024).

  55. Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1827760237194920435 (2024).

  56. Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1813427191000035330 (2024).

  57. Gitter, A. Using ProTrek to retrieve proteins with desired function. Twitter https://x.com/anthonygitter/status/1882642214624678193 (2025).

  58. Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2022).

    Article  CAS  PubMed  Google Scholar 

  59. van den Oord, A., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2018).

  60. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) 4171–4186 (Association for Computational Linguistics, 2019).

  61. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Rasley, J., Rajbhandari, S., Ruwase, O. & He, Y. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proc. 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (eds Gupta, R. & Liu, Y.) 3505–3506 (Association for Computing Machinery, 2020).

  63. Loshchilov, I. and Hutter, F. Fixing weight decay regularization in Adam. OpenReview.net https://openreview.net/forum?id=rk6qdGgCZ (2018).

  64. Loshchilov, I. & Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In Proc. International Conference on Learning Representations (ICLR, 2017); https://openreview.net/forum?id=Skq89Scxx

  65. Xu, J. et al. Protein inverse folding from structure feedback. Preprint at https://arxiv.org/abs/2506.03028 (2025).

  66. Enzyme Nomenclature (Nomenclature Committee of the International Union of Biochemistry and Molecular Biology, 2024); https://iubmb.qmul.ac.uk/enzyme/

  67. Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, D353–D361 (2017).

    Article  CAS  PubMed  Google Scholar 

  68. Kucera, T., Oliver, C., Chen, D., and Borgwardt, K. ProteinShake: building datasets and benchmarks for deep learning on protein structures. In Advances in Neural Information Processing Systems 36 (eds Oh, A. et al.) (NeurIPS, 2023).

Download references

关于《A trimodal protein language model enables advanced protein searches》的评论


暂无评论

发表评论

摘要

The document you've shared appears to be a reference section or bibliography from an academic paper discussing the advancements and applications of language models in protein-related research, particularly focusing on de novo protein design, sequence alignment, function prediction, and base editing. Here are key points and topics covered based on the references listed: 1. **Language Models for Protein Research**: - Several papers discuss how large language models (LLMs) like ProGen can be used to generate sequences with desired properties or functions. - Papers such as "Protein inverse folding from structure feedback" explore using LLMs in conjunction with structural information to improve protein design. 2. **Sequence Alignment and Comparative Genomics**: - Tools like DIAMOND are cited for rapid and sensitive sequence alignments at a large scale. - Research on comparative genomics and metagenomics is referenced, with datasets such as the OMG dataset being used to train models that can handle mixed-modality genomic data. 3. **Protein Structure Prediction and Databases**: - References to databases like AlphaFold Protein Structure Database, RCSB PDB, and MGnify highlight the importance of large-scale structural coverage. - Papers by Jumper et al., discussing methods for highly accurate protein structure prediction with deep learning models, are cited. 4. **Protein Function Prediction and Ontology Analysis**: - Tools like GOnet are mentioned for interactive Gene Ontology (GO) analysis. - Convergent evolution of enzyme active sites is discussed in the context of functional predictions. 5. **Protein Design and Engineering Applications**: - Papers such as "Development of deaminase-free T-to-S base editor and C-to-G base editor by engineered human uracil DNA glycosylase" show applications of protein engineering for genome editing. - Research on designing proteins from natural language descriptions is highlighted, with references to papers like "Toward de novo protein design from natural language." 6. **Machine Learning Techniques**: - Methods such as contrastive predictive coding (CPC) and self-supervised learning are referenced in the context of representation learning for biological sequences. - Optimizations for training deep learning models, including DeepSpeed optimizations and stochastic gradient descent with warm restarts (SGDR), are discussed. 7. **Evaluation Metrics and Benchmarks**: - Datasets and benchmarks like PDFBench are mentioned to assess de novo protein design capabilities of various models. 8. **Publications on Enzyme Nomenclature and Pathway Databases**: - References to KEGG and enzyme nomenclature databases emphasize the importance of standardization in biological data representation. The references indicate a rich interdisciplinary approach involving bioinformatics, computational biology, machine learning, and protein engineering. The cited works span from foundational methodological studies to applied research, demonstrating how advancements in AI are increasingly integrated into biological sciences to solve complex problems such as de novo protein design and functional annotation.