دانلود کتاب مدل‌های زبانی بزرگ (LLM) در بیوانفورماتیک پروتئین

عنوان کتاب: Large Language Models (LLMs) in Protein Bioinformatics
نویسنده: Dukka B. KC
حوزه: مدل زبانی بزرگ
سال انتشار: 2025
تعداد صفحه: 360
زبان اصلی: انگلیسی
نوع فایل: pdf
حجم فایل: 6.69 مگابایت

موج اخیر در هوش مصنوعی، جایگزینی مدل‌های وظیفه‌محور با مدل‌های بنیادی است که بر روی مجموعه وسیعی از داده‌های بدون برچسب آموزش دیده‌اند و می‌توانند با حداقل تنظیم دقیق برای وظایف مختلف استفاده شوند. این مدل‌ها، مدل‌های بنیادی نامیده می‌شوند زیرا به عنوان پایه و اساس بسیاری از کاربردهای مدل هوش مصنوعی عمل می‌کنند. مدل‌های زبان بزرگ (LLM) دسته‌ای از مدل‌های بنیادی هستند که (از قبل) بر روی حجم عظیمی از داده‌ها آموزش دیده‌اند تا قابلیت‌های بنیادی مورد نیاز برای هدایت موارد استفاده و کاربردهای متعدد را فراهم کنند. LLMها معمولاً مبتنی بر معماری تبدیل‌کننده هستند و شامل آموزش بر روی مجموعه‌ای عظیم از داده‌ها (مثلاً متن) می‌شوند. معماری تبدیل‌کننده LLMها به LLMها اجازه می‌دهد تا به طور مؤثر اطلاعات متنی طولانی و متوالی را مدیریت کنند. LLMها نشان‌دهنده یک پیشرفت قابل توجه در پردازش زبان طبیعی (NLP) هستند و برای درک و تولید متون/محتوا طراحی شده‌اند. LLMها در تولید متن/محتوا، خلاصه‌سازی محتوا، دستیاران هوش مصنوعی، تولید کد و ترجمه زبان و موارد دیگر کاربرد پیدا کرده‌اند. LLMها در زمینه‌های تحقیقاتی مختلف از جمله بیوانفورماتیک پروتئین، نویدبخش بوده‌اند. به لطف پیشرفت‌ها در LLMها، حوزه بیوانفورماتیک پروتئین نیز شاهد پیشرفت‌های زیادی در زمینه‌های مختلف از جمله پیش‌بینی ساختار پروتئین، پیش‌بینی عملکرد پروتئین و موارد دیگر بوده است، اما محدود به این موارد نیست. با شروع آموزش مدل‌های زبان پروتئین (PLMها، LLMهایی که بر اساس توالی/ساختار پروتئین آموزش دیده‌اند) و کاربرد بعدی این PLMها، این حوزه شاهد رویکردهای فراوانی برای وظایف مختلف بیوانفورماتیک پروتئین بوده است.

کاربرد LLMها در بیوانفورماتیک پروتئین را می‌توان به طور کلی به دو دسته طبقه‌بندی کرد: نمایش (درک) و طراحی/مهندسی پروتئین (تولید). در دسته نمایش، از مدل زبان پروتئین برای استخراج جاسازی‌ها معمولاً از آخرین لایه استفاده می‌شود و سپس این جاسازی‌ها برای وظایف پیش‌بینی/طبقه‌بندی پایین‌دست استفاده می‌شوند. اکثر مردم به این امر به عنوان جاسازی‌های ایستا و از پیش آموزش‌دیده اشاره می‌کنند و یکی از رایج‌ترین رویکردها در بیوانفورماتیک پروتئین بوده است. اخیراً، در NLP، شاهد چند کار بوده‌ایم که از تنظیم دقیق و آموزش تحت نظارت برای هر دو بخش رمزگذار PLM و سر پیش‌بینی استفاده می‌کنند. حوزه LLMها با سرعت بسیار بیشتری در حال توسعه است و مباحث جدیدی مانند عامل‌های هوش مصنوعی و غیره در حال حاضر محبوبیت بیشتری پیدا کرده‌اند. امیدواریم شاهد کاربرد عامل‌های هوش مصنوعی و سایر مضامین نوظهور در حوزه بیوانفورماتیک پروتئین نیز باشیم. علاوه بر این، مفاهیم جدیدتری مانند مدل‌های زمینه بزرگ (LCM) وجود دارد و ما همچنین انتظار داریم کاربرد آنها را در این زمینه ببینیم.

فصل‌های زیر در این جلد از کتاب «روش‌ها در زیست‌شناسی مولکولی» گنجانده شده است. این کتاب (فصل 1) با ایجاد فرضیه‌ای برای توسعه رویکردهای مبتنی بر LLM در بیوانفورماتیک پروتئین، به ویژه بررسی مدل‌های زبان پروتئینی از پیش آموزش‌دیده اخیر، آغاز می‌شود. این فصل مروری عالی بر انواع مختلف مدل‌های زبان پروتئینی مبتنی بر معماری، به نام‌های فقط رمزگذار، رمزگذار-رمزگشا و فقط رمزگشا ارائه می‌دهد. این فصل همچنین به طور خلاصه جدیدترین روندها در این زمینه را خلاصه می‌کند: مدل‌های زبان پروتئین تنظیم دقیق و مدل‌های زبان پروتئین چندوجهی، در میان موارد دیگر.

در فصل 2، گروه دونگ شو از دانشگاه میسوری-کلمبیا، S-PLM، یک مدل زبان پروتئین آگاه از ساختار سه‌بعدی را توصیف می‌کنند. S-PLM با انگیزه پیشرفت‌های اخیر در رویکردهای پیش‌بینی ساختار پروتئین، به فرد اجازه می‌دهد تا با استفاده از یک مدل پرسپترون برداری هندسی (GVP) برای پردازش مختصات سه‌بعدی پروتئین، جاسازی‌های ساختاری را به دست آورد. مدل‌های زبان پروتئین تقاضای بالایی برای منابع محاسباتی دارند. در فصل 3، یان وانگ، ژیدونگ شو و همکارانشان مدل زبان پروتئین سبک خود به نام ProtFlash را توصیف می‌کنند. ProtFlash از چندین پیشرفت کلیدی فناوری از جمله توجه به تکه‌های مختلط و موارد دیگر استفاده می‌کند. نویسندگان همچنین دستورالعمل‌های گام به گام برای استفاده از کتابخانه ProtFlash را شرح می‌دهند. گروه بونوین از دانشگاه اوترخت، رویکردی به نام DeepRank-GNN-esm را بر اساس مدل‌های زبان پروتئین برای پیش‌بینی تعامل پروتئین-پروتئین برای رتبه‌بندی مدل‌های پروتئین-پروتئین (مسئله امتیازدهی) در فصل ۴ شرح می‌دهند. از آنجایی که رتبه‌بندی مدل‌های خوب از میان مجموعه بزرگ مدل‌های تولید شده در پیش‌بینی تعامل پروتئین-پروتئین گامی مهم است، این فصل استفاده از ویژگی‌های مدل زبان پروتئین (ESM-2) را برای بهبود پیش‌بینی تعامل پروتئین-پروتئین شرح می‌دهد.

در فصل ۵، دایسوکه کیهارا از دانشگاه پردو و همکارانش، خلاصه‌کننده اصطلاحات هستی‌شناسی ژن (GO2Sum) را توصیف می‌کنند که از یک مدل زبان پروتئین استفاده می‌کند. اساساً، GO2Sum لیستی از اصطلاحات GO را به عنوان ورودی می‌گیرد و آنها را به خلاصه‌ای تبدیل می‌کند که جنبه‌های مختلف GO یک پروتئین را توصیف می‌کند. این فصل همچنین وب سرور GO2Sum را شرح می‌دهد.

گروه جینالین چنگ از دانشگاه میسوری-کلمبیا، ابزار حاشیه‌نویسی عملکرد پروتئین که اخیراً توسعه داده‌اند و TransFun نام دارد را در فصل ۶ شرح می‌دهند. با تشخیص فقدان حاشیه‌نویسی عملکردی

The recent wave in AI is to replace the task-specific models with foundation models that are trained on a broad set of unlabeled data that can be used for different tasks with minimal fine-tuning. These models are called foundation models as they serve as the foundation for many applications of the AI model. Large Language Models (LLMs) are a class of founda-tion models that are (pre)trained on enormous amounts of data to provide the foundational capabilities needed to drive multiple use cases and applications. LLMs are typically based on a transformer architecture and involve training on a massive corpus of data (e.g., text). The transformer architecture of LLMs allows LLMs to effectively handle long context and sequential information. LLMs represent a significant breakthrough in natural language processing (NLP) and are designed to understand and generate texts/contents. LLMs have found application in text/content generation, content summarization, AI assistants, code generation, and language translation, among others.
LLMs have shown significant promise in various research fields including protein bioin-formatics. Thanks to advances in LLMs, the field of protein bioinformatics has also wit-nessed a lot of advances in various areas including but not limited to protein structure prediction, protein function prediction, and others. Starting with training of Protein Lan-guage Models (PLMs, LLMs that are trained on protein sequence/structure) and the subsequent application of these PLMs, the field has seen a plethora of approaches for various protein bioinformatics tasks.
The application of LLMs in protein bioinformatics can be broadly classified into two categories: representation (understanding) and protein design/engineering (generation). In the representation category, the protein language model is used to extract embeddings typically from the last layer and then these embeddings are used for downstream predic-tion/classification tasks. Most people refer to this as static, pretrained embeddings, and it has been one the most common approaches in protein bioinformatics. Recently, in NLP, we have seen a few works that use task-specific supervised fine-tuning and training for both the PLM encoder and the prediction head. The field of LLMs is developing at a much faster pace, and new topics like AI agents, etc. are already becoming more popular. We hope to see the application of AI agents and other new emerging themes in the area of protein bioinformatics as well. Additionally, there are newer concepts like large context models (LCMs), and we also expect to see their application in the field.
The following chapters are included in this volume of Methods in Molecular Biology. The book begins (Chap. 1) by setting up the premise for the development of LLM-based approaches in protein bioinformatics, specifically, surveying recent pretrained protein lan-guage models. This chapter gives an excellent overview of various types of protein language models based on architecture, aka encoder-only, encoder-decoder, and decoder-only. The chapter also briefly summarizes the most recent trends in the field: fine-tuning protein language models and multimodal protein language models, among other things.
In Chap. 2, Dong Xu’s group from the University of Missouri-Columbia describes S-PLM, a 3D structure-aware protein language model. Motivated by the recent advances in protein structure prediction approaches, S-PLM allows one to obtain structural embeddings by leveraging a Geometric Vector Perceptron (GVP) model to process the 3D coordinates of protein. Protein language models have high demand for computational resources. In Chap. 3, Yan Wang, Zhidong Xue, and colleagues describe their lightweight protein language model called ProtFlash. ProtFlash uses several key technological breakthroughs including mixed-chunk attention, among other things. The authors also describe step-by-step instructions for utilizing the ProtFlash library.
Bonvin’s group from Utrecht University describes an approach called DeepRank-GNN-esm based on protein language models for protein-protein interaction prediction to rank protein-protein models (scoring problem) in Chap. 4. As the ranking of good models from the large pool of generated models in protein-protein interaction prediction is an important step, this chapter describes the use of protein language model (ESM-2) features to improve protein-protein interaction prediction.
In Chap. 5, Daisuke Kihara from Purdue University and colleagues describe GO2Sum, gene ontology (GO) terms summarizer that uses a protein-language model. Essentially, GO2Sum takes a list of GO terms as input and converts them into a summary that describes various GO aspects of a protein. The chapter also describes the web server of GO2Sum.
Jinalin Cheng’s group from University of Missouri-Columbia describes their recently developed protein function annotation tool called TransFun in Chap. 6. Recognizing the lack of functional annotations for many proteins, TransFun leverages embeddings from ESM-1b and predicted structures from AlphaFold to predict function for a given protein. The authors also describe in detail how to get started with TransFun.
In Chap. 7, Lydia Fredollino’s group at University of Michigan describes InterLabelGO +, a top-performing model to predict GO term in CAFA5. InterLabelGO+ is an approach for prediction of protein functions in the form of gene ontology that uses the ESM2 protein language model to extract sequence features. Additionally, the group also describes the procedure to perform protein GO term prediction with InterLabelGO+ webserver and the standalone package in details.
In Chap. 8, Ana Rojas and collaborators from Centro Andaluz De Biologia Del Desar-rollo discuss the application of a protein language model (ProtTrans) for protein function annotation. Additionally, they also describe the FANTASIA tool for large-scale annotation of uncharacterized proteomes.
Debswapna Bhattacharya from Virginia Tech and collaborators summarize the recent advances in protein-nucleic acid binding site prediction approaches that harness protein language models in Chap. 9. Additionally, the chapter also presents their own approach called EquiPNAS that integrates pLM with equivariant deep graph neural networks for protein-DNA and protein-RNA binding site prediction.
In Chap. 10, Henrik Nielsen from the Technical University of Denmark describes three important tools that his group recently developed related to what proteins belong to which compartments, making use of protein language models. Specifically, the chapter describes SignalP6.0 for prediction of signal peptides, DeepLoc2.1 for prediction of subcellular location and membrane association in eukaryotes, and DeepLocPro1.0 for prediction of subcellular location in prokaryotes.
Iman Dehzangi and colleagues from Rutgers University-Camden discuss their tool CNN-Meth for predicting lysine methylation sites that uses evolutionary information and structural features in Chap. 11. Although their method does not directly use a protein language model, the Position-Specific Scoring Matrix (PSSM) features in their approach can readily be replaced by protein language model-based embeddings.
Pier Luigi Martelli and colleagues from the University of Bologna describe their approach for characterizing proteins and for predicting the pathogenicity of human protein variants.TheirapproachdescribetheirBioinformcharacterizingproteinsChap.
12.uses embeddings from protein language models. Additionally, they atics Sweeties, a web portal, that has a list of bioinformatics tools for and different aspects of pathogenic variants with examples in
Preface ix
In Chap. 13, Shandar Ahmad’s group from Jawaharlal Nehru University discusses various existing approaches for prediction of biological function that leverages protein language models and NLP-based techniques. Additionally, the survey highlights the major advances in the field and possible future directions for the research in the field.
Shanfeng Zhu and Jianyi Yang’s group discusses their recent approach, in Chap. 14, for homologous protein search and sequence alignment that uses protein language models. PLMSearch is their protein language model-based tools for searching homologous sequences, and PLMAlign is their tool for aligning remote homologous sequences. The chapter also describes in detail how to use these tools.
In Chap. 15, Siwei Chen’s group at the Broad Institute of MIT and Harvard sum-marizes the recent advances in protein-protein interaction analysis that leverages protein language models. Essentially, the computational tools for predicting protein-protein inter-actions and protein-protein interaction site prediction are discussed in detail. The chapter also highlights some of the other promising areas of PPI prediction, including PPI hotspots among others.
Identifying protein-peptide binding residue is important for understanding the mechanisms of protein functions and drug discovery. In Chap. 16, Leyi Wei and colleagues describe their PepBCL tool to predict protein-peptide binding site. PepBCL uses a pre-trained BERT model called ProtBert-BFD to generate the encoding vector.
Bioactive peptide discovery is another important field across food, nutraceuticals, cos-metics, and pharmaceuticals. In Chap. 17, Yonghui Li and colleagues from Kansas State University describe their tool that leverages a protein language model for predicting peptide bioactivity. Their approach, called UniDL4BioPep, uses the ESM protein language model.
In Chap. 18, Boxue Tian’s group at Tsinghua University discusses a new tool called CLAPE for the prediction of protein-ligand binding site. CLAPE uses contrastive learning and the pretrained protein language model ProtBERT. The authors describe in detail the architecture, model performance, and datasets utilized in the training of CLAPE as well as how to use CLAPE.
Finally, in Chap. 19, My group from the Rochester Institute of Technology focus our chapter on a survey of recent advances in the prediction of post-translational modification sites in proteins that leverage large language models. We also identify emerging trends in the field and outline some of the challenges and future research directions in the field.
I hope readers receive this book as a comprehensive collection of methods, resources, and studies that use LLMs in protein bioinformatics. In addition to the description of these approaches, I believe the book will also serve as a practical guide for using these LLM-based tools in relation to various protein bioinformatics tasks. I am hopeful that this book exhibits a state of the art of the current research field in the arena and provides future trends in the field regarding the use of LLMs for protein bioinformatics.

این کتاب را میتوانید از لینک زیر بصورت رایگان دانلود کنید:

Download: Large Language Models (LLMs) in Protein Bioinformatics

پست های اخیر

دانلود کتاب مدل‌های زبانی بزرگ (LLM) در بیوانفورماتیک پروتئین

نظرات کاربران

دیدگاهتان را بنویسید لغو پاسخ

مطالب تصادفی ماه گذشته

بیشتر بخوانید

آهنگ خارجی

کتب علمی

رمان انگلیسی

کتب عمومی

پست های اخیر

دانلود کتاب مدل‌های زبانی بزرگ (LLM) در بیوانفورماتیک پروتئین

مشاهده بیشتر

نظرات کاربران

دیدگاهتان را بنویسید لغو پاسخ

مطالب تصادفی ماه گذشته

بیشتر بخوانید

آهنگ خارجی

کتب علمی

رمان انگلیسی

کتب عمومی