Large models can now be used to identify cells! Produced by a Tsinghua team, selected for ICML 2024 | Open Source
Submitted by Mizuki Molecule
Qubit | Public account QbitAI
Breakthroughs in the field of life sciences brought about by large models have just been announced.
From the Department of Tsinghua University, it uses a large model to achieve single-cell identification , and the model LangCell is also officially open source.
It can not only accurately identify cell identities, but also has strong zero-sample analysis capabilities. The paper has been accepted by ICML 2024.
LangCell's data set contains approximately 27.5 million pieces of data, covering eight dimensions of information including cell types, developmental stages, tissues and organs, and diseases. It can be called an "encyclopedia of cells."
In actual tests, LangCell also surpassed the former SOTA in multiple cell recognition and understanding tasks, and also performed well on new tasks specially designed by researchers.
Moreover, even without using text information, the included cell encoder module alone can achieve optimal performance on various tasks.
Production team: Tsinghua University startup company Water Molecule and Tsinghua University AIR Professor Nie Zaiqing’s team .
Large model, a "new weapon" for cell identification
Cells are the starting point for exploring the mysteries of life. The identification of cell identity is a hot topic in the field of biological sciences.
This is not only about the "household survey" of cells, but also about their "social relationships" in tissues, and their sensitive responses to "biological signals" and "environmental changes", and the important way to understand this information is Analyze single-cell sequencing data.
However, single-cell sequencing data analysis is like a "treasure hunt" in the scientific community. It may require an interdisciplinary team of several to ten people, and it may take several weeks to several months, or even longer. To be done.
Now, the LangCell model has become a "new weapon" for cell identity recognition.
LangCell is the first pre-trained single-cell representation model that combines single-cell RNA sequencing data and natural language processing, which not only improves the accuracy of recognition, but also reduces the reliance on large amounts of labeled data.
Traditional single-cell RNA sequencing data analysis is like looking for treasure without a map. Although some clues can be found, there is always a lack of power.
The LangCell model, by constructing a unified representation of single-cell data and natural language, is like giving the model a "treasure map", allowing it to find information related to cell identity more directly.
Specifically, LangCell mainly consists of two parts: Cell Encoder (CE) and text encoder.
The cell encoder is initialized using the pre-trained Geneformer. The sorted gene expression sequence input is converted into an embedding vector sequence, and the [CLS] tag is added at the beginning of the sequence. The embedding vector is linearly transformed as the representation vector of the entire cell.
Text encoders have two encoding modes: single-modal and multi-modal.
When single-modal, it is equivalent to a BERT model, used to convert text into embedding vectors;
In multi-modal mode, a cross-attention module is added after self-attention, the cell embedding vector is fused to calculate a joint representation, and the cell-text matching probability is predicted through a linear layer.
To train LangCell, the researchers also constructed a data set called scLibrary, which contains 27.5 million pieces of scRNA-seq data and multi-view text descriptions of cell identities obtained from OBO Foundry, like The "encyclopedia" of cell research.
This data set not only contains a large amount of raw data, but also contains multi-view text descriptions of cell identities, providing rich learning materials for the model.
In addition, in the zero-sample scenario, only the scRNA-seq data of the unknown type of cells need to be input into CE to obtain the cell embedding vector representation, and then the similarity is calculated with the text embedding vector of the candidate type. The type with the highest score is predicted to be the type of the unknown cell.
As a result, the LangCell model performed well in zero-shot cell identity understanding scenarios and was able to directly annotate new cell types even without fine-tuning.
On the PBMC dataset, the zero-sample LangCell classification accuracy reached 86.5%, and the F1 score exceeded the 9-shot performance of the previous SOTA model.
In the more challenging cross-dataset cell-text retrieval task, LangCell's zero-sample recall R@1, R@5, and R@10 results all exceeded the BioTranslator model trained with 30% annotated data.
In addition, the researchers also specially constructed two new benchmark testing tasks of "non-small cell lung cancer subtype classification" and "cell pathway identification" with important biological significance.
Results: In the non-small cell lung cancer subtype classification task, LangCell's zero-shot classification accuracy and F1 score reached 93.5% and 93.2% respectively, which was about 20% higher than the 10-shot Geneformer.
For the cell batch integration task, LangCell's Avgbio, ASWbatch and Sfinal indicators all reached the optimal level on the two data sets of PBMC10K and Perirhinal Cortex.
Not only does LangCell perform well, but the individual CE modules can achieve optimal performance on each task even without using text information.
On the data sets of multiple cell type annotation tasks, the performance of the CE module exceeded that of the previous SOTA, and its performance in cell pathway identification was also very good.
According to the author, these capabilities of LangCell are particularly important in the study of new diseases or cell subtypes, as they can reduce reliance on large amounts of labeled data and accelerate the discovery of disease mechanisms.
Team Profile
Aquatic Molecules is incubated by the Intelligent Industry Research Institute (AIR) of Tsinghua University. Its key research directions are basic large-scale models of the biopharmaceutical industry and a new generation of conversational biopharmaceutical R&D assistants.
Shuimu Molecule and Tsinghua University also have two jointly developed results with Peking University and Nanjing University, which were selected for ICML 2024, respectively making progress in small molecule 3D representation learning and macromolecule protein representation learning.
GitHub:
https://github.com/PharMolix/OpenBioMed
Paper address
https://arxiv.org/abs/2405.06708
Please send an email to:
ai@qbitai.com
Mark [Submission] in the title and tell us:
Who are you, where are you from, what is your contribution?
Attached is the paper/project homepage link and contact information.
We will (try to) get back to you promptly
Click here ???? Follow me and remember to star~