LLMs trained on plant genomes and proteomes. For context or questions, see the GitHub repo.
Skip to: Plant-Based | Plant-Inclusive | Other Notable
| Name | Source | Tokenization | HF Link | GH Link |
|---|---|---|---|---|
| AgroNT | InstaDeep | 6bp/token, 1000x window | InstaDeepAI/agro-nucleotide-transformer-1b | instadeepai/nucleotide-transformer |
| Inference notebook - Inference on segment - For non-plant DNA, see: nucleotide-transformer-2.5b-multi-species | ||||
| PlantCaduceus | Kuleshov Group, Cornell University, USA | 1bp/token, 512 tokens x bidirectional | kuleshov-group/PlantCaduceus_l32 | kuleshov-group/PlantCaduceus |
| Examples notebook | ||||
| Plant-*-BPE / PDLLMs | Zhang Tao Lab, Yangzhou University, China | nucleotide / BPE | collections/zhangtaolab/plant-foundation-models (multiple architectures: Mamba, BERT, GPT) | zhangtaolab/plant_DNA_LLMs |
| finetuning shell script - inference shell script | ||||
| PlantRNA-FM | ColaLAB @ University of Exeter and the John Innes Centre @ Norwich Research Park, UK | EsmTokenizer | yangheng/PlantRNA-FM | yangheng95/PlantRNA-FM |
| region classification in Python - also see OmniGenome-52M model and OmniGenBench benchmark for foundation models | ||||
| PlantBERT | University of Potsdam, Germany | multiple nucleotides | nigelhartm/PlantBERT | nigelhartm/PlantBERT |
| finetuning Python script | ||||
| FloraBERT | Institute for Applied Computational Sciences, Harvard University, USA | multiple nucleotides | Gurveer05/FloraBERT | gurveervirk/florabert |
| finetuning Python script | ||||
| Farmer.chat | Digital Green and CGIAR, India | natural language LLMs | HF blog | GooeyAI |
| 1001G+ project | multiple, EU | pangenomic data | -- | 1001genomes.org from Arabidopsis thaliana |
| MoBiPlant | Universidad de Buenos Aires, Argentina | - | manufernandezbur/MoBiPlant | manoloFer10/mobiplant |
| This is a Q&A dataset for plant molecular biology. Claude and DeepSeek scored highly. | ||||
| protein-matryoshka-embeddings | it me, USA | amino acids | monsoon-nlp/protein-matryoshka-embeddings | -- |
|
2024 project on more efficient embeddings for plant proteins dataset of pairs plant protein and text descriptions plant protein descriptions converted to binary classification tasks |
||||
| tomatotomato-gLM2 | it me, USA | custom pangenome tokens | monsoon-nlp/tomatotomato-gLM2-150M-v0.1 | -- |
| 2025 project on representing two genomes with a single sequence encoding variation between tomato accessions | ||||
| Name | Source | Tokenization | HF Link | GH Link |
|---|---|---|---|---|
| gLM2 | Tatta Bio | hybrid 1bp or 1 amino acid/token, 4096x window | tattabio/gLM2_650M | TattaBio/gLM2 |
| example notebook - gLM2's embeddings model | ||||
| Evo 2 | Arc Institute and Laboratory of Evolutionary Design, Stanford University, USA | 1bp/token, 1 million token context, StripedHyena bidirectional | arcinstitute/evo2_40b | arcinstitute/evo2 and Zymrael/savanna for StripedHyena2 architecture |
|
Trained on opengenome2 dataset
which includes multiple kingdoms of life, avoiding any viruses which affect eukaryotes.
The previous Evo model was trained on prokaryote genomes only. Independent researchers wrote a paper about deploying Evo2 also read generative design of bacteriophages with Evo2. |
||||
| ZymCTRL | AI for Protein Design, Centre for Genomic Regulation, Spain | amino acids | AI4PD/ZymCTRL | -- |
|
Enzyme design model Improved on with reinforcement learning / DPO: code |
||||
| ProtNote | Microsoft | amino acids + annotation codes | -- | microsoft/protnote |
| Multiple notebooks | ||||
| ESM-2 | Facebook / Meta | ESMTokenizer / amino acids | facebook/esm2_t48_15B_UR50D | facebookresearch/esm (archived) |
|
Many finetuned models on HF, such as AmelieSchreiber/esmbind
With "Flash Attention" |
||||
| Bio Embeddings | Technical University of Munich, Germany | amino acids (?) | supports multiple models | sacdallago/bio_embeddings |
| data.bioembeddings.com | ||||
| ProtTrans | multiple | amino acids | Rostlab | agemagician/ProtTrans |
| GPN | Song Lab, U.C. Berkeley, USA | nucleotide | collections/songlab/gpn (model, tokenizer, dataset) | songlab-cal/gpn |
| Embedding notebook - Multiple Sequence Alignment (MSA) | ||||
| BioM3 | U. Chicago, USA | amino acid, protein design from natural text | -- | -- |
| BioM3 paper | ||||
| ProteinBERT | Hebrew University of Jerusalem, Israel | 1/amino acid | GrimSqueaker/proteinBERT | nadavbra/protein_bert |
| demo notebook | ||||
| Mistral-DNA | Paul Sabatier University, France | nucleotide | RaphaelMourad/Mistral-DNA-v1-417M-Athaliana | raphaelmourad/Mistral-DNA |
| finetuning notebook | ||||
| PAIR | Vector Institute / U. Toronto, Canada | amino acid | h4duan (finetunes of ESM, ProtT5) | -- |
| Sample use in HF repo | ||||
| ProtBFN | InstaDeep | amino acids | InstaDeepAI/protein-sequence-bfn | instadeepai/protein-sequence-bfn |
| Sample Python script | ||||
| CHEAP Embeddings | multiple, USA | amino acids | amyxlu/cheap-proteins, based on ESMFold | amyxlu/cheap-proteins |
| ProTrek | Westlake Representation Learning Lab, Westlake University, China | EsmTokenizer | westlake-repl/ProTrek_650M_UniRef50 and see SaProt_650M_AF2 | westlake-repl/ProTrek |
| CoLab notebook | ||||
| Prot-xLSTM | JKU Linz, Austria | amino acids | ml.jku.at | ml-jku/Prot-xLSTM |
| evaluation examples - generation notebook - variant fitness notebook | ||||
| ProstT5 | multiple | amino acid and 3Di structure tokens | Rostlab/ProstT5 | mheinzinger/ProstT5 |
| InterProt | multiple | ESM2 tokenizer | liambai/InterProt-ESM2-SAEs | etowahadams/interprot |
| RNA-FM / RhoFold | multiple | nucleotides |
(official on Google Drive) (unofficial on HF/multimolecule) |
ml4bio/RNA-FM |
| Protein-Vec | Gleghorn Lab, U. Delaware, USA | lhallee/ProteinVec | lhallee/ProteinVecHuggingface | |
| MAMMAL / biomed.omics / biomed-multi-alignment | IBM, USA | amino acids, gene names | ibm/biomed.omics.bl.sm.ma-ted-458m.protein_solubility | BiomedSciAI/biomed-multi-alignment |
| tutorial notebook creating a new task | ||||
| ProteinGLM | Tsinghua University, China | amino acids | Bo1015/proteinglm-100b-int4 | allanchen95/xTrimoPGLM |
| contact prediction Python script | ||||
| PSALM | Harvard University, USA | ESMTokenizer | ProteinSequenceAnnotation | Protein-Sequence-Annotation/PSALM |
| Protein-Llama-3-8B and Protein-Phi-3-mini | Esperanto Technologies, USA | collections/Esperanto/esperanto-fine-tuned-llm-based-models | -- | |
| ProteinDT | multiple | amino acids | chao1224/ProteinDT | chao1224/ProteinDT |
| SeqDance | Shen Lab, Columbia University, USA | EsmTokenizer | on Zenodo | ShenLab/SeqDance |
Not trained on any plants, but seems important to the bio research community. Includes datasets, tool-use
| Name | Source | HF Link | GH Link |
|---|---|---|---|
| AlphaGenome | Google DeepMind | - | google-deepmind/alphagenome |
| blog | |||
| HyenaDNA | Hazy Research, Stanford University, USA | LongSafari/hyenadna-large-1m-seqlen | HazyResearch/hyena-dna |
| evaluation scripts | |||
| Open MetaGenomic Dataset OMG | Tatta Bio | datasets/tattabio/OMG | TattaBio/OMG |
| scGPT | WangLab, U. Toronto, Canada | -- | bowang-lab/scGPT |
| finetuning Python script | |||
| AllTheBacteria | European Molecular Biology Laboratory, UK | -- | AllTheBacteria/AllTheBacteria dataset |
| GENA-LM | AIRI Institute , Russia | AIRI-Institute/dna-language-models | AIRI-Institute/GENA_LM human DNA LLM |
| TYMEFLIES | Multiple, USA | -- | Nature paper, Two decades of bacterial ecology and evolution in a freshwater lake genomics, wisc.edu article, Bsky Post |
| METAGENE | Multiple, USA | metagene-ai/METAGENE-1 pretrained on wastewater sample genomics | metagene-ai |
|
Decoder-only, 1-direction Also check out the Microbial General Model (MGM) |
|||
| Genomic Benchmarks | CEITEC, Czech Republic | katarinagresova | ML-Bioinfo-CEITEC/genomic_benchmarks benchmark |
| Notebook loading benchmarks from HF | |||
| DNABERT-2 | MAGICS Labs, Northwestern University and Stony Brook University, USA | zhihan1996/DNABERT-2-117M | MAGICS-LAB/DNABERT_2 |
| nucleotide-transformer | InstaDeep | InstaDeepAI/nucleotide-transformer | InstaDeepAI/nucleotide-transformer |
| Best finetuned performance on DART-Eval | |||
| GeneGPT | NCBI, USA | -- | ncbi/GeneGPT |
| Natural language text / amino acids tool-use dataset | |||
| ChatCell / Mol-Instructions | ZJUKG, Zhejiang University, China | zjunlp/chatcell-large | -- |
|
ChatCell instruction dataset
Mol-Instructions dataset |
|||
| affinity_pred | Oak Ridge National Laboratory, USA | jglaser | ORNL/affinity_pred |
| ceLLama | CelVox, Netherlands | -- (standard LLM with R packages) | CelVoxes/ceLLama |
| Tutorial notebook | |||