LLMs trained on plant genomes and proteomes. For context or questions, see the GitHub repo.

Skip to: Plant-Based | Plant-Inclusive | Other Notable

Plant-Based LLMs / Research

Name Source Tokenization HF Link GH Link
AgroNT InstaDeep 6bp/token, 1000x window InstaDeepAI/agro-nucleotide-transformer-1b instadeepai/nucleotide-transformer
Inference notebook - Inference on segment - For non-plant DNA, see: nucleotide-transformer-2.5b-multi-species
PlantCaduceus Kuleshov Group, Cornell University, USA 1bp/token, 512 tokens x bidirectional kuleshov-group/PlantCaduceus_l32 kuleshov-group/PlantCaduceus
Examples notebook
Evo 2 Arc Institute and Laboratory of Evolutionary Design, Stanford University, USA 1bp/token, 1 million token context, StripedHyena bidirectional arcinstitute/evo2_40b arcinstitute/evo2 and Zymrael/savanna for StripedHyena2 architecture
Trained on opengenome2 dataset which includes multiple kingdoms of life, avoiding any viruses which affect eukaryotes.
The previous Evo model was trained on prokaryote genomes only.
Plant-*-BPE Zhang Tao Lab, Yangzhou University, China nucleotide / BPE collections/zhangtaolab/plant-foundation-models (multiple architectures: Mamba, BERT, GPT) zhangtaolab/plant_DNA_LLMs
finetuning shell script - inference shell script
PlantRNA-FM ColaLAB @ University of Exeter and the John Innes Centre @ Norwich Research Park, UK EsmTokenizer yangheng/PlantRNA-FM yangheng95/PlantRNA-FM
region classification in Python - also see OmniGenome-52M model and OmniGenBench benchmark for foundation models
PlantBERT University of Potsdam, Germany multiple nucleotides nigelhartm/PlantBERT nigelhartm/PlantBERT
finetuning Python script
FloraBERT Institute for Applied Computational Sciences, Harvard University, USA multiple nucleotides Gurveer05/FloraBERT gurveervirk/florabert
finetuning Python script
Farmer.chat Digital Green and CGIAR, India natural language LLMs HF blog GooeyAI
1001G+ project multiple, EU pangenomic data -- 1001genomes.org from Arabidopsis thaliana
protein-matryoshka-embeddings it me, USA amino acids monsoon-nlp/protein-matryoshka-embeddings --
dataset of pairs
plant protein and text descriptions
plant protein descriptions converted to binary classification tasks

"Plant-Inclusive" LLMs / Research

Trained on cells and proteins in general, including plants in training.
Name Source Tokenization HF Link GH Link
gLM2 Tatta Bio hybrid 1bp or 1 amino acid/token, 4096x window tattabio/gLM2_650M TattaBio/gLM2
example notebook - gLM2's embeddings model
ZymCTRL AI for Protein Design, Centre for Genomic Regulation, Spain amino acids AI4PD/ZymCTRL --
Enzyme design model
Improved on with reinforcement learning / DPO: code
ProtNote Microsoft amino acids + annotation codes -- microsoft/protnote
Multiple notebooks
ESM-2 Facebook / Meta ESMTokenizer / amino acids facebook/esm2_t48_15B_UR50D facebookresearch/esm (archived)
Many finetuned models on HF, such as AmelieSchreiber/esmbind
With "Flash Attention"
Bio Embeddings Technical University of Munich, Germany amino acids (?) supports multiple models sacdallago/bio_embeddings
data.bioembeddings.com
ProtTrans multiple amino acids Rostlab agemagician/ProtTrans
GPN Song Lab, U.C. Berkeley, USA nucleotide collections/songlab/gpn (model, tokenizer, dataset) songlab-cal/gpn
Embedding notebook - Multiple Sequence Alignment (MSA)
BioM3 U. Chicago, USA amino acid, protein design from natural text -- --
BioM3 paper
ProteinBERT Hebrew University of Jerusalem, Israel 1/amino acid GrimSqueaker/proteinBERT nadavbra/protein_bert
demo notebook
Mistral-DNA Paul Sabatier University, France nucleotide RaphaelMourad/Mistral-DNA-v1-417M-Athaliana raphaelmourad/Mistral-DNA
finetuning notebook
PAIR Vector Institute / U. Toronto, Canada amino acid h4duan (finetunes of ESM, ProtT5) --
Sample use in HF repo
ProtBFN InstaDeep amino acids InstaDeepAI/protein-sequence-bfn instadeepai/protein-sequence-bfn
Sample Python script
CHEAP Embeddings multiple, USA amino acids amyxlu/cheap-proteins, based on ESMFold amyxlu/cheap-proteins
ProTrek Westlake Representation Learning Lab, Westlake University, China EsmTokenizer westlake-repl/ProTrek_650M_UniRef50 and see SaProt_650M_AF2 westlake-repl/ProTrek
CoLab notebook
Prot-xLSTM JKU Linz, Austria amino acids ml.jku.at ml-jku/Prot-xLSTM
evaluation examples - generation notebook - variant fitness notebook
ProstT5 multiple amino acid and 3Di structure tokens Rostlab/ProstT5 mheinzinger/ProstT5
InterProt multiple ESM2 tokenizer liambai/InterProt-ESM2-SAEs etowahadams/interprot
RNA-FM / RhoFold multiple nucleotides (official on Google Drive)
(unofficial on HF/multimolecule)
ml4bio/RNA-FM
Protein-Vec Gleghorn Lab, U. Delaware, USA lhallee/ProteinVec lhallee/ProteinVecHuggingface
MAMMAL / biomed.omics / biomed-multi-alignment IBM, USA amino acids, gene names ibm/biomed.omics.bl.sm.ma-ted-458m.protein_solubility BiomedSciAI/biomed-multi-alignment
tutorial notebook creating a new task
ProteinGLM Tsinghua University, China amino acids Bo1015/proteinglm-100b-int4 allanchen95/xTrimoPGLM
contact prediction Python script
PSALM Harvard University, USA ESMTokenizer ProteinSequenceAnnotation Protein-Sequence-Annotation/PSALM
Protein-Llama-3-8B and Protein-Phi-3-mini Esperanto Technologies, USA collections/Esperanto/esperanto-fine-tuned-llm-based-models --
ProteinDT multiple amino acids chao1224/ProteinDT chao1224/ProteinDT
SeqDance Shen Lab, Columbia University, USA EsmTokenizer on Zenodo ShenLab/SeqDance

Other notable biology + LLM work

Not trained on any plants, but seems important to the bio research community. Includes datasets, tool-use

Name Source HF Link GH Link
HyenaDNA Hazy Research, Stanford University, USA LongSafari/hyenadna-large-1m-seqlen HazyResearch/hyena-dna
evaluation scripts
Open MetaGenomic Dataset OMG Tatta Bio datasets/tattabio/OMG TattaBio/OMG
scGPT WangLab, U. Toronto, Canada -- bowang-lab/scGPT
finetuning Python script
AllTheBacteria European Molecular Biology Laboratory, UK -- AllTheBacteria/AllTheBacteria dataset
GENA-LM AIRI Institute , Russia AIRI-Institute/dna-language-models AIRI-Institute/GENA_LM human DNA LLM
TYMEFLIES Multiple, USA -- Nature paper, Two decades of bacterial ecology and evolution in a freshwater lake genomics, wisc.edu article, Bsky Post
METAGENE Multiple, USA metagene-ai/METAGENE-1 pretrained on wastewater sample genomics metagene-ai
Genomic Benchmarks CEITEC, Czech Republic katarinagresova ML-Bioinfo-CEITEC/genomic_benchmarks benchmark
Notebook loading benchmarks from HF
DNABERT-2 MAGICS Labs, Northwestern University and Stony Brook University, USA zhihan1996/DNABERT-2-117M MAGICS-LAB/DNABERT_2
nucleotide-transformer InstaDeep InstaDeepAI/nucleotide-transformer InstaDeepAI/nucleotide-transformer
Best finetuned performance on DART-Eval
GeneGPT NCBI, USA -- ncbi/GeneGPT
Natural language text / amino acids tool-use dataset
ChatCell / Mol-Instructions ZJUKG, Zhejiang University, China zjunlp/chatcell-large --
ChatCell instruction dataset
Mol-Instructions dataset
affinity_pred Oak Ridge National Laboratory, USA jglaser ORNL/affinity_pred
ceLLama CelVox, Netherlands -- (standard LLM with R packages) CelVoxes/ceLLama
Tutorial notebook

To be sorted

Vine border image produced by valadzionak_volha on Freepik