LLMs trained on plant genomes and proteomes. For context or questions, see the GitHub repo.

Skip to: Plant-Based | Plant-Inclusive | Other Notable

Plant-Based LLMs / Research

Name Source Tokenization HF Link GH Link
AgroNT InstaDeep 6bp/token, 1000x window InstaDeepAI/agro-nucleotide-transformer-1b instadeepai/nucleotide-transformer
Inference notebook - Inference on segment - For non-plant DNA, see: nucleotide-transformer-2.5b-multi-species
PlantCaduceus Kuleshov Group, Cornell University, USA 1bp/token, 512 tokens x bidirectional kuleshov-group/PlantCaduceus_l32 kuleshov-group/PlantCaduceus
Examples notebook
Plant-*-BPE Zhang Tao Lab, Yangzhou University, China nucleotide / BPE collections/zhangtaolab/plant-foundation-models (multiple architectures: Mamba, BERT, GPT) zhangtaolab/plant_DNA_LLMs
finetuning shell script - inference shell script
PlantRNA-FM ColaLAB @ University of Exeter and the John Innes Centre @ Norwich Research Park, UK EsmTokenizer yangheng/PlantRNA-FM yangheng95/PlantRNA-FM
region classification in Python - also see OmniGenome-52M model and OmniGenBench benchmark for foundation models
PlantBERT University of Potsdam, Germany multiple nucleotides nigelhartm/PlantBERT nigelhartm/PlantBERT
finetuning Python script
FloraBERT Institute for Applied Computational Sciences, Harvard University, USA multiple nucleotides Gurveer05/FloraBERT gurveervirk/florabert
finetuning Python script
Farmer.chat Digital Green and CGIAR, India natural language LLMs HF blog GooeyAI
1001G+ project multiple, EU pangenomic data -- 1001genomes.org from Arabidopsis thaliana
protein-matryoshka-embeddings it me, USA amino acids monsoon-nlp/protein-matryoshka-embeddings --
dataset of pairs
plant protein and text descriptions
plant protein descriptions converted to binary classification tasks

"Plant-Inclusive" LLMs / Research

Trained on cells and proteins in general, including plants in training.
Name Source Tokenization HF Link GH Link
gLM2 Tatta Bio hybrid 1bp or 1 amino acid/token, 4096x window tattabio/gLM2_650M TattaBio/gLM2
example notebook - gLM2's embeddings model
ZymCTRL AI for Protein Design, Centre for Genomic Regulation, Spain amino acids AI4PD/ZymCTRL --
Enzyme design model
Improved on with reinforcement learning / DPO: code
ProtNote Microsoft amino acids + annotation codes -- microsoft/protnote
Multiple notebooks
ESM-2 Facebook / Meta ESMTokenizer / amino acids facebook/esm2_t48_15B_UR50D facebookresearch/esm (archived)
Many finetuned models on HF, such as AmelieSchreiber/esmbind
With "Flash Attention"
Bio Embeddings Technical University of Munich, Germany amino acids (?) supports multiple models sacdallago/bio_embeddings
ProtTrans multiple amino acids Rostlab agemagician/ProtTrans
GPN Song Lab, U.C. Berkeley, USA nucleotide collections/songlab/gpn (model, tokenizer, dataset) songlab-cal/gpn
Embedding notebook - Multiple Sequence Alignment (MSA)
BioM3 U. Chicago, USA amino acid, protein design from natural text -- --
BioM3 paper
ProteinBERT Hebrew University of Jerusalem, Israel 1/amino acid GrimSqueaker/proteinBERT nadavbra/protein_bert
demo notebook
Mistral-DNA Paul Sabatier University, France nucleotide RaphaelMourad/Mistral-DNA-v1-417M-Athaliana raphaelmourad/Mistral-DNA
finetuning notebook
PAIR Vector Institute / U. Toronto, Canada amino acid h4duan (finetunes of ESM, ProtT5) --
Sample use in HF repo
ProtBFN InstaDeep amino acids InstaDeepAI/protein-sequence-bfn instadeepai/protein-sequence-bfn
Sample Python script
CHEAP Embeddings multiple, USA amino acids amyxlu/cheap-proteins, based on ESMFold amyxlu/cheap-proteins
ProTrek Westlake Representation Learning Lab, Westlake University, China EsmTokenizer westlake-repl/ProTrek_650M_UniRef50 and see SaProt_650M_AF2 westlake-repl/ProTrek
CoLab notebook
Prot-xLSTM JKU Linz, Austria amino acids ml.jku.at ml-jku/Prot-xLSTM
evaluation examples - generation notebook - variant fitness notebook
ProstT5 multiple amino acid and 3Di structure tokens Rostlab/ProstT5 mheinzinger/ProstT5
InterProt multiple ESM2 tokenizer liambai/InterProt-ESM2-SAEs etowahadams/interprot
RNA-FM / RhoFold multiple nucleotides (official on Google Drive)
(unofficial on HF/multimolecule)
Protein-Vec Gleghorn Lab, U. Delaware, USA lhallee/ProteinVec lhallee/ProteinVecHuggingface
MAMMAL / biomed.omics / biomed-multi-alignment IBM, USA amino acids, gene names ibm/biomed.omics.bl.sm.ma-ted-458m.protein_solubility BiomedSciAI/biomed-multi-alignment
tutorial notebook creating a new task
ProteinGLM Tsinghua University, China amino acids Bo1015/proteinglm-100b-int4 allanchen95/xTrimoPGLM
contact prediction Python script
PSALM Harvard University, USA ESMTokenizer ProteinSequenceAnnotation Protein-Sequence-Annotation/PSALM
Protein-Llama-3-8B and Protein-Phi-3-mini Esperanto Technologies, USA collections/Esperanto/esperanto-fine-tuned-llm-based-models --
ProteinDT multiple amino acids chao1224/ProteinDT chao1224/ProteinDT
SeqDance Shen Lab, Columbia University, USA EsmTokenizer on Zenodo ShenLab/SeqDance

Other notable biology + LLM work

Not trained on any plants, but seems important to the bio research community. Includes datasets, tool-use

Name Source HF Link GH Link
Evo Laboratory of Evolutionary Design, Stanford University, USA togethercomputer/evo-1-131k-base evo-design/evo and togethercomputer/stripedhyena
inference example Python script
HyenaDNA Hazy Research, Stanford University, USA LongSafari/hyenadna-large-1m-seqlen HazyResearch/hyena-dna
evaluation scripts
Open MetaGenomic Dataset OMG Tatta Bio datasets/tattabio/OMG TattaBio/OMG
scGPT WangLab, U. Toronto, Canada -- bowang-lab/scGPT
finetuning Python script
AllTheBacteria European Molecular Biology Laboratory, UK -- AllTheBacteria/AllTheBacteria dataset
GENA-LM AIRI Institute , Russia AIRI-Institute/dna-language-models AIRI-Institute/GENA_LM human DNA LLM
TYMEFLIES Multiple, USA -- Nature paper, Two decades of bacterial ecology and evolution in a freshwater lake genomics, wisc.edu article, Bsky Post
METAGENE Multiple, USA metagene-ai/METAGENE-1 pretrained on wastewater sample genomics metagene-ai
Genomic Benchmarks CEITEC, Czech Republic katarinagresova ML-Bioinfo-CEITEC/genomic_benchmarks benchmark
Notebook loading benchmarks from HF
DNABERT-2 MAGICS Labs, Northwestern University and Stony Brook University, USA zhihan1996/DNABERT-2-117M MAGICS-LAB/DNABERT_2
nucleotide-transformer InstaDeep InstaDeepAI/nucleotide-transformer InstaDeepAI/nucleotide-transformer
Best finetuned performance on DART-Eval
GeneGPT NCBI, USA -- ncbi/GeneGPT
Natural language text / amino acids tool-use dataset
ChatCell / Mol-Instructions ZJUKG, Zhejiang University, China zjunlp/chatcell-large --
ChatCell instruction dataset
Mol-Instructions dataset
affinity_pred Oak Ridge National Laboratory, USA jglaser ORNL/affinity_pred
ceLLama CelVox, Netherlands -- (standard LLM with R packages) CelVoxes/ceLLama
Tutorial notebook

To be sorted

Vine border image produced by valadzionak_volha on Freepik