Plant-Based LLMs

LLMs trained on plant genomes and proteomes. For context or questions, see the GitHub repo.

Plant-Based LLMs / Research

Name	Source	Tokenization	HF Link	GH Link
AgroNT	InstaDeep	6bp/token, 1000x window	InstaDeepAI/agro-nucleotide-transformer-1b	instadeepai/nucleotide-transformer
Inference notebook - Inference on segment - For non-plant DNA, see: nucleotide-transformer-2.5b-multi-species
PlantCaduceus	Kuleshov Group, Cornell University, USA	1bp/token, 512 tokens x bidirectional	kuleshov-group/PlantCaduceus_l32	kuleshov-group/PlantCaduceus
Examples notebook
Evo 2	Arc Institute and Laboratory of Evolutionary Design, Stanford University, USA	1bp/token, 1 million token context, StripedHyena bidirectional	arcinstitute/evo2_40b	arcinstitute/evo2 and Zymrael/savanna for StripedHyena2 architecture
Trained on opengenome2 dataset which includes multiple kingdoms of life, avoiding any viruses which affect eukaryotes. The previous Evo model was trained on prokaryote genomes only.
Plant-*-BPE	Zhang Tao Lab, Yangzhou University, China	nucleotide / BPE	collections/zhangtaolab/plant-foundation-models (multiple architectures: Mamba, BERT, GPT)	zhangtaolab/plant_DNA_LLMs
finetuning shell script - inference shell script
PlantRNA-FM	ColaLAB @ University of Exeter and the John Innes Centre @ Norwich Research Park, UK	EsmTokenizer	yangheng/PlantRNA-FM	yangheng95/PlantRNA-FM
region classification in Python - also see OmniGenome-52M model and OmniGenBench benchmark for foundation models
PlantBERT	University of Potsdam, Germany	multiple nucleotides	nigelhartm/PlantBERT	nigelhartm/PlantBERT
finetuning Python script
FloraBERT	Institute for Applied Computational Sciences, Harvard University, USA	multiple nucleotides	Gurveer05/FloraBERT	gurveervirk/florabert
finetuning Python script
Farmer.chat	Digital Green and CGIAR, India	natural language LLMs	HF blog	GooeyAI

1001G+ project	multiple, EU	pangenomic data	--	1001genomes.org from Arabidopsis thaliana
protein-matryoshka-embeddings	it me, USA	amino acids	monsoon-nlp/protein-matryoshka-embeddings	--
dataset of pairs plant protein and text descriptions plant protein descriptions converted to binary classification tasks

"Plant-Inclusive" LLMs / Research

Trained on cells and proteins in general, including plants in training.

Name	Source	Tokenization	HF Link	GH Link
gLM2	Tatta Bio	hybrid 1bp or 1 amino acid/token, 4096x window	tattabio/gLM2_650M	TattaBio/gLM2
example notebook - gLM2's embeddings model
ZymCTRL	AI for Protein Design, Centre for Genomic Regulation, Spain	amino acids	AI4PD/ZymCTRL	--
Enzyme design model Improved on with reinforcement learning / DPO: code
ProtNote	Microsoft	amino acids + annotation codes	--	microsoft/protnote
Multiple notebooks
ESM-2	Facebook / Meta	ESMTokenizer / amino acids	facebook/esm2_t48_15B_UR50D	facebookresearch/esm (archived)
Many finetuned models on HF, such as AmelieSchreiber/esmbind With "Flash Attention"
Bio Embeddings	Technical University of Munich, Germany	amino acids (?)	supports multiple models	sacdallago/bio_embeddings
data.bioembeddings.com
ProtTrans	multiple	amino acids	Rostlab	agemagician/ProtTrans
GPN	Song Lab, U.C. Berkeley, USA	nucleotide	collections/songlab/gpn (model, tokenizer, dataset)	songlab-cal/gpn
Embedding notebook - Multiple Sequence Alignment (MSA)
BioM3	U. Chicago, USA	amino acid, protein design from natural text	--	--
BioM3 paper
ProteinBERT	Hebrew University of Jerusalem, Israel	1/amino acid	GrimSqueaker/proteinBERT	nadavbra/protein_bert
demo notebook
Mistral-DNA	Paul Sabatier University, France	nucleotide	RaphaelMourad/Mistral-DNA-v1-417M-Athaliana	raphaelmourad/Mistral-DNA
finetuning notebook
PAIR	Vector Institute / U. Toronto, Canada	amino acid	h4duan (finetunes of ESM, ProtT5)	--
Sample use in HF repo
ProtBFN	InstaDeep	amino acids	InstaDeepAI/protein-sequence-bfn	instadeepai/protein-sequence-bfn
Sample Python script
CHEAP Embeddings	multiple, USA	amino acids	amyxlu/cheap-proteins, based on ESMFold	amyxlu/cheap-proteins
ProTrek	Westlake Representation Learning Lab, Westlake University, China	EsmTokenizer	westlake-repl/ProTrek_650M_UniRef50 and see SaProt_650M_AF2	westlake-repl/ProTrek
CoLab notebook
Prot-xLSTM	JKU Linz, Austria	amino acids	ml.jku.at	ml-jku/Prot-xLSTM
evaluation examples - generation notebook - variant fitness notebook
ProstT5	multiple	amino acid and 3Di structure tokens	Rostlab/ProstT5	mheinzinger/ProstT5
InterProt	multiple	ESM2 tokenizer	liambai/InterProt-ESM2-SAEs	etowahadams/interprot
RNA-FM / RhoFold	multiple	nucleotides	(official on Google Drive) (unofficial on HF/multimolecule)	ml4bio/RNA-FM
Protein-Vec	Gleghorn Lab, U. Delaware, USA		lhallee/ProteinVec	lhallee/ProteinVecHuggingface
MAMMAL / biomed.omics / biomed-multi-alignment	IBM, USA	amino acids, gene names	ibm/biomed.omics.bl.sm.ma-ted-458m.protein_solubility	BiomedSciAI/biomed-multi-alignment
tutorial notebook creating a new task
ProteinGLM	Tsinghua University, China	amino acids	Bo1015/proteinglm-100b-int4	allanchen95/xTrimoPGLM
contact prediction Python script
PSALM	Harvard University, USA	ESMTokenizer	ProteinSequenceAnnotation	Protein-Sequence-Annotation/PSALM
Protein-Llama-3-8B and Protein-Phi-3-mini	Esperanto Technologies, USA		collections/Esperanto/esperanto-fine-tuned-llm-based-models	--
ProteinDT	multiple	amino acids	chao1224/ProteinDT	chao1224/ProteinDT
SeqDance	Shen Lab, Columbia University, USA	EsmTokenizer	on Zenodo	ShenLab/SeqDance

Other notable biology + LLM work

Not trained on any plants, but seems important to the bio research community. Includes datasets, tool-use

Name	Source	HF Link	GH Link
AlphaGenome	Google DeepMind	-	google-deepmind/alphagenome
blog
HyenaDNA	Hazy Research, Stanford University, USA	LongSafari/hyenadna-large-1m-seqlen	HazyResearch/hyena-dna
evaluation scripts
Open MetaGenomic Dataset OMG	Tatta Bio	datasets/tattabio/OMG	TattaBio/OMG
scGPT	WangLab, U. Toronto, Canada	--	bowang-lab/scGPT
finetuning Python script
AllTheBacteria	European Molecular Biology Laboratory, UK	--	AllTheBacteria/AllTheBacteria dataset
GENA-LM	AIRI Institute , Russia	AIRI-Institute/dna-language-models	AIRI-Institute/GENA_LM human DNA LLM
TYMEFLIES	Multiple, USA	--	Nature paper, Two decades of bacterial ecology and evolution in a freshwater lake genomics, wisc.edu article, Bsky Post
METAGENE	Multiple, USA	metagene-ai/METAGENE-1 pretrained on wastewater sample genomics	metagene-ai
Genomic Benchmarks	CEITEC, Czech Republic	katarinagresova	ML-Bioinfo-CEITEC/genomic_benchmarks benchmark
Notebook loading benchmarks from HF
DNABERT-2	MAGICS Labs, Northwestern University and Stony Brook University, USA	zhihan1996/DNABERT-2-117M	MAGICS-LAB/DNABERT_2
nucleotide-transformer	InstaDeep	InstaDeepAI/nucleotide-transformer	InstaDeepAI/nucleotide-transformer
Best finetuned performance on DART-Eval
GeneGPT	NCBI, USA	--	ncbi/GeneGPT
Natural language text / amino acids tool-use dataset
ChatCell / Mol-Instructions	ZJUKG, Zhejiang University, China	zjunlp/chatcell-large	--
ChatCell instruction dataset Mol-Instructions dataset
affinity_pred	Oak Ridge National Laboratory, USA	jglaser	ORNL/affinity_pred
ceLLama	CelVox, Netherlands	-- (standard LLM with R packages)	CelVoxes/ceLLama
Tutorial notebook

Plant-Based LLMs / Research

"Plant-Inclusive" LLMs / Research

Other notable biology + LLM work

To be sorted