LLMs trained on plant genomes and proteomes. For context or questions, see the GitHub repo.
Skip to: Plant-Based | Plant-Inclusive | Other Notable
Name | Source | Tokenization | HF Link | GH Link |
---|---|---|---|---|
gLM2 | Tatta Bio | hybrid 1bp or 1 amino acid/token, 4096x window | tattabio/gLM2_650M | TattaBio/gLM2 |
example notebook - gLM2's embeddings model | ||||
ZymCTRL | AI for Protein Design, Centre for Genomic Regulation, Spain | amino acids | AI4PD/ZymCTRL | -- |
Enzyme design model Improved on with reinforcement learning / DPO: code |
||||
ProtNote | Microsoft | amino acids + annotation codes | -- | microsoft/protnote |
Multiple notebooks | ||||
ESM-2 | Facebook / Meta | ESMTokenizer / amino acids | facebook/esm2_t48_15B_UR50D | facebookresearch/esm (archived) |
Many finetuned models on HF, such as AmelieSchreiber/esmbind
With "Flash Attention" |
||||
Bio Embeddings | Technical University of Munich, Germany | amino acids (?) | supports multiple models | sacdallago/bio_embeddings |
data.bioembeddings.com | ||||
ProtTrans | multiple | amino acids | Rostlab | agemagician/ProtTrans |
GPN | Song Lab, U.C. Berkeley, USA | nucleotide | collections/songlab/gpn (model, tokenizer, dataset) | songlab-cal/gpn |
Embedding notebook - Multiple Sequence Alignment (MSA) | ||||
BioM3 | U. Chicago, USA | amino acid, protein design from natural text | -- | -- |
BioM3 paper | ||||
ProteinBERT | Hebrew University of Jerusalem, Israel | 1/amino acid | GrimSqueaker/proteinBERT | nadavbra/protein_bert |
demo notebook | ||||
Mistral-DNA | Paul Sabatier University, France | nucleotide | RaphaelMourad/Mistral-DNA-v1-417M-Athaliana | raphaelmourad/Mistral-DNA |
finetuning notebook | ||||
PAIR | Vector Institute / U. Toronto, Canada | amino acid | h4duan (finetunes of ESM, ProtT5) | -- |
Sample use in HF repo | ||||
ProtBFN | InstaDeep | amino acids | InstaDeepAI/protein-sequence-bfn | instadeepai/protein-sequence-bfn |
Sample Python script | ||||
CHEAP Embeddings | multiple, USA | amino acids | amyxlu/cheap-proteins, based on ESMFold | amyxlu/cheap-proteins |
ProTrek | Westlake Representation Learning Lab, Westlake University, China | EsmTokenizer | westlake-repl/ProTrek_650M_UniRef50 and see SaProt_650M_AF2 | westlake-repl/ProTrek |
CoLab notebook | ||||
Prot-xLSTM | JKU Linz, Austria | amino acids | ml.jku.at | ml-jku/Prot-xLSTM |
evaluation examples - generation notebook - variant fitness notebook | ||||
ProstT5 | multiple | amino acid and 3Di structure tokens | Rostlab/ProstT5 | mheinzinger/ProstT5 |
InterProt | multiple | ESM2 tokenizer | liambai/InterProt-ESM2-SAEs | etowahadams/interprot |
RNA-FM / RhoFold | multiple | nucleotides |
(official on Google Drive) (unofficial on HF/multimolecule) |
ml4bio/RNA-FM |
Protein-Vec | Gleghorn Lab, U. Delaware, USA | lhallee/ProteinVec | lhallee/ProteinVecHuggingface | |
MAMMAL / biomed.omics / biomed-multi-alignment | IBM, USA | amino acids, gene names | ibm/biomed.omics.bl.sm.ma-ted-458m.protein_solubility | BiomedSciAI/biomed-multi-alignment |
tutorial notebook creating a new task | ||||
ProteinGLM | Tsinghua University, China | amino acids | Bo1015/proteinglm-100b-int4 | allanchen95/xTrimoPGLM |
contact prediction Python script | ||||
PSALM | Harvard University, USA | ESMTokenizer | ProteinSequenceAnnotation | Protein-Sequence-Annotation/PSALM |
Protein-Llama-3-8B and Protein-Phi-3-mini | Esperanto Technologies, USA | collections/Esperanto/esperanto-fine-tuned-llm-based-models | -- | |
ProteinDT | multiple | amino acids | chao1224/ProteinDT | chao1224/ProteinDT |
SeqDance | Shen Lab, Columbia University, USA | EsmTokenizer | on Zenodo | ShenLab/SeqDance |
Not trained on any plants, but seems important to the bio research community. Includes datasets, tool-use