bio_embeddings.embed¶

Language models to translate amino acid sequences into vector representations

All language models implement the EmbedderInterface. You can embed a single sequences with EmbedderInterface.embed() or a list of sequences with the EmbedderInterface.embed_many() function. All except CPCProt and UniRep generate per-residue embeddings, which you can summarize into a fixed size per-protein embedding by calling EmbedderInterface.reduce_per_protein(). CPCProt only generates a per-protein embedding (reduce_per_protein does nothing), while UniRep includes a start token, so the embedding is one longer than the protein. UniRep, GloVe, fastText and word2vec only support the CPU.

OneHotEncodingEmbedder offers a naive baseline to compare the language model embeddings against, with one hot encoding as per-residue and amino acid composition as per-protein embedding. It accepts keyword arguments but ignores them since it does not do any notable computation.

Instead of using bio_embeddings[all], it’s possible to only install some embedders by selecting specific extras:

allennlp: seqvec
transformers: prottrans_albert_bfd, prottrans_bert_bfd, protrans_xlnet_uniref100, prottrans_t5_bfd, prottrans_t5_uniref50, prottrans_t5_xl_u50
jax-unirep: unirep
esm: esm
cpcprot: cpcprot
plus: plus_rnn

Model sizes¶

The disk size represents the size of the unzipped models or the combination of all files necessary for a particular embeddder. The GPU and CPU sizes are only for loading the model into GPU memory (VRAM) or the main RAM without the memory required to do any computation. They were measured for one specific set of hardware and software (Quadro RTX 8000, CUDA 11.1, torch 1.7.1, x86 64-bit Ubuntu 18.04) and will vary for different setups.

Model	Disk size (GB)	GPU size (GB)	CPU size (GB)
bepler	0.1	1.4	0.2
bert_from_publication	0.008	1.1	0.006
cpcprot	0.007	1.1	0.01
deepblast	0.4	1.4	0.26
esm	6.3	3.9	2.7
esm1b	7.3	3.8	2.6
esm1v	7.3	3.9	2.6
fasttext	0.05	n/a	0.03
glove	0.06	n/a	0.03
one_hot_encoding	n/a	n/a	n/a
pb_tucker	0.009	1.0	0.02
plus_rnn	0.06	1.2	0.1
prottrans_albert_bfd	0.9	2.0	1.8
prottrans_bert_bfd	1.6	2.8	3.4
prottrans_t5_bfd	7.2	5.9	16.1
prottrans_t5_uniref50	7.2	5.9	16.1
prottrans_t5_xl_u50	7.2	5.9	16.1
prottrans_xlnet_uniref100	1.6	2.7	3.3
seqvec	0.4	1.6	0.5
seqvec_from_publication	0.004	1.1	0.006
unirep	n/a	n/a	0.2
word2vec	0.07	n/a	0.06

class bio_embeddings.embed.BeplerEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

Bepler Embedder

Bepler, Tristan, and Bonnie Berger. “Learning protein sequence embeddings using information from structure.” arXiv preprint arXiv:1902.08661 (2019).

__init__(device: Union[None, str, torch.device] = None, **kwargs)[source]¶: Initializer accepts location of a pre-trained model and options

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embedding_dimension: ClassVar[int] = 121¶

name: ClassVar[str] = 'bepler'¶

necessary_files: ClassVar[List[str]] = ['model_file']¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.CPCProtEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

CPCProt Embedder

Lu, Amy X., et al. “Self-supervised contrastive learning of protein representations by mutual information maximization.” bioRxiv (2020). https://doi.org/10.1101/2020.09.04.283929

__init__(device: Union[None, str, torch.device] = None, **kwargs)[source]¶: Initializer accepts location of a pre-trained model and options

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]¶: See https://github.com/amyxlu/CPCProt/blob/df1ad1118544ed349b5e711207660a7c205b3128/embed_fasta.py

embedding_dimension: ClassVar[int] = 512¶

name: ClassVar[str] = 'cpcprot'¶

necessary_files: ClassVar[List[str]] = ['model_file']¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.ESM1bEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

ESM-1b Embedder (Note: This is not the original ESM)

Rives, Alexander, et al. “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.” Proceedings of the National Academy of Sciences 118.15 (2021). https://doi.org/10.1073/pnas.2016239118

name: ClassVar[str] = 'esm1b'¶

class bio_embeddings.embed.ESM1vEmbedder(ensemble_id: int, device: Union[None, str, torch.device] = None, **kwargs)[source]¶

ESM-1v Embedder (one of five)

ESM1v uses an ensemble of five models, called esm1v_t33_650M_UR90S_[1-5]. An instance of this class is one of the five, specified by ensemble_id.

Meier, Joshua, et al. “Language models enable zero-shot prediction of the effects of mutations on protein function.” bioRxiv (2021). https://doi.org/10.1101/2021.07.09.450648

__init__(ensemble_id: int, device: Union[None, str, torch.device] = None, **kwargs)[source]¶: You must pass the number of the model (1-5) as first parameter, though you can override the weights file with model_file

ensemble_id: int¶

name: ClassVar[str] = 'esm1v'¶

class bio_embeddings.embed.ESMEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

ESM Embedder (Note: This is not ESM-1b)

name: ClassVar[str] = 'esm'¶

class bio_embeddings.embed.EmbedderInterface(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

__init__(device: Union[None, str, torch.device] = None, **kwargs)[source]¶: Initializer accepts location of a pre-trained model and options

abstract embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]¶

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embed_many(sequences: Iterable[str], batch_size: Optional[int] = None) → Generator[numpy.ndarray, None, None][source]¶

Returns embedding for one sequence.

Parameters

sequences – List of proteins as AA strings
batch_size – For embedders that profit from batching, this is maximum number of AA per batch

Returns

A list object with embeddings of the sequences.

embedding_dimension: ClassVar[int]¶

name: ClassVar[str]¶

necessary_directories: ClassVar[List[str]] = []¶

necessary_files: ClassVar[List[str]] = []¶

number_of_layers: ClassVar[int]¶

abstract static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.FastTextEmbedder(**kwargs)[source]¶

__init__(**kwargs)[source]¶

Parameters: model_file – path of model file. If not supplied, will be downloaded.

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embedding_dimension: ClassVar[int] = 512¶

name: ClassVar[str] = 'fasttext'¶

necessary_files: ClassVar[List[str]] = ['model_file']¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.GloveEmbedder(**kwargs)[source]¶

__init__(**kwargs)[source]¶

Parameters: model_file – path of model file. If not supplied, will be downloaded.

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embedding_dimension: ClassVar[int] = 512¶

name: ClassVar[str] = 'glove'¶

necessary_files: ClassVar[List[str]] = ['model_file']¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.OneHotEncodingEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

Baseline embedder: One hot encoding as per-residue embedding, amino acid composition for per-protein

This embedder is meant to be used as naive baseline for comparing different types of inputs or training method.

While option such as device aren’t used, you may still pass them for consistency.

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embedding_dimension: ClassVar[int] = 21¶

name: ClassVar[str] = 'one_hot_encoding'¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶: This returns the amino acid composition of the sequence as vector

class bio_embeddings.embed.PLUSRNNEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

PLUS RNN Embedder

Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information Seonwoo Min, Seunghyun Park, Siwon Kim, Hyun-Soo Choi, Sungroh Yoon https://arxiv.org/abs/1912.05625

__init__(device: Union[None, str, torch.device] = None, **kwargs)[source]¶: Initializer accepts location of a pre-trained model and options

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]¶

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embedding_dimension: ClassVar[int] = 1024¶

name: ClassVar[str] = 'plus_rnn'¶

necessary_files: ClassVar[List[str]] = ['model_file']¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.ProtTransAlbertBFDEmbedder(**kwargs)[source]¶

ProtTrans-Albert-BFD Embedder (ProtAlbert-BFD)

Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225

__init__(**kwargs)[source]¶

Initialize Albert embedder.

Parameters

model_directory –
half_precision_model –

embedding_dimension: ClassVar[int] = 4096¶

name: ClassVar[str] = 'prottrans_albert_bfd'¶

number_of_layers: ClassVar[int] = 1¶

class bio_embeddings.embed.ProtTransBertBFDEmbedder(**kwargs)[source]¶

ProtTrans-Bert-BFD Embedder (ProtBert-BFD)

__init__(**kwargs)[source]¶

Initialize Bert embedder.

Parameters

model_directory –
half_precision_model –

embedding_dimension: ClassVar[int] = 1024¶

name: ClassVar[str] = 'prottrans_bert_bfd'¶

number_of_layers: ClassVar[int] = 1¶

class bio_embeddings.embed.ProtTransT5BFDEmbedder(**kwargs)[source]¶

Encoder of the ProtTrans T5 model trained on BFD. Consider using ProtTransT5XLU50Embedder instead of this

We recommend settings half_model=True, which on the tested GPU (Quadro RTX 3000) reduces memory consumption from 12GB to 7GB while the effect in benchmarks is negligible (±0.1 percentages points in different sets, generally below standard error)

name: ClassVar[str] = 'prottrans_t5_bfd'¶

class bio_embeddings.embed.ProtTransT5UniRef50Embedder(**kwargs)[source]¶

Encoder of the ProtTrans T5 model trained on BFD and finetuned on UniRef 50. Consider using ProtTransT5XLU50Embedder instead of this

name: ClassVar[str] = 'prottrans_t5_uniref50'¶

class bio_embeddings.embed.ProtTransT5XLU50Embedder(**kwargs)[source]¶

Encoder of the ProtTrans T5 model trained on BFD and finetuned on UniRef 50.

name: ClassVar[str] = 'prottrans_t5_xl_u50'¶

class bio_embeddings.embed.ProtTransXLNetUniRef100Embedder(**kwargs)[source]¶

ProtTrans-XLNet-UniRef100 Embedder (ProtXLNet)

__init__(**kwargs)[source]¶

Initialize XLNet embedder.

Parameters: model_directory –

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]¶

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embedding_dimension: ClassVar[int] = 1024¶

name: ClassVar[str] = 'prottrans_xlnet_uniref100'¶

necessary_directories: ClassVar[List[str]] = ['model_directory']¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding)[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.SeqVecEmbedder(warmup_rounds: int = 4, **kwargs)[source]¶

SeqVec Embedder

Heinzinger, Michael, et al. “Modeling aspects of the language of life through transfer-learning protein sequences.” BMC bioinformatics 20.1 (2019): 723. https://doi.org/10.1186/s12859-019-3220-8

__init__(warmup_rounds: int = 4, **kwargs)[source]¶

Initialize Elmo embedder. Can define non-positional arguments for paths of files and other settings.

Parameters

warmup_rounds – A sample sequence will be embedded this often to work around elmo’s non-determinism (https://github.com/allenai/allennlp/blob/v0.9.0/tutorials/how_to/elmo.md#notes-on-statefulness-and-non-determinism)
weights_file – path of weights file
options_file – path of options file
model_directory – Alternative of weights_file/options_file
max_amino_acids – max # of amino acids to include in embed_many batches. Default: 15k AA

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embedding_dimension: ClassVar[int] = 1024¶

name: ClassVar[str] = 'seqvec'¶

necessary_files: ClassVar[List[str]] = ['weights_file', 'options_file']¶

number_of_layers: ClassVar[int] = 3¶

static reduce_per_protein(embedding)[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.UniRepEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

UniRep Embedder

Alley, E.C., Khimulya, G., Biswas, S. et al. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16, 1315–1322 (2019). https://doi.org/10.1038/s41592-019-0598-1

We use a reimplementation of unirep:

Ma, Eric, and Arkadij Kummer. “Reimplementing Unirep in JAX.” bioRxiv (2020). https://doi.org/10.1101/2020.05.11.088344

__init__(device: Union[None, str, torch.device] = None, **kwargs)[source]¶: Initializer accepts location of a pre-trained model and options

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embedding_dimension: ClassVar[int] = 1900¶

name: ClassVar[str] = 'unirep'¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.Word2VecEmbedder(**kwargs)[source]¶

__init__(**kwargs)[source]¶

Parameters: model_file – path of model file. If not supplied, will be downloaded.

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embedding_dimension: ClassVar[int] = 512¶

name: ClassVar[str] = 'word2vec'¶

necessary_files: ClassVar[List[str]] = ['model_file']¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)