bio_embeddings.embed

Language models to translate amino acid sequences into vector representations

All language models implement the EmbedderInterface. You can embed a single sequences with EmbedderInterface.embed() or a list of sequences with the EmbedderInterface.embed_many() function. All except CPCProt and UniRep generate per-residue embeddings, which you can summarize into a fixed size per-protein embedding by calling EmbedderInterface.reduce_per_protein(). CPCProt only generates a per-protein embedding (reduce_per_protein does nothing), while UniRep includes a start token, so the embedding is one longer than the protein. UniRep, GloVe, fastText and word2vec only support the CPU.

OneHotEncodingEmbedder offers a naive baseline to compare the language model embeddings against, with one hot encoding as per-residue and amino acid composition as per-protein embedding. It accepts keyword arguments but ignores them since it does not do any notable computation.

Instead of using bio_embeddings[all], it’s possible to only install some embedders by selecting specific extras:

  • allennlp: seqvec

  • transformers: prottrans_albert_bfd, prottrans_bert_bfd, protrans_xlnet_uniref100, prottrans_t5_bfd, prottrans_t5_uniref50, prottrans_t5_xl_u50

  • jax-unirep: unirep

  • esm: esm

  • cpcprot: cpcprot

  • plus: plus_rnn

Model sizes

The disk size represents the size of the unzipped models or the combination of all files necessary for a particular embeddder. The GPU and CPU sizes are only for loading the model into GPU memory (VRAM) or the main RAM without the memory required to do any computation. They were measured for one specific set of hardware and software (Quadro RTX 8000, CUDA 11.1, torch 1.7.1, x86 64-bit Ubuntu 18.04) and will vary for different setups.

Model

Disk size (GB)

GPU size (GB)

CPU size (GB)

bepler

0.1

1.4

0.2

bert_from_publication

0.008

1.1

0.006

cpcprot

0.007

1.1

0.01

deepblast

0.4

1.4

0.26

esm

6.3

3.9

2.7

esm1b

7.3

3.8

2.6

fasttext

0.05

n/a

0.03

glove

0.06

n/a

0.03

one_hot_encoding

n/a

n/a

n/a

pb_tucker

0.009

1.0

0.02

plus_rnn

0.06

1.2

0.1

prottrans_albert_bfd

0.9

2.0

1.8

prottrans_bert_bfd

1.6

2.8

3.4

prottrans_t5_bfd

7.2

5.9

16.1

prottrans_t5_uniref50

7.2

5.9

16.1

prottrans_t5_xl_u50

7.2

5.9

16.1

prottrans_xlnet_uniref100

1.6

2.7

3.3

seqvec

0.4

1.6

0.5

seqvec_from_publication

0.004

1.1

0.006

unirep

n/a

n/a

0.2

word2vec

0.07

n/a

0.06

class bio_embeddings.embed.BeplerEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

Bepler Embedder

Bepler, Tristan, and Bonnie Berger. “Learning protein sequence embeddings using information from structure.” arXiv preprint arXiv:1902.08661 (2019).

__init__(device: Union[None, str, torch.device] = None, **kwargs)[source]

Initializer accepts location of a pre-trained model and options

embed(sequence: str)numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embedding_dimension: ClassVar[int] = 121
name: ClassVar[str] = 'bepler'
necessary_files: ClassVar[List[str]] = ['model_file']
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding: numpy.ndarray)numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.CPCProtEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

CPCProt Embedder

Lu, Amy X., et al. “Self-supervised contrastive learning of protein representations by mutual information maximization.” bioRxiv (2020). https://doi.org/10.1101/2020.09.04.283929

__init__(device: Union[None, str, torch.device] = None, **kwargs)[source]

Initializer accepts location of a pre-trained model and options

embed(sequence: str)numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embed_batch(batch: List[str])Generator[numpy.ndarray, None, None][source]

See https://github.com/amyxlu/CPCProt/blob/df1ad1118544ed349b5e711207660a7c205b3128/embed_fasta.py

embedding_dimension: ClassVar[int] = 512
name: ClassVar[str] = 'cpcprot'
necessary_files: ClassVar[List[str]] = ['model_file']
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding: numpy.ndarray)numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.ESM1bEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

ESM-1b Embedder (Note: This is not the original ESM)

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Rives, Alexander, et al. “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.” bioRxiv (2019): 622803. https://doi.org/10.1101/622803

name: ClassVar[str] = 'esm1b'
class bio_embeddings.embed.ESMEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

ESM Embedder (Note: This is not ESM-1b)

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Rives, Alexander, et al. “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.” bioRxiv (2019): 622803. https://doi.org/10.1101/622803

name: ClassVar[str] = 'esm'
class bio_embeddings.embed.EmbedderInterface(device: Union[None, str, torch.device] = None, **kwargs)[source]
__init__(device: Union[None, str, torch.device] = None, **kwargs)[source]

Initializer accepts location of a pre-trained model and options

abstract embed(sequence: str)numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embed_batch(batch: List[str])Generator[numpy.ndarray, None, None][source]

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embed_many(sequences: Iterable[str], batch_size: Optional[int] = None)Generator[numpy.ndarray, None, None][source]

Returns embedding for one sequence.

Parameters
  • sequences – List of proteins as AA strings

  • batch_size – For embedders that profit from batching, this is maximum number of AA per batch

Returns

A list object with embeddings of the sequences.

embedding_dimension: ClassVar[int]
name: ClassVar[str]
necessary_directories: ClassVar[List[str]] = []
necessary_files: ClassVar[List[str]] = []
number_of_layers: ClassVar[int]
abstract static reduce_per_protein(embedding: numpy.ndarray)numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.FastTextEmbedder(**kwargs)[source]
__init__(**kwargs)[source]
Parameters

model_file – path of model file. If not supplied, will be downloaded.

embed(sequence: str)numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embedding_dimension: ClassVar[int] = 512
name: ClassVar[str] = 'fasttext'
necessary_files: ClassVar[List[str]] = ['model_file']
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding: numpy.ndarray)numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.GloveEmbedder(**kwargs)[source]
__init__(**kwargs)[source]
Parameters

model_file – path of model file. If not supplied, will be downloaded.

embed(sequence: str)numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embedding_dimension: ClassVar[int] = 512
name: ClassVar[str] = 'glove'
necessary_files: ClassVar[List[str]] = ['model_file']
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding: numpy.ndarray)numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.OneHotEncodingEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

Baseline embedder: One hot encoding as per-residue embedding, amino acid composition for per-protein

This embedder is meant to be used as naive baseline for comparing different types of inputs or training method.

While option such as device aren’t used, you may still pass them for consistency.

embed(sequence: str)numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embedding_dimension: ClassVar[int] = 21
name: ClassVar[str] = 'one_hot_encoding'
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding: numpy.ndarray)numpy.ndarray[source]

This returns the amino acid composition of the sequence as vector

class bio_embeddings.embed.PLUSRNNEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

PLUS RNN Embedder

Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information Seonwoo Min, Seunghyun Park, Siwon Kim, Hyun-Soo Choi, Sungroh Yoon https://arxiv.org/abs/1912.05625

__init__(device: Union[None, str, torch.device] = None, **kwargs)[source]

Initializer accepts location of a pre-trained model and options

embed(sequence: str)numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embed_batch(batch: List[str])Generator[numpy.ndarray, None, None][source]

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embedding_dimension: ClassVar[int] = 1024
name: ClassVar[str] = 'plus_rnn'
necessary_files: ClassVar[List[str]] = ['model_file']
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding: numpy.ndarray)numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.ProtTransAlbertBFDEmbedder(**kwargs)[source]

ProtTrans-Albert-BFD Embedder (ProtAlbert-BFD)

Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225

__init__(**kwargs)[source]

Initialize Albert embedder.

Parameters
  • model_directory

  • half_precision_model

embedding_dimension: ClassVar[int] = 4096
name: ClassVar[str] = 'prottrans_albert_bfd'
number_of_layers: ClassVar[int] = 1
class bio_embeddings.embed.ProtTransBertBFDEmbedder(**kwargs)[source]

ProtTrans-Bert-BFD Embedder (ProtBert-BFD)

Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225

__init__(**kwargs)[source]

Initialize Bert embedder.

Parameters
  • model_directory

  • half_precision_model

embedding_dimension: ClassVar[int] = 1024
name: ClassVar[str] = 'prottrans_bert_bfd'
number_of_layers: ClassVar[int] = 1
class bio_embeddings.embed.ProtTransT5BFDEmbedder(**kwargs)[source]

Encoder of the ProtTrans T5 model trained on BFD. Consider using ProtTransT5XLU50Embedder instead of this

We recommend settings half_model=True, which on the tested GPU (Quadro RTX 3000) reduces memory consumption from 12GB to 7GB while the effect in benchmarks is negligible (±0.1 percentages points in different sets, generally below standard error)

name: ClassVar[str] = 'prottrans_t5_bfd'
class bio_embeddings.embed.ProtTransT5UniRef50Embedder(**kwargs)[source]

Encoder of the ProtTrans T5 model trained on BFD and finetuned on UniRef 50. Consider using ProtTransT5XLU50Embedder instead of this

We recommend settings half_model=True, which on the tested GPU (Quadro RTX 3000) reduces memory consumption from 12GB to 7GB while the effect in benchmarks is negligible (±0.1 percentages points in different sets, generally below standard error)

name: ClassVar[str] = 'prottrans_t5_uniref50'
class bio_embeddings.embed.ProtTransT5XLU50Embedder(**kwargs)[source]

Encoder of the ProtTrans T5 model trained on BFD and finetuned on UniRef 50.

We recommend settings half_model=True, which on the tested GPU (Quadro RTX 3000) reduces memory consumption from 12GB to 7GB while the effect in benchmarks is negligible (±0.1 percentages points in different sets, generally below standard error)

name: ClassVar[str] = 'prottrans_t5_xl_u50'
class bio_embeddings.embed.ProtTransXLNetUniRef100Embedder(**kwargs)[source]

ProtTrans-XLNet-UniRef100 Embedder (ProtXLNet)

Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225

__init__(**kwargs)[source]

Initialize XLNet embedder.

Parameters

model_directory

embed(sequence: str)numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embed_batch(batch: List[str])Generator[numpy.ndarray, None, None][source]

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embedding_dimension: ClassVar[int] = 1024
name: ClassVar[str] = 'prottrans_xlnet_uniref100'
necessary_directories: ClassVar[List[str]] = ['model_directory']
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding)[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.SeqVecEmbedder(warmup_rounds: int = 4, **kwargs)[source]

SeqVec Embedder

Heinzinger, Michael, et al. “Modeling aspects of the language of life through transfer-learning protein sequences.” BMC bioinformatics 20.1 (2019): 723. https://doi.org/10.1186/s12859-019-3220-8

__init__(warmup_rounds: int = 4, **kwargs)[source]

Initialize Elmo embedder. Can define non-positional arguments for paths of files and other settings.

Parameters
embed(sequence: str)numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embedding_dimension: ClassVar[int] = 1024
name: ClassVar[str] = 'seqvec'
necessary_files: ClassVar[List[str]] = ['weights_file', 'options_file']
number_of_layers: ClassVar[int] = 3
static reduce_per_protein(embedding)[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.UniRepEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

UniRep Embedder

Alley, E.C., Khimulya, G., Biswas, S. et al. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16, 1315–1322 (2019). https://doi.org/10.1038/s41592-019-0598-1

We use a reimplementation of unirep:

Ma, Eric, and Arkadij Kummer. “Reimplementing Unirep in JAX.” bioRxiv (2020). https://doi.org/10.1101/2020.05.11.088344

__init__(device: Union[None, str, torch.device] = None, **kwargs)[source]

Initializer accepts location of a pre-trained model and options

embed(sequence: str)numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embedding_dimension: ClassVar[int] = 1900
name: ClassVar[str] = 'unirep'
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding: numpy.ndarray)numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.Word2VecEmbedder(**kwargs)[source]
__init__(**kwargs)[source]
Parameters

model_file – path of model file. If not supplied, will be downloaded.

embed(sequence: str)numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embedding_dimension: ClassVar[int] = 512
name: ClassVar[str] = 'word2vec'
necessary_files: ClassVar[List[str]] = ['model_file']
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding: numpy.ndarray)numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)