bio_embeddings.embed¶

Language models to translate amino acid sequences into vector representations

All language models implement the EmbedderInterface. You can embed a single sequences with EmbedderInterface.embed() or a list of sequences with the EmbedderInterface.embed_many() function. All except CPCProt and UniRep generate per-residue embeddings, which you can summarize into a fixed size per-protein embedding by calling EmbedderInterface.reduce_per_protein(). CPCProt only generates a per-protein embedding (reduce_per_protein does nothing), while UniRep includes a start token, so the embedding is one longer than the protein.

Instead of using bio_embeddings[all], it’s possible to only install some embedders by selecting specific extras:

allennlp: seqvec
transformers: prottrans_albert_bfd, prottrans_bert_bfd, protrans_xlnet_uniref100, prottrans_t5_bfd
jax-unirep: unirep
esm: esm
cpcprot: cpcprot
plus: plus_rnn

class bio_embeddings.embed.BeplerEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

Bepler Embedder

Bepler, Tristan, and Bonnie Berger. “Learning protein sequence embeddings using information from structure.” arXiv preprint arXiv:1902.08661 (2019).

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embedding_dimension: ClassVar[int] = 121¶

name: ClassVar[str] = 'bepler'¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.CPCProtEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

CPCProt Embedder

Lu, Amy X., et al. “Self-supervised contrastive learning of protein representations by mutual information maximization.” bioRxiv (2020). https://doi.org/10.1101/2020.09.04.283929

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]¶: See https://github.com/amyxlu/CPCProt/blob/df1ad1118544ed349b5e711207660a7c205b3128/embed_fasta.py

embedding_dimension: ClassVar[int] = 512¶

name: ClassVar[str] = 'cpcprot'¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.ESM1bEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

ESM-1b Embedder (Note: This is not the original ESM)

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Rives, Alexander, et al. “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.” bioRxiv (2019): 622803. https://doi.org/10.1101/622803

name: ClassVar[str] = 'esm1b'¶

class bio_embeddings.embed.ESMEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

ESM Embedder (Note: This is not ESM-1b)

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Rives, Alexander, et al. “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.” bioRxiv (2019): 622803. https://doi.org/10.1101/622803

name: ClassVar[str] = 'esm'¶

class bio_embeddings.embed.EmbedderInterface(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

abstract embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]¶

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embed_many(sequences: Iterable[str], batch_size: Optional[int] = None) → Generator[numpy.ndarray, None, None][source]¶

Returns embedding for one sequence.

Parameters

sequences – List of proteins as AA strings
batch_size – For embedders that profit from batching, this is maximum number of AA per batch

Returns

A list object with embeddings of the sequences.

embedding_dimension: ClassVar[int]¶

name: ClassVar[str]¶

number_of_layers: ClassVar[int]¶

abstract static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.PLUSRNNEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

PLUS RNN Embedder

Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information Seonwoo Min, Seunghyun Park, Siwon Kim, Hyun-Soo Choi, Sungroh Yoon https://arxiv.org/abs/1912.05625

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]¶

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embedding_dimension: ClassVar[int] = 1024¶

name: ClassVar[str] = 'plus_rnn'¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.ProtTransAlbertBFDEmbedder(**kwargs)[source]¶

ProtTrans-Albert-BFD Embedder (ProtAlbert-BFD)

Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225

embedding_dimension: ClassVar[int] = 4096¶

name: ClassVar[str] = 'prottrans_albert_bfd'¶

number_of_layers: ClassVar[int] = 1¶

class bio_embeddings.embed.ProtTransBertBFDEmbedder(**kwargs)[source]¶

ProtTrans-Bert-BFD Embedder (ProtBert-BFD)

embedding_dimension: ClassVar[int] = 1024¶

name: ClassVar[str] = 'prottrans_bert_bfd'¶

number_of_layers: ClassVar[int] = 1¶

class bio_embeddings.embed.ProtTransT5BFDEmbedder(**kwargs)[source]¶

Encoder of the ProtTrans T5 BFD model

Note that this model alone takes 13GB, so you need a GPU with a lot of memory.

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embed_many(sequences: Iterable[str], batch_size: Optional[int] = None) → Generator[numpy.ndarray, None, None][source]¶

Returns embedding for one sequence.

Parameters

sequences – List of proteins as AA strings
batch_size – For embedders that profit from batching, this is maximum number of AA per batch

Returns

A list object with embeddings of the sequences.

embedding_dimension: ClassVar[int] = 1024¶

name: ClassVar[str] = 'prottrans_t5_bfd'¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding)[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.ProtTransXLNetUniRef100Embedder(**kwargs)[source]¶

ProtTrans-XLNet-UniRef100 Embedder (ProtXLNet)

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]¶

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embedding_dimension: ClassVar[int] = 1024¶

name: ClassVar[str] = 'prottrans_xlnet_uniref100'¶

number_of_layers: ClassVar[int] = 1¶

static reduce_per_protein(embedding)[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.SeqVecEmbedder(warmup_rounds: int = 4, **kwargs)[source]¶

SeqVec Embedder

Heinzinger, Michael, et al. “Modeling aspects of the language of life through transfer-learning protein sequences.” BMC bioinformatics 20.1 (2019): 723. https://doi.org/10.1186/s12859-019-3220-8

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embedding_dimension: ClassVar[int] = 1024¶

name: ClassVar[str] = 'seqvec'¶

number_of_layers: ClassVar[int] = 3¶

static reduce_per_protein(embedding)[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.UniRepEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶

UniRep Embedder

Alley, E.C., Khimulya, G., Biswas, S. et al. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16, 1315–1322 (2019). https://doi.org/10.1038/s41592-019-0598-1

We use a reimplementation of unirep:

Ma, Eric, and Arkadij Kummer. “Reimplementing Unirep in JAX.” bioRxiv (2020). https://doi.org/10.1101/2020.05.11.088344

embed(sequence: str) → numpy.ndarray[source]¶

Returns embedding for one sequence.

Parameters: sequence – Valid amino acid sequence as String
Returns: An embedding of the sequence.

embedding_dimension: ClassVar[int] = 1900¶

name: ClassVar[str] = 'unirep'¶

number_of_layers: ClassVar[int] = 1¶

params: Dict[str, Any]¶

static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]¶

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters: embedding – the embedding
Returns: A fixed size embedding (a vector of size N, where N is fixed)