bio_embeddings.embed

Language models to translate amino acid sequences into vector representations

All language models implement the EmbedderInterface. You can embed a single sequences with EmbedderInterface.embed() or a list of sequences with the EmbedderInterface.embed_many() function. All except CPCProt and UniRep generate per-residue embeddings, which you can summarize into a fixed size per-protein embedding by calling EmbedderInterface.reduce_per_protein(). CPCProt only generates a per-protein embedding (reduce_per_protein does nothing), while UniRep includes a start token, so the embedding is one longer than the protein.

Instead of using bio_embeddings[all], it’s possible to only install some embedders by selecting specific extras:

  • allennlp: seqvec

  • transformers: prottrans_albert_bfd, prottrans_bert_bfd, protrans_xlnet_uniref100, prottrans_t5_bfd

  • jax-unirep: unirep

  • esm: esm

  • cpcprot: cpcprot

  • plus: plus_rnn

class bio_embeddings.embed.BeplerEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

Bepler Embedder

Bepler, Tristan, and Bonnie Berger. “Learning protein sequence embeddings using information from structure.” arXiv preprint arXiv:1902.08661 (2019).

embed(sequence: str) → numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embedding_dimension: ClassVar[int] = 121
name: ClassVar[str] = 'bepler'
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.CPCProtEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

CPCProt Embedder

Lu, Amy X., et al. “Self-supervised contrastive learning of protein representations by mutual information maximization.” bioRxiv (2020). https://doi.org/10.1101/2020.09.04.283929

embed(sequence: str) → numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]

See https://github.com/amyxlu/CPCProt/blob/df1ad1118544ed349b5e711207660a7c205b3128/embed_fasta.py

embedding_dimension: ClassVar[int] = 512
name: ClassVar[str] = 'cpcprot'
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.ESM1bEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

ESM-1b Embedder (Note: This is not the original ESM)

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Rives, Alexander, et al. “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.” bioRxiv (2019): 622803. https://doi.org/10.1101/622803

name: ClassVar[str] = 'esm1b'
class bio_embeddings.embed.ESMEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

ESM Embedder (Note: This is not ESM-1b)

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

Rives, Alexander, et al. “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.” bioRxiv (2019): 622803. https://doi.org/10.1101/622803

name: ClassVar[str] = 'esm'
class bio_embeddings.embed.EmbedderInterface(device: Union[None, str, torch.device] = None, **kwargs)[source]
abstract embed(sequence: str) → numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embed_many(sequences: Iterable[str], batch_size: Optional[int] = None) → Generator[numpy.ndarray, None, None][source]

Returns embedding for one sequence.

Parameters
  • sequences – List of proteins as AA strings

  • batch_size – For embedders that profit from batching, this is maximum number of AA per batch

Returns

A list object with embeddings of the sequences.

embedding_dimension: ClassVar[int]
name: ClassVar[str]
number_of_layers: ClassVar[int]
abstract static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.PLUSRNNEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

PLUS RNN Embedder

Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information Seonwoo Min, Seunghyun Park, Siwon Kim, Hyun-Soo Choi, Sungroh Yoon https://arxiv.org/abs/1912.05625

embed(sequence: str) → numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embedding_dimension: ClassVar[int] = 1024
name: ClassVar[str] = 'plus_rnn'
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.ProtTransAlbertBFDEmbedder(**kwargs)[source]

ProtTrans-Albert-BFD Embedder (ProtAlbert-BFD)

Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225

embedding_dimension: ClassVar[int] = 4096
name: ClassVar[str] = 'prottrans_albert_bfd'
number_of_layers: ClassVar[int] = 1
class bio_embeddings.embed.ProtTransBertBFDEmbedder(**kwargs)[source]

ProtTrans-Bert-BFD Embedder (ProtBert-BFD)

Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225

embedding_dimension: ClassVar[int] = 1024
name: ClassVar[str] = 'prottrans_bert_bfd'
number_of_layers: ClassVar[int] = 1
class bio_embeddings.embed.ProtTransT5BFDEmbedder(**kwargs)[source]

Encoder of the ProtTrans T5 BFD model

Note that this model alone takes 13GB, so you need a GPU with a lot of memory.

embed(sequence: str) → numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embed_many(sequences: Iterable[str], batch_size: Optional[int] = None) → Generator[numpy.ndarray, None, None][source]

Returns embedding for one sequence.

Parameters
  • sequences – List of proteins as AA strings

  • batch_size – For embedders that profit from batching, this is maximum number of AA per batch

Returns

A list object with embeddings of the sequences.

embedding_dimension: ClassVar[int] = 1024
name: ClassVar[str] = 'prottrans_t5_bfd'
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding)[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.ProtTransXLNetUniRef100Embedder(**kwargs)[source]

ProtTrans-XLNet-UniRef100 Embedder (ProtXLNet)

Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225

embed(sequence: str) → numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embed_batch(batch: List[str]) → Generator[numpy.ndarray, None, None][source]

Computes the embeddings from all sequences in the batch

The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.

embedding_dimension: ClassVar[int] = 1024
name: ClassVar[str] = 'prottrans_xlnet_uniref100'
number_of_layers: ClassVar[int] = 1
static reduce_per_protein(embedding)[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.SeqVecEmbedder(warmup_rounds: int = 4, **kwargs)[source]

SeqVec Embedder

Heinzinger, Michael, et al. “Modeling aspects of the language of life through transfer-learning protein sequences.” BMC bioinformatics 20.1 (2019): 723. https://doi.org/10.1186/s12859-019-3220-8

embed(sequence: str) → numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embedding_dimension: ClassVar[int] = 1024
name: ClassVar[str] = 'seqvec'
number_of_layers: ClassVar[int] = 3
static reduce_per_protein(embedding)[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)

class bio_embeddings.embed.UniRepEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]

UniRep Embedder

Alley, E.C., Khimulya, G., Biswas, S. et al. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16, 1315–1322 (2019). https://doi.org/10.1038/s41592-019-0598-1

We use a reimplementation of unirep:

Ma, Eric, and Arkadij Kummer. “Reimplementing Unirep in JAX.” bioRxiv (2020). https://doi.org/10.1101/2020.05.11.088344

embed(sequence: str) → numpy.ndarray[source]

Returns embedding for one sequence.

Parameters

sequence – Valid amino acid sequence as String

Returns

An embedding of the sequence.

embedding_dimension: ClassVar[int] = 1900
name: ClassVar[str] = 'unirep'
number_of_layers: ClassVar[int] = 1
params: Dict[str, Any]
static reduce_per_protein(embedding: numpy.ndarray) → numpy.ndarray[source]

For a variable size embedding, returns a fixed size embedding encoding all information of a sequence.

Parameters

embedding – the embedding

Returns

A fixed size embedding (a vector of size N, where N is fixed)