bio_embeddings.embed¶
Language models to translate amino acid sequences into vector representations
All language models implement the EmbedderInterface
. You can embed a single
sequences with EmbedderInterface.embed()
or a list of
sequences with the EmbedderInterface.embed_many()
function. All except
CPCProt and UniRep generate per-residue embeddings, which you can summarize
into a fixed size per-protein embedding by calling
EmbedderInterface.reduce_per_protein()
. CPCProt only generates a
per-protein embedding (reduce_per_protein
does nothing), while UniRep
includes a start token, so the embedding is one longer than the protein.
Instead of using bio_embeddings[all]
, it’s possible to only install
some embedders by selecting specific extras:
allennlp
: seqvectransformers
: prottrans_albert_bfd, prottrans_bert_bfd, protrans_xlnet_uniref100, prottrans_t5_bfdjax
-unirep: unirepesm
: esmcpcprot
: cpcprotplus
: plus_rnn
-
class
bio_embeddings.embed.
BeplerEmbedder
(device: Union[None, str, torch.device] = None, **kwargs)[source]¶ Bepler Embedder
Bepler, Tristan, and Bonnie Berger. “Learning protein sequence embeddings using information from structure.” arXiv preprint arXiv:1902.08661 (2019).
-
class
bio_embeddings.embed.
CPCProtEmbedder
(device: Union[None, str, torch.device] = None, **kwargs)[source]¶ CPCProt Embedder
Lu, Amy X., et al. “Self-supervised contrastive learning of protein representations by mutual information maximization.” bioRxiv (2020). https://doi.org/10.1101/2020.09.04.283929
-
embed
(sequence: str) → numpy.ndarray[source]¶ Returns embedding for one sequence.
- Parameters
sequence – Valid amino acid sequence as String
- Returns
An embedding of the sequence.
-
-
class
bio_embeddings.embed.
ESM1bEmbedder
(device: Union[None, str, torch.device] = None, **kwargs)[source]¶ ESM-1b Embedder (Note: This is not the original ESM)
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
Rives, Alexander, et al. “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.” bioRxiv (2019): 622803. https://doi.org/10.1101/622803
-
class
bio_embeddings.embed.
ESMEmbedder
(device: Union[None, str, torch.device] = None, **kwargs)[source]¶ ESM Embedder (Note: This is not ESM-1b)
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences
Rives, Alexander, et al. “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.” bioRxiv (2019): 622803. https://doi.org/10.1101/622803
-
class
bio_embeddings.embed.
EmbedderInterface
(device: Union[None, str, torch.device] = None, **kwargs)[source]¶ -
abstract
embed
(sequence: str) → numpy.ndarray[source]¶ Returns embedding for one sequence.
- Parameters
sequence – Valid amino acid sequence as String
- Returns
An embedding of the sequence.
-
embed_batch
(batch: List[str]) → Generator[numpy.ndarray, None, None][source]¶ Computes the embeddings from all sequences in the batch
The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.
-
embed_many
(sequences: Iterable[str], batch_size: Optional[int] = None) → Generator[numpy.ndarray, None, None][source]¶ Returns embedding for one sequence.
- Parameters
sequences – List of proteins as AA strings
batch_size – For embedders that profit from batching, this is maximum number of AA per batch
- Returns
A list object with embeddings of the sequences.
-
abstract
-
class
bio_embeddings.embed.
PLUSRNNEmbedder
(device: Union[None, str, torch.device] = None, **kwargs)[source]¶ PLUS RNN Embedder
Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information Seonwoo Min, Seunghyun Park, Siwon Kim, Hyun-Soo Choi, Sungroh Yoon https://arxiv.org/abs/1912.05625
-
embed
(sequence: str) → numpy.ndarray[source]¶ Returns embedding for one sequence.
- Parameters
sequence – Valid amino acid sequence as String
- Returns
An embedding of the sequence.
-
-
class
bio_embeddings.embed.
ProtTransAlbertBFDEmbedder
(**kwargs)[source]¶ ProtTrans-Albert-BFD Embedder (ProtAlbert-BFD)
Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225
-
class
bio_embeddings.embed.
ProtTransBertBFDEmbedder
(**kwargs)[source]¶ ProtTrans-Bert-BFD Embedder (ProtBert-BFD)
Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225
-
class
bio_embeddings.embed.
ProtTransT5BFDEmbedder
(**kwargs)[source]¶ Encoder of the ProtTrans T5 BFD model
Note that this model alone takes 13GB, so you need a GPU with a lot of memory.
-
embed
(sequence: str) → numpy.ndarray[source]¶ Returns embedding for one sequence.
- Parameters
sequence – Valid amino acid sequence as String
- Returns
An embedding of the sequence.
-
embed_many
(sequences: Iterable[str], batch_size: Optional[int] = None) → Generator[numpy.ndarray, None, None][source]¶ Returns embedding for one sequence.
- Parameters
sequences – List of proteins as AA strings
batch_size – For embedders that profit from batching, this is maximum number of AA per batch
- Returns
A list object with embeddings of the sequences.
-
-
class
bio_embeddings.embed.
ProtTransXLNetUniRef100Embedder
(**kwargs)[source]¶ ProtTrans-XLNet-UniRef100 Embedder (ProtXLNet)
Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225
-
embed
(sequence: str) → numpy.ndarray[source]¶ Returns embedding for one sequence.
- Parameters
sequence – Valid amino acid sequence as String
- Returns
An embedding of the sequence.
-
-
class
bio_embeddings.embed.
SeqVecEmbedder
(warmup_rounds: int = 4, **kwargs)[source]¶ SeqVec Embedder
Heinzinger, Michael, et al. “Modeling aspects of the language of life through transfer-learning protein sequences.” BMC bioinformatics 20.1 (2019): 723. https://doi.org/10.1186/s12859-019-3220-8
-
class
bio_embeddings.embed.
UniRepEmbedder
(device: Union[None, str, torch.device] = None, **kwargs)[source]¶ UniRep Embedder
Alley, E.C., Khimulya, G., Biswas, S. et al. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16, 1315–1322 (2019). https://doi.org/10.1038/s41592-019-0598-1
We use a reimplementation of unirep:
Ma, Eric, and Arkadij Kummer. “Reimplementing Unirep in JAX.” bioRxiv (2020). https://doi.org/10.1101/2020.05.11.088344