bio_embeddings.embed¶
Language models to translate amino acid sequences into vector representations
All language models implement the EmbedderInterface
. You can embed a single
sequences with EmbedderInterface.embed()
or a list of
sequences with the EmbedderInterface.embed_many()
function. All except
CPCProt and UniRep generate per-residue embeddings, which you can summarize
into a fixed size per-protein embedding by calling
EmbedderInterface.reduce_per_protein()
. CPCProt only generates a
per-protein embedding (reduce_per_protein
does nothing), while UniRep
includes a start token, so the embedding is one longer than the protein.
UniRep, GloVe, fastText and word2vec only support the CPU.
OneHotEncodingEmbedder
offers a naive baseline to compare the language model
embeddings against, with one hot encoding as per-residue and amino acid composition
as per-protein embedding. It accepts keyword arguments but ignores them since it does
not do any notable computation.
Instead of using bio_embeddings[all]
, it’s possible to only install
some embedders by selecting specific extras:
allennlp
: seqvectransformers
: prottrans_albert_bfd, prottrans_bert_bfd, protrans_xlnet_uniref100, prottrans_t5_bfd, prottrans_t5_uniref50, prottrans_t5_xl_u50jax
-unirep: unirepesm
: esmcpcprot
: cpcprotplus
: plus_rnn
Model sizes¶
The disk size represents the size of the unzipped models or the combination of all files necessary for a particular embeddder. The GPU and CPU sizes are only for loading the model into GPU memory (VRAM) or the main RAM without the memory required to do any computation. They were measured for one specific set of hardware and software (Quadro RTX 8000, CUDA 11.1, torch 1.7.1, x86 64-bit Ubuntu 18.04) and will vary for different setups.
Model |
Disk size (GB) |
GPU size (GB) |
CPU size (GB) |
---|---|---|---|
bepler |
0.1 |
1.4 |
0.2 |
bert_from_publication |
0.008 |
1.1 |
0.006 |
cpcprot |
0.007 |
1.1 |
0.01 |
deepblast |
0.4 |
1.4 |
0.26 |
esm |
6.3 |
3.9 |
2.7 |
esm1b |
7.3 |
3.8 |
2.6 |
esm1v |
7.3 |
3.9 |
2.6 |
fasttext |
0.05 |
n/a |
0.03 |
glove |
0.06 |
n/a |
0.03 |
one_hot_encoding |
n/a |
n/a |
n/a |
pb_tucker |
0.009 |
1.0 |
0.02 |
plus_rnn |
0.06 |
1.2 |
0.1 |
prottrans_albert_bfd |
0.9 |
2.0 |
1.8 |
prottrans_bert_bfd |
1.6 |
2.8 |
3.4 |
prottrans_t5_bfd |
7.2 |
5.9 |
16.1 |
prottrans_t5_uniref50 |
7.2 |
5.9 |
16.1 |
prottrans_t5_xl_u50 |
7.2 |
5.9 |
16.1 |
prottrans_xlnet_uniref100 |
1.6 |
2.7 |
3.3 |
seqvec |
0.4 |
1.6 |
0.5 |
seqvec_from_publication |
0.004 |
1.1 |
0.006 |
unirep |
n/a |
n/a |
0.2 |
word2vec |
0.07 |
n/a |
0.06 |
- class bio_embeddings.embed.BeplerEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
Bepler Embedder
Bepler, Tristan, and Bonnie Berger. “Learning protein sequence embeddings using information from structure.” arXiv preprint arXiv:1902.08661 (2019).
- __init__(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
Initializer accepts location of a pre-trained model and options
- class bio_embeddings.embed.CPCProtEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
CPCProt Embedder
Lu, Amy X., et al. “Self-supervised contrastive learning of protein representations by mutual information maximization.” bioRxiv (2020). https://doi.org/10.1101/2020.09.04.283929
- __init__(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
Initializer accepts location of a pre-trained model and options
- embed(sequence: str) numpy.ndarray [source]¶
Returns embedding for one sequence.
- Parameters
sequence – Valid amino acid sequence as String
- Returns
An embedding of the sequence.
- class bio_embeddings.embed.ESM1bEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
ESM-1b Embedder (Note: This is not the original ESM)
Rives, Alexander, et al. “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.” Proceedings of the National Academy of Sciences 118.15 (2021). https://doi.org/10.1073/pnas.2016239118
- class bio_embeddings.embed.ESM1vEmbedder(ensemble_id: int, device: Union[None, str, torch.device] = None, **kwargs)[source]¶
ESM-1v Embedder (one of five)
ESM1v uses an ensemble of five models, called esm1v_t33_650M_UR90S_[1-5]. An instance of this class is one of the five, specified by ensemble_id.
Meier, Joshua, et al. “Language models enable zero-shot prediction of the effects of mutations on protein function.” bioRxiv (2021). https://doi.org/10.1101/2021.07.09.450648
- class bio_embeddings.embed.ESMEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
ESM Embedder (Note: This is not ESM-1b)
Rives, Alexander, et al. “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.” Proceedings of the National Academy of Sciences 118.15 (2021). https://doi.org/10.1073/pnas.2016239118
- class bio_embeddings.embed.EmbedderInterface(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
- __init__(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
Initializer accepts location of a pre-trained model and options
- abstract embed(sequence: str) numpy.ndarray [source]¶
Returns embedding for one sequence.
- Parameters
sequence – Valid amino acid sequence as String
- Returns
An embedding of the sequence.
- embed_batch(batch: List[str]) Generator[numpy.ndarray, None, None] [source]¶
Computes the embeddings from all sequences in the batch
The provided implementation is dummy implementation that should be overwritten with the appropriate batching method for the model.
- embed_many(sequences: Iterable[str], batch_size: Optional[int] = None) Generator[numpy.ndarray, None, None] [source]¶
Returns embedding for one sequence.
- Parameters
sequences – List of proteins as AA strings
batch_size – For embedders that profit from batching, this is maximum number of AA per batch
- Returns
A list object with embeddings of the sequences.
- class bio_embeddings.embed.FastTextEmbedder(**kwargs)[source]¶
- __init__(**kwargs)[source]¶
- Parameters
model_file – path of model file. If not supplied, will be downloaded.
- class bio_embeddings.embed.GloveEmbedder(**kwargs)[source]¶
- __init__(**kwargs)[source]¶
- Parameters
model_file – path of model file. If not supplied, will be downloaded.
- class bio_embeddings.embed.OneHotEncodingEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
Baseline embedder: One hot encoding as per-residue embedding, amino acid composition for per-protein
This embedder is meant to be used as naive baseline for comparing different types of inputs or training method.
While option such as device aren’t used, you may still pass them for consistency.
- class bio_embeddings.embed.PLUSRNNEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
PLUS RNN Embedder
Pre-Training of Deep Bidirectional Protein Sequence Representations with Structural Information Seonwoo Min, Seunghyun Park, Siwon Kim, Hyun-Soo Choi, Sungroh Yoon https://arxiv.org/abs/1912.05625
- __init__(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
Initializer accepts location of a pre-trained model and options
- embed(sequence: str) numpy.ndarray [source]¶
Returns embedding for one sequence.
- Parameters
sequence – Valid amino acid sequence as String
- Returns
An embedding of the sequence.
- class bio_embeddings.embed.ProtTransAlbertBFDEmbedder(**kwargs)[source]¶
ProtTrans-Albert-BFD Embedder (ProtAlbert-BFD)
Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225
- class bio_embeddings.embed.ProtTransBertBFDEmbedder(**kwargs)[source]¶
ProtTrans-Bert-BFD Embedder (ProtBert-BFD)
Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225
- class bio_embeddings.embed.ProtTransT5BFDEmbedder(**kwargs)[source]¶
Encoder of the ProtTrans T5 model trained on BFD. Consider using
ProtTransT5XLU50Embedder
instead of thisWe recommend settings half_model=True, which on the tested GPU (Quadro RTX 3000) reduces memory consumption from 12GB to 7GB while the effect in benchmarks is negligible (±0.1 percentages points in different sets, generally below standard error)
- class bio_embeddings.embed.ProtTransT5UniRef50Embedder(**kwargs)[source]¶
Encoder of the ProtTrans T5 model trained on BFD and finetuned on UniRef 50. Consider using
ProtTransT5XLU50Embedder
instead of thisWe recommend settings half_model=True, which on the tested GPU (Quadro RTX 3000) reduces memory consumption from 12GB to 7GB while the effect in benchmarks is negligible (±0.1 percentages points in different sets, generally below standard error)
- class bio_embeddings.embed.ProtTransT5XLU50Embedder(**kwargs)[source]¶
Encoder of the ProtTrans T5 model trained on BFD and finetuned on UniRef 50.
We recommend settings half_model=True, which on the tested GPU (Quadro RTX 3000) reduces memory consumption from 12GB to 7GB while the effect in benchmarks is negligible (±0.1 percentages points in different sets, generally below standard error)
- class bio_embeddings.embed.ProtTransXLNetUniRef100Embedder(**kwargs)[source]¶
ProtTrans-XLNet-UniRef100 Embedder (ProtXLNet)
Elnaggar, Ahmed, et al. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv preprint arXiv:2007.06225 (2020). https://arxiv.org/abs/2007.06225
- embed(sequence: str) numpy.ndarray [source]¶
Returns embedding for one sequence.
- Parameters
sequence – Valid amino acid sequence as String
- Returns
An embedding of the sequence.
- class bio_embeddings.embed.SeqVecEmbedder(warmup_rounds: int = 4, **kwargs)[source]¶
SeqVec Embedder
Heinzinger, Michael, et al. “Modeling aspects of the language of life through transfer-learning protein sequences.” BMC bioinformatics 20.1 (2019): 723. https://doi.org/10.1186/s12859-019-3220-8
- __init__(warmup_rounds: int = 4, **kwargs)[source]¶
Initialize Elmo embedder. Can define non-positional arguments for paths of files and other settings.
- Parameters
warmup_rounds – A sample sequence will be embedded this often to work around elmo’s non-determinism (https://github.com/allenai/allennlp/blob/v0.9.0/tutorials/how_to/elmo.md#notes-on-statefulness-and-non-determinism)
weights_file – path of weights file
options_file – path of options file
model_directory – Alternative of weights_file/options_file
max_amino_acids – max # of amino acids to include in embed_many batches. Default: 15k AA
- class bio_embeddings.embed.UniRepEmbedder(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
UniRep Embedder
Alley, E.C., Khimulya, G., Biswas, S. et al. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods 16, 1315–1322 (2019). https://doi.org/10.1038/s41592-019-0598-1
We use a reimplementation of unirep:
Ma, Eric, and Arkadij Kummer. “Reimplementing Unirep in JAX.” bioRxiv (2020). https://doi.org/10.1101/2020.05.11.088344
- __init__(device: Union[None, str, torch.device] = None, **kwargs)[source]¶
Initializer accepts location of a pre-trained model and options
- class bio_embeddings.embed.Word2VecEmbedder(**kwargs)[source]¶
- __init__(**kwargs)[source]¶
- Parameters
model_file – path of model file. If not supplied, will be downloaded.