Colab initialization

  • install the pipeline in the colab runtime

  • download files neccessary for this example

!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null
!wget http://data.bioembeddings.com/public/embeddings/notebooks/custom_data/tiny_sampled.fasta --output-document tiny_sampled.fasta

Embed sequences in a FASTA file

from bio_embeddings.embed import ProtTransBertBFDEmbedder
from Bio import SeqIO
sequences = []
for record in SeqIO.parse("tiny_sampled.fasta", "fasta"):
    sequences.append(record)
embedder = ProtTransBertBFDEmbedder()
embeddings = embedder.embed_many([str(s.seq) for s in sequences])

# `embed_many` returns a generator.
# We want to keep both RAW embeddings and reduced embeddings in memory.
# To do so, we simply turn the generator into a list!
# (this will start embedding the sequences!)

embeddings = list(embeddings)
reduced_embeddings = [ProtTransBertBFDEmbedder.reduce_per_protein(e) for e in embeddings]
for (per_amino_acid, per_protein) in zip(embeddings, reduced_embeddings):
    print(per_amino_acid.shape, per_protein.shape)