Colab initialization¶
install the pipeline in the colab runtime
download files neccessary for this example
!pip3 install -U pip > /dev/null
!pip3 install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git" > /dev/null
!wget http://data.bioembeddings.com/public/embeddings/notebooks/custom_data/tiny_sampled.fasta --output-document tiny_sampled.fasta
Embed sequences in a FASTA file¶
from bio_embeddings.embed import ProtTransBertBFDEmbedder
from Bio import SeqIO
sequences = []
for record in SeqIO.parse("tiny_sampled.fasta", "fasta"):
sequences.append(record)
embedder = ProtTransBertBFDEmbedder()
embeddings = embedder.embed_many([str(s.seq) for s in sequences])
# `embed_many` returns a generator.
# We want to keep both RAW embeddings and reduced embeddings in memory.
# To do so, we simply turn the generator into a list!
# (this will start embedding the sequences!)
embeddings = list(embeddings)
reduced_embeddings = [ProtTransBertBFDEmbedder.reduce_per_protein(e) for e in embeddings]
for (per_amino_acid, per_protein) in zip(embeddings, reduced_embeddings):
print(per_amino_acid.shape, per_protein.shape)