Colab initialization¶
install the pipeline in the colab runtime
download files neccessary for this example
!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null
Extract secondary structure and subcellular localization predictions from SeqVec¶
In this notebook we will extract annotations from SeqVec embeddings via trained models that can predict secondary structure and subcellular localization
from bio_embeddings.embed import SeqVecEmbedder
We initialize the SeqVec embedder.
embedder = SeqVecEmbedder()
We select an AA sequence. In this case, the sequence is that of Aspartate aminotransferase, mitochondrial
target_sequence = "MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYKRDTNSKKMNLGVGAYRDDNGKPYVLPSVRKAEAQIAAKGLDKEYLPIGGLAEFCRASAELALGENSEVVKSGRFVTVQTISGTGALRIGASFLQRFFKFSRDVFLPKPSWGNHTPIFRDAGMQLQSYRYYDPKTCGFDFTGALEDISKIPEQSVLLLHACAHNPTGVDPRPEQWKEIATVVKKRNLFAFFDMAYQGFASGDGDKDAWAVRHFIEQGINVCLCQSYAKNMGLYGERVGAFTVICKDADEAKRVESQLKILIRPMYSNPPIHGARIASTILTSPDLRKQWLQEVKGMADRIIGMRTQLVSNLKKEGSTHSWQHITDQIGMFCFTGLKPEQVERLTKEFSIYMTKDGRISVAGVTSGNVGYLAHAIHQVTK"
We produce the embeddings of the above sequence. Since we only have one sequence, we use the simple embed
function, rather than the embed_many
or embed_batch
, which we would instead use if we had multiple sequences to embed.
embedding = embedder.embed(target_sequence)
The bio_embeddings
pipeline includes some models trained on embeddings for the prediction of Secondary Structure and Subcellular Localization. In the following we make use of these models.
To speed up processing, we have downloaded the model weights of the supervised subcellular localization and secondary structure prediction models from here.
from bio_embeddings.extract.basic import BasicAnnotationExtractor
annotations_extractor = BasicAnnotationExtractor("seqvec_from_publication")
annotations = annotations_extractor.get_annotations(embedding)
Let’s see what annotations are available from SeqVec
annotations._fields
Let’s print the subcellular localization predicted via the SeqVec embeddings
print(f"The subcellular localization predicted from the embedding is: {annotations.localization.value}")
For AA-annotations, e.g. secondary structure, we can use a helper function to format the extracted annotations as a single string:
from bio_embeddings.utilities.helpers import convert_list_of_enum_to_string
print("The predicted secondary structure (red) of the sequence is:")
for (AA, DSSP3) in zip(target_sequence, convert_list_of_enum_to_string(annotations.DSSP3)):
print(f"\x1B[30m{AA}\x1b[31m{DSSP3}")