Colab initialization

  • install the pipeline in the colab runtime

  • download files neccessary for this example

!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null

Extract secondary structure and subcellular localization predictions from SeqVec

In this notebook we will extract annotations from SeqVec embeddings via trained models that can predict secondary structure and subcellular localization

from bio_embeddings.embed import SeqVecEmbedder

We initialize the SeqVec embedder.

embedder = SeqVecEmbedder()

We select an AA sequence. In this case, the sequence is that of Aspartate aminotransferase, mitochondrial

target_sequence = "MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYKRDTNSKKMNLGVGAYRDDNGKPYVLPSVRKAEAQIAAKGLDKEYLPIGGLAEFCRASAELALGENSEVVKSGRFVTVQTISGTGALRIGASFLQRFFKFSRDVFLPKPSWGNHTPIFRDAGMQLQSYRYYDPKTCGFDFTGALEDISKIPEQSVLLLHACAHNPTGVDPRPEQWKEIATVVKKRNLFAFFDMAYQGFASGDGDKDAWAVRHFIEQGINVCLCQSYAKNMGLYGERVGAFTVICKDADEAKRVESQLKILIRPMYSNPPIHGARIASTILTSPDLRKQWLQEVKGMADRIIGMRTQLVSNLKKEGSTHSWQHITDQIGMFCFTGLKPEQVERLTKEFSIYMTKDGRISVAGVTSGNVGYLAHAIHQVTK"

We produce the embeddings of the above sequence. Since we only have one sequence, we use the simple embed function, rather than the embed_many or embed_batch, which we would instead use if we had multiple sequences to embed.

embedding = embedder.embed(target_sequence)

The bio_embeddings pipeline includes some models trained on embeddings for the prediction of Secondary Structure and Subcellular Localization. In the following we make use of these models.

To speed up processing, we have downloaded the model weights of the supervised subcellular localization and secondary structure prediction models from here.

from bio_embeddings.extract.basic import BasicAnnotationExtractor
annotations_extractor = BasicAnnotationExtractor("seqvec_from_publication")
annotations = annotations_extractor.get_annotations(embedding)

Let’s see what annotations are available from SeqVec

annotations._fields

Let’s print the subcellular localization predicted via the SeqVec embeddings

print(f"The subcellular localization predicted from the embedding is: {annotations.localization.value}")

For AA-annotations, e.g. secondary structure, we can use a helper function to format the extracted annotations as a single string:

from bio_embeddings.utilities.helpers import convert_list_of_enum_to_string
print("The predicted secondary structure (red) of the sequence is:")

for (AA, DSSP3) in zip(target_sequence, convert_list_of_enum_to_string(annotations.DSSP3)):
    print(f"\x1B[30m{AA}\x1b[31m{DSSP3}")