{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Colab initialization\n",
    "- install the pipeline in the colab runtime\n",
    "- download files neccessary for this example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip3 install -U pip > /dev/null\n",
    "!pip3 install -U bio_embeddings[all] > /dev/null"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "is_executing": false
    }
   },
   "source": [
    "# Extract secondary structure and subcellular localization predictions from SeqVec\n",
    "In this notebook we will extract annotations from SeqVec embeddings via trained models that can predict secondary structure and subcellular localization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bio_embeddings.embed import SeqVecEmbedder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We initialize the SeqVec embedder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedder = SeqVecEmbedder()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We select an AA sequence. In this case, the sequence is that of [Aspartate aminotransferase, mitochondrial](https://www.uniprot.org/uniprot/P12345)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_sequence = \"MALLHSARVLSGVASAFHPGLAAAASARASSWWAHVEMGPPDPILGVTEAYKRDTNSKKMNLGVGAYRDDNGKPYVLPSVRKAEAQIAAKGLDKEYLPIGGLAEFCRASAELALGENSEVVKSGRFVTVQTISGTGALRIGASFLQRFFKFSRDVFLPKPSWGNHTPIFRDAGMQLQSYRYYDPKTCGFDFTGALEDISKIPEQSVLLLHACAHNPTGVDPRPEQWKEIATVVKKRNLFAFFDMAYQGFASGDGDKDAWAVRHFIEQGINVCLCQSYAKNMGLYGERVGAFTVICKDADEAKRVESQLKILIRPMYSNPPIHGARIASTILTSPDLRKQWLQEVKGMADRIIGMRTQLVSNLKKEGSTHSWQHITDQIGMFCFTGLKPEQVERLTKEFSIYMTKDGRISVAGVTSGNVGYLAHAIHQVTK\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We produce the embeddings of the above sequence. Since we only have one sequence, we use the simple `embed` function, rather than the `embed_many` or `embed_batch`, which we would instead use if we had multiple sequences to embed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding = embedder.embed(target_sequence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `bio_embeddings` pipeline includes some models trained on embeddings for the prediction of Secondary Structure and Subcellular Localization. In the following we make use of these models.\n",
    "\n",
    "To speed up processing, we have downloaded the model weights of the supervised subcellular localization and secondary structure prediction models from [here](http://maintenance.dallago.us/public/embeddings/feature_models/seqvec/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bio_embeddings.extract.basic import BasicAnnotationExtractor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "annotations_extractor = BasicAnnotationExtractor(\"seqvec_from_publication\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "annotations = annotations_extractor.get_annotations(embedding)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what annotations are available from SeqVec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "annotations._fields"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's print the subcellular localization predicted via the SeqVec embeddings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"The subcellular localization predicted from the embedding is: {annotations.localization.value}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For AA-annotations, e.g. secondary structure, we can use a helper function to format the extracted annotations as a single string:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bio_embeddings.utilities.helpers import convert_list_of_enum_to_string"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"The predicted secondary structure (red) of the sequence is:\")\n",
    "\n",
    "for (AA, DSSP3) in zip(target_sequence, convert_list_of_enum_to_string(annotations.DSSP3)):\n",
    "    print(f\"\\x1B[30m{AA}\\x1b[31m{DSSP3}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}