Colab initialization¶

install the pipeline in the colab runtime
download files neccessary for this example

!pip3 install -U pip > /dev/null
!pip3 install -U bio_embeddings[all] > /dev/null

!wget http://data.bioembeddings.com/public/embeddings/reference/deeploc/reduced_embeddings_file.h5 --output-document reduced_embeddings_file.h5
!wget http://data.bioembeddings.com/public/embeddings/reference/deeploc/annotations.csv --output-document annotations.csv

Train a simple Neural Network to predict subcellular localization¶

In this notebook we use the embeddings and annotations of the DeepLoc dataset to train a supervised classifier to predict subcellular localization based on embedding.

You can obtain these embeddings yourself by running the deeploc example.

We use popular libraries: sci-kit learn for the machine learning part, pandas for data reading and management, and Plotly for visualizations.

What you will need to perform something similar on your custom dataset are:

embeddings for your dataset (using the embedder of your choice)
an annotation file with labels you would like to predict

WARNING¶

This is a simplified notebook with pre-processed data. When training on protein data, you need special attention to what you are doing, e.g. redunancy reduce training and testing sets. Additionally, it’s good practice to have three sets: training, testing and validation. In this case we only use two (training and testing). In general: you should use training and validation for parameter optimization (e.g. selecting how many layers produce the best prediction) and then test your final model on the test set. For best practices in machine learning with protein sequences check out Bioinformatics and the chapter Predictive methods using protein sequences.

# Basic Protocol 3 — Step 4

import h5py
import numpy as np

# Data utilities
from pandas import read_csv
from bio_embeddings.utilities import QueryEmbeddingsFile

# Machine learning
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

# Basic Protocol 3 — Step 5
annotations = read_csv('annotations.csv')

If we print the first few lines in our annotations CSV file we notice a “set” column. This column tells us whether that sample should be used for testing

annotations[:3]

We devide our data based on samples labeled as for testing and for training (aka: not for testing).

# Basic Protocol 3 — Step 6
train_set = annotations[annotations.set == "train"]
test_set = annotations[annotations.set == "test"]

print(f"The train set contains {len(train_set)} samples, and we will test on {len(test_set)} samples.")

We create two sets:

one for training (containing training identifiers, embeddings and labels)
one for testing (containing testing identifiers, embeddings and labels)

Embeddings are the “features” of our samples, so if you are used to thinking about X and y embeddings are our X, and labels are our y.

# Basic Protocol 3 — Step 7

training_embeddings = list()
training_identifiers = train_set.identifier.values
training_labels = train_set.label.values

testing_embeddings = list()
testing_identifiers = test_set.identifier.values
testing_labels = test_set.label.values

with h5py.File('reduced_embeddings_file.h5', 'r') as embeddings_file:
    embedding_querier = QueryEmbeddingsFile(embeddings_file)
    
    for identifier in training_identifiers:
        training_embeddings.append(embedding_querier.query_original_id(identifier))
    
    for identifier in testing_identifiers:
        testing_embeddings.append(embedding_querier.query_original_id(identifier))
        
# A sanity check: make sure that the numbers are equal!
assert(len(training_identifiers) == len(training_embeddings))
assert(len(testing_identifiers) == len(testing_embeddings))

In the following we define the basic neural network architecture (multilayerperceptron) and a set of parameters that we want to test during parameter optimization. The basic architecture uses the “Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm” solver and a maximum of 1000 training iterations.

The parameter which we want to optimize is the amount of hidden layers, as well as its size. In one case, we will try a network with one hidden layer containing 30 nodes. In another case, we will test a network with two hidden layers and 10 nodes.

You can find more information about the MLPClassifier here: https://scikit-learn.org/stable/modules/neural_networks_supervised.html

# Basic Protocol 3 — Step 8

multilayerperceptron = MLPClassifier(solver='lbfgs', random_state=10, max_iter=1000)

parameters = {
    'hidden_layer_sizes': [(30,), (10,2)]
}

Warning: this steps takes some time! We are now going to train several neural networks using cross validation!¶

We now use the training_* data to train 6 Multi-Layer Perceptrons. Three neural networks will be trained using a neural network architecture with one hidden layer with 30 nodes, while the other three will be trained on a neural network architecture with two hidden layers each with 10 nodes. Each of the three networks will be trained on different validation splits. The various neural networks will be scored according to accuracy.

# Basic Protocol 3 — Step 9

classifiers = GridSearchCV(multilayerperceptron, parameters, cv=3, scoring="accuracy")
classifiers.fit(training_embeddings, training_labels)
classifier = classifiers.best_estimator_

We can now use the classifier to predict labels based on embeddings. For evaluation purposes, we predict the labels of our testing set.

Finally, we evaluate the accuracy of our predictions using “accuracy” from sci-kit learn.

Read more about accuracy: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score

# Basic Protocol 3 — Step 10

predicted_testing_labels = classifier.predict(testing_embeddings)
accuracy = accuracy_score(testing_labels, predicted_testing_labels)

print(f"Our model has an accuracy of {accuracy:.2}")

Predict subcellular localization of arbitrary sequence¶

What if we want to predict the subcellular localization of an arbitrary sequence using our newly built classifier?

# Basic Protocol 3 — Step 11

from bio_embeddings.embed import ProtTransBertBFDEmbedder

embedder = ProtTransBertBFDEmbedder()

# Sequence from: https://www.uniprot.org/uniprot/P58426
sequence = "DDCGKLFSGCDTNADCCEGYVCRLWCKLDW"
per_reisdue_embedding = embedder.embed(sequence)
per_protein_embedding = embedder.reduce_per_protein(per_reisdue_embedding)
sequence_subcellular_prediction = classifier.predict([per_protein_embedding])[0]

print("The arbitrary sequence is predicted to be located in: "
      f"{sequence_subcellular_prediction}")

Going one step back, we can inspect how well our 6 models performed (on their respective validation sets)

from pandas import DataFrame
cv_results = DataFrame(classifiers.cv_results_)

cv_results[['param_hidden_layer_sizes','split0_test_score', 'split1_test_score', 'split2_test_score']]

Another interesting evaluation when dealing with locaization prediction is the visualization of the “confusion matrix”. What this matrix tells us is how many samples have been predicted of a certain class when in reality they were of another class. Ideally, the diagonal of this matrix should be

# Further metrics
from sklearn.metrics import confusion_matrix

# Data visualization
import plotly.express as px

classes = np.unique(testing_labels)

confusion_matrix_data = confusion_matrix(testing_labels, predicted_testing_labels, labels=classes)

confusion_matrix_figure = px.imshow(
    confusion_matrix_data,
    labels=dict(x="True label", y="Predicted label", color="# of samples"),
    x=classes,
    y=classes,
    color_continuous_scale='Gray_r'
)
confusion_matrix_figure.show()