Colab initialization

  • install the pipeline in the colab runtime

  • download files neccessary for this example

!pip3 install -U pip > /dev/null
!pip3 install -U "bio-embeddings[all] @ git+https://github.com/sacdallago/bio_embeddings.git" > /dev/null
!wget http://data.bioembeddings.com/public/embeddings/notebooks/custom_data/disprot_2019_09_labelled_0.2_0.8.csv --output-document disprot_2019_09_labelled_0.2_0.8.csv

Remove some annotations from an annotation file

In order to make sure that you are only transfering annotations for embeddings in the reference embedding file, the pipeline checks that all identifiers in the reference annotations are present in the reference embeddings. If there is a mismatch (you have annotations for sequences/embeddings not present in the reference embeddings), the pipeline will ask you to remove those annotations from your annotation file. This is done to make sure that you don’t believe a certain sequence/embedding is in your reference set, if it in fact is not!

from bio_embeddings.utilities import remove_identifiers_from_annotations_file
faulty_identifiers = ['Q12905','Q98XH7','P61244-2','Q62627','O14936','P19619','Q86UX7','B7T1D9','Q8K4J6','Q10Q08','Q99ML1','O76070','Q00987','P45481','Q9Y6Q9','O95718','P10587','Q9UM11','P02313','P03347','Q9UGL1','P05107','Q9NR61','Q13698','P00533','O60829','P00514','Q71UI9','Q13541','Q9Z0P7','P02511','P17677','P12493','P38919','P22303','P29990','Q9NHC3','O00204-2','O88597','Q548Y4','P04486','P52564','P24928','Q2YHF0','P26554','Q15796','Q9BRG1','Q04207','P63165','Q13573','P14921','Q6P8Z1','P27577','Q9JK11','P62152','P16535','Q868N5','P45561','P14340','P17763','P62326','Q92731','P06935','P03254','P61925','Q09472','P10636-8','P30281','P96884','P13861','Q63450','P17870','O43236-6','Q9Q6P4','P42763','Q9NQA5','P04052','P42759','Q9FG31','P23443','P04370-5','Q03519','P20810','P27958','P02628','C4PB33','Q96270','P41351','P0CE48','P05106','K7J0R2','B9UCQ5','P06239','P27782','O92972','P10923','P84051','P08592','P19599','P11632','P12506','P02686-5','P81558','P84022','Q13469','M5BF30','Q15797','Q9WMX2','A0A068MVV3','C4M0U8','P02619','P27695','Q1K7R9','P42224','Q60795','Q32ZE1','A0A1Z3GD08','O94444','E2IHW6','P02259','D2JX42','P31751','P03129','P12823','P35222','P10114','Q96RI1','P03404','P16220','P0A7L8','P03259','P05318','P68363','P49723','P02687','Q61548','P18212','P60604','P14335','P07746','P27285','Q16143','Q9YGY0','E1UJ20','Q16222-2','P17639','P46108-2','Q5UB51']
filtered_annotations_file = remove_identifiers_from_annotations_file(
    faulty_identifiers,
    "disprot_2019_09_labelled_0.2_0.8.csv")
filtered_annotations_file[:10]
filtered_annotations_file.to_csv("disprot_2019_09_labelled_0.2_0.8_filtered.csv", index=False)