Add a new language model/embedder

  • Pick a name, which should be the one you’re using in the publication, and a lowercase version with underscores (snake_case). E.g. for one hot encoding, we use one_hot_encoding. The class name is the CamelCase version, in this case OneHotEncodingEmbedder. Stay consistent where you place the underscores.

  • Add all new dependencies in pyproject.toml in a new extra

  • Add an entry to bio_embeddings/utilities/defaults.yml with a link to the weights.

  • Create a new class in bio_embeddings/embed that at least implements EmbedderInterface, or even better (for GPU based models) EmbedderWithFallback. The most simple example is OneHotEncodingEmbedder, are more realistic example is ProtTransT5Embedder and its subclasses. If you add any new options, add them to KNOWN_EMBED_OPTIONS in bio_embeddings/embed/pipeline.py

  • Add the class in bio_embeddings/embed/__init__.py

  • The following two are checked by SKIP_SLOW_TESTS=1 pytest:

    • Add the model size in the docs of bio_embeddings/embed/__init__.py

    • Add it to DEFAULT_MAX_AMINO_ACIDS

  • Add it to the tests following the instructions in tests/test_embedder_embedding.py

  • Write a pipeline with your embedder, see that it works

  • Send a pull request 🚀