Add a new language model/embedder¶
Pick a name, which should be the one you’re using in the publication, and a lowercase version with underscores (snake_case). E.g. for one hot encoding, we use
one_hot_encoding. The class name is the CamelCase version, in this caseOneHotEncodingEmbedder. Stay consistent where you place the underscores.Add all new dependencies in
pyproject.tomlin a new extraAdd an entry to
bio_embeddings/utilities/defaults.ymlwith a link to the weights.Create a new class in
bio_embeddings/embedthat at least implementsEmbedderInterface, or even better (for GPU based models)EmbedderWithFallback. The most simple example isOneHotEncodingEmbedder, are more realistic example isProtTransT5Embedderand its subclasses. If you add any new options, add them toKNOWN_EMBED_OPTIONSinbio_embeddings/embed/pipeline.pyAdd the class in
bio_embeddings/embed/__init__.pyThe following two are checked by
SKIP_SLOW_TESTS=1 pytest:Add the model size in the docs of
bio_embeddings/embed/__init__.pyAdd it to
DEFAULT_MAX_AMINO_ACIDS
Add it to the tests following the instructions in
tests/test_embedder_embedding.pyWrite a pipeline with your embedder, see that it works
Send a pull request 🚀