The researchers create a system that can construct AI models for biological research. BioAutoMATED is an open-source, automated machine-learning platform that seeks to democratise AI in research laboratories.

The availability of large, high-dimensional biological datasets in recent years has facilitated the widespread use of machine learning (ML) methods to investigate and predict biological phenomena, resulting in exciting breakthroughs in genomics and promising future advances in systems biology, synthetic biology, and structural biology. Medium- to large-scale biological sequence databases, such as nucleic acid, peptide, and glycan sequences, are common. Using ML on these datasets could help researchers extract biological insights and speed up the construction of sequences with desirable features.

Online courses, open-source code, interactive notebooks, and software packages have made computational analyses, and ML approaches more accessible to scientists. However, ML knowledge is frequently required to create, train, and deploy ML models. Various user-made decisions can significantly impact the quality and performance of ML models. Understanding which design decisions are important and how to make the best judgements for any given dataset remain significant challenges for life science researchers with limited ML knowledge. Even for experienced ML practitioners, choosing the right algorithmic strategies and tweaking model parameters is tough.

Automated machine learning

Automated machine learning (AutoML) is a promising route for facilitating ML to analyse biological datasets. AutoML refers to strategies for automating the design and deployment of ML pipelines with minimal human intervention.

User involvement is required. End-to-end AutoML would make data pre-processing, feature extraction, model selection and optimisation, and performance evaluation easier for life scientists. AutoML approaches can automatically identify model architectures and model hyperparameters. Furthermore, AutoML may be helpful for more experienced ML practitioners as a quick approach to construct baseline models to compare against or quickly identify wide groups of models with promising performance.

AutoML tools

Currently, there is a wide range of AutoML tools accessible. Many well-known AutoML tools only search within classes of neural network models. However, tree-based optimisation methods that search among ''shallow'' or simpler sci-kit-learn-based models, such as random forest classifiers, are among the most fascinating AutoML tools. These techniques, which may be more suited for smaller, sparser biological datasets than neural networks, have yet to be combined with neural architecture search methods to speed up biological sequence analysis. 

Indeed, architectural selection is critical for model performance, and recent studies indicate that there is no single ''optimal'' AutoML tool, emphasising the importance of evaluating multiple classes of models on a single platform. As a result, AutoML integration into a scalable system that can also manage data pre-processing, model deployment, and system reporting is required.

BioAutoMATED

The researchers present BioAutoMATED, an AutoML platform for biological sequence analysis incorporating numerous AutoML algorithms into a cohesive framework. Users are presented with practical approaches for automatically analysing, interpreting and designing biological sequences. BioAutoMATED predicts gene regulation, peptide-drug interactions, and glycan annotation, as well as designing optimised synthetic biology components highlighting prominent sequence features. BioAutoMATED makes it easier for life scientists to apply machine learning to their work by automating sequence modelling.

Supervised ML models

The supervised ML models in BioAutoMATED's repertoire are divided into three types:

  • Binary classification models (which divide data into two classes).
  • Multi-class classification models (which divide data into multiple classes).
  • Regression models.

BioAutoMATED can even assist in determining how much data is required to train the selected model properly.

Conclusion

In this work, the researchers present BioAutoMATED, a platform that integrates and deploys AutoML tools for studying biological sequences and assesses its performance. 

Many biologists who want to use ML in their research face significant barriers to entry due to the design choices behind ML models. AutoML techniques can address many issues associated with bringing ML to the biological sciences. However, because they do not explicitly handle biological sequences (e.g., nucleotide, amino acid, or glycan sequences) and cannot be easily compared with other AutoML methods, these algorithms are rarely employed in systems and synthetic biology studies.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE