New materials and pharmaceuticals are often discovered through a laborious, trial-and-error process that might take decades and cost millions of dollars. 

Scientists frequently use machine learning to anticipate chemical characteristics and narrow down the molecules they need to synthesise and test in the lab to speed up this process.

Researchers have created a unified framework to predict molecular attributes while generating new molecules more quickly than existing deep-learning algorithms.

Existing Approach

Researchers must train a machine-learning model to predict a molecule's biological or mechanical features by exposing it to millions of labelled chemical structures. Large training datasets are generally difficult to come by due to the expense of discovering compounds and the problems of hand-labelling millions of structures, limiting the effectiveness of machine-learning algorithms.

In contrast, the latest method can effectively predict chemical attributes with limited data. Their approach is built on a fundamental knowledge of the laws that govern how building pieces join to form legitimate molecules. These criteria capture chemical structure similarities, allowing the system to build new molecules and forecast their attributes data-efficiently. When given a dataset of fewer than 100 samples, their method outperformed other machine-learning approaches on both small and big datasets, and it was able to predict chemical attributes and build viable compounds accurately.

Language of molecules

Scientists need training datasets containing millions of molecules with similar properties to those they hope to discover to get the best results from machine-learning models. In practice, datasets from a single domain are relatively small. Models are then applied to a much smaller, more specific dataset, yet these models have already been trained on vast datasets of broad molecules. Due to a lack of domain-specific knowledge, these models could perform better.

Words, phrases, and entire paragraphs are created in language theory by following a predetermined set of rules. Molecular grammar is similar in concept. It's a set of guidelines for making things, including how to string together atoms and substructures to make molecules or polymers. One molecular grammar can represent a vast number of molecules, much like a language grammar can generate many sentences using the same principles. Similar molecular structures share common language production principles, which the system learns to exploit. The system draws on its innate understanding of molecular similarities to improve the accuracy of its property prediction for novel molecules. 

Learning

Using a trial-and-error approach in which the model is rewarded for behaviour that brings it closer to accomplishing a goal, the system learns the production rules for a molecular language. Learning grammar production rules for anything more than the tiny dataset would be computationally prohibitive because there are billions of possible combinations of atoms and substructures.

The scientists split the molecular grammar in half. The first component, a metagrammar, is a manually designed, broadly applicable grammar initially provided to the system. Then, the domain dataset can teach it a much more condensed, molecule-specific language. This hierarchical structure accelerates the learning process.

Results

Experiments showed that even though domain-specific datasets often only have a few hundred samples, the novel system developed by the researchers was able to synthesise viable molecules and polymers simultaneously and adequately forecast their properties. Pretraining is an expensive extra step that the new technique does away with.

Glass transition temperature, when a substance changes from a solid to a liquid, was one of the physical properties of polymers that could be accurately predicted using the method. Manually obtaining this data is typically expensive due to the high temperatures and pressures typically required in the tests. The researchers halved the size of one training set, leaving only 94 samples, to test the limits of their method. Despite this limitation, their model performed comparably to approaches trained with the entire dataset.

Conclusion

Understanding the interactions between polymer chains will require expanding their molecular grammar to encompass the 3D geometry of molecules and polymers. To further improve the system's precision, they are also designing a user interface that will display the grammar production rules that were learned.

Sources of Article

Image source: Unsplash

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE