Chapter 15: Representation Learning in Chemistry
Published:15 Jul 2020
J. Staker, G. Marques, and J. Dakka, in Machine Learning in Chemistry
Download citation file:
The past few years have seen a significantly increased interest in applying contemporary machine learning methods to drug discovery, materials science, and other applications in chemistry. Recent advances in deep learning, coupled with the ever-expanding volume of publicly available data, have enabled a breadth of new directions to explore, both in accelerating commercial applications and in enabling new research directions. Many machine learning methods cannot utilize molecule data stored in common formats, e.g., SMILES or connection table, and first require molecules to be descriptorized and processed into representations amenable to machine learning. Historically, molecular featurization has been performed through non-learned transformations that are usually coarse-grained and highly lossy, such as molecular fingerprints that encounter bit collisions and discard the overall molecular topology. By contrast, learned featurization may provide richer, more descriptive representations of molecules, leading to more powerful and accurate models. We compare common non-learned featurization methods with those that are learned and explore the different families of deep neural architectures used to obtain learned representations. We also discuss recent work that explores the addition of constraints to models that induce stronger physical priors in deep neural network architectures. Imposing physical constraints in neural models can lead to more robust featurizations and improved transfer learning.