Conclusion - 本文 Thesis 総合研究大学院大学学術情報リポジトリ A1918本文

Fig. 3.13 Venn diagram for total numbers of significantly annotated motifs over all the 228 datasets, reported by RPMCMC, Hegma and DREME.

all-at-once sampling, we could discover diverse motifs by which the parallel samplers divide their responsibility in the overall search region.

As another contribution, we provided a list of predicted cofactor motifs that were overrep-resented in the 228 ENCODE ChIP-seq datasets. RPMCMC can potentially mine promising annotated motifs which other word-count methods fail to find. To narrow down things to truly functional cofactor sets, it is necessary to conduct further validation experiments.

Molecular design problem

4.1 Introduction

Computational molecular design has a great potential to promote enormous savings in time and cost in the discovery and development of functional molecules and assembles including drugs, dyes, solvents, polymers, and catalysis. The objective is to computationally create novel promising molecules that exhibit desired properties of various kinds, simultaneously.

For instance, the chemical space of small organic molecules is known to consist of more than 10⁶⁰candidates. The problem entails a considerably complicated multi-objective optimization where it is impractical to fully explore the vast landscape of structure-property relationships.

In general, the molecular design process involves two different types of prediction; the forward prediction is aimed at predicting physical, chemical and electric properties of a given molecular structure, and the backward prediction is to inversely identify appropriate molecular structures with the given desired properties. While the former design process is referred to as the quantitative structure-property relationship (QSPR) analysis, the latter is known as the inverse-QSPR analysis [14, 90, 68, 122, 123, 81, 66, 129]. In this study, a Bayesian perspective is employed to unify the forward and backward prediction processes.

Therefore, the present method is called the Bayesian molecular design.

In the relevant areas called chemical or materials informatics, there have been extensive studies on the forward prediction; however, there has been considerably less progress made

in the backward prediction. An obvious approach to the inverse problem is the use of combinatorial optimization techniques. The objective is to minimize the difference between given desired properties and those attained by the designed molecules. Some previous studies tackled this issue with genetic algorithms (GAs) [90, 122, 123, 81, 66, 32, 87, 71]

and molecular graph enumeration [129, 3]. Graph enumeration is generally less effective due to the combinatorial complexity of the design space. To narrow down the candidates, several ways of introducing a restricted class of molecular graphs have been investigated [129, 3]. Using GAs [125], which have been more intensively studied, searches for optimal or suboptimal designs by successively modifying chemical structures with genetic operators consisting of mutation, crossover, and selection.

The major difficulty of using a GA lies in the procedure of mutating molecules such that unfavorable structures are successfully excluded, for instance, unfavorable and/or unrealistic chemical bonds such as F-N and C=O=C. This issue is common to the graph enumeration.

To avoid the emergence of unfavorable structures, exclusion rules were employed in some studies, particularly those aimed at the design of drug-like molecules [56, 67]. However, such rules might be incomprehensive, and it is impractical to establish a general rule of chemically favorable structures. A promising alternative is fragment assembly methods [122, 123, 81, 66, 30, 38]. In a structure manipulation step of these methods, randomly chosen substructures are replaced by fragments of existing compounds. While the fragment assembly methods have a certain appeal, as is evident from their widespread use, they suffer from critical disadvantages:

(i) the design space is restricted to possible combinations of collected fragments, (ii) the use of a vast amount of fragments entails unacceptably large computational loads to homology search in the fragment exchange operation, and (iii) mutation and crossover operations require computationally intractable graph manipulations. The proposed method circumvents all these issues.

The Bayesian molecular design begins by obtaining a set of machine learning models that forwardly predict properties of a given molecule for multiple design objectives. These forward models on QSPR are inverted to the backward model through Bayes’ law, combined with a certain prior distribution. This gives a posterior probability distribution for the

inverse-QSPR analysis, which is conditioned by a desired property region. Exploring high-probability regions of the posterior with the sequential Monte Carlo (SMC) method [28], molecules that exhibit the desired properties are computationally created. The most distinguished feature of this workflow lies in the backward prediction algorithm. In this study, a molecule is described by a ASCII string according to the well-known SMILES chemical language notation. To reduce the occurrence of chemically unfavorable structures, a chemical language model is trained, which acquires commonly occurring patterns of chemical substructures by the natural language processing of the SMILES language of existing compounds. The trained model is used to recursively refine SMILES strings of seed molecules such that the properties of the resulting molecules fall in the desired property region while eliminating the creation of unfavorable chemical structures. The key contributions of the newly proposed method are summarized below.

• String-based structure refinement. The string representation of molecules enables much faster structure refinements in the backward prediction than those based on graph representation.

• Generator for chemically favorable structures. The method is designed according to a fragment-free strategy. Structural patterns of known compounds or implied contexts of ‘chemically favorable structures’ are captured by the probabilistic model.

Afterward, the resulting SMILES generator will be shown to be very effective in creating chemically plausible hypothetical molecules. The trained model serves as a substitute for a fragment library, and also forms the prior distribution in the Bayesian analysis.

The forward and backward predictions are pipelined in the R package iqspr which is provided at the CRAN repository [100]. The proposed method is illustrated through the design of small organic molecules exhibiting properties within prescribed ranges of HOMO-LUMO gap and internal energy. The properties of created molecules are verified with a quantum chemistry calculation based on density functional theory (DFT) as done for the design of organic photovoltaics in [51]. Finally, through case studies, we highlight a

significant issue and the ultimate goal of the machine learning-driven material discovery to which less attention has so far been paid.

ドキュメント内本文 Thesis 総合研究大学院大学学術情報リポジトリ A1918本文 (ページ 84-90)