Zhao, Simiao (2021) Prediction of Protein Expression and Growth Rates by Supervised Machine Learning. Natural Science, 13 (08). pp. 301-330. ISSN 2150-4091
ns_2021080214163103.pdf - Published Version
Download (15MB)
Abstract
The DNA sequences of an organism play an important influence on its transcription and translation process, thus affecting its protein production and growth rate. Due to the com-plexity of DNA, it was extremely difficult to predict the macroscopic characteristics of or-ganisms. However, with the rapid development of machine learning in recent years, it be-comes possible to use powerful machine learning algorithms to process and analyze biolog-ical data. Based on the synthetic DNA sequences of a specific microbe, E. coli, I designed a process to predict its protein production and growth rate. By observing the properties of a data set constructed by previous work, I chose to use supervised learning regressors with encoded DNA sequences as input features to perform the predictions. After comparing different encoders and algorithms, I selected three encoders to encode the DNA sequences as inputs and trained seven different regressors to predict the outputs. The hy-per-parameters are optimized for three regressors which have the best potential prediction performance. Finally, I successfully predicted the protein production and growth rates, with the best R2 score 0.55 and 0.77, respectively, by using encoders to catch the potential fea-tures from the DNA sequences.
Item Type: | Article |
---|---|
Subjects: | STM Repository > Medical Science |
Depositing User: | Managing Editor |
Date Deposited: | 08 Nov 2023 08:55 |
Last Modified: | 08 Nov 2023 08:55 |
URI: | http://classical.goforpromo.com/id/eprint/4572 |