CMStatistics 2022: Start Registration
View Submission - CMStatistics
B1677
Title: Estimation of tissue profiles from blood RNA-seq based on latent Dirichlet allocation Authors:  Shintaro Yuki - Doshisha University (Japan) [presenting]
Yusuke Matsui - Nagoya university graduate school of medicine (Japan)
Yoshikazu Terada - Osaka University; RIKEN (Japan)
Hiroshi Yadohisa - Doshisha University (Japan)
Abstract: Disease prediction based on gene expression data from blood samples is clinically important. However, since the expression of the blood is a mixture of molecules from multiple tissues, it is necessary to estimate tissue-specific profiles to know the disease's source. To address this problem, we consider an estimation method using Latent Dirichlet Allocation (LDA), assuming that tissue-specific molecular markers are given as a priori information. Specifically, consider assuming the prior information described above as a topic-specific prior distribution for each topic of word frequency in the LDA. Another related method is penalized LDA, which addresses the effects of housekeeping genes corresponding to the ``stop word''. In particular, RNA-seq gene expression data is high-dimensional and contains an excess of zeros. We will discuss methods to achieve accuracy and robustness in such situations and to extract interpretable biological information.