Replicability of Classification Procedures for Gene Expression Data

Loading...
Thumbnail Image

Date

2017-05

Journal Title

Journal ISSN

Volume Title

Publisher

The Ohio State University

Research Projects

Organizational Units

Journal Issue

Abstract

Pattern classification is a branch of statistics and machine learning that uses labeled samples to predict information about unlabeled ones. A common application of this theory in medicine is to classify cancer patients into subtypes based on the patterns of their gene expression profiles. What determines the validity of the procedure is not whether one can find these patterns in observed data, but whether these patterns generalize to unobserved data from the same population. In this regard, the error of the classification rule over the population determines its validity and a key issue is how to estimate it.
In small sample situations, where the number of observed data is small, estimating the classification error becomes problematic as most of the error estimators have high variance. This raises doubts on the replicability of small sample studies. In this thesis, I will use a replicability index to asses multiple classification and error estimation procedures that are commonly used in the medical community, and in particular, on RNA-seq and microarray gene expression data, and provide suggestions on the sample size to ensure that a procedure applied to a small preliminary study will generalize in a large follow-on study with an acceptable margin of error.

Description

The Ohio State University Undergraduate Student Scholar Award

Keywords

Pattern Recognition, Data Analysis, Gene expression classification, replicability, small sample, error estimation

Citation