Cyber Forensics Intelligence Center
Permanent URI for this collectionhttps://hdl.handle.net/20.500.11875/2423
Browse
Browsing Cyber Forensics Intelligence Center by Author "Chen, Zhongxue"
Now showing 1 - 9 of 9
- Results Per Page
- Sort Options
Item A gene selection method for GeneChip array data with small sample sizes(2010-07) Chen, Zhongxue; Liu, Qingzhong; McGee, Monnie; Kong, Megan; Huang, Xudong; Deng, Youpin; Scheuermann, Richard H.In microarray experiments with small sample sizes, it is a challenge to estimate p-values accurately and decide cutoff p-values for gene selection appropriately. Although permutation-based methods have proved to have greater sensitivity and specificity than the regular t-test, their p-values are highly discrete due to the limited number of permutations available in very small sample sizes. Furthermore, estimated permutation-based p-values for true nulls are highly correlated and not uniformly distributed between zero and one, making it difficult to use current false discovery rate (FDR)-controlling methods. Results: We propose a model-based information sharing method (MBIS) that, after an appropriate data transformation, utilizes information shared among genes. We use a normal distribution to model the mean differences of true nulls across two experimental conditions. The parameters of the model are then estimated using all data in hand. Based on this model, p-values, which are uniformly distributed from true nulls, are calculated. Then, since FDR-controlling methods are generally not well suited to microarray data with very small sample sizes, we select genes for a given cutoff p-value and then estimate the false discovery rate. Conclusion: Simulation studies and analysis using real microarray data show that the proposed method, MBIS, is more powerful and reliable than current methods. It has wide application to a variety of situations.Item A new statistical approach to combining p-values using gamma distribution and its application to genome-wide association study(BMC Bioinformatics, 2014-12-16) Liu, Qingzhong; Chen, Zhongxue; Yang, William; Yang, Jack Y; Li, Jing; Yang, Mary QuBackground: Combining information from different studies is an important and useful practice in bioinformatics, including genome-wide association study, rare variant data analysis and other set-based analyses. Many statistical methods have been proposed to combine p-values from independent studies. However, it is known that there is no uniformly most powerful test under all conditions; therefore, finding a powerful test in specific situation is important and desirable. Results: In this paper, we propose a new statistical approach to combining p-values based on gamma distribution, which uses the inverse of the p-value as the shape parameter in the gamma distribution. Conclusions: Simulation study and real data application demonstrate that the proposed method has good performance under some situations.Item Assessment of gene order computing methods for Alzheimer’s disease(BMC Medical Genomics, 2013-01-23) Liu, Qingzhong; Hu, Benqiong; Pang, Chaoyang; Wang, Shipend; Chen, Zhongxue; Vanderburg, Charles R; Rogers, Jack T.; Deng, Youping; Huang, Xudong; Jiang, GangBackground: Computational genomics of Alzheimer disease (AD), the most common form of senile dementia, is a nascent field in AD research. The field includes AD gene clustering by computing gene order which generates higher quality gene clustering patterns than most other clustering methods. However, there are few available gene order computing methods such as Genetic Algorithm (GA) and Ant Colony Optimization (ACO). Further, their performance in gene order computation using AD microarray data is not known. We thus set forth to evaluate the performances of current gene order computing methods with different distance formulas, and to identify additional features associated with gene order computation. Methods: Using different distance formulas- Pearson distance and Euclidean distance, the squared Euclidean distance, and other conditions, gene orders were calculated by ACO and GA (including standard GA and improved GA) methods, respectively. The qualities of the gene orders were compared, and new features from the calculated gene orders were identified. Results: Compared to the GA methods tested in this study, ACO fits the AD microarray data the best when calculating gene order. In addition, the following features were revealed: different distance formulas generated a different quality of gene order, and the commonly used Pearson distance was not the best distance formula when used with both GA and ACO methods for AD microarray data. Conclusion: Compared with Pearson distance and Euclidean distance, the squared Euclidean distance generated the best quality gene order computed by GA and ACO methods.Item Comparison of feature selection and classification for MALDI-MS data(BMC Genomics, 2009-07-07) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Yang, Jack Y; Qiao, Mengyu; Yang, Mary Qu; Huang, Xudong; Deng, YoupingIntroduction: In the classification of Mass Spectrometry (MS) proteomics data, peak detection, feature selection, and learning classifiers are critical to classification accuracy. To better understand which methods are more accurate when classifying data, some publicly available peak detection algorithms for Matrix assisted Laser Desorption Ionization Mass Spectrometry (MALDI-MS) data were recently compared; however, the issue of different feature selection methods and different classification models as they relate to classification performance has not been addressed. With the application of intelligent computing, much progress has been made in the development of feature selection methods and learning classifiers for the analysis of high-throughput biological data. The main objective of this paper is to compare the methods of feature selection and different learning classifiers when applied to MALDI-MS data and to provide a subsequent reference for the analysis of MS proteomics data. Results: We compared a well-known method of feature selection, Support Vector Machine Recursive Feature Elimination (SVMRFE), and a recently developed method, Gradient based Leaveone-out Gene Selection (GLGS) that effectively performs microarray data analysis. We also compared several learning classifiers including K-Nearest Neighbor Classifier (KNNC), Naive Bayes Classifier (NBC), Nearest Mean Scaled Classifier (NMSC), uncorrelated normal based quadratic Bayes Classifier recorded as UDC, Support Vector Machines, and a distance metric learning for Large Margin Nearest Neighbor classifier (LMNN) based on Mahanalobis distance. To compare, we conducted a comprehensive experimental study using three types of MALDI-MS data. Conclusion: Regarding feature selection, SVMRFE outperformed GLGS in classification. As for the learning classifiers, when classification models derived from the best training were compared, SVMs performed the best with respect to the expected testing accuracy. However, the distance metric learning LMNN outperformed SVMs and other classifiers on evaluating the best testing. In such cases, the optimum classification model based on LMNN is worth investigating for future study.Item Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data(PLoS ONE, 2009-12-11) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Huang, Xudong; Deng, Youpin; Liu, JianzhongMicroarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods: Support Vector Machine Recursive Feature Elimination (SVMRFE) N Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS) N Gradient based Leave-one-out Gene Selection (GLGS) To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II) breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC) is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and AUC errors.Item Gene selection and classification for cancer microarray data based on machine learning and similarity measures(BMC Genomics, 2011) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Liu, Jianzhong; Chen, Lei; Deng, Youpin; Wang, Zhaohui; Huang, Xudong; Qiao, MengyuBackground: Microarray data have a high dimension of variables and a small sample size. In microarray data analyses, two important issues are how to choose genes, which provide reliable and good prediction for disease status, and how to determine the final gene set that is best for classification. Associations among genetic markers mean one can exploit information redundancy to potentially reduce classification cost in terms of time and money. Results: To deal with redundant information and improve classification, we propose a gene selection method, Recursive Feature Addition, which combines supervised learning and statistical similarity measures. To determine the final optimal gene set for prediction and classification, we propose an algorithm, Lagging Prediction Peephole Optimization. By using six benchmark microarray gene expression data sets, we compared Recursive Feature Addition with recently developed gene selection methods: Support Vector Machine Recursive Feature Elimination, Leave-One-Out Calculation Sequential Forward Selection and several others. Conclusions: On average, with the use of popular learning machines including Nearest Mean Scaled Classifier, Support Vector Machine, Naive Bayes Classifier and Random Forest, Recursive Feature Addition outperformed other methods. Our studies also showed that Lagging Prediction Peephole Optimization is superior to random strategy; Recursive Feature Addition with Lagging Prediction Peephole Optimization obtained better testing accuracies than the gene selection method varSelRF.Item Independent component analysis of Alzheimer's DNA microarray gene expression data(Molecular Neurodegeneration, 2009-01-28) Liu, Qingzhong; Chen, Zhongxue; Vanderburg, Charles R; Rogers, Jack T.; Huang, Xudong; Mou, Xiaoyang; Wei, KongBackground: Gene microarray technology is an effective tool to investigate the simultaneous activity of multiple cellular pathways from hundreds to thousands of genes. However, because data in the colossal amounts generated by DNA microarray technology are usually complex, noisy, highdimensional, and often hindered by low statistical power, their exploitation is difficult. To overcome these problems, two kinds of unsupervised analysis methods for microarray data: principal component analysis (PCA) and independent component analysis (ICA) have been developed to accomplish the task. PCA projects the data into a new space spanned by the principal components that are mutually orthonormal to each other. The constraint of mutual orthogonality and second-order statistics technique within PCA algorithms, however, may not be applied to the biological systems studied. Extracting and characterizing the most informative features of the biological signals, however, require higher-order statistics. Results: ICA is one of the unsupervised algorithms that can extract higher-order statistical structures from data and has been applied to DNA microarray gene expression data analysis. We performed FastICA method on DNA microarray gene expression data from Alzheimer's disease (AD) hippocampal tissue samples and consequential gene clustering. Experimental results showed that the ICA method can improve the clustering results of AD samples and identify significant genes. More than 50 significant genes with high expression levels in severe AD were extracted, representing immunity-related protein, metal-related protein, membrane protein, lipoprotein, neuropeptide, cytoskeleton protein, cellular binding protein, and ribosomal protein. Within the aforementioned categories, our method also found 37 significant genes with low expression levels. Moreover, it is worth noting that some oncogenes and phosphorylation-related proteins are expressed in low levels. In comparison to the PCA and support vector machine recursive feature elimination (SVM-RFE) methods, which are widely used in microarray data analysis, ICA can identify more AD-related genes. Furthermore, we have validated and identified many genes that are associated with AD pathogenesis. Conclusion: We demonstrated that ICA exploits higher-order statistics to identify gene expression profiles as linear combinations of elementary expression patterns that lead to the construction of potential AD-related pathogenic pathways. Our computing results also validated that the ICA model outperformed PCA and the SVM-RFE method. This report shows that ICA as a microarray data analysis tool can help us to elucidate the molecular taxonomy of AD and other multifactorial and polygenic complex diseases.Item A Method to Detect AAC Audio Forgery(Endorsed Transactions on Security and Safety, 2015-05-25) Zhang, Jing; Liu, Qingzhong; Yang, Ming; Sung, Andrew H.; Liu, Yanxin; Chen, Lei; Chen, ZhongxueAdvanced Audio Coding (AAC), a standardized lossy compression scheme for digital audio, which was designed to be the successor of the MP3 format, generally achieves better sound quality than MP3 at similar bit rates. While AAC is also the default or standard audio format for many devices and AAC audio files may be presented as important digital evidences, the authentication of the audio files is highly needed but relatively missing. In this paper, we propose a scheme to expose tampered AAC audio streams that are encoded at the same encoding bit rate. Specifically, we design a shift-recompression based method to retrieve the differential features between the re-encoded audio stream at each shifting and original audio stream, learning classifier is employed to recognize different patterns of differential features of the doctored forgery files and original (untouched) audio files. Experimental results show that our approach is very promising and effective to detect the forgery of the same encoding bit-rate on AAC audio streams. Our study also shows that shift recompression-based differential analysis is very effective for detection of the MP3 forgery at the same bit rateItem Supervised learning-based tagSNP selection for genome-wide disease classifications(BIomed Central, 2007-07-25) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Yang, Mary Qu; Huang, Xudong; Yang, JackComprehensive evaluation of common genetic variations through association of single nucleotide polymorphisms (SNPs) with complex human diseases on the genome-wide scale is an active area in human genome research. One of the fundamental questions in a SNP-disease association study is to find an optimal subset of SNPs with predicting power for disease status. To find that subset while reducing study burden in terms of time and costs, one can potentially reconcile information redundancy from associations between SNP markers