Department of Computer Science

Permanent URI for this collectionhttps://hdl.handle.net/20.500.11875/2419

Browse

Now showing 1 - 20 of 41

A Distribution-free Convolution Model for Background Correction of Oligonucleotide Microarray Data
(BMC Genomics, 2009-07-07) Chen, Zhongxue; McGee, Monnie; Liu, Qingzhong; Kong, Megan; Deng, Youpin; Schuermann, Richard H
Introduction: Affymetrix GeneChip® high-density oligonucleotide arrays are widely used in biological and medical research because of production reproducibility, which facilitates the comparison of results between experiment runs. In order to obtain high-level classification and cluster analysis that can be trusted, it is important to perform various pre-processing steps on the probe-level data to control for variability in sample processing and array hybridization. Many proposed preprocessing methods are parametric, in that they assume that the background noise generated by microarray data is a random sample from a statistical distribution, typically a normal distribution. The quality of the final results depends on the validity of such assumptions. Results: We propose a Distribution Free Convolution Model (DFCM) to circumvent observed deficiencies in meeting and validating distribution assumptions of parametric methods. Knowledge of array structure and the biological function of the probes indicate that the intensities of mismatched (MM) probes that correspond to the smallest perfect match (PM) intensities can be used to estimate the background noise. Specifically, we obtain the smallest q2 percent of the MM intensities that are associated with the lowest q1 percent PM intensities, and use these intensities to estimate background. Conclusion: Using the Affymetrix Latin Square spike-in experiments, we show that the background noise generated by microarray experiments typically is not well modeled by a single overall normal distribution. We further show that the signal is not exponentially distributed, as is also commonly assumed. Therefore, DFCM has better sensitivity and specificity, as measured by ROC curves and area under the curve (AUC) than MAS 5.0, RMA, RMA with no background correction (RMA-noBG), GCRMA, PLIER, and dChip (MBEI) for preprocessing of Affymetrix microarray data. These results hold for two spike-in data sets and one real data set that were analyzed. Comparisons with other methods on two spike-in data sets and one real data set show that our nonparametric methods are a superior alternative for background correction of Affymetrix data.
Detecting Differentially Methylated Loci for Multiple Treatments Based on High-Throughput Methylation Data
(BMC Bioinformatics, 2014-05-15) Chen, Zhongxue; Huang, Hanwen; Liu, Qingzhong
Background: Because of its important effects, as an epigenetic factor, on gene expression and disease development, DNA methylation has drawn much attention from researchers. Detecting differentially methylated loci is an important but challenging step in studying the regulatory roles of DNA methylation in a broad range of biological processes and diseases. Several statistical approaches have been proposed to detect significant methylated loci; however, most of them were designed specifically for case-control studies. Results: Noticing that the age is associated with methylation level and the methylation data are not normally distributed, in this paper, we propose a nonparametric method to detect differentially methylated loci under multiple conditions with trend for Illumina Array Methylation data. The nonparametric method, Cuzick test is used to detect the differences among treatment groups with trend for each age group; then an overall p-value is calculated based on the method of combining those independent p-values each from one age group. Conclusions: We compare the new approach with other methods using simulated and real data. Our study shows that the proposed method outperforms other methods considered in this paper in term of power: it detected more biological meaningful differentially methylated loci than others.
A Logic Approach to Granular computing
(International Journal of Cognitive Informatics and Natural Intelligence, 2008-04) Bing, Zhou; Yiyu, Yao
Granular computing is an emerging field of study that attempts to formalize and explore methods and heuristics of human problem solving with multiple levels of granularity and abstraction. A fundamental issue of granular computing is the representation and utilization of granular structures. The main objective of this article is to examine a logic approach to address this issue. Following the classical interpretation of a concept as a pair of intension and extension, we interpret a granule as a pair of a set of objects and a logic formula describing the granule. The building blocks of granular structures are basic granules representing an elementary concept or a piece of knowledge. They are treated as atomic formulas of a logic language. Different types of granular structures can be constructed by using logic connectives. Within this logic framework, we show that rough set analysis (RSA) and formal concept analysis (FCA) can be interpreted uniformly. The two theories use multilevel granular structures but differ in their choices of definable granules and granular structures.
In Search of Effective Granulization with DTRS for Ternary Classification
(International Journal of Cognitive Informatics and Natural Intelligence, 2011) Bing, Zhou; Yiyu, Yao
Decision-Theoretic Rough Set (DTRS) model provides a three-way decision approach to classification problems, which allows a classifier to make a deferment decision on suspicious examples, rather than being forced to make an immediate determination. The deferred cases must be reexamined by collecting further information. Although the formulation of DTRS is intuitively appealing, a fundamental question that remains is how to determine the class of the deferment examples. In this paper, the authors introduce an adaptive learning method that automatically deals with the deferred examples by searching for effective granulization. A decision tree is constructed for classification. At each level, the authors sequentially choose the attributes that provide the most effective granulization. A subtree is added recursively if the conditional probability lies in between of the two given thresholds. A branch reaches its leaf node when the conditional probability is above or equal to the first threshold, or is below or equal to the second threshold, or the granule meets certain conditions. This learning process is illustrated by an example.
Supervised learning-based tagSNP selection for genome-wide disease classifications
(BIomed Central, 2007-07-25) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Yang, Mary Qu; Huang, Xudong; Yang, Jack
Comprehensive evaluation of common genetic variations through association of single nucleotide polymorphisms (SNPs) with complex human diseases on the genome-wide scale is an active area in human genome research. One of the fundamental questions in a SNP-disease association study is to find an optimal subset of SNPs with predicting power for disease status. To find that subset while reducing study burden in terms of time and costs, one can potentially reconcile information redundancy from associations between SNP markers
Independent component analysis of Alzheimer's DNA microarray gene expression data
(Molecular Neurodegeneration, 2009-01-28) Liu, Qingzhong; Chen, Zhongxue; Vanderburg, Charles R; Rogers, Jack T.; Huang, Xudong; Mou, Xiaoyang; Wei, Kong
Background: Gene microarray technology is an effective tool to investigate the simultaneous activity of multiple cellular pathways from hundreds to thousands of genes. However, because data in the colossal amounts generated by DNA microarray technology are usually complex, noisy, highdimensional, and often hindered by low statistical power, their exploitation is difficult. To overcome these problems, two kinds of unsupervised analysis methods for microarray data: principal component analysis (PCA) and independent component analysis (ICA) have been developed to accomplish the task. PCA projects the data into a new space spanned by the principal components that are mutually orthonormal to each other. The constraint of mutual orthogonality and second-order statistics technique within PCA algorithms, however, may not be applied to the biological systems studied. Extracting and characterizing the most informative features of the biological signals, however, require higher-order statistics. Results: ICA is one of the unsupervised algorithms that can extract higher-order statistical structures from data and has been applied to DNA microarray gene expression data analysis. We performed FastICA method on DNA microarray gene expression data from Alzheimer's disease (AD) hippocampal tissue samples and consequential gene clustering. Experimental results showed that the ICA method can improve the clustering results of AD samples and identify significant genes. More than 50 significant genes with high expression levels in severe AD were extracted, representing immunity-related protein, metal-related protein, membrane protein, lipoprotein, neuropeptide, cytoskeleton protein, cellular binding protein, and ribosomal protein. Within the aforementioned categories, our method also found 37 significant genes with low expression levels. Moreover, it is worth noting that some oncogenes and phosphorylation-related proteins are expressed in low levels. In comparison to the PCA and support vector machine recursive feature elimination (SVM-RFE) methods, which are widely used in microarray data analysis, ICA can identify more AD-related genes. Furthermore, we have validated and identified many genes that are associated with AD pathogenesis. Conclusion: We demonstrated that ICA exploits higher-order statistics to identify gene expression profiles as linear combinations of elementary expression patterns that lead to the construction of potential AD-related pathogenic pathways. Our computing results also validated that the ICA model outperformed PCA and the SVM-RFE method. This report shows that ICA as a microarray data analysis tool can help us to elucidate the molecular taxonomy of AD and other multifactorial and polygenic complex diseases.
Comparison of feature selection and classification for MALDI-MS data
(BMC Genomics, 2009-07-07) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Yang, Jack Y; Qiao, Mengyu; Yang, Mary Qu; Huang, Xudong; Deng, Youping
Introduction: In the classification of Mass Spectrometry (MS) proteomics data, peak detection, feature selection, and learning classifiers are critical to classification accuracy. To better understand which methods are more accurate when classifying data, some publicly available peak detection algorithms for Matrix assisted Laser Desorption Ionization Mass Spectrometry (MALDI-MS) data were recently compared; however, the issue of different feature selection methods and different classification models as they relate to classification performance has not been addressed. With the application of intelligent computing, much progress has been made in the development of feature selection methods and learning classifiers for the analysis of high-throughput biological data. The main objective of this paper is to compare the methods of feature selection and different learning classifiers when applied to MALDI-MS data and to provide a subsequent reference for the analysis of MS proteomics data. Results: We compared a well-known method of feature selection, Support Vector Machine Recursive Feature Elimination (SVMRFE), and a recently developed method, Gradient based Leaveone-out Gene Selection (GLGS) that effectively performs microarray data analysis. We also compared several learning classifiers including K-Nearest Neighbor Classifier (KNNC), Naive Bayes Classifier (NBC), Nearest Mean Scaled Classifier (NMSC), uncorrelated normal based quadratic Bayes Classifier recorded as UDC, Support Vector Machines, and a distance metric learning for Large Margin Nearest Neighbor classifier (LMNN) based on Mahanalobis distance. To compare, we conducted a comprehensive experimental study using three types of MALDI-MS data. Conclusion: Regarding feature selection, SVMRFE outperformed GLGS in classification. As for the learning classifiers, when classification models derived from the best training were compared, SVMs performed the best with respect to the expected testing accuracy. However, the distance metric learning LMNN outperformed SVMs and other classifiers on evaluating the best testing. In such cases, the optimum classification model based on LMNN is worth investigating for future study.
Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data
(PLoS ONE, 2009-12-11) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Huang, Xudong; Deng, Youpin; Liu, Jianzhong
Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods: Support Vector Machine Recursive Feature Elimination (SVMRFE) N Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS) N Gradient based Leave-one-out Gene Selection (GLGS) To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II) breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC) is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and AUC errors.
A gene selection method for GeneChip array data with small sample sizes
(2010-07) Chen, Zhongxue; Liu, Qingzhong; McGee, Monnie; Kong, Megan; Huang, Xudong; Deng, Youpin; Scheuermann, Richard H.
In microarray experiments with small sample sizes, it is a challenge to estimate p-values accurately and decide cutoff p-values for gene selection appropriately. Although permutation-based methods have proved to have greater sensitivity and specificity than the regular t-test, their p-values are highly discrete due to the limited number of permutations available in very small sample sizes. Furthermore, estimated permutation-based p-values for true nulls are highly correlated and not uniformly distributed between zero and one, making it difficult to use current false discovery rate (FDR)-controlling methods. Results: We propose a model-based information sharing method (MBIS) that, after an appropriate data transformation, utilizes information shared among genes. We use a normal distribution to model the mean differences of true nulls across two experimental conditions. The parameters of the model are then estimated using all data in hand. Based on this model, p-values, which are uniformly distributed from true nulls, are calculated. Then, since FDR-controlling methods are generally not well suited to microarray data with very small sample sizes, we select genes for a given cutoff p-value and then estimate the false discovery rate. Conclusion: Simulation studies and analysis using real microarray data show that the proposed method, MBIS, is more powerful and reliable than current methods. It has wide application to a variety of situations.
Gene selection and classification for cancer microarray data based on machine learning and similarity measures
(BMC Genomics, 2011) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Liu, Jianzhong; Chen, Lei; Deng, Youpin; Wang, Zhaohui; Huang, Xudong; Qiao, Mengyu
Background: Microarray data have a high dimension of variables and a small sample size. In microarray data analyses, two important issues are how to choose genes, which provide reliable and good prediction for disease status, and how to determine the final gene set that is best for classification. Associations among genetic markers mean one can exploit information redundancy to potentially reduce classification cost in terms of time and money. Results: To deal with redundant information and improve classification, we propose a gene selection method, Recursive Feature Addition, which combines supervised learning and statistical similarity measures. To determine the final optimal gene set for prediction and classification, we propose an algorithm, Lagging Prediction Peephole Optimization. By using six benchmark microarray gene expression data sets, we compared Recursive Feature Addition with recently developed gene selection methods: Support Vector Machine Recursive Feature Elimination, Leave-One-Out Calculation Sequential Forward Selection and several others. Conclusions: On average, with the use of popular learning machines including Nearest Mean Scaled Classifier, Support Vector Machine, Naive Bayes Classifier and Random Forest, Recursive Feature Addition outperformed other methods. Our studies also showed that Lagging Prediction Peephole Optimization is superior to random strategy; Recursive Feature Addition with Lagging Prediction Peephole Optimization obtained better testing accuracies than the gene selection method varSelRF.
Assessment of gene order computing methods for Alzheimer’s disease
(BMC Medical Genomics, 2013-01-23) Liu, Qingzhong; Hu, Benqiong; Pang, Chaoyang; Wang, Shipend; Chen, Zhongxue; Vanderburg, Charles R; Rogers, Jack T.; Deng, Youping; Huang, Xudong; Jiang, Gang
Background: Computational genomics of Alzheimer disease (AD), the most common form of senile dementia, is a nascent field in AD research. The field includes AD gene clustering by computing gene order which generates higher quality gene clustering patterns than most other clustering methods. However, there are few available gene order computing methods such as Genetic Algorithm (GA) and Ant Colony Optimization (ACO). Further, their performance in gene order computation using AD microarray data is not known. We thus set forth to evaluate the performances of current gene order computing methods with different distance formulas, and to identify additional features associated with gene order computation. Methods: Using different distance formulas- Pearson distance and Euclidean distance, the squared Euclidean distance, and other conditions, gene orders were calculated by ACO and GA (including standard GA and improved GA) methods, respectively. The qualities of the gene orders were compared, and new features from the calculated gene orders were identified. Results: Compared to the GA methods tested in this study, ACO fits the AD microarray data the best when calculating gene order. In addition, the following features were revealed: different distance formulas generated a different quality of gene order, and the commonly used Pearson distance was not the best distance formula when used with both GA and ACO methods for AD microarray data. Conclusion: Compared with Pearson distance and Euclidean distance, the squared Euclidean distance generated the best quality gene order computed by GA and ACO methods.
Influence of Machine Learning vs. Ranking Algorithm on the Critical Dimension
(International Journal of Future Computer and Communication, 2013-06) Sung, Andrew H.; Liu, Qingzhong; Suryakumar, Divya
The critical dimension is the minimum number of features required for a learning machine to perform with “high” accuracy, which for a specific dataset is dependent upon the learning machine and the ranking algorithm. Discovering the critical dimension, if one exists for a dataset, can help to reduce the feature size while maintaining the learning machine’s performance. It is important to understand the influence of learning machines and ranking algorithms on critical dimension to reduce the feature size effectively. In this paper we experiment with three ranking algorithms and three learning machines on several datasets to study their combined effect on the critical dimension. Results show the ranking algorithm has greater influence on the critical dimension than the learning machine.
“Meta Cloud Discovery” Model: An Approach to Integrity Monitoring for Cloud-Based Disaster Recovery Planning
(International Journal of Information and Education Technology, 2013-10) Wilbert, Brittany M.; Liu, Qingzhong
A structure is required to prevent the malicious code from leaking onto the system. The use of sandboxes has become more advance, allowing for investigators to access malicious code while minimizing the risk of infecting their own machine. This technology is also used to prevent malicious code from compromising vulnerable machines. The use of sandbox technology and techniques can potentially be extended to cloud infrastructures to prevent malicious content from compromising specialized infrastructure such as backups that are used for disaster recovery and business continuity planning. This paper will discuss existing algorithms related to current sandbox technology, and extend the work into the “Meta Cloud Discovery” model, a sandbox integrity-monitoring proposal for disaster recovery. Finally, implementation examples will be discussed as well as future research that would need to be performed to improve the model.
A new statistical approach to combining p-values using gamma distribution and its application to genome-wide association study
(BMC Bioinformatics, 2014-12-16) Liu, Qingzhong; Chen, Zhongxue; Yang, William; Yang, Jack Y; Li, Jing; Yang, Mary Qu
Background: Combining information from different studies is an important and useful practice in bioinformatics, including genome-wide association study, rare variant data analysis and other set-based analyses. Many statistical methods have been proposed to combine p-values from independent studies. However, it is known that there is no uniformly most powerful test under all conditions; therefore, finding a powerful test in specific situation is important and desirable. Results: In this paper, we propose a new statistical approach to combining p-values based on gamma distribution, which uses the inverse of the p-value as the shape parameter in the gamma distribution. Conclusions: Simulation study and real data application demonstrate that the proposed method has good performance under some situations.
A Method to Detect AAC Audio Forgery
(Endorsed Transactions on Security and Safety, 2015-05-25) Zhang, Jing; Liu, Qingzhong; Yang, Ming; Sung, Andrew H.; Liu, Yanxin; Chen, Lei; Chen, Zhongxue
Advanced Audio Coding (AAC), a standardized lossy compression scheme for digital audio, which was designed to be the successor of the MP3 format, generally achieves better sound quality than MP3 at similar bit rates. While AAC is also the default or standard audio format for many devices and AAC audio files may be presented as important digital evidences, the authentication of the audio files is highly needed but relatively missing. In this paper, we propose a scheme to expose tampered AAC audio streams that are encoded at the same encoding bit rate. Specifically, we design a shift-recompression based method to retrieve the differential features between the re-encoded audio stream at each shifting and original audio stream, learning classifier is employed to recognize different patterns of differential features of the doctored forgery files and original (untouched) audio files. Experimental results show that our approach is very promising and effective to detect the forgery of the same encoding bit-rate on AAC audio streams. Our study also shows that shift recompression-based differential analysis is very effective for detection of the MP3 forgery at the same bit rate
Mobile Watermarking against Geometrical Distortions
(EAI Endorsed Transactions on Security and Safety, 2015-05-25) Liu, Qingzhong; Zhang, Jing; Su, Yuting; Zhi, Meili
Mobile watermarking robust to geometrical distortions is still a great challenge. In mobile watermarking, efficient computation is necessary because mobile devices have very limited resources due to power consumption. In this paper, we propose a low complexity geometrically resilient watermarking approach based on the optimal tradeoff circular harmonic function (OTCHF) correlation filter and the minimum average correlation energy Mellin radial harmonic (MACE-MRH) correlation filter. By the rotation, translation and scale tolerance properties of the two kinds of filter, the proposed watermark detector can be robust to geometrical attacks. The embedded watermark is weighted by a perceptual mask which matches very well with the properties of the human visual system. Before correlation, a whitening process is utilized to improve watermark detection reliability. Experimental results demonstrate that the proposed watermarking approach is computationally efficient and robust to geometrical distortions.
A Comparison Study using Stegexpose for Steganalysis.
(International Journal of Knowledge Engineering, 2017-06) Olson, Eric; Carter, Larry; Liu, Qingzhong
Steganography is the art of hiding secret message in innocent digital data files. Steganalysis aims to expose the existence of steganograms. While internet applications and social media has grown tremendously in recent years, the use of social media is increasingly being used by cybercriminals as well as terrorists as a means of command and control communication including taking advantage of steganography for covert communication. In this paper, we investigate open source steganography/steganalysis software and test StegExpose for steganalysis. Our experimental results show that the capability of stegExpose is very limited.
Smartphone Forensic Challenges
(International Journal of Computer Science and Security, 2019) Krishnan, Sundar; Zhou, Bing; An, Min Kyung
Globally, the extensive use of smartphone devices has led to an increase in storage and transmission of enormous volumes of data that could be potentially be used as digital evidence in a forensic investigation. Digital evidence can sometimes be difficult to extract from these devices given the various versions and models of smartphone devices in the market. Forensic analysis of smartphones to extract digital evidence can be carried out in many ways, however, prior knowledge of smartphone forensic tools is paramount to a successful forensic investigation. In this paper, the authors outline challenges, limitations and reliability issues faced when using smartphone device forensic tools and accompanied forensic techniques. The main objective of this paper is intended to be consciousness-raising than suggesting best practices to these forensic work challenges.
Android System Partition to Traffic Data?
(International Journal of Knowledge Engineering, 2017-12) Bing, Zhou; Liu, Qingzhong; Byrd, Brittany
The familiarity and prevalence of mobile devices inflates their use as instruments of crime. Law enforcement personnel and mobile forensics investigators, are constantly battling to gain the upper-hand at developing a standardized system able to comprehensively identify and resolve the vulnerabilities present within the mobile device platform. The Android mobile platform can be perceived as an antagonist to this objective, as its open nature provides attackers direct insight into the internalization and security features of the most popular platform presently in the consumer market. This paper identifies and demonstrates the system partition in an Android smartphone as a viable attack vector for covert data trafficking. An implementation strategy (comprised of four experimental phases) is developed to exploit the internal memory of a non-activated rooted Android HTC Desire 510 4g smartphone. A set of mobile forensics tools: AccessData Mobile Phone Examiner Plus (MPE+ v5.5.6), Oxygen Forensic Suite 2015 Standard, and Google Android Debug Bridge adb were used for the extraction and analysis process. The data analysis found the proposed approach to be a persistent and minimally detectable method to exchange data
Smartphone Sensor-Based Activity Recognition by Using Machine Learning and Deep Learning Algorithms
(International Journal of Machine Learning and Computing, 2018) Liu, Qingzhong; Zhaoxian, Zhou; Sarbagya, Shakya Ratna; Prathyusha, Uduthalapally; Mengyu, Qiao; Andrew, Sung H.
Smartphones are widely used today, and it becomes possible to detect the user's environmental changes by using the smartphone sensors, as demonstrated in this paper where we propose a method to identify human activities with reasonably high accuracy by using smartphone sensor data. First, the raw smartphone sensor data are collected from two categories of human activity: motion-based, e.g., walking and running; and phone movement-based, e.g., left-right, up-down, clockwise and counterclockwise movement. Firstly, two types of features extraction are designed from the raw sensor data, and activity recognition is analyzed using machine learning classification models based on these features. Secondly, the activity recognition performance is analyzed through the Convolutional Neural Network (CNN) model using only the raw data. Our experiments show substantial improvement in the result with the addition of features and the use of CNN model based on smartphone sensor data with judicious learning techniques and good feature designs.

Browse

Recent Submissions