Cyber Forensics Intelligence Center

Permanent URI for this communityhttps://hdl.handle.net/20.500.11875/2422

Browse

Now showing 1 - 20 of 39

Supervised learning-based tagSNP selection for genome-wide disease classifications
(BIomed Central, 2007-07-25) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Yang, Mary Qu; Huang, Xudong; Yang, Jack
Comprehensive evaluation of common genetic variations through association of single nucleotide polymorphisms (SNPs) with complex human diseases on the genome-wide scale is an active area in human genome research. One of the fundamental questions in a SNP-disease association study is to find an optimal subset of SNPs with predicting power for disease status. To find that subset while reducing study burden in terms of time and costs, one can potentially reconcile information redundancy from associations between SNP markers
A Logic Approach to Granular computing
(International Journal of Cognitive Informatics and Natural Intelligence, 2008-04) Bing, Zhou; Yiyu, Yao
Granular computing is an emerging field of study that attempts to formalize and explore methods and heuristics of human problem solving with multiple levels of granularity and abstraction. A fundamental issue of granular computing is the representation and utilization of granular structures. The main objective of this article is to examine a logic approach to address this issue. Following the classical interpretation of a concept as a pair of intension and extension, we interpret a granule as a pair of a set of objects and a logic formula describing the granule. The building blocks of granular structures are basic granules representing an elementary concept or a piece of knowledge. They are treated as atomic formulas of a logic language. Different types of granular structures can be constructed by using logic connectives. Within this logic framework, we show that rough set analysis (RSA) and formal concept analysis (FCA) can be interpreted uniformly. The two theories use multilevel granular structures but differ in their choices of definable granules and granular structures.
Independent component analysis of Alzheimer's DNA microarray gene expression data
(Molecular Neurodegeneration, 2009-01-28) Liu, Qingzhong; Chen, Zhongxue; Vanderburg, Charles R; Rogers, Jack T.; Huang, Xudong; Mou, Xiaoyang; Wei, Kong
Background: Gene microarray technology is an effective tool to investigate the simultaneous activity of multiple cellular pathways from hundreds to thousands of genes. However, because data in the colossal amounts generated by DNA microarray technology are usually complex, noisy, highdimensional, and often hindered by low statistical power, their exploitation is difficult. To overcome these problems, two kinds of unsupervised analysis methods for microarray data: principal component analysis (PCA) and independent component analysis (ICA) have been developed to accomplish the task. PCA projects the data into a new space spanned by the principal components that are mutually orthonormal to each other. The constraint of mutual orthogonality and second-order statistics technique within PCA algorithms, however, may not be applied to the biological systems studied. Extracting and characterizing the most informative features of the biological signals, however, require higher-order statistics. Results: ICA is one of the unsupervised algorithms that can extract higher-order statistical structures from data and has been applied to DNA microarray gene expression data analysis. We performed FastICA method on DNA microarray gene expression data from Alzheimer's disease (AD) hippocampal tissue samples and consequential gene clustering. Experimental results showed that the ICA method can improve the clustering results of AD samples and identify significant genes. More than 50 significant genes with high expression levels in severe AD were extracted, representing immunity-related protein, metal-related protein, membrane protein, lipoprotein, neuropeptide, cytoskeleton protein, cellular binding protein, and ribosomal protein. Within the aforementioned categories, our method also found 37 significant genes with low expression levels. Moreover, it is worth noting that some oncogenes and phosphorylation-related proteins are expressed in low levels. In comparison to the PCA and support vector machine recursive feature elimination (SVM-RFE) methods, which are widely used in microarray data analysis, ICA can identify more AD-related genes. Furthermore, we have validated and identified many genes that are associated with AD pathogenesis. Conclusion: We demonstrated that ICA exploits higher-order statistics to identify gene expression profiles as linear combinations of elementary expression patterns that lead to the construction of potential AD-related pathogenic pathways. Our computing results also validated that the ICA model outperformed PCA and the SVM-RFE method. This report shows that ICA as a microarray data analysis tool can help us to elucidate the molecular taxonomy of AD and other multifactorial and polygenic complex diseases.
Comparison of feature selection and classification for MALDI-MS data
(BMC Genomics, 2009-07-07) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Yang, Jack Y; Qiao, Mengyu; Yang, Mary Qu; Huang, Xudong; Deng, Youping
Introduction: In the classification of Mass Spectrometry (MS) proteomics data, peak detection, feature selection, and learning classifiers are critical to classification accuracy. To better understand which methods are more accurate when classifying data, some publicly available peak detection algorithms for Matrix assisted Laser Desorption Ionization Mass Spectrometry (MALDI-MS) data were recently compared; however, the issue of different feature selection methods and different classification models as they relate to classification performance has not been addressed. With the application of intelligent computing, much progress has been made in the development of feature selection methods and learning classifiers for the analysis of high-throughput biological data. The main objective of this paper is to compare the methods of feature selection and different learning classifiers when applied to MALDI-MS data and to provide a subsequent reference for the analysis of MS proteomics data. Results: We compared a well-known method of feature selection, Support Vector Machine Recursive Feature Elimination (SVMRFE), and a recently developed method, Gradient based Leaveone-out Gene Selection (GLGS) that effectively performs microarray data analysis. We also compared several learning classifiers including K-Nearest Neighbor Classifier (KNNC), Naive Bayes Classifier (NBC), Nearest Mean Scaled Classifier (NMSC), uncorrelated normal based quadratic Bayes Classifier recorded as UDC, Support Vector Machines, and a distance metric learning for Large Margin Nearest Neighbor classifier (LMNN) based on Mahanalobis distance. To compare, we conducted a comprehensive experimental study using three types of MALDI-MS data. Conclusion: Regarding feature selection, SVMRFE outperformed GLGS in classification. As for the learning classifiers, when classification models derived from the best training were compared, SVMs performed the best with respect to the expected testing accuracy. However, the distance metric learning LMNN outperformed SVMs and other classifiers on evaluating the best testing. In such cases, the optimum classification model based on LMNN is worth investigating for future study.
Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data
(PLoS ONE, 2009-12-11) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Huang, Xudong; Deng, Youpin; Liu, Jianzhong
Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods: Support Vector Machine Recursive Feature Elimination (SVMRFE) N Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS) N Gradient based Leave-one-out Gene Selection (GLGS) To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II) breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC) is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and AUC errors.
A gene selection method for GeneChip array data with small sample sizes
(2010-07) Chen, Zhongxue; Liu, Qingzhong; McGee, Monnie; Kong, Megan; Huang, Xudong; Deng, Youpin; Scheuermann, Richard H.
In microarray experiments with small sample sizes, it is a challenge to estimate p-values accurately and decide cutoff p-values for gene selection appropriately. Although permutation-based methods have proved to have greater sensitivity and specificity than the regular t-test, their p-values are highly discrete due to the limited number of permutations available in very small sample sizes. Furthermore, estimated permutation-based p-values for true nulls are highly correlated and not uniformly distributed between zero and one, making it difficult to use current false discovery rate (FDR)-controlling methods. Results: We propose a model-based information sharing method (MBIS) that, after an appropriate data transformation, utilizes information shared among genes. We use a normal distribution to model the mean differences of true nulls across two experimental conditions. The parameters of the model are then estimated using all data in hand. Based on this model, p-values, which are uniformly distributed from true nulls, are calculated. Then, since FDR-controlling methods are generally not well suited to microarray data with very small sample sizes, we select genes for a given cutoff p-value and then estimate the false discovery rate. Conclusion: Simulation studies and analysis using real microarray data show that the proposed method, MBIS, is more powerful and reliable than current methods. It has wide application to a variety of situations.
Gene selection and classification for cancer microarray data based on machine learning and similarity measures
(BMC Genomics, 2011) Liu, Qingzhong; Sung, Andrew H.; Chen, Zhongxue; Liu, Jianzhong; Chen, Lei; Deng, Youpin; Wang, Zhaohui; Huang, Xudong; Qiao, Mengyu
Background: Microarray data have a high dimension of variables and a small sample size. In microarray data analyses, two important issues are how to choose genes, which provide reliable and good prediction for disease status, and how to determine the final gene set that is best for classification. Associations among genetic markers mean one can exploit information redundancy to potentially reduce classification cost in terms of time and money. Results: To deal with redundant information and improve classification, we propose a gene selection method, Recursive Feature Addition, which combines supervised learning and statistical similarity measures. To determine the final optimal gene set for prediction and classification, we propose an algorithm, Lagging Prediction Peephole Optimization. By using six benchmark microarray gene expression data sets, we compared Recursive Feature Addition with recently developed gene selection methods: Support Vector Machine Recursive Feature Elimination, Leave-One-Out Calculation Sequential Forward Selection and several others. Conclusions: On average, with the use of popular learning machines including Nearest Mean Scaled Classifier, Support Vector Machine, Naive Bayes Classifier and Random Forest, Recursive Feature Addition outperformed other methods. Our studies also showed that Lagging Prediction Peephole Optimization is superior to random strategy; Recursive Feature Addition with Lagging Prediction Peephole Optimization obtained better testing accuracies than the gene selection method varSelRF.
In Search of Effective Granulization with DTRS for Ternary Classification
(International Journal of Cognitive Informatics and Natural Intelligence, 2011) Bing, Zhou; Yiyu, Yao
Decision-Theoretic Rough Set (DTRS) model provides a three-way decision approach to classification problems, which allows a classifier to make a deferment decision on suspicious examples, rather than being forced to make an immediate determination. The deferred cases must be reexamined by collecting further information. Although the formulation of DTRS is intuitively appealing, a fundamental question that remains is how to determine the class of the deferment examples. In this paper, the authors introduce an adaptive learning method that automatically deals with the deferred examples by searching for effective granulization. A decision tree is constructed for classification. At each level, the authors sequentially choose the attributes that provide the most effective granulization. A subtree is added recursively if the conditional probability lies in between of the two given thresholds. A branch reaches its leaf node when the conditional probability is above or equal to the first threshold, or is below or equal to the second threshold, or the granule meets certain conditions. This learning process is illustrated by an example.
Calm Before the Storm: The Challenges of Cloud Computing in Digital Forensics
(International Journal of Digital Crime and Forensics, 2012-04) Grispos, George; Storer, Tim; Glisson, William Bradley
Cloud computing is a rapidly evolving information technology (IT) phenomenon. Rather than procure, deploy, and manage a physical IT infrastructure to host their software applications, organizations are increasingly deploying their infrastructure into remote, virtualized environments, often hosted and managed by third parties. This development has significant implications for digital forensic investigators, equipment vendors, law enforcement, as well as corporate compliance and audit departments, amongst other organizations. Much of digital forensic practice assumes careful control and management of IT assets (particularly data storage) during the conduct of an investigation. This paper summarises the key aspects of cloud computing and analyses how established digital forensic procedures will be invalidated in this new environment, as well as discussing and identifying several new research challenges addressing this changing context.
Assessment of gene order computing methods for Alzheimer’s disease
(BMC Medical Genomics, 2013-01-23) Liu, Qingzhong; Hu, Benqiong; Pang, Chaoyang; Wang, Shipend; Chen, Zhongxue; Vanderburg, Charles R; Rogers, Jack T.; Deng, Youping; Huang, Xudong; Jiang, Gang
Background: Computational genomics of Alzheimer disease (AD), the most common form of senile dementia, is a nascent field in AD research. The field includes AD gene clustering by computing gene order which generates higher quality gene clustering patterns than most other clustering methods. However, there are few available gene order computing methods such as Genetic Algorithm (GA) and Ant Colony Optimization (ACO). Further, their performance in gene order computation using AD microarray data is not known. We thus set forth to evaluate the performances of current gene order computing methods with different distance formulas, and to identify additional features associated with gene order computation. Methods: Using different distance formulas- Pearson distance and Euclidean distance, the squared Euclidean distance, and other conditions, gene orders were calculated by ACO and GA (including standard GA and improved GA) methods, respectively. The qualities of the gene orders were compared, and new features from the calculated gene orders were identified. Results: Compared to the GA methods tested in this study, ACO fits the AD microarray data the best when calculating gene order. In addition, the following features were revealed: different distance formulas generated a different quality of gene order, and the commonly used Pearson distance was not the best distance formula when used with both GA and ACO methods for AD microarray data. Conclusion: Compared with Pearson distance and Euclidean distance, the squared Euclidean distance generated the best quality gene order computed by GA and ACO methods.
Influence of Machine Learning vs. Ranking Algorithm on the Critical Dimension
(International Journal of Future Computer and Communication, 2013-06) Sung, Andrew H.; Liu, Qingzhong; Suryakumar, Divya
The critical dimension is the minimum number of features required for a learning machine to perform with “high” accuracy, which for a specific dataset is dependent upon the learning machine and the ranking algorithm. Discovering the critical dimension, if one exists for a dataset, can help to reduce the feature size while maintaining the learning machine’s performance. It is important to understand the influence of learning machines and ranking algorithms on critical dimension to reduce the feature size effectively. In this paper we experiment with three ranking algorithms and three learning machines on several datasets to study their combined effect on the critical dimension. Results show the ranking algorithm has greater influence on the critical dimension than the learning machine.
“Meta Cloud Discovery” Model: An Approach to Integrity Monitoring for Cloud-Based Disaster Recovery Planning
(International Journal of Information and Education Technology, 2013-10) Wilbert, Brittany M.; Liu, Qingzhong
A structure is required to prevent the malicious code from leaking onto the system. The use of sandboxes has become more advance, allowing for investigators to access malicious code while minimizing the risk of infecting their own machine. This technology is also used to prevent malicious code from compromising vulnerable machines. The use of sandbox technology and techniques can potentially be extended to cloud infrastructures to prevent malicious content from compromising specialized infrastructure such as backups that are used for disaster recovery and business continuity planning. This paper will discuss existing algorithms related to current sandbox technology, and extend the work into the “Meta Cloud Discovery” model, a sandbox integrity-monitoring proposal for disaster recovery. Finally, implementation examples will be discussed as well as future research that would need to be performed to improve the model.
Web Engineering Security (WES) Methodology
(Communications of the Association for Information Systems, 2014) Glisson, William Bradley; Welland, Ray
The impact of the World Wide Web on basic operational economical components in global information-rich civilizations is significant. The repercussions force organizations to provide justification for security from a businesscase perspective and to focus on security from a Web application development environment standpoint. The need for clarity promoted an investigation through the acquisition of empirical evidence from a high level Web survey and a more detailed industry survey to analyze security in the Web application development environment ultimately contributing to the proposal of the Essential Elements (EE) and the Security Criteria for Web Application Development (SCWAD). The synthesis of information provided was used to develop the Web Engineering Security (WES) methodology. WES is a proactive, flexible, process neutral security methodology with customizable components that is based on empirical evidence and used to explicitly integrate security throughout an organization’s chosen application development process.
A new statistical approach to combining p-values using gamma distribution and its application to genome-wide association study
(BMC Bioinformatics, 2014-12-16) Liu, Qingzhong; Chen, Zhongxue; Yang, William; Yang, Jack Y; Li, Jing; Yang, Mary Qu
Background: Combining information from different studies is an important and useful practice in bioinformatics, including genome-wide association study, rare variant data analysis and other set-based analyses. Many statistical methods have been proposed to combine p-values from independent studies. However, it is known that there is no uniformly most powerful test under all conditions; therefore, finding a powerful test in specific situation is important and desirable. Results: In this paper, we propose a new statistical approach to combining p-values based on gamma distribution, which uses the inverse of the p-value as the shape parameter in the gamma distribution. Conclusions: Simulation study and real data application demonstrate that the proposed method has good performance under some situations.
A Method to Detect AAC Audio Forgery
(Endorsed Transactions on Security and Safety, 2015-05-25) Zhang, Jing; Liu, Qingzhong; Yang, Ming; Sung, Andrew H.; Liu, Yanxin; Chen, Lei; Chen, Zhongxue
Advanced Audio Coding (AAC), a standardized lossy compression scheme for digital audio, which was designed to be the successor of the MP3 format, generally achieves better sound quality than MP3 at similar bit rates. While AAC is also the default or standard audio format for many devices and AAC audio files may be presented as important digital evidences, the authentication of the audio files is highly needed but relatively missing. In this paper, we propose a scheme to expose tampered AAC audio streams that are encoded at the same encoding bit rate. Specifically, we design a shift-recompression based method to retrieve the differential features between the re-encoded audio stream at each shifting and original audio stream, learning classifier is employed to recognize different patterns of differential features of the doctored forgery files and original (untouched) audio files. Experimental results show that our approach is very promising and effective to detect the forgery of the same encoding bit-rate on AAC audio streams. Our study also shows that shift recompression-based differential analysis is very effective for detection of the MP3 forgery at the same bit rate
Mobile Watermarking against Geometrical Distortions
(EAI Endorsed Transactions on Security and Safety, 2015-05-25) Liu, Qingzhong; Zhang, Jing; Su, Yuting; Zhi, Meili
Mobile watermarking robust to geometrical distortions is still a great challenge. In mobile watermarking, efficient computation is necessary because mobile devices have very limited resources due to power consumption. In this paper, we propose a low complexity geometrically resilient watermarking approach based on the optimal tradeoff circular harmonic function (OTCHF) correlation filter and the minimum average correlation energy Mellin radial harmonic (MACE-MRH) correlation filter. By the rotation, translation and scale tolerance properties of the two kinds of filter, the proposed watermark detector can be robust to geometrical attacks. The embedded watermark is weighted by a perceptual mask which matches very well with the properties of the human visual system. Before correlation, a whitening process is utilized to improve watermark detection reliability. Experimental results demonstrate that the proposed watermarking approach is computationally efficient and robust to geometrical distortions.
In-The-Wild Residual Data Research and Privacy
(Journal of Digital Forensics, Security and Law, 2016) Glisson, William Bradley; Storer, Tim; Blyth, Andrew; Grispos, George; Campbell, Matt
As the world becomes increasingly dependent on technology, researchers in industry and academia endeavor to understand how technology is used, the impact it has on everyday life, the artifact life-cycle and overall integrations of digital information. In doing so, researchers are increasingly gathering ‘real-world’ or ‘in-the-wild’ residual data, obtained from a variety of sources, without the explicit consent of the original owners. This data gathering raises significant concerns regarding privacy, ethics and legislation, as well as practical considerations concerning investigator training, data storage, overall security and data disposal. This research surveys recent studies of residual data gathered in-the-wild and analyzes the challenges that were confronted. Amalgamating these insights, the research presents a compendium of practices for addressing the issues that can arise in-the-wild when conducting residual data research. The practices identified in this research can be used to critique current projects and assess the feasibility of proposed future research.
Implications of Malicious 3D Printer Firmware
(Proceedings of the 50th Hawaii International Conference on System Sciences, 2017-01) Moore, Samuel Bennett; Glisson, William Bradley; Yampolskiy, Mark
The utilization of 3D printing technology within the manufacturing process creates an environment that is potentially conducive to malicious activity. Previous research in 3D printing focused on attack vector identification and intellectual property protection. This research develops and implements malicious code using Printrbot’s branch of the open source Marlin 3D printer firmware. Implementations of the malicious code were activated based on a specified printer command sent from a desktop application. The malicious firmware successfully ignored incoming print commands for a printed 3D model, substituted malicious print commands for an alternate 3D model, and manipulated extruder feed rates. The research contribution is three-fold. First, this research provides an initial assessment of potential effects malicious firmware can have on a 3D printed object. Second, it documents a potential vulnerability that impacts 3D product output using 3D printer firmware. Third, it provides foundational grounding for future research in malicious 3D printing process activities.
Attack-Graph Threat Modeling Assessment of Ambulatory Medical Devices
(Proceedings of the 50th Hawaii International Conference on System Sciences, 2017-01) Luckett, Patrick; McDonald, J. Todd; Glisson, William Bradley
The continued integration of technology into all aspects of society stresses the need to identify and understand the risk associated with assimilating new technologies. This necessity is heightened when technology is used for medical purposes like ambulatory devices that monitor a patient’s vital signs. This integration creates environments that are conducive to malicious activities. The potential impact presents new challenges for the medical community. \ \ Hence, this research presents attack graph modeling as a viable solution to identifying vulnerabilities, assessing risk, and forming mitigation strategies to defend ambulatory medical devices from attackers. Common and frequent vulnerabilities and attack strategies related to the various aspects of ambulatory devices, including Bluetooth enabled sensors and Android applications are identified in the literature. Based on this analysis, this research presents an attack graph modeling example on a theoretical device that highlights vulnerabilities and mitigation strategies to consider when designing ambulatory devices with similar components.
Exploitation and Detection of a Malicious Mobile Application
(Proceedings of the 50th Hawaii International Conference on System Sciences, 2017-01) Nguyen, Thanh; McDonald, J. Todd; Glisson, William Bradley
Mobile devices are increasingly being embraced by both organizations and individuals in today’s society. Specifically, Android devices have been the prominent mobile device OS for several years. This continued amalgamation creates an environment that is an attractive attack target. The heightened integration of these devices prompts an investigation into the viability of maintaining non-compromised devices. Hence, this research presents a preliminary investigation into the effectiveness of current commercial anti-virus, static code analysis and dynamic code analysis engines in detecting unknown repackaged malware piggybacking on popular applications with excessive permissions. The contribution of this paper is two-fold. First, it provides an initial assessment of the effectiveness of anti-virus and analysis tools in detecting malicious applications and behavior in Android devices. Secondly, it provides process for inserting code injection attacks to stimulate a zero-day repackaged malware that can be used in future research efforts.

Browse

Browsing Cyber Forensics Intelligence Center by Issue Date