QUALITATIVE CLEANING METHODS ON DISTRIBUTED IOT DATASETS

Date

2019-04-08

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Data analysis encompasses a set of individual steps that allows a typically large data set to be remodeled such that actionable information can be extracted from the data set, which can then be used to support decision-making. Data generated from multiple distributed sources is usually dirty by default and dirty data will often lead to inaccurate or incomplete data analysis. As a result, without first performing data cleaning, wrong or fatally flawed business decisions is inevitable. IoT describes a network of physical and virtual objects containing software, electrical components and sensors that exchange data with other connected devices over the internet. The data generated from these sensors is distributed by design and my aim for this thesis is to explore qualitative data cleaning methods such as integrity constraints and functional dependency violations to perform error detection and in place error repairing techniques on the distributed data set generated from these devices. This approach is relatively new since most of the prior data cleaning research in this domain have focused on quantitative techniques such as outlier detection. The next goal for my thesis will then be to perform exploratory data analysis on the data sets from these IoT sources using data wrangling tools on open source frameworks such as Optimus under Apache Spark to handle the unstructured and semi structured formats of the data generated from these sources. The end goal will be to generate clean data from these data sources such that insights can be gained to support decision making for the purpose of product improvement.

Description

Keywords

IoT, Apache, PySpark

Citation