NNMF IN GOOGLE TENSORFLOW AND APACHE SPARK: A COMPARISON STUDY
Data mining is no longer a new term as it has been already pervasive in all aspects of our lives. New computing platforms for specific usages are proposed continuously. Therefore, the awareness of the characteristics and the capacity of existing and newly proposed platforms becomes a critical task for researchers and practitioners, who want to use existing algorithms and also develop new ones on the recent platforms. Particularly, this thesis aims to implement and compare a set of popular matrix factorization algorithms on recent computing platforms. Specifically, the three matrix factorization algorithms, including classic Non-negative Matrix Factorization (NNMF), CUR Matrix Decomposition, and Compact Matrix Decomposition (CMD), are implemented on the two computing platforms, including Apache Spark and Google TensorFlow. As rank k approximation with Singular Value Decomposition (SVD) is an optimal baseline, both CUR and CMD approximation are less accurate than the SVD approximation. The experimental result shows that CMD in TensorFlow performs better in terms of matrix approximation than the other two non-negative matrix factorization algorithms (NNMF, and CUR) in the same experiment setup. Also, as the number of rows or columns selected for CUR and CMD increases, the approximation error decreases.