NNMF IN GOOGLE TENSORFLOW AND APACHE SPARK: A COMPARISON STUDY
Abstract
Data mining is no longer a new term as it has been already pervasive in all aspects
of our lives. New computing platforms for specific usages are proposed continuously.
Therefore, the awareness of the characteristics and the capacity of existing and newly
proposed platforms becomes a critical task for researchers and practitioners, who want to
use existing algorithms and also develop new ones on the recent platforms.
Particularly, this thesis aims to implement and compare a set of popular matrix
factorization algorithms on recent computing platforms. Specifically, the three matrix
factorization algorithms, including classic Non-negative Matrix Factorization (NNMF),
CUR Matrix Decomposition, and Compact Matrix Decomposition (CMD), are
implemented on the two computing platforms, including Apache Spark and Google
TensorFlow.
As rank k approximation with Singular Value Decomposition (SVD) is an optimal
baseline, both CUR and CMD approximation are less accurate than the SVD
approximation. The experimental result shows that CMD in TensorFlow performs better
in terms of matrix approximation than the other two non-negative matrix factorization
algorithms (NNMF, and CUR) in the same experiment setup. Also, as the number of rows
or columns selected for CUR and CMD increases, the approximation error decreases.