Dependency Management System with Hadoop Streaming for Data-analytic Projects

Reviewed, Featured
Lin Li, Sozo Inoue,
Korea-Japan Joint Workshop on ICT
(Not Available)
(Not Available)
4 pages
2012-09-21
Pohang, Korea
http://www.f.ait.kyushu-u.ac.jp/kjjwonict/
In this paper, we propose a distributed parallel processing system for data-analytic project, which manages dependency among data and analytic programs, and re-execute updated programs and dependent programs for up- dated data/programs. In the system, a data analyzer can specify the dependency, parts for requiring distributed parallel processing using Hadoop Streaming, and they can be processed only for updated and dependent part, with flexibly selecting parallel or sequential execution. The specification can also specify multiple execution for the same program for different data as a simple statement, while their dependencies are checked separately.

Data Files