How Datameer executes jobs and stores data.

Goal

To understand exactly how Datameer jobs are being executed and where the actual data is stored.

Learn

Datameer acts as a job compiler, it compiles a MapReduce job and sends it to a cluster(Hadoop/EMR) for processing.

If Datameer is not connected to a cluster (Local mode), all calculations take place in the Datameer JVM (pseudo cluster). All data, permanent and temporary is stored in the Datameer installation folder, unless tmp/ or cache/ directories are configured to point elsewhere.
When Datameer is connected to a cluster all the data is stored in an HDFS directory called the Datameer private folder. By default this is located at hdfs://user/datameer

Differences between DataLinks and ImportJobs:

For ImportJobs, the dataset is copied from the source to the Datameer private folder in HDFS. When a Workbook is built on this ImportJob, Datameer processes the copy of the data in the HDFS private folder and does not connect to the data source again.
A DataLink is just a pointer to a dataset. When a DataLink is run, Datameer validates the data path and the schema, as well as copying a small data sample for preview purposes. When a Workbook is built on top of this DataLink, Datameer copies and processes data during the Workbook's execution.

When Datameer submits a job to the cluster, the Resource Manager creates an ApplicationMaster container on a DataNode. After this, all actions are being handled by the ApplicationMaster, according to the job definition provided by Datameer.

Summary

When an ImportJob is executed, DataNodes make connections to the datasource and copy data into the Datameer private folder in HDFS.
When a DataLink is executed, DataNodes make connection to the datasource and create a preview sample in the Datameer private folder in HDFS.
When executing a Workbook, DataNodes will pick up data from the Datameer private folder in HDFS if the Workbook is based on an ImportJob.
When executing a Workbook, DataNodes will make a connection to the datasource to read the data if the Workbook is based on a DataLink.
Interim calculations are stored in the temp folder within the Datameer private folder while the job is running and get cleaned up after the job completes. If a job fails, interim data will be cleaned up by the Datameer Housekeeping service.
During MapReduce job execution (e.g. shuffle phase), cluster services use temporary space locations as defined in the cluster configuration.

Articles in this section

Learn

Summary

Comments

Articles in this section

Learn

Summary

Related articles