To understand exactly how Datameer jobs are being executed and where the actual data is stored.
Datameer acts as a job compiler, it compiles a MapReduce job and sends it to a cluster(Hadoop/EMR) for processing.
- If Datameer is not connected to a cluster (Local mode), all calculations take place in the Datameer JVM (pseudo cluster). All data, permanent and temporary is stored in the Datameer installation folder, unless
cache/directories are configured to point elsewhere.
- When Datameer is connected to a cluster all the data is stored in an HDFS directory called the Datameer private folder. By default this is located at
Differences between DataLinks and ImportJobs:
- For ImportJobs, the dataset is copied from the source to the Datameer private folder in HDFS. When a Workbook is built on this ImportJob, Datameer processes the copy of the data in the HDFS private folder and does not connect to the data source again.
- A DataLink is just a pointer to a dataset. When a DataLink is run, Datameer validates the data path and the schema, as well as copying a small data sample for preview purposes. When a Workbook is built on top of this DataLink, Datameer copies and processes data during the Workbook's execution.
When Datameer submits a job to the cluster, the Resource Manager creates an ApplicationMaster container on a DataNode. After this, all actions are being handled by the ApplicationMaster, according to the job definition provided by Datameer.
- When an ImportJob is executed, DataNodes make connections to the datasource and copy data into the Datameer private folder in HDFS.
- When a DataLink is executed, DataNodes make connection to the datasource and create a preview sample in the Datameer private folder in HDFS.
- When executing a Workbook, DataNodes will pick up data from the Datameer private folder in HDFS if the Workbook is based on an ImportJob.
- When executing a Workbook, DataNodes will make a connection to the datasource to read the data if the Workbook is based on a DataLink.
- Interim calculations are stored in the temp folder within the Datameer private folder while the job is running and get cleaned up after the job completes. If a job fails, interim data will be cleaned up by the Datameer Housekeeping service.
- During MapReduce job execution (e.g. shuffle phase), cluster services use temporary space locations as defined in the cluster configuration.