Goal
Better understand the retention policy options for Datameer artifacts.
Learn
Data replacement
ImportJobs and DataLinks give a user two different methods of bringing data into Datameer.
ImportJob - literally copies a source dataset to Datameer private folder in HDFS/S3 and stores it in parquet format. There are the following data retention policies for ImportJobs.
- File-based Connections (HDFS, S3, SFTP, etc).
- Replace - all data ingested during the previous execution of this ImportJobs is being completely removed and replaced by the new dataset.
- Append - all data ingested during the previous execution remains intact, and a new portion of data is being added. Keep in mind that this policy is useful when you need to accumulate historical data in a Datameer and the source data is being replaced, so every time the ImportJob runs, it reads new files from the source location. Otherwise, there will be duplicated data.
- Append with sliding time window - allows accumulating data ingested during a certain amount of time (e.g., 10 days) or job executions.
- Database Connections
- Replace - all data ingested during the previous execution of this ImportJob is being completely removed and replaced by the new dataset.
- Append - all data ingested during the previous execution remains intact, and a new portion of data is being added.
- Append with sliding time window - allows accumulating data ingested during a certain amount of time (e.g., 10 days) or job executions.
- Append with incremental mode - imports only data added to the source tables since the last ImportJob execution.
Thereby, if you want to avoid situations when an ImportJob copies the same data on every execution, you should use Append mode. Please note that for file-based Connections, already ingested source data should be removed before the next ImportJob execution to avoid data duplication. For Database Connections, one should use Append with incremental mode option to ingest only new records.
DataLink - just a pointer to a dataset. It does not copy anything to Datameer private folder in HDFS/S3 except for a short preview. Data is being accessed when a Workbook based on this DataLink is being executed. As there is no data ingested via a DataLink, there is no retention policy for this artifact type. Data is being accumulated at the source side, and thus a Workbook is being always executed over the complete dataset.
In case there is a need to process a certain part of the data accessed via a DataLink, there are the following options.
- File-based Connections (HDFS, S3, SFTP, etc).
- Set a time-based filter for source files via the path, file names, or file's modification day. In this case, the Workbook based on such DataLink will work only with files that match the filter.
- Set time-based partitions (in case the file structure allows this). In this case, one could choose the partition(s) the Workbook will process.
- HiveServer2 Connections (given that the source table is partitioned).
- Introduce a partition filter and choose what partitions the DataLink should be pointed to. In this case, the Workbook based on such DataLink will work only with files that match the filter.
- Set time-based partitions. In this case, one could choose the partition(s) the Workbook will process.
- Database Connections (MySQL, PostgeSQL, etc).
- Use custom SQL queries with defined filter conditions (WHERE clause).
Retention policy
The retention policy in Datameer allows to configure the following parameters:
- Keep the last N results (regardless of their age).
- Purge results older than N days.
- Purge results older than N days, but keep the last N results.
- Never delete historical data.
- ExportOnly (for Workbooks).
A corresponding configuration is stored in the dap_job_configuration
table under the columns min_keep_count
and expire_time_days
.
The possible combinations of values in these columns (respectively) are:
N / NULL
- Keep the last N results (Purge results older than N days is empty).NULL / N
- Purge results older than N days (Keep the last N results is empty).N / N
- Purge results older than N days, but keep the last N results.NULL / NULL
- Never remove historical data.0 / 1
- Export Only.
The query allows us to view all artifacts tighter with their retention policy and use the WHERE clause to filter the desired results.
SELECT dap_job_configuration.id ConfID, dap_file.name Name, CASE dap_file.extension WHEN 'IMPORT_LINK_JOB_EXTENSION' THEN 'Data Link' WHEN 'IMPORT_JOB_EXTENSION' THEN 'Import Job' WHEN 'WORKBOOK_EXTENSION' THEN 'Workbook' WHEN 'EXPORT_JOB_EXTENSION' THEN 'Export Job' END Type, permission.owner Owner, dap_file.creation_date CreationTime, dap_job_configuration.min_keep_count KeptResults, dap_job_configuration.expire_time_days PurgeResultsAfter, dap_file.id FileID FROM dap_job_configuration JOIN dap_file ON dap_job_configuration.dap_file__id = dap_file.id JOIN permission ON dap_file.permission_fk = permission.id;
With the option, Append with sliding time window
you could set the following retention policies:
- Use only the
Expire after
field. Datameer will keep records ingested by an import job during the last N days/weeks/month. Older records will be removed. For example, in case you setExpire after
to5 days
and will run the import job daily that ingests 10 records every day, Datameer will keep only 50 recent records each time. - Use only the
Keep last N results
field. Datameer will keep records ingested by the import job during the last N executions, regardless of the time. For example, in case you setKeep last N results
to5
, Datameer will keep data imported during the previous 5 job executions, irrespective of whether the job has run 5 times in 1 hour or 1 week. Please note that executions that don't import any records are also being considered. If there would be no new data added into the source table and the ImportJob will be executed 5 times, no records will be stored and this moment. - Use both
Expire after
andKeep last N results
parameters at the same time. This gives additional flexibility and allows to ensure that N results will still be stored, even if some of them are expired, e.g., in case you pause at the ingestion but still want to use previously ingested data.
When a Workbook is configured with ExportOnly retention policy, the data object it creates immediately gets status 1 (marked_for_deletion) at the data table. Thereby it becomes a subject of the Housekeeping service right away. It will be removed during the next Housekeeping round, after a subsequent ExportJob completion.
Comments
0 comments
Please sign in to leave a comment.