Optimize a partitioned Import Job to decrease its execution time.
When creating a partitioned Import Job, it's possible that the data may be skewed such that most records belong to a single partition. For example, when there is a daily import and the job is partitioned by day.
During job execution all records that belong to a single partition will be sent to the same reducer. So if an import contains 50M records and 49M belong to the same partition - 98% of work will end up on a single reducer. This significantly impacts job's execution time and may cause OOM exceptions if a single container can't handle the load.
To avoid this situation, it's possible to move partition handling to map side of the job using the custom property: das.partitioning.use-map-only=true
Please sign in to leave a comment.