If you have a file-based data source with many partitions and a significant amount of data, poor performance can be observed when running a data link. Upon investigation of the YARN application logs, it can be seen that five splits are always being used instead of the optimal calculated number of splits.
---------- Split Settings ----------
min/max split size: 16.0 MB (8.0 MB) / 5.0 GB
min/max split count: 0 / 5
total input size: 599.4 GB
slot count: 6980
number of desired tasks: 5
optimal split size: 119.9 GB
optimal split count: 5
Regardless of changing the min/max split size, min/max split count, and wave count - the 'optimal' split count always remains 5 with only 5 tasks.
This results in poor performance when the data link is run to generate the sample. In this instance there were almost 20,000 partitions, which means there would be 100,000 sample records generated five tasks at a time.
This was an intentional design decision made by the engineering team. The ideology behind it is that data link samples aren't doing analytical work so they should have fewer resources allocated to them on the cluster. This way jobs doing analytics run faster and have more resources allocated to them proportionally.
This behavior can be overridden by explicitly setting the number of splits for data link sample generation. The property that controls this behavior is:
This property has a default value of 5, which explains the behavior described in the problem section. Add this property to the Custom Properties of the data link job, and increase the value beyond the default of 5 for added parallelism and more splits.