Problem
Jobs occasionally take long periods of time to start, or they start and sit at 0% progress.
Cause
When a Hadoop cluster is under heavy utilization, YARN and the Tez framework may wait for resource availability local to the data before allocating containers and tasks. This can cause a delay during the startup of jobs, or allocation of tasks after a job has been started.
Solution
Often when a cluster is being heavily used, processing resources are unavailable on a node where the data resides locally. YARN waits for resources on these nodes to become free in order to achieve best wall-clock time of the job in question. However, when the delay is longer than base wall clock time in an ideal situation - it becomes more advantageous to not wait for data locality.
In such a situation you can change the following three parameters to allow YARN and Tez to operate more efficiently:
tez.am.container.reuse.enabled=true
Allows Tez containers to persist on the cluster after a job has completed for re-use.
tez.am.container.reuse.non-local-fallback.enabled=true
Tez not only allows containers to be re-used, but will also re-use running containers for jobs where the data is not local to the container.
Note: This can adversely affect run time in clusters that are not facing over-utilization.
tez.am.container.reuse.locality.delay-allocation-millis=120000
Allow containers to persist for two minutes (120,000 milliseconds) for re-use.
Comments
0 comments
Please sign in to leave a comment.