Improving Tez Session and Task Allocation on a Heavily Utilized Cluster

Alan

March 05, 2018 10:38
Updated

Problem

Jobs occasionally take long periods of time to start, or they start and sit at 0% progress.

Cause

When a Hadoop cluster is under heavy utilization, YARN and the Tez framework may wait for resource availability local to the data before allocating containers and tasks. This can cause a delay during the startup of jobs, or allocation of tasks after a job has been started.

Solution

Often when a cluster is being heavily used, processing resources are unavailable on a node where the data resides locally. YARN waits for resources on these nodes to become free in order to achieve best wall-clock time of the job in question. However, when the delay is longer than base wall clock time in an ideal situation - it becomes more advantageous to not wait for data locality.

In such a situation you can change the following three parameters to allow YARN and Tez to operate more efficiently:

tez.am.container.reuse.enabled=true

Allows Tez containers to persist on the cluster after a job has completed for re-use.

tez.am.container.reuse.non-local-fallback.enabled=true

Tez not only allows containers to be re-used, but will also re-use running containers for jobs where the data is not local to the container.

Note: This can adversely affect run time in clusters that are not facing over-utilization.

tez.am.container.reuse.locality.delay-allocation-millis=120000

Allow containers to persist for two minutes (120,000 milliseconds) for re-use.

Articles in this section

Improving Tez Session and Task Allocation on a Heavily Utilized Cluster

Comments

Articles in this section

Related articles