Problem
During a cluster job execution, some tasks fail with the following message seen in the YARN application log.
{"entity":"attempt_111111111_222222_1_01_000001_0","entitytype":"TEZ_TASK_ATTEMPT_ID",
"events":[{"ts":1566540243508,"eventtype":"TASK_ATTEMPT_FINISHED"}],
"otherinfo":{"creationTime":1566540195458,"allocationTime":1566540197777,"startTime":1566540230285,"endTime":1566540243508,"timeTaken":13223,
"status":"FAILED","taskAttemptErrorEnum":"CONTAINER_EXITED","taskFailureType":"NON_FATAL","diagnostics":"Container container_111111111_222222_1_01_000001 finished with diagnostics set to [Container failed, exitCode=-100. Container released on a *lost* node]",
"counters":{"counterGroups":"[{counterGroupName=org.apache.tez.common.counters.DAGCounter, counters=[{counterName=RACK_LOCAL_TASKS, counterValue=1}]}]"},"lastDataEvents":{"lastDataEvents":"[{TEZ_TASK_ATTEMPT_ID=, ts=1566540195231}]"},"nodeHttpAddress":"DataNodeHostName:port"}}
Occasionally this can cause a job failure, but usually just impacts performance, as a failed task should be rerun.
Cause
This failure is a known issue with YARN (YARN-8671) that may occur if a node is overly busy (e.g., some other container is using too much CPU or the NodeManager is doing too much to respond). The failure is indicative of a busy cluster or nodes that are having issues for some other reason.
Solution
As this exception points to a cluster services issue, it is recommended to review the cluster's configuration, performance and perform a general health check.
Comments
0 comments
Please sign in to leave a comment.