Problem
No jobs are processing due to a closed filesystem and we are not able to identify the file system.
Error message
... [anonymous] INFO [LeaseRenewer:user@host:port] (Client.java:713) - Retrying connect to server: <hostname>/<ip>:<port>. Already tried 4 time(s); maxRetries=5 [anonymous] WARN [LeaseRenewer:user@host:port] (LeaseRenewer.java:449) - Failed to renew lease for [DFSClient_NONMAPREDUCE_-<id>] for 30 seconds. Aborting ... org.apache.hadoop.net.ConnectTimeoutException: Call From FQDN/<ip> to <hostname>:<port> failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=<hostname>/<ip>:<port>]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout ... [anonymous] INFO [ConcurrentJobExecutor-2] (MrJobInputFormat.java:186) - Releasing splits (UUID: <id>) from cache, still cached split-arrays: 3 [anonymous] ERROR [ConcurrentJobExecutor-2] (HadoopMrJobClient.java:274) - Failed to cleanup job Workbook <name> / <name> java.io.IOException: Filesystem closed ... [system] ERROR [JobScheduler thread-1] (BasicDasStorageProvider.java:24) - Storage not available, Filesystem closed ... [system] ERROR [JobScheduler thread-1] (JobScheduler.java:538) - Failed to start job, filesystem is not available. ... |
Troubleshooting steps
Check Hadoop Cluster settings
Try to deploy job "Cluster Health Check"
Probably it will report the same error
Check ulimit -n
and cat /proc/sys/fs/file-nr
if there are enough file descriptors
Check /var/log/messages
and gather the file!
Do this both endpoint, which means on Datameer host as cluster nodes as well!
Cause
The (network) connection to the cluster and with this to the remote storage (HDFS) was lost. This can be caused by network issues, rebooting the cluster and so on. The Datameer service closes the filesystem than, as the storage (HDFS) is not available. The Datameer service will stay in this state, even if the remote storage is available again. See HDFS-5028 for more information.
Solution
In this case restarting conductor will solve the issue.
Comments
0 comments
Please sign in to leave a comment.