No jobs are processing due to a closed filesystem and we are not able to identify the file system.
... [anonymous] INFO [LeaseRenewer:user@host:port] (Client.java:713) - Retrying connect to server: <hostname>/<ip>:<port>. Already tried 4 time(s); maxRetries=5 [anonymous] WARN [LeaseRenewer:user@host:port] (LeaseRenewer.java:449) - Failed to renew lease for [DFSClient_NONMAPREDUCE_-<id>] for 30 seconds. Aborting ... org.apache.hadoop.net.ConnectTimeoutException: Call From FQDN/<ip> to <hostname>:<port> failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=<hostname>/<ip>:<port>]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout ... [anonymous] INFO [ConcurrentJobExecutor-2] (MrJobInputFormat.java:186) - Releasing splits (UUID: <id>) from cache, still cached split-arrays: 3 [anonymous] ERROR [ConcurrentJobExecutor-2] (HadoopMrJobClient.java:274) - Failed to cleanup job Workbook <name> / <name> java.io.IOException: Filesystem closed ... [system] ERROR [JobScheduler thread-1] (BasicDasStorageProvider.java:24) - Storage not available, Filesystem closed ... [system] ERROR [JobScheduler thread-1] (JobScheduler.java:538) - Failed to start job, filesystem is not available. ...
Check Hadoop Cluster settings
Try to deploy job "Cluster Health Check"
Probably it will report the same error
ulimit -n and
cat /proc/sys/fs/file-nr if there are enough file descriptors
/var/log/messages and gather the file!
Do this both endpoint, which means on Datameer host as cluster nodes as well!
The (network) connection to the cluster and with this to the remote storage (HDFS) was lost. This can be caused by network issues, rebooting the cluster and so on. The Datameer service closes the filesystem than, as the storage (HDFS) is not available. The Datameer service will stay in this state, even if the remote storage is available again. See HDFS-5028 for more information.
In this case restarting conductor will solve the issue.