RuntimeException: Filesystem closed

Problem

No jobs are processing due to a closed filesystem and we are not able to identify the file system.

Error message

...
[anonymous] INFO [LeaseRenewer:user@host:port] (Client.java:713) - Retrying connect to server: <hostname>/<ip>:<port>. Already tried 4 time(s); maxRetries=5
[anonymous] WARN [LeaseRenewer:user@host:port] (LeaseRenewer.java:449) - Failed to renew lease for [DFSClient_NONMAPREDUCE_-<id>] for 30 seconds. Aborting ...
org.apache.hadoop.net.ConnectTimeoutException: Call From FQDN/<ip> to <hostname>:<port> failed on socket timeout exception: 
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending remote=<hostname>/<ip>:<port>]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
...
[anonymous] INFO [ConcurrentJobExecutor-2] (MrJobInputFormat.java:186) - Releasing splits (UUID: <id>) from cache, still cached split-arrays: 3
[anonymous] ERROR [ConcurrentJobExecutor-2] (HadoopMrJobClient.java:274) - Failed to cleanup job Workbook <name> / <name>
java.io.IOException: Filesystem closed
...
[system] ERROR [JobScheduler thread-1] (BasicDasStorageProvider.java:24) - Storage not available, Filesystem closed
...
[system] ERROR [JobScheduler thread-1] (JobScheduler.java:538) - Failed to start job, filesystem is not available.
...

Troubleshooting steps

Check Hadoop Cluster settings
Try to deploy job "Cluster Health Check"
Probably it will report the same error

Check ulimit -n and cat /proc/sys/fs/file-nr if there are enough file descriptors
Check /var/log/messages and gather the file!
Do this both endpoint, which means on Datameer host as cluster nodes as well!

Cause

The (network) connection to the cluster and with this to the remote storage (HDFS) was lost. This can be caused by network issues, rebooting the cluster and so on. The Datameer service closes the filesystem than, as the storage (HDFS) is not available. The Datameer service will stay in this state, even if the remote storage is available again. See HDFS-5028 for more information.

Solution

In this case restarting conductor will solve the issue.

Articles in this section

Problem

Error message

Troubleshooting steps

Cause

Solution

Comments

Articles in this section

Problem

Error message

Troubleshooting steps

Cause

Solution

Related articles