Problem
Running an import job results in records being dropped due to errors such as:
!message:ErrorLogException: Invalid record source: org.apache.hadoop.io.ArrayWritable@68ebb8dd
!Error Repeated:>100 times
!Source Record:org.apache.hadoop.io.ArrayWritable@68ebb8dd
!Reason:UnsupportedOperationException: Cannot inspect org.apache.hadoop.hive.serde2.io.DateWritable
Cause
Datameer's Hive Import functionality reads the metadata of the table from the Hive Metastore, then connects directly to HDFS to retrieve the data, rather than retrieving the data from Hive itself. This is for performance reasons and results in significantly better throughput reading the records. However, it does require that the Hive Metadata match exactly what is in the data files themselves.
Hive has a mechanism that allows you to reorder columns. When this feature is used, it creates an abstraction layer that translates the locations of the columns within the data files themselves, rather than rewriting the files. Unfortunately this means there can be a mismatch between what Hive reports the column order to be when the Metastore is queried, vs what is actually in the data files themselves.
Solution
There are two ways to resolve this issue:
1. Using Hive, create a copy of the table in question having the problem. New table data for the copy is written with the specified ordering in the Hive Metastore. The new table will not have a mismatch between the metadata and what is actually stored in the data files.
CREATE TABLE new_table_name AS SELECT <* or columns> FROM original_table_name WHERE <expression>;
2. Instead of using one of the Hive connectors, instead use a direct HDFS connection. This allows Datameer to detect the schema on read of the data files, instead of attempting to use the definition from the Hive Metastore. Column names will need to be specified, and types will need to be checked to ensure they match what is expected.
Comments
0 comments
Please sign in to leave a comment.