Problem
When executing the GROUPCOUNTDISTINCT function with a very large data set (more then 10.000 distinct elements in a group), the following error message is displayed in the Hadoop syslogs for the failing job:
!message:ComputationException: DistinctSubIds: =GROUPCOUNTDISTINCT(#SheetName!ColumnName) failed with NullPointerException: !Record (current sheet):ColumnA: "(null)", ColumnB: 12345678, ColumnC: (null), ColumnD: "(null)", ColumnE: "(null)" !stack:datameer.dap.common.exception.ComputationException: DistinctSubIds: =GROUPCOUNTDISTINCT(#SheetName!ColumnName) failed with NullPointerException: at datameer.dap.common.formula.RecordContext.createComputationException(RecordContext.java:128) at datameer.dap.common.formula.lazy.RecordEvalSequence.toComputationException(RecordEvalSequence.java:123) at datameer.dap.common.formula.lazy.RecordEvalSequence.moveToNext(RecordEvalSequence.java:135) at datameer.dap.common.formula.lazy.ExpressionEvaluator2$2.computeNext(ExpressionEvaluator2.java:114) at datameer.dap.common.formula.lazy.ExpressionEvaluator2$2.computeNext(ExpressionEvaluator2.java:111) at datameer.dap.sdk.sequence.Sequence$Simple.moveToNext(Sequence.java:157) at datameer.dap.sdk.sequence.Sequence$23.moveToNext(Sequence.java:1140) at datameer.dap.sdk.sequence.Sequence$27.moveToNext(Sequence.java:1240) at datameer.dap.sdk.sequence.Sequence$13.moveToNext(Sequence.java:602) at datameer.dap.sdk.sequence.Sequence$14.computeNext(Sequence.java:647) at datameer.dap.sdk.sequence.Sequence$Simple.moveToNext(Sequence.java:157) at datameer.plugin.tez.processing.AggregationVertexRecordProcessor.run(AggregationVertexRecordProcessor.java:161) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:324) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:180) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:172) at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.call(TezTaskRunner.java:167) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:268) at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131) at org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:3402) at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:3196) at org.apache.hadoop.io.SequenceFile$Sorter.mergePass(SequenceFile.java:3172) at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2735) at org.apache.hadoop.io.SequenceFile$Sorter.sort(SequenceFile.java:2774) at datameer.das.functions.grouping.spills.SpillFile.eachSorted(SpillFile.java:98) at datameer.das.functions.grouping.spills.SpillingSet.aggregateMergedValues(SpillingSet.java:142) at datameer.das.functions.grouping.spills.SpillingSet.aggregateValues(SpillingSet.java:149) at datameer.das.functions.grouping.GroupCountDistinctFunction$GroupCountDistinctAggregator.computeAggregationResult(GroupCountDistinctFunction.java:69) at datameer.dap.common.formula.lazy.EvalSequence$6.computeValue(EvalSequence.java:122) at datameer.dap.common.formula.lazy.SingleEvalSequence.currentValue(SingleEvalSequence.java:31) at datameer.dap.common.formula.lazy.EvalSequence.currentIsError(EvalSequence.java:47) at datameer.dap.common.formula.lazy.RecordEvalSequence.moveToNext(RecordEvalSequence.java:134) ... 21 more |
Cause
This is a limitation for the GROUPCOUNTDISTINCT function. For large datasets, the file spilling to temporary storage can get exhausted and lead to a NullPointerException.
Solution
Starting in Datameer versions 4.5.10, 5.0.5, 5.1.1 and 5.2+, Datameer enhanced the usage of the spill files and can now better handle the large datasets when processed through a GROUPCOUNTDISTINCT function.
Further internal reference for the enhancements can be found in DAP-21628.
Comments
0 comments
Please sign in to leave a comment.