Problem
When working with the GROUPCOUNTDISTINCT function on a dataset with a very high number of records in a group, out of memory exceptions may be encountered.
The following illustrates the operation:
Example
Sheet_1: #user_id, #device Sheet_2: GROUPBY(#Sheet_1.device), GROUPCOUNTDISTINCT(#Sheet_2.user_id)
Running this job may result in the following error:
job.log
ERROR [ConcurrentJobExecutor-0] (ClusterSession.java:239) - Failed to generate file for 'GroupByOperation[sheetName=Sheet_2]' [14 mins, 36 sec] java.lang.RuntimeException: Job job_ failed! Failure info: NA at datameer.dap.sdk.util.ExceptionUtil.convertToRuntimeException(ExceptionUtil.java:38) at datameer.dap.sdk.util.ExceptionUtil.convertToRuntimeException(ExceptionUtil.java:30) at datameer.dap.common.graphv2.hadoop.MrJob.runImpl(MrJob.java:193) at datameer.dap.common.graphv2.ClusterJob.run(ClusterJob.java:134) at datameer.dap.common.graphv2.ClusterSession.execute(ClusterSession.java:227) at datameer.dap.common.graphv2.ConcurrentClusterSession$1.run(ConcurrentClusterSession.java:53) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Job job_ failed! Failure info: NA at datameer.dap.common.job.mr.HadoopMrJobClient.waitUntilJobCompletion(HadoopMrJobClient.java:189) at datameer.dap.common.job.mr.HadoopMrJobClient.runJobImpl(HadoopMrJobClient.java:81) at datameer.dap.common.job.mr.MrJobClient.runJob(MrJobClient.java:32) at datameer.dap.common.graphv2.hadoop.MrJob.runImpl(MrJob.java:188) ... 9 more Caused by: java.lang.RuntimeException: Task: Error: Direct buffer memory
Cause
This issue is caused by the fact that in order to determine the true distinct value, GROUPCOUNTDISTINCT need to shift all mapped information to disk on one reducer. This is a limitation and can cause disk space and performance/memory issues on large datasets.
Solution
In order to work around this issue, an intermediary sheet along with a combination of GROUPBY and GROUPCOUNT functions is suggested as described below.
Original
Sheet_1: #user_id, #device Sheet_2: GROUPBY(#Sheet_1.device), GROUPCOUNTDISTINCT(#Sheet_2.user_id)
Workaround
Sheet_1: #user_id, #device Sheet_2: GROUPBY(#Sheet_1.device), GROUPBY(#Sheet_1.user_id) Sheet_3: GROUPBY(#Sheet_2.device), GROUPCOUNT
This will give the same desired output of the GROUPCOUNTDISTINCT function.
Comments
0 comments
Please sign in to leave a comment.