group_previous
I used group_previous on smaller datasets and worked fine. However, on a 10GB dataset (204M rows), it did not work. Is there a size restriction?
-
I used 8 patients data, groupby(A), groupby(B), group_sort_asc(C), group_sort_asc(D), group_previous(E). It worked. When I tried to use the same fields and the same syntax for 204M rows, the group_previous(E) did not show up in the preview. It also seems to slow the Datameer processing.
-
Hello YM Wang,
>When I tried to use the same fields and the same syntax for 204M rows, the group_previous(E) did not show up in the preview.
Do you mean there is no data in Column E? Have you executed the Workbook after you applied formulas to larger dataset, whether there are still no results? -
The fields are string and date types. I think there is no error in the full run. As I mentioned, the weird part is the same setup with <200 rows worked.
In addition, if I change the first field to be group_sort_asc(A), second to groupby(B), the group_previous(E) field will show, even in the preview. However, it did not sort right and so no right previous values attained.
-
It is notable that the GROUP_PREVIOUS function does in fact display a null record type when a group's size is 1 because there is no previous value to pull data from. It would be valuable to check the cardinality of the data groups to ensure that there is more than one entry there.
Specifically, if you replace the GROUP_PREVIOUS(E) function with a GROUP_COUNT() function. Do you receive counts other than 1 in the resulting column?
-
I was wrong that after running, it did get values. And there were errors: Errors
failed with NullPointerException
I found a thread with the same error message for GROUPCOUNTDISTINCT function. The cause was noted as "This is a limitation for the GROUPCOUNTDISTINCT function. For large datasets, the file spilling to temporary storage can get exhausted and lead to a NullPointerException." I think it is the same situation. I hope to know how big the data could be before the temp storage gets exhausted.
In addition, it is interesting when I change the first field to group_sort_asc() instead if groupby(), the preview shows values. However, they are not the values that I need...
-
NullPointerException is a very generic error. Are you able to share more context on the stacktrace for the error itself? For the test I posed in my last update, I was specifically referring to the GROUPCOUNT() function and not the GROUPCOUNTDISTINCT() function. For GROUPCOUNT() temporary space is not a concern.
For clarity, here is the test that I am curious to know about:
- GROUPBY(A)
- GROUPBY(B)
- GROUP_SORT_ASC(C)
- GROUP_SORT_ASC(D)
- GROUPCOUNT()
Then are there values in the GROUPCOUNT column that are greater than 1. If not, it sounds like the GROUP_SORT_ASC results are leading to groupings that are of cardinality (think of group "size") 1. If groups of cardinality 1 exist, the GROUP_PREVIOUS function will return a null value by design.
Please sign in to leave a comment.
Comments
11 comments