How to do group B - group A?
Use case: For column A, I want to know how many new string that're never seen before has added in the past week, and I want to be able to schedule this job to run weekly.
Create a group A (every unique string before last week) is easy. GROUPBY can do it, so does creating a group B (every unique string last week), but How do I do the set complement? i.e. Group B - Group A.
-
Official comment
To get the set complement, you can utilize the Join functionality along with a filter:
- Perform a Right Outer Join of Group A to Group B (this will preserve the entire set of Group B).
- Filter the Joined sheet to remove entries where a joined key from Group A was found for the Group B.
This will leave you with the set compliment of Group B - Group A.
Hope this helps!
Comment actions -
Thanks Joel.
I followed the step and I do see a smart sample of result (pretty excited so far!)
but when I run the job. I get the following exception.
ERROR [2016-03-25 22:06:15.952] [JobScheduler worker1-thread-5106] (DasJobCallable.java:135) - Job failed! Execution plan: null java.lang.NullPointerException: No sheet with ID 'ca30b021-a050-4ee3-8ea4-562ad3e4fb08' found. at datameer.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229) at datameer.dap.common.entity.WorkbookConfigurationImpl.getSheet(WorkbookConfigurationImpl.java:372) at datameer.dap.common.job.WorkbookJob.exchangeWithSnapshotSheet(WorkbookJob.java:192) at datameer.dap.common.job.WorkbookJob.compileWorkbook(WorkbookJob.java:162) at datameer.dap.common.job.WorkbookJob.registerJobOperations(WorkbookJob.java:264) at datameer.dap.common.job.DatameerJob.createExecutionPlan(DatameerJob.java:99) at datameer.dap.common.job.DasJobCallable.call(DasJobCallable.java:95) at datameer.dap.common.job.DasJobCallable.call(DasJobCallable.java:50) at datameer.dap.conductor.job.JobSchedulerJob$2.call(JobSchedulerJob.java:124) at datameer.dap.conductor.job.JobSchedulerJob$2.call(JobSchedulerJob.java:106) at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:100) at datameer.dap.conductor.job.JobSchedulerJob.call(JobSchedulerJob.java:106) at datameer.dap.conductor.job.JobSchedulerJob.call(JobSchedulerJob.java:40) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662)
-
The ID ca30b021-a050-4ee3-8ea4-562ad3e4fb08 is a Sheet ID for the job that does not exist in its full data set. To check the sheet ID directly, you may download the Job Trace from the particular job and review the job-definition.json file inside of the downloaded zip file. This will show which sheet name corresponds to this ID.
Is this workbook linked to other workbooks as the source? If so, do the parent sheets still exist in full? This error commonly indicates that a sheet did not exist or was not saved in full for reference.
-
I intended to ensure that the following situation was not affecting the environment:
- ParentWorkbook has sheets: ParentSheet1 and ParentSheet2
- ParentSheet1 and ParentSheet2 are "Saved" sheets in ParentWorkbook
- ChildWorkbook uses ParentSheet1 as a source sheet
- ParentWorkbook's configuration changes and ParentSheet1 is no longer a "Saved" sheet.
In this circumstance, ChildWorkbook initiated using ParentSheet1 which was saved, but the configuration has changed and ParentSheet1 is no longer saved.
Does that clarify what I meant by "still exist in full" before?
Please sign in to leave a comment.
Comments
6 comments