How to do group B - group A?

March 25, 2016 01:13

Use case: For column A, I want to know how many new string that're never seen before has added in the past week, and I want to be able to schedule this job to run weekly.

Create a group A (every unique string before last week) is easy. GROUPBY can do it, so does creating a group B (every unique string last week), but How do I do the set complement? i.e. Group B - Group A.

Comments

6 comments

Official comment
Joel Stewart

March 25, 2016 15:47
To get the set complement, you can utilize the Join functionality along with a filter:
1. Perform a Right Outer Join of Group A to Group B (this will preserve the entire set of Group B).
2. Filter the Joined sheet to remove entries where a joined key from Group A was found for the Group B.
This will leave you with the set compliment of Group B - Group A.

Hope this helps!
Comment actions Permalink

Simon Gao

March 25, 2016 22:10

Thanks Joel.

I followed the step and I do see a smart sample of result (pretty excited so far!)

but when I run the job. I get the following exception.

ERROR [2016-03-25 22:06:15.952] [JobScheduler worker1-thread-5106] (DasJobCallable.java:135) - Job failed! Execution plan: null
java.lang.NullPointerException: No sheet with ID 'ca30b021-a050-4ee3-8ea4-562ad3e4fb08' found.
	at datameer.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:229)
	at datameer.dap.common.entity.WorkbookConfigurationImpl.getSheet(WorkbookConfigurationImpl.java:372)
	at datameer.dap.common.job.WorkbookJob.exchangeWithSnapshotSheet(WorkbookJob.java:192)
	at datameer.dap.common.job.WorkbookJob.compileWorkbook(WorkbookJob.java:162)
	at datameer.dap.common.job.WorkbookJob.registerJobOperations(WorkbookJob.java:264)
	at datameer.dap.common.job.DatameerJob.createExecutionPlan(DatameerJob.java:99)
	at datameer.dap.common.job.DasJobCallable.call(DasJobCallable.java:95)
	at datameer.dap.common.job.DasJobCallable.call(DasJobCallable.java:50)
	at datameer.dap.conductor.job.JobSchedulerJob$2.call(JobSchedulerJob.java:124)
	at datameer.dap.conductor.job.JobSchedulerJob$2.call(JobSchedulerJob.java:106)
	at datameer.dap.common.security.DatameerSecurityService.runAsUser(DatameerSecurityService.java:100)
	at datameer.dap.conductor.job.JobSchedulerJob.call(JobSchedulerJob.java:106)
	at datameer.dap.conductor.job.JobSchedulerJob.call(JobSchedulerJob.java:40)
	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
	at java.lang.Thread.run(Thread.java:662)

Joel Stewart

March 26, 2016 00:01
The ID ca30b021-a050-4ee3-8ea4-562ad3e4fb08 is a Sheet ID for the job that does not exist in its full data set. To check the sheet ID directly, you may download the Job Trace from the particular job and review the job-definition.json file inside of the downloaded zip file. This will show which sheet name corresponds to this ID.

Is this workbook linked to other workbooks as the source? If so, do the parent sheets still exist in full? This error commonly indicates that a sheet did not exist or was not saved in full for reference.
0

Comment actions Permalink
Simon Gao

March 26, 2016 00:04
Thanks, Joel. I was able to get around that problem by moving all logics into one workbook. Just for the future reference, may I ask what do you mean by "exist in full"?
0

Comment actions Permalink
Joel Stewart

March 26, 2016 00:09
I intended to ensure that the following situation was not affecting the environment:
1. ParentWorkbook has sheets: ParentSheet1 and ParentSheet2
2. ParentSheet1 and ParentSheet2 are "Saved" sheets in ParentWorkbook
3. ChildWorkbook uses ParentSheet1 as a source sheet
4. ParentWorkbook's configuration changes and ParentSheet1 is no longer a "Saved" sheet.
In this circumstance, ChildWorkbook initiated using ParentSheet1 which was saved, but the configuration has changed and ParentSheet1 is no longer saved.

Does that clarify what I meant by "still exist in full" before?
0

Comment actions Permalink
Simon Gao

March 26, 2016 00:15
Thanks. That is a very detailed answer. Thanks a lot.
0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?