How to eliminate duplicate rows?

June 16, 2015 16:36

I have rows like
- 1, abc, xyz, 123, 567
- 2, def, str, 123, 789
- 1, abc, xyz, 123, 567
- 3, abc, xyz, 123, 567

Whats the best way to eliminate duplicate rows and get the result as

1, abc, xyz, 123, 567
2, def, str, 123, 789
3, abc, xyz, 123, 567

Comments

9 comments

Official comment
Joel Stewart

June 16, 2015 19:07
To combine duplicated rows, you may use a grouping function.

In the example above, a GROUPBY(#FirstColumn) would result in only three rows. The other values could be copied over from the original sheet.
Comment actions Permalink
John Mulhall

August 10, 2015 13:20
So to clarify, if you use GROUPBY(#FirstColumn), Datameer will match rows by first column, and then check every column on the sheet to make sure the role is truly duplicated?

e.g.
abc, xyz, 123, 567
abc, xyz, 123, 567
abc, xyz, 123, 568

GROUPBY (FirstColumn)
Result:
abc, xyz, 123, 567
abc, xyz, 123, 568

Is this correct?
0

Comment actions Permalink
Joel Stewart

August 10, 2015 15:13
John, the GROUPBY(#FirstColumn) would only output a single value. So for this example data, you'd get the following results:
e.g.
abc, xyz, 123, 567
abc, xyz, 123, 567
abc, xyz, 123, 568

GROUPBY(#FirstColumn) -- Result:
abc

However, if you used consecutive GROUPBY statements for all four columns, you'd get the result that you reported:
GROUPBY(#FirstColumn), GROUPBY(#SecondColumn), GROUPBY(#ThirdColumn), GROUPBY(#FourthColumn) -- Result:
abc, xyz, 123, 567
abc, xyz, 123, 568
1

Comment actions Permalink
John Mulhall

August 10, 2015 15:37
Joel, thanks for this. Much appreciated! It seems like a cumbersome way to deduplicate records in a process. Is there any specific function to do de-deduplicate records like the "UNIQUE" function for lists?
0

Comment actions Permalink
Joel Stewart

August 10, 2015 16:34
The number of GROUPBY functions used does increase with the number of columns that need to be deduplicated together. There is not another function that will automatically deduplicate records based on all columns.

The GROUPBY approach is very flexible. If there is only a handful of columns that need to be deduplicated, these columns may be used specifically in the Workbook and the others may be omitted.
0

Comment actions Permalink
John Mulhall

August 10, 2015 16:44
Thanks Joel for your continuing engagement on this. So, if I GROUPBY (#Column A) and row 1 and 3 (col A) have identical identifiers, will the dulicate cell in Col A be removed or will be row/record be removed? If the cell is removed, will the other columnner information in the row/record be out of what with Col A? I presume the duplicate row goes right?
0

Comment actions Permalink
Joel Stewart

August 10, 2015 18:04
Datameer processes functions on an entire column, not at a cell level. When thinking about the data set in this way, the functions are creating a new manipulated version of the data on this new worksheet.

The operation of GROUPBY lists out all unique values for the requested column. If multiple GROUPBY statements are included on a worksheet, they unique groupings are nested.

I'll work through the latest example you commented about. In this example, you have a set of data in which Column A has the same value for Row 1 and Row 3. Here is an example that meets this requirement:
ColumnA, ColumnB, ColumnC
abc, 123, apple
xyz, 987, apple
abc, 456, banana

If I create a new worksheet with the function "GROUPBY(#ColumnA)", I would receive these results:
GROUPBY(#ColumnA)
abc
xyz
Of note, this is only 2 records because the unique number of values in ColumnA is just 2.

If I add a second function "GROUPBY(#ColumnB)" then I will see these results:
GROUPBY(#ColumnA), GROUPBY(#ColumnB)
abc, 123
abc, 456
xzy, 987
Of note, the output has increased from 2 records to 3 records because the unique pairings of ColumnA and ColumnB are 3.

(The order of the displayed records was changed here for clarity of the grouping concept. I wanted to keep the "abc" records adjacent. This re-order is intended for the demonstration and is not related to this discussion.)
0

Comment actions Permalink
John Mulhall

August 11, 2015 11:18
Thanks Joel, so it basically works like a manual version of the DISTINCT command in SQL, right?
0

Comment actions Permalink
Joel Stewart

August 11, 2015 15:18
Yes, the SQL DISTINCT function is a good parallel to draw to the GROUPBY function.
0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?