How to eliminate duplicate rows?
I have rows like
- 1, abc, xyz, 123, 567
- 2, def, str, 123, 789
- 1, abc, xyz, 123, 567
- 3, abc, xyz, 123, 567
Whats the best way to eliminate duplicate rows and get the result as
- 1, abc, xyz, 123, 567
- 2, def, str, 123, 789
- 3, abc, xyz, 123, 567
-
Official comment
To combine duplicated rows, you may use a grouping function.
In the example above, a GROUPBY(#FirstColumn) would result in only three rows. The other values could be copied over from the original sheet.
Comment actions -
So to clarify, if you use GROUPBY(#FirstColumn), Datameer will match rows by first column, and then check every column on the sheet to make sure the role is truly duplicated?
e.g.
abc, xyz, 123, 567
abc, xyz, 123, 567
abc, xyz, 123, 568GROUPBY (FirstColumn)
Result:
abc, xyz, 123, 567
abc, xyz, 123, 568Is this correct?
-
John, the GROUPBY(#FirstColumn) would only output a single value. So for this example data, you'd get the following results:
e.g.
abc, xyz, 123, 567
abc, xyz, 123, 567
abc, xyz, 123, 568GROUPBY(#FirstColumn) -- Result:
abcHowever, if you used consecutive GROUPBY statements for all four columns, you'd get the result that you reported:
GROUPBY(#FirstColumn), GROUPBY(#SecondColumn), GROUPBY(#ThirdColumn), GROUPBY(#FourthColumn) -- Result:
abc, xyz, 123, 567
abc, xyz, 123, 568 -
The number of GROUPBY functions used does increase with the number of columns that need to be deduplicated together. There is not another function that will automatically deduplicate records based on all columns.
The GROUPBY approach is very flexible. If there is only a handful of columns that need to be deduplicated, these columns may be used specifically in the Workbook and the others may be omitted.
-
Thanks Joel for your continuing engagement on this. So, if I GROUPBY (#Column A) and row 1 and 3 (col A) have identical identifiers, will the dulicate cell in Col A be removed or will be row/record be removed? If the cell is removed, will the other columnner information in the row/record be out of what with Col A? I presume the duplicate row goes right?
-
Datameer processes functions on an entire column, not at a cell level. When thinking about the data set in this way, the functions are creating a new manipulated version of the data on this new worksheet.
The operation of GROUPBY lists out all unique values for the requested column. If multiple GROUPBY statements are included on a worksheet, they unique groupings are nested.
I'll work through the latest example you commented about. In this example, you have a set of data in which Column A has the same value for Row 1 and Row 3. Here is an example that meets this requirement:
ColumnA, ColumnB, ColumnC
abc, 123, apple
xyz, 987, apple
abc, 456, bananaIf I create a new worksheet with the function "GROUPBY(#ColumnA)", I would receive these results:
GROUPBY(#ColumnA)
abc
xyz
Of note, this is only 2 records because the unique number of values in ColumnA is just 2.If I add a second function "GROUPBY(#ColumnB)" then I will see these results:
GROUPBY(#ColumnA), GROUPBY(#ColumnB)
abc, 123
abc, 456
xzy, 987
Of note, the output has increased from 2 records to 3 records because the unique pairings of ColumnA and ColumnB are 3.(The order of the displayed records was changed here for clarity of the grouping concept. I wanted to keep the "abc" records adjacent. This re-order is intended for the demonstration and is not related to this discussion.)
Please sign in to leave a comment.
Comments
9 comments