How to eliminate duplicate rows?

Comments

9 comments

  • Official comment
    Joel Stewart

    To combine duplicated rows, you may use a grouping function.

    In the example above, a GROUPBY(#FirstColumn) would result in only three rows. The other values could be copied over from the original sheet.

    Comment actions Permalink
  • John Mulhall

    So to clarify, if you use GROUPBY(#FirstColumn), Datameer will match rows by first column, and then check every column on the sheet to make sure the role is truly duplicated?

    e.g.
    abc, xyz, 123, 567
    abc, xyz, 123, 567
    abc, xyz, 123, 568

    GROUPBY (FirstColumn)
    Result:
    abc, xyz, 123, 567
    abc, xyz, 123, 568

    Is this correct?

    0
    Comment actions Permalink
  • Joel Stewart

    John, the GROUPBY(#FirstColumn) would only output a single value. So for this example data, you'd get the following results:
    e.g.
    abc, xyz, 123, 567
    abc, xyz, 123, 567
    abc, xyz, 123, 568

    GROUPBY(#FirstColumn) -- Result:
    abc

    However, if you used consecutive GROUPBY statements for all four columns, you'd get the result that you reported:
    GROUPBY(#FirstColumn), GROUPBY(#SecondColumn), GROUPBY(#ThirdColumn), GROUPBY(#FourthColumn) -- Result:
    abc, xyz, 123, 567
    abc, xyz, 123, 568

    1
    Comment actions Permalink
  • John Mulhall

    Joel, thanks for this. Much appreciated! It seems like a cumbersome way to deduplicate records in a process. Is there any specific function to do de-deduplicate records like the "UNIQUE" function for lists?

    0
    Comment actions Permalink
  • Joel Stewart

    The number of GROUPBY functions used does increase with the number of columns that need to be deduplicated together. There is not another function that will automatically deduplicate records based on all columns.

    The GROUPBY approach is very flexible. If there is only a handful of columns that need to be deduplicated, these columns may be used specifically in the Workbook and the others may be omitted.

    0
    Comment actions Permalink
  • John Mulhall

    Thanks Joel for your continuing engagement on this. So, if I GROUPBY (#Column A) and row 1 and 3 (col A) have identical identifiers, will the dulicate cell in Col A be removed or will be row/record be removed? If the cell is removed, will the other columnner information in the row/record be out of what with Col A? I presume the duplicate row goes right?

    0
    Comment actions Permalink
  • Joel Stewart

    Datameer processes functions on an entire column, not at a cell level. When thinking about the data set in this way, the functions are creating a new manipulated version of the data on this new worksheet.

    The operation of GROUPBY lists out all unique values for the requested column. If multiple GROUPBY statements are included on a worksheet, they unique groupings are nested.

    I'll work through the latest example you commented about. In this example, you have a set of data in which Column A has the same value for Row 1 and Row 3. Here is an example that meets this requirement:
    ColumnA, ColumnB, ColumnC
    abc, 123, apple
    xyz, 987, apple
    abc, 456, banana

    If I create a new worksheet with the function "GROUPBY(#ColumnA)", I would receive these results:
    GROUPBY(#ColumnA)
    abc
    xyz
    Of note, this is only 2 records because the unique number of values in ColumnA is just 2.

    If I add a second function "GROUPBY(#ColumnB)" then I will see these results:
    GROUPBY(#ColumnA), GROUPBY(#ColumnB)
    abc, 123
    abc, 456
    xzy, 987
    Of note, the output has increased from 2 records to 3 records because the unique pairings of ColumnA and ColumnB are 3.

    (The order of the displayed records was changed here for clarity of the grouping concept. I wanted to keep the "abc" records adjacent. This re-order is intended for the demonstration and is not related to this discussion.)

    0
    Comment actions Permalink
  • John Mulhall

    Thanks Joel, so it basically works like a manual version of the DISTINCT command in SQL, right?

    0
    Comment actions Permalink
  • Joel Stewart

    Yes, the SQL DISTINCT function is a good parallel to draw to the GROUPBY function.

    0
    Comment actions Permalink

Please sign in to leave a comment.