Removing Duplicate Rows
Hi,
Is there any other efficient way to remove duplicate rows instead of applying "GROUP BY" to all the columns? Kindly suggest.
Thank You in Advance,
Swetha Telu
-
Official comment
Hello Swetha,
I could suggest the following approach to remove duplicated records from a dataset.
Let's say you have a dataset and need to remove duplicated records based on data in first 3 columns - this means that, if there are any other records that contain the same values in first 3 columns (let's call them Key Columns), they will be considered as duplicates.
You could change number of key columns according to your requirements.
Please try the following steps:
Prepare the data
- Add a column to your sheet and apply function
GROUPBY(1)
to it. This will fill whole column with the same value ("1" in this case) in each row. - Add one more column (lets name it RowNumber) and apply function
GROUPROWNUMBER()
to it. This will assign unique number to every row across the sheet (as all of them are at the same group we've created at the step one). - At new column (lets call it KeyColumnsHash). Apply
CONCUT
function to your 3 key columns, then hash them in order to make result string shorter - you could use formulasha_256(CONCAT(sha_256(#KeyCol1); sha_256(#KeyCol2); sha_256(#KeyCol3)))
Remove Duplicates
- At the new sheet execute
GROUPBY
function under KeyColumnsHash column. It returns Hash values for rows with unique data in key columns. - At the same sheet apply
GROUPFIRST
function with RowNumber column as both arguments. It will display corresponding row number for values from KeyColumnsHash column.
After all these steps you should have two sheets. The first contains whole your data set and additional columns with row number and hash string of your key columns. The second - rows with unique key columns' hash string and their numbers. You could perform Inner Join over these two sheets by RowNumber and exclude unnecessary columns. This will give you your dataset without duplicated rows.
Hope you will find this helpful.
Best regards,
Konsta.Comment actions - Add a column to your sheet and apply function
-
Hi Konsta,
I have tried the way you suggested. It is working fine as expected. Thank You so much for the quick response.
But here if we import the data into a workbook, we need to create a duplicate data source sheet as we cannot apply any modifications to the original data source sheet. So in this case we will have three sheets : 1) Original Data Source sheet 2) Duplicate of the Data Source sheet 3) Required Final Output.
Please correct me if I am wrong.
Thanks in Advance,
Swetha
-
Hello Swetha,
Yes, you are right, you need to duplicate source sheet to be able to perform action of the dataset.
In total you should have the following sheets at your workbook:
1. Original data source.
2. Duplicate of Original data source (at this sheet you will add GroupRowNumber and Hash or your key columns).
3. Data preparation sheet (here you should applyGROUPBY
function under KeyColumnsHash column andGROUPFIRST
function for RowNumber column).
4. Result sheet, where you will get results of your Join operation.Best regards,
Konsta.
Please sign in to leave a comment.
Comments
4 comments