Approach to unpivot a workbook
Hello Datameer,
We have the following technical question; in which way in Datameer can we convert the columns header (months) in rows. The workbook is already uploaded.
Example
|
Jan |
Feb |
Mar |
INICIO |
126.0 |
40.0 |
176.0 |
DTU/DTI |
49.0 |
141.0 |
204.0 |
Desired result
INICIO |
JAN |
126.0 |
INICIO |
FEB |
40.0 |
INCIO |
MAR |
176 |
DTU/DTI |
JAN |
49.0 |
...
-
Official comment
I can't think of a function that can be used to perform this directly, however it can be done in a couple of steps.
In a sheet, create a keyed list of the values that you'd like to split later. Here is an example function to do that:
LIST("Jan: "+T(#Jan); "Feb: "+T(#Feb); "Mar: " + T(#Mar))
After that, you can create a second sheet and use the EXPAND(#Sheet1!MyList) function to separate the list into one row per list. You may also reference #Sheet1!column1 to copy the original values over in this data too.
Then, you may split the contents of the EXPAND data from "Jan: 126.0" into the different columns using the SUBSTR function.
Lastly, be sure to convert "126.0" from a STRING back to a FLOAT using the FLOAT() function.
Comment actions -
Joel - thanks for the suggestion. We are running into the following error with the EXPAND function. Any ideas how to workaround this.
Sheet 1, the expand function it´s displaying the values in right order
Sheet 2, the copy function takes the values for the expand function in the sheet 1 in wrong order
-
The row order is not expected to be maintained between different sheets. Hadoop may process the records on different distributed nodes and the order of the output records may differ from the input records. This occurs during a "Shuffle" phase.
All the rows should still be consistent with respect to their relative column data.
If the order of the results is critical to the analysis, you may implement a sort on the sheet to force the order to be preserved. This is generally not recommended because it significantly reduces the performance of the job by reducing the distribution of processing.
-
Joel -
Thanks again! So, I just implemented your technique and Im going to suggest your approach which is LIST("Jan: "+T(#Jan); "Feb: "+T(#Feb); "Mar: " + T(#Mar)). I think the issue with the original approach is that he is doing a group rownumber and that requires a shuffle I'm guessing. With your approach, there should be no shuffle correct?
-
The GROUPROWNUMBER function would be tying this solution too closely to the order of the rows. I advise taking a critical look at the expected outcome to verify if the row number itself is required.
Without full scope of the workbook that the user is developing, it's difficult to say if there would be any shuffling or not. The best way I can think to answer this question is to recommend building analytics where the order of the rows is not critical.
-
Thanks Joel. I have suggested your approach. I agree that we should divorce ourselves from having t do a group number and also not worry about the order of records. But I have a feeling that they are looking to display the records in the Jan, Feb .... Dec order. I will confirm and get back to you.
Please sign in to leave a comment.
Comments
6 comments