Approach to unpivot a workbook

January 19, 2016 17:03

Hello Datameer,

We have the following technical question; in which way in Datameer can we convert the columns header (months) in rows. The workbook is already uploaded.

Example

	Jan	Feb	Mar
INICIO	126.0	40.0	176.0
DTU/DTI	49.0	141.0	204.0

Desired result

INICIO	JAN	126.0
INICIO	FEB	40.0
INCIO	MAR	176
DTU/DTI	JAN	49.0

...

Comments

6 comments

Official comment
Joel Stewart

January 19, 2016 19:48
I can't think of a function that can be used to perform this directly, however it can be done in a couple of steps.

In a sheet, create a keyed list of the values that you'd like to split later. Here is an example function to do that:

LIST("Jan: "+T(#Jan); "Feb: "+T(#Feb); "Mar: " + T(#Mar))

After that, you can create a second sheet and use the EXPAND(#Sheet1!MyList) function to separate the list into one row per list. You may also reference #Sheet1!column1 to copy the original values over in this data too.

Then, you may split the contents of the EXPAND data from "Jan: 126.0" into the different columns using the SUBSTR function.

Lastly, be sure to convert "126.0" from a STRING back to a FLOAT using the FLOAT() function.
Comment actions Permalink
Nikhil Kumar

February 04, 2016 18:58
Joel - thanks for the suggestion. We are running into the following error with the EXPAND function. Any ideas how to workaround this.

Sheet 1, the expand function it´s displaying the values in right order

Sheet 2, the copy function takes the values for the expand function in the sheet 1 in wrong order
0

Comment actions Permalink
Joel Stewart

February 04, 2016 19:47
The row order is not expected to be maintained between different sheets. Hadoop may process the records on different distributed nodes and the order of the output records may differ from the input records. This occurs during a "Shuffle" phase.

All the rows should still be consistent with respect to their relative column data.

If the order of the results is critical to the analysis, you may implement a sort on the sheet to force the order to be preserved. This is generally not recommended because it significantly reduces the performance of the job by reducing the distribution of processing.
0

Comment actions Permalink
Nikhil Kumar

February 04, 2016 20:44
Joel -

Thanks again! So, I just implemented your technique and Im going to suggest your approach which is LIST("Jan: "+T(#Jan); "Feb: "+T(#Feb); "Mar: " + T(#Mar)). I think the issue with the original approach is that he is doing a group rownumber and that requires a shuffle I'm guessing. With your approach, there should be no shuffle correct?
0

Comment actions Permalink
Joel Stewart

February 04, 2016 21:07
The GROUPROWNUMBER function would be tying this solution too closely to the order of the rows. I advise taking a critical look at the expected outcome to verify if the row number itself is required.

Without full scope of the workbook that the user is developing, it's difficult to say if there would be any shuffling or not. The best way I can think to answer this question is to recommend building analytics where the order of the rows is not critical.
0

Comment actions Permalink
Nikhil Kumar

February 04, 2016 21:37
Thanks Joel. I have suggested your approach. I agree that we should divorce ourselves from having t do a group number and also not worry about the order of records. But I have a feeling that they are looking to display the records in the Jan, Feb .... Dec order. I will confirm and get back to you.
0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?