How can I 'pipe' or 'join' multiple formulas so I can do several IF statements?
I've tried a lot of different formula and operator combinations but can't seem to nail this down. Here's the problem: I have a field that is cast as a STRING because 90% of it are numbers but the rest contain letters at the end of the numbers. This is due to some numbers having " MB", " GB", and " TB" appended to them. That's why Datameer casts it as a STRING. I'm cleaning up the data in a workbook to trim off the letters, leave only numbers, cast the numbers as FLOAT, and then execute the correct math to normalize the field to MB.
The formula below works BUT I can only catch two of the four cases (MB and GB). Hence my question... how can I put together multiple IF statements to take care of all 4 (or possibly more) cases? Also how do I put in the function to cast it as a FLOAT and then use mathematical operators on it?
IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw ; "MB"); REGEX(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"\\b(.+)\\s(MB)";"$1"); IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw ; "GB"); REGEX(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"\\b(.+)\\s(GB)";$1"); #TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw))
Here's some pseudo code of what I really want to do:
IF #field CONTAINS " KB"
THEN "\\b(.+)\\s([M,G,T,P]B)"; FLOAT($1)/1024
ELSE IF #field CONTAINS " MB"
THEN "\\b(.+)\\s([M,G,T,P]B)"; FLOAT($1)
ELSE IF #field CONTAINS " GB"
THEN "\\b(.+)\\s([M,G,T,P]B)"; FLOAT($1)*1024
ELSE IF #field CONTAINS " TB"
THEN"\\b(.+)\\s([M,G,T,P]B)"; FLOAT($1)*1024*1024
ELSE FLOAT(#field)
Please advise; thanks!
I tried doing it like this:
IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw ; REGEX(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"\\b(.+)\\s([M,G,T,P]B)"; "$1")); REGEX(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"\\b(.+)\\s([M,G,T,P]B)"; "$1"); #TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw)
I got this to work but had even worse problems. Yeah it strips off all the letters but now I have no idea which number is MB, GB, or TB? Stripping off the designation is meaningless... it's now just a number... I need them to all be MB. So I was able to work within the 3 argument limit of the IF function but still won't get me what I want.
-
Official comment
There are a few ways to do this in Datameer.
The most direct method that I can think of is to create the IF statements within each other and create one very long function:
IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"KB");FLOAT(SUBSTR(#Data;0;LEN(#Data)-2))*1024;IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"MB");FLOAT(SUBSTR(#Data;0;LEN(#Data)-2))*1024*1024;IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"GB");FLOAT(SUBSTR(#Data;0;LEN(#Data)-2))*1024*1024*1024;IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"TB");FLOAT(SUBSTR(#Data;0;LEN(#Data)-2))*1024*1024*1024*1024;null))))
The most readable solution that I can think how splits the logic into different columns and then joins the result at the end:
- New Column Titled "KB":
-
IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"KB");FLOAT(SUBSTR(#Data;0;LEN(#Data)-2))*1024;null)
- New Column Titled "MB":
-
IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"MB");FLOAT(SUBSTR(#Data;0;LEN(#Data)-2))*1024*1024;null)
- New Column Titled "GB":
-
IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"GB");FLOAT(SUBSTR(#Data;0;LEN(#Data)-2))*1024*1024*1024;null)
- New Column Titled "TB":
-
IF(CONTAINS(##TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"TB");FLOAT(SUBSTR(#Data;0;LEN(#Data)-2))*1024*1024*1024*1024;null)
- New Column Titled "Final Result":
-
COALESCE(#KB;#MB;#GB;#TB)
Personally, I prefer the more readable method. It's easier to manipulate when editing or troubleshooting. Ultimately, when Datameer compiles the job for execution on the full data set, there should not be a detectable difference in performance for designing the query to be in multiple parts.
Comment actions -
Hi Morgan,
Correct, IF only takes three arguments.
I see two approaches:
1. Create a second column where you populate records with MB/GB/TB depending on what the corresponding entry has. Then strip the letters and you can use your new column with the size abbreviation to do your multiplication
2. Generate three new columns, one for MB, one for GB, and one for TB. Sanitize them individually, then recombine them into a single column after they've been corrected.
-
Joel Stewart,
Thanks for helping with this.
I get the premise of nesting the IF statements and I've tried it several ways but I always get syntax errors. Even if I simplify your example it comes back with syntax errors.
IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"KB");"KB FOUND";IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"MB");"MB FOUND";IF(CONTAINS(#TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"GB");"GB FOUND";IF(CONTAINS(##TSM_Filespace_Ent_Strg_DataLink!Capacity_Raw;"TB");"TB FOUND";null)I tried plugging in your formula and it returned syntax errors (I changed #Data to my field names). Help me understand what's going on here. I get that you're nesting an IF statement as the third argument for each IF statement. What I don't understand is how and why you're using LEN and SUBSTR together. I get how they work individually but not grasping how you're using them together to make the FLOAT work.
Also what are the tags used to mark text as code? It's not listed in the WYSIWYG editor. Thx!
-
Morgan,
Is there a technical reason why you are wanting to nest everything into a single IF statement? It seems that your current approach is a much, much more difficult way to achieve what you want. Please help us understand why you don't want to split the conversions into smaller chunks with multiple columns. You can always create a separate worksheet just to do these manipulations and then use the final column in another sheet.
As far as your syntax issue, the formula you built has a mismatch in the number of open parenthesis and closed parenthesis. You've got 8 open parenthesis with only 5 closing parenthesis. This is one of the troubles of taking the nested approach. You have to be extremely meticulous with your syntax.
Also, in order to put in a code block within the WSYIWYG, click the Paragraph button on the top left and select code.
-
Morgan, with respect to the LEN and SUBSTR functions, I'm nesting these to perform the conversion in-line.
For example, I'm converting a string that looks like "14KB" to a Float with value "14366".
- Create a substring of "14KB" that removes the "KB" delimiter: To do this, I take the length of the original string (4 in this example) and then I take a substring of the original example that stops 2 characters before the end (omitting the "KB" characters).
- Convert the string value of "14" to a Float value of 14.
- Multiply the Float value of 14 by 1024 to give me the Float value of 14366 (the bytes count).
The key here is that I'm explicitly removing exactly the last two characters of the original string to get a new string with just number characters.
PS I see my example also contains the parentheses mismatch issue, I suspect these were truncated mistakenly when I copy-pasted the example. Sorry for any confusion this may have caused.
-
Alan,
Thanks for catching the parenthesis. I'll see if I can get it working. I had two reasons for trying to get it all done in one formula. First being sprawl... widening my 500,000 row dataset by 10 fields adds up. Is it enough to worry about? I'm not sure as I don't really deal with any of the underlying infrastructure of our environment or even administration or performance of Datameer. So I wanted to be prudent. Secondly... I figured there was a way to do it and I was curious =) I did end up doing multiple fields and separate formulas and it works.
Also thanks for the tip on where to find the code tags!Joel,
Thank you for explaining how you're using LEN and SUBSTR. Makes sense now. Using that with FLOAT helped me cut down the amount of fields I added in half. I was originally doing all of the string manipulation in one formula then converting them to floats in another formula then finally merging it all together into one field. Now that I can do the string manipulation, data type casting, and math in one field I just have one field for KB, MB, GB, TB, and PB.
I'll still try playing with getting it all done in one formula... just because it sounds cool =) but also what do you and Alan think about my data sprawl concerns? Is it really enough to worry about for performance or capacity? It's quite possible I'm worrying about nothing in this regard.
Thanks guys!
-
The recommendation that I have if you're concerned about the sprawl of the data is to use an intermediate sheet and not mark it to be kept. This would allow you to have the flexibility and ease of designing the solution that is easy to understand and edit.
At the very end, simply only the required columns to a duplicate sheet. This allows Datameer to optimize the code at run time and not be forced to write the intermediate results to disk -- they will only be kept in memory and can be compiled optimally as well. This is one of the advantages of using Datameer -- you can work in the way that you think and then Datameer can handle the compilation and optimization at run time.
There is merit to your concern if you plan to save all sheets, but this is not a best practice as it leads to overhead that is often unutilized.
Please sign in to leave a comment.
Comments
7 comments