Extract values from a string
We have a column for comments. People can tag important info in this string field using a # and code for the Start date, end date, 3rd party system, Work effort in hours. So a string value is inconsistent and can look like any of these four examples:
#Start-2021/05/01 #3RD-Yes #HRS-5 #End-2021-05-12
#Start-2021/06/01 #3RD-N
#End-20210514 #HRS-9
#Start-20210601 #End-2021-06-30 #Hrs-3
How do I parse it out if the code #<whatever> is consistent, but the position, length and existence are not?
I could do a long winded way in SQL but how could this be achieved in Datameer. Any alternatives to REGEX?
-
Hello Elwyn.
May I ask you to provide the following details, please.- What Datameer version do you use?
- If the # character represents the start of the desired part of a sting, is there anything for the end of this part? Or the next # character is the end of a current segment and the start of the next one?
-
Hello Elwyn.
Perhaps you could leverage the Datameer REGEX function.I've tested the following approach using the records you've shared
In case the only indication of the end of the tag is a start of a new tag, we could assume that we need to extract the part of the string that starts with the #<Tag name> and ends with the next # character.
I've used the *#TAG([^#]+).* expression for this.
For example, to get the Start values.
Or to get the End values.
I hope this helps.
-
Hello Elwyn.
May I ask you to:
- Create a new Workbook without any sources (cancel the choose source dialog)
- Create a single record via T("#IMPACT-2 #EFFORT-2 #SD-20210407 #BFY #CD-20210421")
- At the next column apply the formula REGEX(#A;".*#SD-([^#]+).*"; "$1")
- At the next column apply the formula LEFT(#A;10)
Let me know if any of those formulas return anything.
-
So yes I did that and it does work. The regex did return the right value.
Which means I went back to the original workbook, and wondered if it was something wrong with the column values itself so I wrapped them in a T function but no result.
So I took the 20 left character of the column that was had the original column data in a T function and it returned 20 characters. I then applied the Regex on that new column with the LEFT function.... no result.
Now that we know the function works successfully on my side, I am left wondering what is wrong with the data that could cause the REGEX to fail in this case? -
Hi Elwyn,
If you wrapped the column data in a T() function, it would be simple string data in the new column.
First I have to ask what's probably a silly question: You're applying the REGEX function directly on the T() column output, before using the LEFT function, correct? If you were using LEFT first, your string doesn't appear in the first 20 characters. This would result in no output.
Are you nesting the functions?
LEFT(REGEX(T(#A);".*#SD-([^#]+).*"; "$1");10)
If so, I'd suggest trying to simplify things one step at a time to see if that's a factor.
If not, it might be worth a try doing the full evaluation in a single column.It also might be worthwhile to run the workbook while it's returning no values for this column to see if there are any exceptions thrown.
-
OK I have gotten somewhere or maybe a clue....but to answer your questions, no not nesting the functions and the LEFT function was wrapped around the T. So REGEX( LEFT( T( the thinking that, convert the original to text, take the first 10 chars and then regex that.
I used LEFT 20 to solve for additional fields with "#" and would it know where to stop. Using 20 characters was just getting part of the string for testing, in reality I wont use LEFT function at all, the RegEx would (hopefully) know where to get the start and end of the "#" code provided.
Here is what I have...
OrigText : The unmodified text
ColA : T(#OrigText)
ColB : LEFT(#ColA; 10) *** more on this later
Result : REGEX(#ColB;".*#SD-([^#]+).*"; "$1")
And AWESOME it works!
Now, I will change the number of characters in the LEFT function to include up to the next code.
Brilliant it works!
But heres where it goes wrong... I will change it to include 19 characters, and it fails (does not work)
It will work with 18 characters, which would be right up to the "Y" at the end. So what I suspect is causing this and not sure how to fix it, but it must be carriage returns or maybe tab characters. Here are 25 charcters and so its stopping on the 19th character (right after the "Y" and before the next #)
Hope that sheds some light and maybe there is a resolution. Ideally the end state should be no need for ColA or ColB and the RegEx is done on the OrigText value. -
Ok solved it. It was as suspected a line feed (carriage return)...although I thought the // in the regex was supposed to handle that. When I was testing, I would copy the text into a new workbook which i guess was removing the linefeed(?)
I had to use the CODE function to find out what was in the blank spaces at the end of the string I wanted. And turns out it was Ascii code 10. So then I used the REPLACEALL function to remove this and put a blank space instead.
So in the end, this would be the final function(s) to get the data separated based on the # position.
REGEX(REPLACEALL(#OrigText;CHAR(10);"");".*#CD-([^#]+).*"; "$1")
Thank you both so much for your time and knowledge. Sorry it was just something as simple as a character type - but then again, glad its a simple solution.
Please sign in to leave a comment.
Comments
18 comments