Extract values from a string

Comments

18 comments

  • Konsta Danyliuk

    Hello Elwyn.
    May I ask you to provide the following details, please.

    • What Datameer version do you use?
    • If the # character represents the start of the desired part of a sting, is there anything for the end of this part? Or the next # character is the end of a current segment and the start of the next one? 
    0
    Comment actions Permalink
  • Elwyn Roberts

    Version  11.1.3.1

    No there is nothing that determines the end of the string other than the possibility of the next #


    0
    Comment actions Permalink
  • Konsta Danyliuk

    Hello Elwyn.
    Perhaps you could leverage the Datameer REGEX function.

    I've tested the following approach using the records you've shared 

    In case the only indication of the end of the tag is a start of a new tag, we could assume that we need to extract the part of the string that starts with the #<Tag name> and ends with the next # character. 

    I've used the *#TAG([^#]+).* expression for this.

    For example, to get the Start values.

    Or to get the End values.

    I hope this helps.

    1
    Comment actions Permalink
  • Elwyn Roberts

    Thanks I will give it a go and let you know

    0
    Comment actions Permalink
  • Elwyn Roberts

    Didnt work for me.  

    For example, this is the string value
    #IMPACT-2 #EFFORT-2 #SD-20210407 #BFY #CD-20210421

    I tried the following code:
    REGEX(#Notes;".*#SD-([^#]+).*"; "$1")

    Not sure if I am missing something in the syntax.

    0
    Comment actions Permalink
  • Konsta Danyliuk

    Hello Elwyn.

    I can get the SD tag value in my lab (version 11.1.3.1) using the same formula you do. 

    Have you received an error or just an empty string?

     

    0
    Comment actions Permalink
  • Elwyn Roberts

    Hmmm - No error, just an empty string.... 

    0
    Comment actions Permalink
  • Konsta Danyliuk

    Hello Elwyn.
    Are you able to extract any other tag from this string using the same approach, e.g., EFFORT or CD?

    0
    Comment actions Permalink
  • Elwyn Roberts

    Sadly no.... I even tried just taking the first 20 characters with the LEFT function, where I know the the code #Start appears in the first 20 characters of the string but it still did not return anything.

    0
    Comment actions Permalink
  • Konsta Danyliuk

    Hello Elwyn.

    May I ask you to:

    • Create a new Workbook without any sources (cancel the choose source dialog)
    • Create a single record via T("#IMPACT-2 #EFFORT-2 #SD-20210407 #BFY #CD-20210421")
    • At the next column apply the formula REGEX(#A;".*#SD-([^#]+).*"; "$1")
    • At the next column apply the formula LEFT(#A;10)

    Let me know if any of those formulas return anything.

    0
    Comment actions Permalink
  • Elwyn Roberts

    So yes I did that and it does work.  The regex did return the right value.

    Which means I went back to the original workbook, and wondered if it was something wrong with the column values itself so I wrapped them in a T function but no result.

    So I took the 20 left character of the column that was had the original column data in a T function and it returned 20 characters.  I then applied the Regex on that new column with the LEFT function.... no result.

    Now that we know the function works successfully on my side, I am left wondering what is wrong with the data that could cause the REGEX to fail in this case? 

    0
    Comment actions Permalink
  • Alan Mark

    Hi Elwyn,

    If you wrapped the column data in a T() function, it would be simple string data in the new column. 

    First I have to ask what's probably a silly question: You're applying the REGEX function directly on the T() column output, before using the LEFT function, correct?  If you were using LEFT first, your string doesn't appear in the first 20 characters.  This would result in no output.

    Are you nesting the functions?  
    LEFT(REGEX(T(#A);".*#SD-([^#]+).*"; "$1");10)

    If so, I'd suggest trying to simplify things one step at a time to see if that's a factor.  
    If not, it might be worth a try doing the full evaluation in a single column.

    It also might be worthwhile to run the workbook while it's returning no values for this column to see if there are any exceptions thrown.

    1
    Comment actions Permalink
  • Elwyn Roberts

    OK I have gotten somewhere or maybe a clue....but to answer your questions, no not nesting the functions and the LEFT function was wrapped around the T.  So REGEXLEFTT(   the thinking that, convert the original to text, take the first 10 chars and then regex that.

    I used LEFT 20 to solve for additional fields with "#" and would it know where to stop.  Using 20 characters was just getting part of the string for testing, in reality I wont use LEFT function at all, the RegEx would (hopefully) know where to get the start and end of the "#" code provided.

    Here is what I have...
    OrigText   : The unmodified text
    ColA        :  T(#OrigText)
    ColB        :  LEFT(#ColA; 10)   *** more on this later
    Result      :  REGEX(#ColB;".*#SD-([^#]+).*"; "$1")

    And AWESOME it works!

    Now, I will change the number of characters in the LEFT function to include up to the next code.

    Brilliant it works!


    But heres where it goes wrong... I will change it to include 19 characters, and it fails (does not work)

    It will work with 18 characters, which would be right up to the "Y" at the end.  So what I suspect is causing this and not sure how to fix it, but it must be carriage returns or maybe tab characters.  Here are 25 charcters and so its stopping on the 19th character (right after the "Y" and before the next #)


    Hope that sheds some light and maybe there is a resolution.  Ideally the end state should be no need for ColA or ColB and the RegEx is done on the OrigText value.




    0
    Comment actions Permalink
  • Alan Mark

    Hi Elwyn,

    I've simplified and changed the REGEX expression slightly to terminate on a whitespace character rather than #.  Can you please see if this helps?

    REGEX(#B;"#SD-([^\\s]+).*";"$1")

    0
    Comment actions Permalink
  • Elwyn Roberts

    Only works up to 12 characters, so the space after the 311 causes it to fail.  I have to assume this is also a special character maybe?

    0
    Comment actions Permalink
  • Alan Mark

    Hi Elwyn,

    Both Konsta and I have been unable to reproduce this behavior.  Could you please open a support ticket for this?  Also if you could provide a CSV with the actual data you're working with when  you open the ticket - that might help us figure out what's going on.

    0
    Comment actions Permalink
  • Elwyn Roberts

    Ok solved it.  It was as suspected a line feed (carriage return)...although I thought the // in the regex was supposed to handle that.  When I was testing, I would copy the text into a new workbook which i guess was removing the linefeed(?)

    I had to use the CODE function to find out what was in the blank spaces at the end of the string I wanted.  And turns out it was Ascii code 10.  So then I used the REPLACEALL function to remove this and put a blank space instead.

    So in the end, this would be the final function(s) to get the data separated based on the # position.

    REGEX(REPLACEALL(#OrigText;CHAR(10);"");".*#CD-([^#]+).*"; "$1")

    Thank you both so much for your time and knowledge.  Sorry it was just something as simple as a character type - but then again, glad its a simple solution.



    0
    Comment actions Permalink
  • Alan Mark

    Glad you got it sorted out and your hunch about the space not being a space was spot on!  Great data sleuthing.

    0
    Comment actions Permalink

Please sign in to leave a comment.