Parse Message Body from Email
I am importing emails from an outlook inbox, and cant find a useful way to extract the message body from the xml. attached below is a snippet of what i am getting:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from rtf -->
<style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
</head>
<body>
<font face="Arial" size="2"><span style="font-size:10.5pt;">
<div>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec scelerisque, ante eget imperdiet pretium, enim sapien imperdiet enim, id sollicitudin elit justo in lectus. Vivamus nec erat id nibh gravida accumsan. Nullam vel nulla in libero luctus rhoncus
non vitae nibh. Aenean eget est pretium, viverra mauris quis, tristique dui.</div>
<div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div>
<div>id sollicitudin elit justo in lectus. Vivamus nec erat id nibh gravida accumsan. Nullam vel nulla in libero luctus rhoncus non vitae nibh. Aenean eget est pretium, viverra mauris quis, tristique dui.</div>
<div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div>
<div>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec scelerisque, ante eget imperdiet pretium, enim sapien imperdiet enim, id sollicitudin elit justo in lectus. Vivamus nec erat id nibh gravida accumsan. Nullam vel nulla in libero luctus </div>
<div><font face="Calibri" size="2"><span style="font-size:11pt;"> </span></font></div>
</span></font>
</body>
</html>
I want to extract the bold and italic text, and none of the associated XML.
Any help would be greatly appreciated!
-
This is easiest to handle when the data is being brought into Datameer using the HTML Files type: https://documentation.datameer.com/documentation/display/DAS70/Importing+HTML+Files
Specifically, if you select the "Remove meta information and HTML tags" option from the HTML Parse Options on the Data Details page, you're left with just this data:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec scelerisque, ante eget imperdiet pretium, enim sapien imperdiet enim, id sollicitudin elit justo in lectus. Vivamus nec erat id nibh gravida accumsan. Nullam vel nulla in libero luctus rhoncus non vitae nibh. Aenean eget est pretium, viverra mauris quis, tristique dui. id sollicitudin elit justo in lectus. Vivamus nec erat id nibh gravida accumsan. Nullam vel nulla in libero luctus rhoncus non vitae nibh. Aenean eget est pretium, viverra mauris quis, tristique dui. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec scelerisque, ante eget imperdiet pretium, enim sapien imperdiet enim, id sollicitudin elit justo in lectus. Vivamus nec erat id nibh gravida accumsan. Nullam vel nulla in libero luctus
-
In that case, the data can be extracted using the https://documentation.datameer.com/documentation/display/DAS70/REMOVE_HTML_TAGS function directly. This is the same process that I recommended during ingestion but available within the workbook.
Please sign in to leave a comment.
Comments
4 comments