Datameer Workbook Best practices
Hello,
I have a question regarding creating workbook best practices. Most of our team members are from RDBMS background. Normally in our organization, we create workbooks in Datameer which are processed in Hadoop environment and are exported to Hive tables for further using in Tableau. Usually any workbook gets data from different workbooks or datalinks, we perform joins, add calculations etc and have a final sheet with all required data for export within one single workbook. Is there any normalization concept in datameer? Do you recommend breaking down these workbooks in smaller workbooks and joining them together to create a Final workbook? What are the pros and cons of each approach?
Appreciate any suggestions on this and if someone can share the links to best practices documents that will be really great.
Thanks.
-
Official comment
Hi Jyoti, I recommend starting with this article: How to Optimize a Workbook
If you and your team have other specific questions, let us know and we'll gladly assist.
Comment actions -
Joel,
Thanks for the link to Workbook Optimization. But, my question was like, while creating workbooks that has many datasources, with lots of complex joins and calculations, is it advisable to break them into smaller workbooks with individual joins and calculations and join all these small workbooks to create a final Workbook which will be used for reporting purpose? Am I adding any overhead with memory, performance etc on the environment? Does anyone use data modeling, normalization concepts like RDBMS here too?
Thanks
-
Thank you for clarifying Jyoti. There's not a single answer that I can recommend based on your additional context.
From a teamwork perspective, it may be best to split up the content into smaller workbooks so that intermediate data results are available for other teams to access.
From a purely performance perspective, Datameer will optimize the job best if it knows the full pipeline in a single workbook. This can be improved even further by not saving intermediate sheets.
Splitting the workbooks does add some overhead to the overall calculations and does force some intermediate results (results from each workbook) to be saved to HDFS. This overhead may be recovered though if it improves reusability for your team.
-
Thanks for this thread, I am a new user and had exactly the same questions. As I sandbox my first solution, I'm noticing that the thought process I'm using to develop my answer is reflected in the sheets of the book, my book has a lot of sheets. Usually my first approach isn't always the best and I usually go back and optimize for performance once I know it works. This helps me in knowing that I probably shouldn't split my process apart into separate books, but maintain a linear progression through the sheets.
Please sign in to leave a comment.
Comments
5 comments