Smart Execution, In Memory Jobs, Workbooks Size
Dear,
We are testing/measuring performance on a new Env. Datameer on Hadoop Cluster against Local Mode Datameer.
1) The Workbooks that are using Data Links are quite faster on The Hadoop Cluster which is understandable, but the In memory jobs are quite slower. Can You please advise for a setting(s) change in order to improve this?
2) Smart Execution is not used on the Hadoop Cluster, although Spark and Tez plugins are enabled on Datameer. Also all the jobs are executed as "Standard MR job". How can We enable Smart Execution, what are the prerequisites ? (Maybe this is connected with Issue No.1)
3) The Size of the WBs is quite larger on the Hadoop Cluster. This is not due to replication, and We are using Gzip Compression. Any advice on that (maybe LZO is used as default in Local Mode)?
Looking forward to Your answers and Thanks in advance.
Best Regards,
Aleksandar Razmovski
-
Hello Aleksandar.
- Without information about exact cluster settings, it is hard to say what exactly might cause slowness in your case. Perhaps it is related to execution engine used or memory settings. We could investigate further if you will provide more information about cluster and Datameer configuration (version, memory allocation, etc.)
- In order to use Smart Execution it would be required to get appropriate license which enables this feature.
Meanwhile you could try different execution frameworks (e.g. Tez), by setting custom property e.g. das.exection-framework=Tez on global level at Hadoop Cluster page or individually for a job.
- LZO codec isn't used as default compression in Datameer.
In order to make fair comparison of workbook's size in local and cluster mode, I would suggest to create Baseline test workbook in different modes and check its size. You could use instruction from the article How to Generate Normal Distributed Random Values to make the workbook for clean test.
- Without information about exact cluster settings, it is hard to say what exactly might cause slowness in your case. Perhaps it is related to execution engine used or memory settings. We could investigate further if you will provide more information about cluster and Datameer configuration (version, memory allocation, etc.)
-
Dear,
About 1)
I've noticed 1 weird thing:
[system] INFO [2016-11-11 09:56:38.877] [ConcurrentJobExecutor-0] (JdbcSplitter.java:71) - SplitHint{numMapTasks=149, minSplitSize=0, maxSplitSize=9223372036854775807, minSplitCount=0, maxSplitCount=4}
While the setting in mapred-site are:
<property>
<name>mapreduce.input.fileinputformat.split.maxsize</name>
<value>536870912</value>
</property>
<property>
<name>mapreduce.input.fileinputformat.split.minsize</name>
<value>134217728</value>
</property>
<property>
<name>mapreduce.job.max.split.locations</name>
<value>5</value>
</property>Where does Datameer get the above numbers (in the Split Hint)?
Thanks in Advance
Best Regards,
Aleksandar Razmovski
-
Datameer will try to Find the Optimal Split Size/Split Count based on the values for max/min split size and max/min split count.
-
Dear,
Also the containers RAM allocated is never surpassing 2GB, altgough the settings show differently:
dfs.datanode.data.dir = /localservices/hdfs_data
yarn.scheduler.capacity.maximum-am-resource-percent = 20
yarn.nodemanager.vmem-check-enabled = false
dfs.permissions.superusergroup = hdfs
mapreduce.map.cpu.vcores = 5
mapreduce.map.speculative = false
mapreduce.output.fileoutputformat.compress = true
mapreduce.task.io.sort.mb = 1024
mapreduce.reduce.cpu.vcores = 5
yarn.scheduler.minimum-allocation-vcores = 1
mapreduce.reduce.memory.mb = 4096
dfs.namenode.checkpoint.dir = /localservices/secondary1,/localservices/secondary2
yarn.nodemanager.local-dirs = /localservices/yarn-local
mapreduce.job.max.split.locations = 5
yarn.resourcemanager.address = ${yarn.resourcemanager.hostname}:8032
yarn.scheduler.increment-allocation-mb = 512
mapreduce.map.output.compress = true
hadoop.tmp.dir = /localservices/tmp
yarn.nodemanager.vmem-pmem-ratio = 2.1
yarn.application.classpath = $HADOOP_CONF_DIR,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-hdfs/*,/usr/lib/*,/usr/lib/hadoop-hdfs/lib/*,/usr/lib/hadoop-mapreduce/lib/*,/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop/*,/usr/lib/hadoop/lib/*
dfs.namenode.name.dir = /localservices/hadoopname1,/localservices/hadoopname2
fs.tachyon.impl = tachyon.hadoop.TFS
yarn.nodemanager.resource.memory-mb = 12288
yarn.nodemanager.resource.cpu-vcores = 7
mapred.child.java.opts = -Xmx3277m
yarn.scheduler.minimum-allocation-mb = 512
mapreduce.framework.name = yarn
dfs.blocksize = 134217728
mapreduce.input.fileinputformat.split.minsize = 134217728
mapreduce.reduce.java.opts = -Xmx3277m
mapreduce.map.java.opts = -Xmx3277m
mapreduce.input.fileinputformat.split.maxsize = 536870912
yarn.resourcemanager.hostname = hdp2.carrierzone.com
yarn.scheduler.maximum-allocation-mb = 4096
io.compression.codecs = org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec
yarn.scheduler.maximum-allocation-vcores = 5
yarn.nodemanager.aux-services = mapreduce_shuffle
mapreduce.map.memory.mb = 4096
yarn.app.mapreduce.am.command-opts = -Xmx922m
yarn.nodemanager.aux-services.mapreduce_shuffle.class = org.apache.hadoop.mapred.ShuffleHandler
yarn.app.mapreduce.am.resource.mb = 1024
mapreduce.job.ubertask.enable = true
dfs.replication = 2
yarn.app.mapreduce.am.resource.cpu-vcores = 1
yarn.resourcemanager.scheduler.class = org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
fs.defaultFS = hdfs://hdp1:8020/
mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.GzipCodec
mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.GzipCodec
mapreduce.reduce.speculative = false
JVM Version = Java HotSpot(TM) 64-Bit Server VM, 1.7 (Oracle Corporation)
JVM Opts = -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1638m -Djava.io.tmpdir=/localservices/yarn-local/usercache/datameer/appcache/application_1479209793677_0005/container_1479209793677_0005_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1479209793677_0005/container_1479209793677_0005_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslogBest Regards,
Aleksandar Razmovski
-
Hi Aleksandar,
As of Datameer version 5.6 we use new custom properties as an abstraction layer for several Execution Frameworks.
If you like to configure memory related properties you can change the default settings which are currently:
das.job.map-task.memory=2048
das.job.reduce-task.memory=2048
das.job.application-manager.memory=2048e.g. to
das.job.map-task.memory=4096
das.job.reduce-task.memory=4096
das.job.application-manager.memory=4096To determine the memory configuration settings you may have also a look into distribution specific documentation like from Cloudera or Hortonworks.
Additionally remove Execution Framework specific parameters from Datameer's configuration e.g.:
mapred.map.child.java.opts=-Xmx<value>m
mapred.reduce.child.java.opts=-Xmx<value>m
mapred.job.map.memory.mb=<value>
mapred.job.reduce.memory.mb=<value>
Please sign in to leave a comment.
Comments
7 comments