Smart Execution, In Memory Jobs, Workbooks Size

Comments

7 comments

  • Konsta Danyliuk

    Hello Aleksandar.

    1. Without information about exact cluster settings, it is hard to say what exactly might cause slowness in your case. Perhaps it is related to execution engine used or memory settings. We could investigate further if you will provide more information about cluster and Datameer configuration (version, memory allocation, etc.)

    2. In order to use Smart Execution it would be required to get appropriate license which enables this feature.

      Meanwhile you could try different execution frameworks (e.g. Tez), by setting custom property e.g. das.exection-framework=Tez on global level at Hadoop Cluster page or individually for a job.

    3. LZO codec isn't used as default compression in Datameer.

      In order to make fair comparison of workbook's size in local and cluster mode, I would suggest to create Baseline test workbook in different modes and check its size. You could use instruction from the article How to Generate Normal Distributed Random Values to make the workbook for clean test.

    0
    Comment actions Permalink
  • Aleksandar Razmovski

    Dear,

    About 1)

    I've noticed 1 weird thing:

    [system]  INFO [2016-11-11 09:56:38.877] [ConcurrentJobExecutor-0] (JdbcSplitter.java:71) - SplitHint{numMapTasks=149, minSplitSize=0, maxSplitSize=9223372036854775807, minSplitCount=0, maxSplitCount=4}

    While the setting in mapred-site are:

    <property>
    <name>mapreduce.input.fileinputformat.split.maxsize</name>
    <value>536870912</value>
    </property>
    <property>
    <name>mapreduce.input.fileinputformat.split.minsize</name>
    <value>134217728</value>
    </property>
    <property>
    <name>mapreduce.job.max.split.locations</name>
    <value>5</value>
    </property>

    Where does Datameer get the above numbers (in the Split Hint)?

    Thanks in Advance 

    Best Regards,

    Aleksandar Razmovski

    0
    Comment actions Permalink
  • Gido

    Datameer will try to Find the Optimal Split Size/Split Count based on the values for max/min split size and max/min split count.

    0
    Comment actions Permalink
  • Aleksandar Razmovski

    Dear,

    Also the containers RAM allocated is never surpassing 2GB, altgough the settings show differently:

        dfs.datanode.data.dir = /localservices/hdfs_data
        yarn.scheduler.capacity.maximum-am-resource-percent = 20
        yarn.nodemanager.vmem-check-enabled = false
        dfs.permissions.superusergroup = hdfs
        mapreduce.map.cpu.vcores = 5
        mapreduce.map.speculative = false
        mapreduce.output.fileoutputformat.compress = true
        mapreduce.task.io.sort.mb = 1024
        mapreduce.reduce.cpu.vcores = 5
        yarn.scheduler.minimum-allocation-vcores = 1
        mapreduce.reduce.memory.mb = 4096
        dfs.namenode.checkpoint.dir = /localservices/secondary1,/localservices/secondary2
        yarn.nodemanager.local-dirs = /localservices/yarn-local
        mapreduce.job.max.split.locations = 5
        yarn.resourcemanager.address = ${yarn.resourcemanager.hostname}:8032
        yarn.scheduler.increment-allocation-mb = 512
        mapreduce.map.output.compress = true
        hadoop.tmp.dir = /localservices/tmp
        yarn.nodemanager.vmem-pmem-ratio = 2.1
        yarn.application.classpath =  $HADOOP_CONF_DIR,/usr/lib/hadoop-yarn/*,/usr/lib/hadoop-mapreduce/*,/usr/lib/hadoop-hdfs/*,/usr/lib/*,/usr/lib/hadoop-hdfs/lib/*,/usr/lib/hadoop-mapreduce/lib/*,/usr/lib/hadoop-yarn/lib/*,/usr/lib/hadoop/*,/usr/lib/hadoop/lib/*
        dfs.namenode.name.dir = /localservices/hadoopname1,/localservices/hadoopname2
        fs.tachyon.impl = tachyon.hadoop.TFS
        yarn.nodemanager.resource.memory-mb = 12288
        yarn.nodemanager.resource.cpu-vcores = 7
        mapred.child.java.opts = -Xmx3277m
        yarn.scheduler.minimum-allocation-mb = 512
        mapreduce.framework.name = yarn
        dfs.blocksize = 134217728
        mapreduce.input.fileinputformat.split.minsize = 134217728
        mapreduce.reduce.java.opts = -Xmx3277m
        mapreduce.map.java.opts = -Xmx3277m
        mapreduce.input.fileinputformat.split.maxsize = 536870912
        yarn.resourcemanager.hostname = hdp2.carrierzone.com
        yarn.scheduler.maximum-allocation-mb = 4096
        io.compression.codecs = org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec
        yarn.scheduler.maximum-allocation-vcores = 5
        yarn.nodemanager.aux-services = mapreduce_shuffle
        mapreduce.map.memory.mb = 4096
        yarn.app.mapreduce.am.command-opts = -Xmx922m
        yarn.nodemanager.aux-services.mapreduce_shuffle.class = org.apache.hadoop.mapred.ShuffleHandler
        yarn.app.mapreduce.am.resource.mb = 1024
        mapreduce.job.ubertask.enable = true
        dfs.replication = 2
        yarn.app.mapreduce.am.resource.cpu-vcores = 1
        yarn.resourcemanager.scheduler.class = org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler
        fs.defaultFS = hdfs://hdp1:8020/
        mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.GzipCodec
        mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.GzipCodec
        mapreduce.reduce.speculative = false
        JVM Version = Java HotSpot(TM) 64-Bit Server VM, 1.7 (Oracle Corporation)
        JVM Opts = -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx1638m -Djava.io.tmpdir=/localservices/yarn-local/usercache/datameer/appcache/application_1479209793677_0005/container_1479209793677_0005_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1479209793677_0005/container_1479209793677_0005_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog

     

    Best Regards,

    Aleksandar Razmovski

    0
    Comment actions Permalink
  • Gido

    Hi Aleksandar,

    As of Datameer version 5.6 we use new custom properties as an abstraction layer for several Execution Frameworks.

    If you like to configure memory related properties you can change the default settings which are currently: 

    das.job.map-task.memory=2048 
    das.job.reduce-task.memory=2048
    das.job.application-manager.memory=2048

     

    e.g. to 

    das.job.map-task.memory=4096
    das.job.reduce-task.memory=4096
    das.job.application-manager.memory=4096

    To determine the memory configuration settings you may have also a look into distribution specific documentation like from Cloudera or Hortonworks

    Additionally remove Execution Framework specific parameters from Datameer's configuration e.g.:

    mapred.map.child.java.opts=-Xmx<value>m 
    mapred.reduce.child.java.opts=-Xmx<value>m
    mapred.job.map.memory.mb=<value>
    mapred.job.reduce.memory.mb=<value>

     

    0
    Comment actions Permalink
  • Aleksandar Razmovski

    Dear,

    That explains a lot. Thanks

    Btw, can I set this custom properties on job/task level (WorkBook, ImportJob....)?

     

    Best Regards,

    Aleksandar Razmovski

    0
    Comment actions Permalink
  • Konsta Danyliuk

    Hello Aleksandar.

    You could set memory settings at job level, just mention required parameters at artefact's Custom Properties field (it is usually located under Advanced section at artefact's configuration).

    0
    Comment actions Permalink

Please sign in to leave a comment.