Incremental load issue when using DATE column type - Terdata Timezone Timestamp

Comments

6 comments

  • Joel Stewart

    It would be helpful to see the job logs from consecutive runs to see how the split column is calculated at run time from one run to the next. Are you able to share excerpts from these as well? 

    0
    Comment actions Permalink
  • Tiago Rubio

    I forgot to mention that the partitioning is done by month. While the load happens daily.

    The number of mappers is set to 4.

    I'm not sure what to look for in the logs.

     

    But this is the start of the log of the first load. There was no data on datameer at this time. In the database the last date was 2018-09-27.

    INFO [2018-09-28 13:42:14.454] [JobScheduler thread-1] (JobScheduler.java:433) - Starting job 1601910 (DAS Version: 6.4.11, Revision: 3bf0fb1d8e2afeffd142deeb97111ce45a0123f7, Hadoop-Distribution: 2.7.3.2.6.3.0-235 (hdp-2.6.3), JVM: 1.8)
    INFO [2018-09-28 13:42:14.460] [JobScheduler thread-1] (NormalJobDriver.java:120) - Checking if JobExecutionValueObject{_id=1601910} can be started
    INFO [2018-09-28 13:42:14.525] [JobScheduler thread-1] (JobScheduler.java:471) - [Job 1601910] Preparing job in job scheduler thread for DataSourceConfigurationImpl{id=1}...
    INFO [2018-09-28 13:42:14.525] [JobScheduler thread-1] (JobScheduler.java:474) - [Job 1601910] Preparing job in job scheduler thread for DataSourceConfigurationImpl{id=1}... done (0 sec)
    INFO [2018-09-28 13:42:14.532] [JobScheduler worker1-thread-13908] (JobSchedulerJob.java:89) - [Job 1601910] Preparing job for DataSourceConfigurationImpl{id=1}...
    INFO [2018-09-28 13:42:14.542] [JobScheduler worker1-thread-13908] (ImportJobCreator.java:106) - Previous max split value for '/Job1.imp' is null

    ..... and this is the end of it.

    INFO [2018-09-28 13:44:36.043] [JobScheduler worker1-thread-13908] (DapJobCounter.java:175) - Job SUCCESS with 1 cluster jobs and following counters:
    INFO [2018-09-28 13:44:36.043] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_RECORDS: 47687
    INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_DROPPED_RECORDS: 0
    INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_PREVIEW_RECORDS: 10500
    INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_OUTPUT_BYTES: 1219572
    INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_DROPPED_SPLITS: 0
    INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_OUTPUT_PARTITIONS: 21
    INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DasClassLoader.java:146) - Uninstalling classloader DelegatingClassLoader([Ljava.lang.ClassLoader;@aacb734,1736481196)
    INFO [2018-09-28 13:46:58.197] [JobScheduler thread-1] (JobScheduler.java:1018) - Computing after job completion operations for execution 1601910 (type=NORMAL)
    INFO [2018-09-28 13:46:58.198] [JobScheduler thread-1] (JobScheduler.java:1022) - Finished computing after job completion operations for execution 1601910 (type=NORMAL) [0 sec]
    INFO [2018-09-28 13:46:58.252] [JobScheduler thread-1] (JobScheduler.java:830) - Job DapJobExecution{id=1601910, type=NORMAL, status=COMPLETED} completed with status COMPLETED.

     

    ----------------- now the same sections of the job on the second day. Now the data base has 2018-09-28 as the latest date. But the 27th was imported again.

     INFO [2018-09-29 08:01:46.963] [JobScheduler thread-1] (JobScheduler.java:433) - Starting job 1603960 (DAS Version: 6.4.11, Revision: 3bf0fb1d8e2afeffd142deeb97111ce45a0123f7, Hadoop-Distribution: 2.7.3.2.6.3.0-235 (hdp-2.6.3), JVM: 1.8)
     INFO [2018-09-29 08:01:46.968] [JobScheduler thread-1] (NormalJobDriver.java:120) - Checking if JobExecutionValueObject{_id=1603960} can be started
     INFO [2018-09-29 08:01:47.033] [JobScheduler thread-1] (JobScheduler.java:471) - [Job 1603960] Preparing job in job scheduler thread for DataSourceConfigurationImpl{id=2}... 
     INFO [2018-09-29 08:01:47.033] [JobScheduler thread-1] (JobScheduler.java:474) - [Job 1603960] Preparing job in job scheduler thread for DataSourceConfigurationImpl{id=2}... done (0 sec)
     INFO [2018-09-29 08:01:47.041] [JobScheduler worker1-thread-14637] (JobSchedulerJob.java:89) - [Job 1603960] Preparing job for DataSourceConfigurationImpl{id=2}...
     INFO [2018-09-29 08:01:47.052] [JobScheduler worker1-thread-14637] (ImportJobCreator.java:106) - Previous max split value for 'Job1.imp' is 1538020800000
    ....
    INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:175) - Job SUCCESS with 1 cluster jobs and following counters:
    INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_RECORDS: 230
    INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_DROPPED_RECORDS: 0
    INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_PREVIEW_RECORDS: 230
    INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_OUTPUT_BYTES: 10103
    INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_DROPPED_SPLITS: 0
    INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_OUTPUT_PARTITIONS: 1
    INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DasClassLoader.java:146) - Uninstalling classloader DelegatingClassLoader([Ljava.lang.ClassLoader;@453fc40c,838038466)
    INFO [2018-09-29 08:02:50.645] [JobScheduler thread-1] (JobScheduler.java:1018) - Computing after job completion operations for execution 1603960 (type=NORMAL)
    INFO [2018-09-29 08:02:50.646] [JobScheduler thread-1] (JobScheduler.java:1022) - Finished computing after job completion operations for execution 1603960 (type=NORMAL) [0 sec]
    INFO [2018-09-29 08:02:50.666] [JobScheduler thread-1] (JobScheduler.java:830) - Job DapJobExecution{id=1603960, type=NORMAL, status=COMPLETED} completed with status COMPLETED.
    0
    Comment actions Permalink
  • Joel Stewart

    Thanks for sharing the logs. In the second execution, this section stands out to me: 

    Previous max split value for 'Job1.imp' is 1538020800000

    This indicates that the initial import job brought in all data up to the UNIX timestamp of 1538020800000 [ms]. This equates to "September 27, 2018 4:00:00 AM" so it does make sense that some data from September 27 is imported on the second run (all data after 04:00) should be imported in the second execution based on the split column and historical run. 

     

    Does this match your observations with the data brought in: data before 04:00 was imported in the first run and data after 04:00 was imported in the subsequent run? 

    0
    Comment actions Permalink
  • Tiago Rubio

    Thank you for the help Joel.

    But I don't understand how the max value for the split column is set to 4am; If any time at all: it should be 12:00AM.

    The column SNAP_DT is of type DATE. When I see the data, within Datameer "browse data" (on the import job details), that columns shows as datetime with the time for all rows set to 12:00:00 AM.

     

    On the second run all the records from the 27th were re-imported.

    Is there a way to see the actual queries sent to the database?

     

     

     

     

     

    0
    Comment actions Permalink
  • Joel Stewart

    It's possible that the 04:00 AM is just an offset for your timezone. Are you in Eastern Daylight Time (GMT-4) by chance? If so, it seems that the offset to 04:00 AM is explained by that. If so, this makes sense that in the second execution the records from Sept. 27 are included however it is unexpected that these records were included in the initial execution (since the max split is only midnight -- in Eastern Daylight Time). 

    Related to the timezone as well, what is the following server property set to by your Datameer Administrator from the configuration files: 

    # Define which time zone to use for displaying dates in the UI and for parsing date strings that do not contain 
    # a defined time zone. Use "default" if you want to use the time zone of the local system.
    system.property.das.default-timezone=default

    The actual queries are not logged directly into the job log so we cannot see it directly. 

     

    0
    Comment actions Permalink
  • Tiago Rubio

    Hi Joel,

     

    So based on your help and comments, here is what I did and seems to have worked.

     

    select cast(t1.SNAP_DT as timestamp) as SNAP_DT,

    ....

    from table t1

    where 1=1

    and $t1.SNAP_DT$

     

    Now datameer stores the epoch time for the "max split value" as 1538107200000 which is 12:00AM. I think that even though it was showing as 12am, as the datatype was DATE in the result set, the function that calculates the "MAX SPLIT value" is setting Datameer time zone on the conversion. Which for me seems like a bug.

    But without being able to see the actual query sent to the DB and without any access to the Datameer admin configs (my case), I just couldn't pinpoint the exact problem.

    There are a few layers of configuration regarding timezone: there is the TERADATA session parameters vs Default parameters, there is the JDBC driver that datameer is using to connect, and there is Datameer configurations as well.

    I'll try to find a suggestion page and add a request to "Log the actual query sent to the DB in the job LOG".

    0
    Comment actions Permalink

Please sign in to leave a comment.