Incremental load issue when using DATE column type - Terdata Timezone Timestamp
Hi, I recently created a Teradata import and configured the split colum on a date, then I enabled partition on that date and I've also enabled the "incremental load".
Now each day the job runs the import is bringing data that is equal to the "previous max split value" along with data that is greater than.
Below is an example of my where clause. The first filter is to set the start date for the first run, then the split column to enable incremental load.
WHERE t1,SNAP_DT >'2016-12-31'
and $t1.SNAP_DT$ --datameer Split Column
Am I missing something, or this is just a bug?
-
I forgot to mention that the partitioning is done by month. While the load happens daily.
The number of mappers is set to 4.
I'm not sure what to look for in the logs.
But this is the start of the log of the first load. There was no data on datameer at this time. In the database the last date was 2018-09-27.
INFO [2018-09-28 13:42:14.454] [JobScheduler thread-1] (JobScheduler.java:433) - Starting job 1601910 (DAS Version: 6.4.11, Revision: 3bf0fb1d8e2afeffd142deeb97111ce45a0123f7, Hadoop-Distribution: 2.7.3.2.6.3.0-235 (hdp-2.6.3), JVM: 1.8)
INFO [2018-09-28 13:42:14.460] [JobScheduler thread-1] (NormalJobDriver.java:120) - Checking if JobExecutionValueObject{_id=1601910} can be started
INFO [2018-09-28 13:42:14.525] [JobScheduler thread-1] (JobScheduler.java:471) - [Job 1601910] Preparing job in job scheduler thread for DataSourceConfigurationImpl{id=1}...
INFO [2018-09-28 13:42:14.525] [JobScheduler thread-1] (JobScheduler.java:474) - [Job 1601910] Preparing job in job scheduler thread for DataSourceConfigurationImpl{id=1}... done (0 sec)
INFO [2018-09-28 13:42:14.532] [JobScheduler worker1-thread-13908] (JobSchedulerJob.java:89) - [Job 1601910] Preparing job for DataSourceConfigurationImpl{id=1}...
INFO [2018-09-28 13:42:14.542] [JobScheduler worker1-thread-13908] (ImportJobCreator.java:106) - Previous max split value for '/Job1.imp' is null..... and this is the end of it.
INFO [2018-09-28 13:44:36.043] [JobScheduler worker1-thread-13908] (DapJobCounter.java:175) - Job SUCCESS with 1 cluster jobs and following counters:
INFO [2018-09-28 13:44:36.043] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_RECORDS: 47687
INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_DROPPED_RECORDS: 0
INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_PREVIEW_RECORDS: 10500
INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_OUTPUT_BYTES: 1219572
INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_DROPPED_SPLITS: 0
INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DapJobCounter.java:178) - IMPORT_OUTPUT_PARTITIONS: 21
INFO [2018-09-28 13:44:36.044] [JobScheduler worker1-thread-13908] (DasClassLoader.java:146) - Uninstalling classloader DelegatingClassLoader([Ljava.lang.ClassLoader;@aacb734,1736481196)
INFO [2018-09-28 13:46:58.197] [JobScheduler thread-1] (JobScheduler.java:1018) - Computing after job completion operations for execution 1601910 (type=NORMAL)
INFO [2018-09-28 13:46:58.198] [JobScheduler thread-1] (JobScheduler.java:1022) - Finished computing after job completion operations for execution 1601910 (type=NORMAL) [0 sec]
INFO [2018-09-28 13:46:58.252] [JobScheduler thread-1] (JobScheduler.java:830) - Job DapJobExecution{id=1601910, type=NORMAL, status=COMPLETED} completed with status COMPLETED.----------------- now the same sections of the job on the second day. Now the data base has 2018-09-28 as the latest date. But the 27th was imported again.
INFO [2018-09-29 08:01:46.963] [JobScheduler thread-1] (JobScheduler.java:433) - Starting job 1603960 (DAS Version: 6.4.11, Revision: 3bf0fb1d8e2afeffd142deeb97111ce45a0123f7, Hadoop-Distribution: 2.7.3.2.6.3.0-235 (hdp-2.6.3), JVM: 1.8) INFO [2018-09-29 08:01:46.968] [JobScheduler thread-1] (NormalJobDriver.java:120) - Checking if JobExecutionValueObject{_id=1603960} can be started INFO [2018-09-29 08:01:47.033] [JobScheduler thread-1] (JobScheduler.java:471) - [Job 1603960] Preparing job in job scheduler thread for DataSourceConfigurationImpl{id=2}... INFO [2018-09-29 08:01:47.033] [JobScheduler thread-1] (JobScheduler.java:474) - [Job 1603960] Preparing job in job scheduler thread for DataSourceConfigurationImpl{id=2}... done (0 sec) INFO [2018-09-29 08:01:47.041] [JobScheduler worker1-thread-14637] (JobSchedulerJob.java:89) - [Job 1603960] Preparing job for DataSourceConfigurationImpl{id=2}... INFO [2018-09-29 08:01:47.052] [JobScheduler worker1-thread-14637] (ImportJobCreator.java:106) - Previous max split value for 'Job1.imp' is 1538020800000
....
INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:175) - Job SUCCESS with 1 cluster jobs and following counters:
INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_RECORDS: 230
INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_DROPPED_RECORDS: 0
INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_PREVIEW_RECORDS: 230
INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_OUTPUT_BYTES: 10103
INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_DROPPED_SPLITS: 0
INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DapJobCounter.java:178) - IMPORT_OUTPUT_PARTITIONS: 1
INFO [2018-09-29 08:02:47.732] [JobScheduler worker1-thread-14637] (DasClassLoader.java:146) - Uninstalling classloader DelegatingClassLoader([Ljava.lang.ClassLoader;@453fc40c,838038466)
INFO [2018-09-29 08:02:50.645] [JobScheduler thread-1] (JobScheduler.java:1018) - Computing after job completion operations for execution 1603960 (type=NORMAL)
INFO [2018-09-29 08:02:50.646] [JobScheduler thread-1] (JobScheduler.java:1022) - Finished computing after job completion operations for execution 1603960 (type=NORMAL) [0 sec]
INFO [2018-09-29 08:02:50.666] [JobScheduler thread-1] (JobScheduler.java:830) - Job DapJobExecution{id=1603960, type=NORMAL, status=COMPLETED} completed with status COMPLETED. -
Thanks for sharing the logs. In the second execution, this section stands out to me:
Previous max split value for 'Job1.imp' is 1538020800000
This indicates that the initial import job brought in all data up to the UNIX timestamp of 1538020800000 [ms]. This equates to "September 27, 2018 4:00:00 AM" so it does make sense that some data from September 27 is imported on the second run (all data after 04:00) should be imported in the second execution based on the split column and historical run.
Does this match your observations with the data brought in: data before 04:00 was imported in the first run and data after 04:00 was imported in the subsequent run?
-
Thank you for the help Joel.
But I don't understand how the max value for the split column is set to 4am; If any time at all: it should be 12:00AM.
The column SNAP_DT is of type DATE. When I see the data, within Datameer "browse data" (on the import job details), that columns shows as datetime with the time for all rows set to 12:00:00 AM.
On the second run all the records from the 27th were re-imported.
Is there a way to see the actual queries sent to the database?
-
It's possible that the 04:00 AM is just an offset for your timezone. Are you in Eastern Daylight Time (GMT-4) by chance? If so, it seems that the offset to 04:00 AM is explained by that. If so, this makes sense that in the second execution the records from Sept. 27 are included however it is unexpected that these records were included in the initial execution (since the max split is only midnight -- in Eastern Daylight Time).
Related to the timezone as well, what is the following server property set to by your Datameer Administrator from the configuration files:
# Define which time zone to use for displaying dates in the UI and for parsing date strings that do not contain
# a defined time zone. Use "default" if you want to use the time zone of the local system.
system.property.das.default-timezone=defaultThe actual queries are not logged directly into the job log so we cannot see it directly.
-
Hi Joel,
So based on your help and comments, here is what I did and seems to have worked.
select cast(t1.SNAP_DT as timestamp) as SNAP_DT,
....
from table t1
where 1=1
and $t1.SNAP_DT$
Now datameer stores the epoch time for the "max split value" as 1538107200000 which is 12:00AM. I think that even though it was showing as 12am, as the datatype was DATE in the result set, the function that calculates the "MAX SPLIT value" is setting Datameer time zone on the conversion. Which for me seems like a bug.
But without being able to see the actual query sent to the DB and without any access to the Datameer admin configs (my case), I just couldn't pinpoint the exact problem.
There are a few layers of configuration regarding timezone: there is the TERADATA session parameters vs Default parameters, there is the JDBC driver that datameer is using to connect, and there is Datameer configurations as well.
I'll try to find a suggestion page and add a request to "Log the actual query sent to the DB in the job LOG".
Please sign in to leave a comment.
Comments
6 comments