Importing and filtering by dynamic dates

March 16, 2018 01:23

Hi there,

I have a hive table that is partitioned by a date-type column of this format: "yyyy-MM-dd".

I would like to import only the latest date of this table to Datameer. How do I do that?

The hive table is not loaded everyday, so something like (TODAY() - 1d) won't work for me. I am looking for something like MAX().

Any idea?

Thanks :)

Comments

6 comments

Joel Stewart

March 16, 2018 02:12
Hi Amin, there's a parameter called $latestpartition that you can use. There are more details about this in our documentation here: Partitioning Data in Datameer

For example, you may use the following statement in the advanced partition filter:
```
$partition == $latestpartition
```
I hope this helps!
0

Comment actions Permalink
Amin Torabi

March 17, 2018 07:19
Thanks for your note, Joel.

I would like to filter when creating the import job (see attached); I am looking for parameters to include in the "Start Expression", and "End Expression" below that picks up the latest partition in the hive table I am trying to import.

I guess what you you are suggesting works only when I include everything in my import job, and then filter for specific partitions when adding data to workbooks. Am I right?
0

Comment actions Permalink
Joel Stewart

March 19, 2018 16:31
Amin, that's correct. I thought you were seeking this parameter within a Workbook to select a partition. This variable is unavailable in an Import Job.

Given that you mention this data resides in Hive however, it seems unusual to me that you're working with an Import Job. I'd generally recommend a Data Link for accessing data from Hive. If you do proceed with a Data Link, after the initial sampling job from Datameer you can then use the recommended variable within the downstream workbook(s) to only select the latest partition.
0

Comment actions Permalink
Amin Torabi

March 19, 2018 18:18
Thanks again Joel. I was able to bring only the latest partition of my hive table into Datameer using Data Link.

I always thought using Import Job would server me better as I take the time and bring the data once, and then all my subsequent jobs will run faster as the data has already been imported; but based on our chat, I guess I need to change my mindset.

Just wondering when you would recommend using Import Job over Data Link?
0

Comment actions Permalink
Joel Stewart

March 19, 2018 18:23
An Import Job is a great way to bring data into Hadoop that currently resides in another system (for example a MySQL database). This helps store the data on a distributed file system and then allows for it to be accessed in a more efficient way by workbooks later.

Data Links are recommended to access data that already resides on a distributed file system since it can most likely be split and processed in parallel already. Hive is already stored this way so a Data Link is a great option rather that creating a duplicate copy of that same data.
0

Comment actions Permalink
Amin Torabi

March 19, 2018 18:27
That makes sense. Thanks Joel :)
0

Comment actions Permalink

Please sign in to leave a comment.

Comments

Didn't find what you were looking for?