Retain data in workbook

Comments

4 comments

  • Konsta Danyliuk

    Hello Qin.
    How are you.

    Would you be able t provide more details on this use-case, please?

    • How does the source Workbook, you are referencing to, receive its data - directly via an ImportJob or a DataLink, or there is another external Sheet(s)?
    • How big is the dataset in question?
    • What Datameer version are you currently running?
    0
    Comment actions Permalink
  • qin ouyang

    Hi Konsta, 

    Thank you for replying my question, appreciate it!

    1. The source workbook uses a  File Upload job to manually upload daily file, we are going to use a DataLink to automatically load the daily file, but unlike import job, there is no 'Append' option in DataLink.  

    2. Both the source workbook and the workbook I'm working on are around 100M, and the daily input file is less than 1M.

    3.  Datameer Version 7.4.6.

    Thanks again!

    0
    Comment actions Permalink
  • Konsta Danyliuk

    Hello Qin.
    An ImportJob or a FileUpload copies the target dataset into the Datameer private folder located in HDFS or at the Datameer application server in case the application is not connected to a cluster.

    A FileUpload was never intended for the enterprise usage and rather serves for a quick demo or one-time ingestion, thereby its functionality is quite limited. An ImportJob is a more robust way to get data into Datameer. It provides flexibility on how exactly one wants to get the data. In case there is a daily growing data an ImportJob allows to only import files/records added since the last execution of this ImportJob (append mode).

    A DataLink in its turn is just a pointer to the data. There is no data copied into Datameer during the DataLink's execution (except a short preview). The data is being accessed only when one executes the Workbook that uses this DataLink. Thereby data append concept is not applicable to DataLinks.

    A Workbook is being executed against the whole data its sources have and it is impossible to run the Workbook on the piece of data and keep the historical dataset at the same time.

    ___________________________________________________________________

    Considering all the above, I could offer the following options for your use-case.

    • In case source data files are permanently stored at the target location, you could set up a DataLink with a FileFilter (by File Modification Day). This helps to point the Workbook to the recently added data. As soon as you will require to execute the Workbook against the historical data, you could: change the FileFilter in the DataLink -> Rerun the DataLink -> Rerun the Workbook. Or you could create two DataLink - Workbook chains - one for daily data (DataLink is pointed to the last day files) and another for historical data (DataLinks is pointed to all files).

    • In case source data files are not stored at the target location, e.g. the new file replaces the previous one, you would need to maintain the historical data at the Datameer side. Thereby you need an ImportJob in append mode that will run every time the new file is added and pile the data in Datameer private location. With this configuration, the Workbook that is built on such an ImportJob will always be executed against the complete dataset. 

    I hope this information helps with your project. Please let me know in case of any further questions.

    Here are also sections of Datameer documentation you might want to review

     

    0
    Comment actions Permalink
  • qin ouyang

    Hi Konsta, 

    Thank you for your detailed explanation, really appreciate it.

    I guess I would have to use import job to solve the issue, seems that's the only way I can retain the data. 

    Thanks again.

    Qin

    0
    Comment actions Permalink

Please sign in to leave a comment.