I have a bunch of clickstream data on HDFS in the form of dated .tsv.gz files. I am not interested in some of the rows/columns so was interested in how Datameer handles dropping these.
Do I gain anything time wise by:
Including fewer of the possible columns in a data link?
Setting some fields to not accept empty values in order to remove rows that are not of interest?
Is it processed file by file or do all of the columns/rows from all the files get loaded then dropped subsequently?
& with the above in mind
when dealing with such a large dataset, would it be likely to be more efficient to create all the aggregated data I need from a single data link that gets split up with filters, or would it be better to create smaller seperate data links for subsets of the data? I am thinking that while doing the small seperate stuff might be faster individually, pulling the data in each time is probably a big time consumer, so I might be better placed to consolidate.
Any advice appreciated, cheers
Please sign in to leave a comment.