Deliberately dropping columns/rows & 1 big datalink vs many small
I have a bunch of clickstream data on HDFS in the form of dated .tsv.gz files. I am not interested in some of the rows/columns so was interested in how Datameer handles dropping these.
Do I gain anything time wise by:
Including fewer of the possible columns in a data link?
Setting some fields to not accept empty values in order to remove rows that are not of interest?
Is it processed file by file or do all of the columns/rows from all the files get loaded then dropped subsequently?
& with the above in mind
when dealing with such a large dataset, would it be likely to be more efficient to create all the aggregated data I need from a single data link that gets split up with filters, or would it be better to create smaller seperate data links for subsets of the data? I am thinking that while doing the small seperate stuff might be faster individually, pulling the data in each time is probably a big time consumer, so I might be better placed to consolidate.
Any advice appreciated, cheers
-
With data stored in *.tsv.gz files, all columns will be read when accessing the files originally. This is forced by the storage of the actually data in a format that doesn't allow direct columnar access to the data. Reducing the rows included after the initial read will help reduce the processing time.
If possible, I'd recommend evaluating if the raw data could be converted to a format that supports columnar access such as Parquet. This would have the biggest impact by allowing direct selection from the source data.
-
Thanks Joel
With that in mind, although it would be a not smart thing to do in terms of storage/duplication of data, I guess using an import job then referencing that would be better than using a data link (based on the idea that the full set of columns & dealing with the .gz stuff would only have to happen once per file rather than once per run)?
Please sign in to leave a comment.
Comments
3 comments