Deliberately dropping columns/rows & 1 big datalink vs many small

Comments

3 comments

  • Joel Stewart

    With data stored in *.tsv.gz files, all columns will be read when accessing the files originally. This is forced by the storage of the actually data in a format that doesn't allow direct columnar access to the data. Reducing the rows included after the initial read will help reduce the processing time. 

    If possible, I'd recommend evaluating if the raw data could be converted to a format that supports columnar access such as Parquet. This would have the biggest impact by allowing direct selection from the source data.

    1
    Comment actions Permalink
  • Stephen Waters

    Thanks Joel

    With that in mind, although it would be a not smart thing to do in terms of storage/duplication of data, I guess using an import job then referencing that would be better than using a data link (based on the idea that the full set of columns & dealing with the .gz stuff would only have to happen once per file rather than once per run)?

    0
    Comment actions Permalink
  • Joel Stewart

    Yes, I think that's a good summary. You'd be able to trade extra storage space for the first time to reorganize the data but then have better performance in the future. 

    0
    Comment actions Permalink

Please sign in to leave a comment.