Smart sampling
Hi,
I have read the algorithm for distributed reservoir sampling here: https://www.datameer.com/documentation/current/Smart+Sampling. I think it's not correct. Since it's not fair to select data items in each partition. For example, if you want to select 2 items from input data with 2 partitions. According to the algorithm, you wrote there, you will select 2/2 = 1 data item from each partition. However, it's possible to select 2 items from only 1 partition.
Am I correct?
Thanks,
Do
-
Hi Do,
If you partition a data set, we create the provided number of sample records for each partition. Default is 5000.
If you link one single partition into a workbook, we are still able to show 5k sample records for that partition.
Though, if you select multiple partitions (or the entire data set), we re-create a new 5k sample across all partitions.
I am curious, if you see any problem involved with this, please let me know.
Thanks,
Frank
-
Hi Do,
Using the distributed reservoir algorithm, we are selecting sample-date from all partitions. In the case of 2 partitions with default sample-data, we select 2 sets of 5,000 record samples.
When a workbook is being created and we are displaying sample data, we will merge sample data from all the partitions used in that particular workbook. The merging behavior does have a preference to include items from all partitions if possible, this is the intended design.
It sounds like you feel that the decision to prefer to select data from all partitions is of concern. From a set theory perspective, yes it is possible to select a complete sample from a single partition. However, I find that when users are working with data from multiple partitions that they prefer to see sample records from all partitions when drafting workbooks -- even if the cardinality ratio of the data between each partition does not match the cardinality ratio of the full data set.
Please let us know if this clarifies your question or if there are any follow up questions.
-
Hi Do,
What is your definition of "correct"? This is how Datameer does it and what works pretty well for us. The ultimate goal is to provide a representative sample to allow users to design the analytics with an interactive experience.
It looks like we have two bad alternatives. Either we could create the total sample across all partitions. This would lead to only a few sample records or even none at all when loading data from one or a few partitions.
Alternative we could sample n from each partition (as we do), but don't re-create a smaller sample when many partititions are loaded. This would have serious impacts on performance as you potentially will work with a very large sample.
So my question again is, where do you face a problem? Is it just about you think the sample isn't representative?
Best,
Frank
Please sign in to leave a comment.
Comments
4 comments