I have read the algorithm for distributed reservoir sampling here: https://www.datameer.com/documentation/current/Smart+Sampling. I think it's not correct. Since it's not fair to select data items in each partition. For example, if you want to select 2 items from input data with 2 partitions. According to the algorithm, you wrote there, you will select 2/2 = 1 data item from each partition. However, it's possible to select 2 items from only 1 partition.
Am I correct?
Please sign in to leave a comment.