Goal
Get a better understanding of how Datameer generates values for Flip Side sheets.
Learn
Flip Sheet is meant to be used for data exploration to get a quick and raw understanding of your data. Not all of its values are exact, some are estimates, however, one could use Datameer spreadsheets and functions to compute the exact values, when needed.
For performance sake, Datameer uses algorithms that read a dataset only once before providing the output and generating values for the Flip Side sheet. Accuracy of such calculations depends on the possibility to provide exact results walking through the data only once.
For example, it's easy to find out Count
(displays a rounded count of records for easy visibility), Min
, Max
and Mean
by One-View algorithms, therefore these values should always be accurate.
However, algorithms that calculate the number of unique values can't provide accurate results when reading the data only once. Therefore the Unique
count on Flip Side displays an estimated count of unique records. The precision of these estimates decreases when data volume increases.
To calculate the number of unique values in a column, Datameer uses HyperLogLog algorithm with 14 bits of memory.
One can calculate the average accuracy (error rate) for this algorithm with the formula:
accuracy = 1.04/sqrt(m) where m = 2^b and b=14 thereby accuracy = 1.04/sqrt(16384) = 0.0081 or 0.81%
Depending on the number of records, the error rate is slightly different, e.g. for 10k-100k records, it's ~0.4% and for 1M it's ~0.7%.
The following article provides more details on how HLL works - Understanding the HyperLogLog.
Comments
0 comments
Please sign in to leave a comment.