for data from production lines I am applying the clustering feature of smart analytics. I tried with different numbers of clusters. Normally I would expect less cost (sum of square of distance from centroid) for higher number of clusters. This could be observed in some cases but not always. In the following case it looks good. The cost decreases with more clusters. I also calculated the silhouette index which grows near to 0.9. Here I get good clusterings.
In this cases it works not well. Clustering multiple times even with same number of clusters I get sometimes very different results:
The low cost value and high silhouette with 8 clusters could be produced with multiple attempts. I always reported the best value, 8 clusters also produced 4 billion sometimes. But with 9 and 10 clusters I could never get a good value.
I know, that k-means very much depends on initial values. That is why k-means ++ cares about it. Therefor I am a little surprised. Best cost for 8 clusters is below 8 million. If you use k-means ++ the expectation of costs is not bigger than 8*(ln k +2) times of the optimum. In my case the factor is about 34 but the figures flip by a factor of 400. And for 9 and 10 clusters I could not produce good results with many attempts.
My questions. Do you use k-means ++ internally? Do you have an explanation? Do you have any hint to improve it. The case could become very valuable for manufactureres if it works.
Please sign in to leave a comment.