Mining Itemsets that Compress
with Matthijs van Leeuwen & Arno Siebes

Abstract. The Krimp algorithm is our answer to the pattern explosion in data mining: the best set of patterns is that set that compresses the data best. Using the Minimum Description Length (MDL) principle, Krimp achieves reductions of up to 7 orders of magnitude in the number of frequent itemsets. The selected patterns are highly characteristic for the data, as indicated by good compression ratios and high classification accuracies.

Krimp has been first published as Siebes et al (2006), although not yet under that name. Since then, we extended the Krimp foundation for data mining tasks like characterising differences between databases, generating data, completing missing data, detecting changes in streams, identifying the components of a database, and more. For more details, see the publication list below.

Public release: source code and binaries. Our implementation of Krimp is freely available for research purposes; we provide both the C++ source code and binaries for Windows (x86 and x64) and Unix (x64 only; tested under Ubuntu, Fedora, OSX). In addition to the pattern selection algorithm, it contains the Krimp classifier and the StreamKrimp algorithm. For your convenience, the package includes some example UCI datasets taken from the LUCS-KDD data library. Please refer to the documentation in the package for installation/compilation details and usage hints.


Slim, beyond Krimp Recently, we introduced the Slim algorithm for directly mining good tables from data—opposed to first mining a (large) candidate collection, ordering it, and then greedily filtering these—Slim iteratively generates candidates that are most likely to improve the current code table. You can find more information here.


Implementation

Krimp source code & binaries (1st February 2013) by Jilles Vreeken, Matthijs van Leeuwen, and Koen Smets.

Related Publications

Vreeken, J, van Leeuwen, M & Siebes, A Krimp: Mining Itemsets that Compress. Data Mining and Knowledge Discovery vol.23(1), pp 169-214, Springer, 2011. (IF 2.950)
Smets, K & Vreeken, J The Odd One Out: Identifying and Characterising Anomalies. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 804-815, SIAM, 2011.
van Leeuwen, M, Vreeken, J & Siebes, A Identifying the Components. Data Mining and Knowledge Discovery vol.19(2), pp 176-193, Springer, 2009. (IF 2.950)video
Vreeken, J, van Leeuwen, M & Siebes, A Preserving Privacy through Data Generation. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 685-690, IEEE, 2007.
Vreeken, J, van Leeuwen, M & Siebes, A Characterising the Difference. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp 765-774, ACM, 2007.
van Leeuwen, M, Vreeken, J & Siebes, A Compression Picks the Item Sets that Matter. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 585-592, Springer, 2006.
Siebes, A, Vreeken, J & van Leeuwen, M Item Sets That Compress. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 393-404, SIAM, 2006.