The deadline is May 26th at 10:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.
For the topic of your assignment, choose one of the following:
One of the advantages of using Minimum Description Length (MDL)  is that it allows to define parameter-free methods . Some people go as far as claiming this is the future of data mining  as it allows for true exploratory data mining. Is this so? Where did the parameters go, and what about any assumptions? Is MDL a magic wand? What if we do not like what MDL says is the best? Discuss critically.
Initially, frequency was thought to be a good measure for the interestingness of a pattern – you want to know what has been sold often, after all. After realising that the resulting set of patterns was more often than not enormous in size, as well as extremely redundant, research attention shifted to find better measure of interestingness. A natural approach then seemed to only report those patterns for which the frequency is significant with regard to some background model. Unsurprisingly, this turned out to be much harder than expected.
Read both Brin et al.  and Webb . Both give examples of how to identify patterns that are somehow deviating from an expectation, both consider a lot more information than simply the expected frequency under the marginals, yet both approaches are otherwise quite different. Analyse what the core ideas are of both approaches, give a succinct summary, and give a detailed discussion on how they differ, what in your view the strong and weak points of both approaches are. Is either of this measure the ultimate measure of interestingness for itemsets, or are there further improvements possible?
(Yes, Brin is Sergey Brin from Google.) (Although in  Webb does not give an algorithm for mining self sufficient itemsets, in a recent paper  we showed how we can mine these efficiently.)
In data mining the goal is to find interesting structure of your data, things that somehow stand out. There are many ways to define 'standing out'. Significance testing based on background knowledge is one of them, but can, again, be done in different ways. There are two main schools. Gionis et al.  propose to measure significance using (swap) randomization, whereas De Bie argues to use Maximum Entropy (MaxEnt) modelling . (Variants exist for all sorts of data types.) What are the key differences between the two approaches? What are the differences in background knowledge that can be incorporated? What are the key differences in the assumptions about the background knowledge? What will the effect be in practice? Which do you think is more practical? Why?
Within the MaxEnt school, there exist two sub-schools. In addition to , read . What are the key differences between the models? What are the differences in type of knowledge they can incorporate? Can we use both models to test significance of the same types of structures/patterns? Are the two approaches unitable? Does it make any sense to have the type of background information of  incorporated into , and how about vice versa? If it does, sketch how you think this would work.
One often sought-after feature of matrix factorizations in data analysis is sparsity: if the data is sparse (i.e. has relatively few non-zero elements), one often hopes that also the factor matrices stay sparse. Some reasons often given to explain why sparsity is preferable include that sparse factor matrices are easier to interpret and that sparse factor matrices allow for space savings when storing just the factorization instead of the full matrix. What are your opinions on sparsity? Is it preferable to have sparse factorization for sparse data? Are the above reasons valid? Can you think any other reason why sparsity would be preferable, or any reasons why sparsity would be non-preferable?
One method that is often claimed to produce sparse factor matrices is the Nonnegative Matrix Factorization (NMF), but for the general case, this cannot be proved. In addition to standard regularizers, researchers have also developed methods to directly find sparse NMF. Read  and  for two examples. Why do people think that vanilla NMF can produce sparse factors, and why it might not? How are Hoyer, on one hand, and Gillis & Glineur tackling that problem? Do they actually produce sparse factorizations in the traditional sense or are they redefining the problem to make it easier?
The Boolean matrix factorization (BMF) can also yield sparse factorizations . How do the sparsity results of  compare to those of ? Compare in particular Theorem 1 of  to Theorem 1 of . Do you think one of these results is somehow better than the other?
Optional: Proof that the bound of Theorem 1 of  also holds for dominated Boolean matrix factorizations (i.e. Boolean underapproximations), or provide a counterexample where it does not hold.
Return the assignment by email to firstname.lastname@example.org by 26 May, 1000 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.
Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.
You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.
|||Minimum Description Length Tutorial (shortened version). In Advances in Minimum Description Length, MIT Press, 2005.|
|||Summarising Data by Clustering Items. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Barcelona, Spain, pages 321-336, Springer, 2010.|
|||Towards parameter-free data mining. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Seattle, WA, pages 206-215, 2004.|
|||Beyond Market Baskets: Generalizing Association Rules to Correlations. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Tucson, AZ, pages 265-276, ACM, 1997.|
|||Self-sufficient itemsets: An approach to screening potentially interesting associations between items. ACM Transactions on Knowledge Discovery from Data, 4(1):1-20, 2010.|
|||Efficient Discovery of the Most Interesting Associations. ACM Transactions on Knowledge Discovery from Data, 8(3):1-31, ACM, 2014.|
|||Assessing Data Mining Results Via Swap Randomization. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA, pages 167-176, ACM, 2006.|
|||Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Mining and Knowledge Discovery, 23(3):407-446, Springer, 2011.|
|||Tell Me What I Need To Know: Succinctly Summarizing Data with Itemsets. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 573-581, ACM, 2011.|
|||Non-negative Matrix Factorization with Sparseness Constraints. J. Mach. Learn. Res., 5:1457-1469, 2004.|
|||Using underapproximations for sparse nonnegative matrix factorization. Pattern Recogn., 43(4):1676-1687, 2010.|
|||Sparse Boolean Matrix Factorizations. In ICDM '10, pages 935-940, 2010.|