Topics in Algorithmic Data Analysis SS'16


News

more ▾

Course Information

Type Advanced Lecture (6 ECTS)
Lecturers Dr. Jilles Vreeken and Dr. Pauli Miettinen
Email tada-staff (at) mpi-inf.mpg.de
Lectures Thursdays, 10–12 o'clock in Room E1.3 0.14.
Summary In this advanced course we'll be investigating hot topics in data mining that the lecturers think are cool. This course is for those of you who are interested in Big Data Analytics, Data Science, Data Mining, Machine Learning – or, as the lecturers prefer to call it – Algorithmic Data Analysis. We'll be looking into how to discover significant and useful patterns from data, efficiently measure non-linear correlations and determine causal directions, as well as how to analyse large graphs and multi-linear structures.

Schedule

Month Day Topic Who Slides Assignment Req. Reading Opt. Reading
April 21 Introduction, Practicalities * PDF 1st handed out
28 Useful Patterns JV PDF [1] [8,9,10]
May 5 yay, holiday – no lecture deadline 1st, 2nd out
12 B for Boolean PM PDF [2] [11,12,13]
19 Significant Results PM PDF [14,15,16,17,18,19]
26 yay, holiday – no lecture deadline 2nd
June 2 planned absence – no lecture 3rd handed out
7 Complex Correlation JV PDF [3] [20,21,22]
9 Causal Inference JV PDF
16 Graph Summarisation JV PDF [4] [23,24,25]
23 Rumours in Graphs JV PDF [5] [26,27,28]
30 dot.product.graphs PM PDF deadline 3rd, 4th out [6] [29]
July 7 One, Two, Tensor PM PDF [7] [30]
14 Three, Four, Tensor PM PDF [31,32,33,34,35]
21 Wrap-Up * PDF
PDF
deadline 4th
25 Oral exams
26 Oral exams

Materials

All required and optional reading will be made available. You will need a username and password to access the papers outside the MPI network. Contact the lecturers if you don't know the username or password.

Required Reading

[1] van Leeuwen, M. & Vreeken, J. Mining and Using Sets of Patterns through Compression. In Frequent Pattern Mining, Aggarwal, C. & Han, J., pages 165-198, Springer, 2014.
[2] Miettinen, P., Mielikäinen, T., Gionis, A., Das, G. & Mannila, H. The Discrete Basis Problem. TKDE, 20(10):1348-1362, IEEE, 2008.
[3] Hoefer, T., Przyrembel, H. & Verleger, S. New Evidence for the Theory of the Stork. Paediatric and Perinatal Epidemiology, 18(1):88-92, 2004.
[4] Koutra, D., Kang, U., Vreeken, J. & Faloutsos, C. VoG: Summarizing Graphs using Rich Vocabularies. In Proceedings of the SIAM International Conference on Data Mining (SDM), Philadelphia, PA, pages 91-99, SIAM, 2014.
[5] Prakash, B.A., Vreeken, J. & Faloutsos, C. Spotting Culprits in Epidemics: How many and Which ones?. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium, IEEE, 2012.
[6] Fiduccia, C.M., Scheinerman, E.R., Trenk, A. & Zito, J.S. Dot product representations of graphs. Discrete Math., 181(1-3):113-138, 1998.
[7] Kolda, T.G. & Bader, B.W. Tensor decompositions and applications. SIAM Rev., 51(3):455-500, 2009.

Optional Reading

[8] Vreeken, J., van Leeuwen, M. & Siebes, A. Krimp: Mining Itemsets that Compress. Data Mining and Knowledge Discovery, 23(1):169-214, Springer, 2011.
[9] van Leeuwen, M., Vreeken, J. & Siebes, A. Identifying the Components. Data Mining and Knowledge Discovery, 19(2):173-292, 2009.
[10] Smets, K. & Vreeken, J. The Odd One Out: Identifying and Characterising Anomalies. In Proceedings of the 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, pages 804-815, Society for Industrial and Applied Mathematics (SIAM), 2011.
[11] Lucchese, C., Orlando, S. & Perego, R. A Unifying Framework for Mining Approximate Top-k Binary Patterns. IEEE Trans. Knowl. Data Eng., 26(12):2900-2913, 2013.
[12] Miettinen, P. & Vreeken, J. MDL4BMF: Minimum Description Length for Boolean Matrix Factorization. ACM Trans. Knowl. Discov. Data, 8(4), 2014.
[13] Karaev, S., Miettinen, P. & Vreeken, J. Getting to Know the Unknown Unknowns: Destructive-Noise Resistant Boolean Matrix Factorization. In SDM '15, 2015.
[14] Gionis, A., Mannila, H., Mielikäinen, T. & Tsaparas, P. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data, 1(3), ACM, 2007.
[15] Hanhijärvi, S., Ojala, M., Vuokko, N., Puolamäki, K., Tatti, N. & Mannila, H. Tell me something I don't know: randomization strategies for iterative data mining. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, pages 379-388, ACM, 2009.
[16] Ojala, M., Vuokko, N., Kallio, A., Haiminen, N. & Mannila, H. Randomization methods for assessing data analysis results on real-valued matrices. Statistical Analysis and Data Mining, 2(4):209-230, 2009.
[17] Jaynes, E. On the rationale of maximum-entropy methods. Proceedings of the IEEE, 70(9):939-952, IEEE, 1982.
[18] De Bie, T. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Mining and Knowledge Discovery, 23(3):407-446, Springer, 2011.
[19] Kontonasios, K.N., Vreeken, J. & De Bie, T. Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Prague, Czech Republic, pages 256-271, Springer, 2013.
[20] Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turnbaugh, P.J., Lander, E.S., Mitzenmacher, M. & Sabeti, P.C. Detecting Novel Associations in Large Data Sets. Science, 334(6062):1518-1524, 2011.
[21] Crescenzo, A.D. & Longobardi, M. On cumulative entropies. Journal of Statistical Planning and Inference, 139(2009):4072-4087, 2009.
[22] Nguyen, H.V., Müller, E., Vreeken, J. & Böhm, K. Multivariate Maximal Correlation Analysis. In Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, pages 775-783, JMLR, 2014.
[23] Chakrabarti, D., Papadimitriou, S., Modha, D.S. & Faloutsos, C. Fully automatic cross-associations. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Seattle, WA, pages 79-88, 2004.
[24] Kang, U. & Faloutsos, C. Beyond Caveman Communities: Hubs and Spokes for Graph Compression and Mining. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM), Vancouver, Canada, pages 300-309, IEEE, 2011.
[25] Tsourakakis, C.E., Kang, U., Miller, G.L. & Faloutsos, C. Doulion: Counting Triangles in Massive Graphs with a Coin. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, ACM, 2009.
[26] Lappas, T., Terzi, E., Gunopulos, D. & Mannila, H. Finding effectors in social networks. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pages 1059-1068, ACM, 2010.
[27] Shah, D. & Zaman, T. Rumors in a Network: Who's the Culprit?. IEEE Transactions on Information Technology, 57(8):5163-5181, 2011.
[28] Sundareisan, S., Vreeken, J. & Prakash, B.A. Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics. In Proceedings of the SIAM International Conference on Data Mining (SDM'15), SIAM, 2015.
[29] Kang, R.J. & Müller, T. Sphere and Dot Product Representations of Graphs. Discrete Comput Geom, 47(3):548-568, 2012.
[30] Skillicorn, D. Understanding Complex Datasets: Data Mining with Matrix Decompositions. Chapman \& Hall/CRC, 2007.
[31] Acar, E. & Yener, B. Unsupervised Multiway Data Analysis: A Literature Survey. IEEE Trans. Knowl. Data En., 21(1):6-20, 2009.
[32] Nickel, M., Tresp, V. & Kriegel, H.P. A Three-Way Model for Collective Learning on Multi-Relational Data. In ICML '11, pages 809-816, 2011.
[33] Nickel, M., Tresp, V. & Kriegel, H.P. Factorizing YAGO: Scalable Machine Learning for Linked Data. In WWW '12, pages 271-280, 2012.
[34] Chi, E.C. & Kolda, T.G. On Tensors, Sparsity, and Nonnegative Factorizations. SIAM J. Matrix Anal. Appl., 33(4):1272-1299, 2012.
[35] Erdős, D. & Miettinen, P. Discovering Facts with Boolean Tensor Tucker Decomposition. In CIKM '13, pages 1569-1572, 2013.

Course format

The course has two hours of lectures per week. There are no weekly tutorial group meetings. Instead, the students have to write essays based on the material covered on the lectures and scientific articles assigned to them by the lecturers.

Structure and Content

In general terms, the course will consist of

  1. lectures, and
  2. assignments that include critically reading scientific articles
At a high level, the topics we will cover will include
  1. Mining Significant Patterns
  2. Mining Correlation and Causation
  3. Mining Graphs and Tensors
Loosely speaking, students will learn about current hot topics in exploratory data analysis, with an emphasis on theoretically well-founded approaches, including those based on information theoretic principles.

Assignments

Students will individually do one assignment per topic – four in total. For every assignment, you will have to read one or more research papers and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you've drawn from should be referenced. The expected length of a report is 3 pages, but there is no limit.

The deadlines for the reports are at 10:00 Saarbrücken standard-time. You are free to hand in earlier.

Grading and Exam

The assignments will be graded in scale of Fail, Pass, Good, and Excellent. Any assignment not handed in by the deadline is automatically considered failed, and cannot be re-done. You are allowed to re-do a Failed assignment once: you have to hand in the improved assignment within two weeks. Two failures mean you fail the course.

You can earn up to three bonus points by obtaining Excellent or Good grades for the assignments. An Excellent grade gives you one bonus point, as do every two Good grades, up to a maximum of three bonus points. Each bonus point improves your final grade by 1/3 assuming you pass the final exam. For example, if you have two bonus points and you receive 2.0 from the final exam, your final grade will be 1.3. You fail the course if you fail the final exam, irrespective of your possible bonus points. Failed assignments do not reduce your final grade, provided you are eligible to sit the final exam.

The final exams will be oral. The final exam will cover all the material discussed in the lectures and the topics on which you did your assignments. The main exam will be on July 25th and 26th. The re-exam will be on September 30th. The exact time slot per student will be announced later. Inform the lecturers of any potential clashes as soon as you know them.

Prerequisites

Students should have basic working knowledge of data analysis and statistics, e.g. by successfully having taken courses related to data mining, machine learning, and/or statistics, such as Information Retrieval and Data Mining, Machine Learning, Probabilistic Graphical Models, Statistical Learning, etc.