Topics in Algorithmic Data Analysis SS'21


News

more ▾

Course Information

Type Advanced Lecture (6 ECTS)
Lecturer Prof. Dr. Jilles Vreeken
Email tada-staff (at) mpi-inf.mpg.de
Lectures Thursdays, 10–12 o'clock (sharp) via Zoom and YouTube
Summary In this advanced course we'll be investigating hot topics in data mining and machine learning that the lecturer thinks are cool. This course is for those of you who are interested in Data Mining, Machine Learning, Data Science, Big Data Analytics – or, as the lecturer prefers to call it – Algorithmic Data Analysis. We'll be looking into how to discover significant and useful patterns from data, efficiently measure non-linear correlations and determine causal relations, as well as how to analyse structured data such as time series and graphs.

Preliminary Schedule

Month Day Topic Slides Assignment Req.
Reading
Opt.
Reading
Apr 15 Introduction and Practicalities 1st assignment out
22 Useful Patterns [1] [7,8,9]
29 Actionable Patterns deadline 1st, 2nd out [2] [10,11,12]
May 6 Significance [3] Sec 4.4, 5 [13,14,15,16]
13 yay, holiday — no class
20 Dependence [4] [17,18]
27 It's not Fair! deadline 2nd, 3rd out [19]
Jun 3 yay, holiday — no class
10 Causality [20] Ch 1, 6.1, 6.3 [21,20]
17 Causal Discovery
24 Causal Inference deadline 3rd, 4th out [20] Ch 2.1, 4 [22,23,24,25]
Jul 1 Epidemics in Graphs [5] [26,27,28,29]
8 Graph Summarization [6] [30,31,32,33]
15 Moar Graphs
22 Wrap-Up deadline 4th
(28), 29, 30 oral exams
September 30 re-exams

All report deadlines are on the indicated day at 10:00. In doubt, the assignment web page determines the exact date and time.

Materials

All required and optional reading will be made available. You will need a username and password to access the papers outside the MMCI/MPI network. Contact the lecturer if you don't know the username or password.

In case you do not have a strong enough background in data mining, machine learning, or statistics, these books may help to get you on your way [34,20,35] The university library kindly keeps hard copies of these books available in a so-called Semesteraparat.

Required Reading

[1] van Leeuwen, M. & Vreeken, J. Mining and Using Sets of Patterns through Compression. In Frequent Pattern Mining, Aggarwal, C. & Han, J., pages 165-198, Springer, 2014.
[2] Atzmueller, M. Subgroup Discovery. WIRE's Data Mining and Knowledge Discovery, 5:35-49, Wiley, 2015.
[3] Vreeken, J. & Tatti, N. Interesting Patterns. In Frequent Pattern Mining, Aggarwal, C. & Han, J., pages 105-134, Springer, 2014.
[4] Hoefer, T., Przyrembel, H. & Verleger, S. New Evidence for the Theory of the Stork. Paediatric and Perinatal Epidemiology, 18(1):88-92, 2004.
[5] Prakash, B.A., Vreeken, J. & Faloutsos, C. Spotting Culprits in Epidemics: How many and Which ones?. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium, IEEE, 2012.
[6] Koutra, D., Kang, U., Vreeken, J. & Faloutsos, C. VoG: Summarizing Graphs using Rich Vocabularies. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, pages 91-99, SIAM, 2014.

Optional Reading

[7] Vreeken, J., van Leeuwen, M. & Siebes, A. Krimp: Mining Itemsets that Compress. Data Mining and Knowledge Discovery, 23(1):169-214, Springer, 2011.
[8] Dalleiger, S. & Vreeken, J. Explainable Data Decompositions. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020.
[9] Fischer, J. & Vreeken, J. Sets of Robust Rules, and How to Find Them. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Data (ECMLPKDD), Springer, 2019.
[10] Friedman, J.H. & Fisher, N.I. Bump Hunting in High-Dimensional Data. Statistics and Computing, 9:123-143, Springer, 1999.
[11] Boley, M., Goldsmith, B.R., Ghiringhelli, L. & Vreeken, J. Identifying Consistent Statements about Numerical Data with Dispersion-Corrected Subgroup Discovery. Data Mining and Knowledge Discovery, 31(5):1391-1418, Springer, 2017.
[12] Kalofolias, J., Boley, M. & Vreeken, J. Efficiently Discovering Locally Exceptional yet Globally Representative Subgroups. In Proceedings of the 17th IEEE International Conference on Data Mining (ICDM), New Orleans, LA, IEEE, 2017.
[13] Gionis, A., Mannila, H., Mielikäinen, T. & Tsaparas, P. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data, 1(3), ACM, 2007.
[14] Hanhijärvi, S., Ojala, M., Vuokko, N., Puolamäki, K., Tatti, N. & Mannila, H. Tell me something I don't know: randomization strategies for iterative data mining. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Paris, France, pages 379-388, ACM, 2009.
[15] Jaynes, E. On the rationale of maximum-entropy methods. Proceedings of the IEEE, 70(9):939-952, IEEE, 1982.
[16] De Bie, T. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Mining and Knowledge Discovery, 23(3):407-446, Springer, 2011.
[17] Mandros, P., Boley, M. & Vreeken, J. Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms. In Proceedings of the IEEE International Conference on Data Mining (ICDM'18), IEEE, 2018.
[18] Mandros, P., Boley, M. & Vreeken, J. Discovering Reliable Correlations in Categorical Data. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 2019.
[19] Dwork, C., Hardt, M., Pitassi, T., Reingold, O. & Zemel, R. Fairness through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS), ACM, 2012.
[20] Peters, J., Janzing, D. & Schölkopf, B. Elements of Causal Inference. MIT Press, 2017.
[21] Pearl, J. The do-calculules revisited. In Proceedings of Uncertainty in AI, 2012.
[22] Janzing, D. & Schölkopf, B. Causal Inference Using the Algorithmic Markov Condition. IEEE Transactions on Information Technology, 56(10):5168-5194, 2010.
[23] Janzing, D., Mooij, J., Zhang, K., hang, , Lemeire, J., Zscheischler, J., Daniusis, P., Steudel, B. & Schölkopf, B. Information-geometric Approach to Inferring Causal Directions. , 182-183:1-31, 2012.
[24] Budhathoki, K. & Vreeken, J. Accurate Causal Inference on Discrete Data. In Proceedings of the IEEE International Conference on Data Mining (ICDM'18), IEEE, 2018.
[25] Marx, A. & Vreeken, J. Identifiability of Cause and Effect using Regularized Regression. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2019.
[26] Lappas, T., Terzi, E., Gunopulos, D. & Mannila, H. Finding effectors in social networks. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pages 1059-1068, ACM, 2010.
[27] Shah, D. & Zaman, T. Rumors in a Network: Who's the Culprit?. IEEE Transactions on Information Technology, 57(8):5163-5181, 2011.
[28] Sundareisan, S., Vreeken, J. & Prakash, B.A. Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics. In Proceedings of the SIAM International Conference on Data Mining (SDM'15), SIAM, 2015.
[29] Rozenshtein, P., Gionis, A., Prakash, B.A. & Vreeken, J. Reconstructing an Epidemic over Time. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'16), pages 1835-1844, ACM, 2016.
[30] Chakrabarti, D., Papadimitriou, S., Modha, D.S. & Faloutsos, C. Fully automatic cross-associations. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Seattle, WA, pages 79-88, 2004.
[31] Kang, U. & Faloutsos, C. Beyond Caveman Communities: Hubs and Spokes for Graph Compression and Mining. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM), Vancouver, Canada, pages 300-309, IEEE, 2011.
[32] Shah, N., Koutra, D., Zou, T., Gallagher, B. & Faloutsos, C. TimeCrunch: Interpretable Dynamic Graph Summarization. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 1055-1064, 2015.
[33] Goeble, S., Tonch, A., Böhm, C. & Plant, C. MeGS: Partitioning Meaningful Subgraph Structures Using Minimum Description Length. In Proceedings of the IEEE International Conference on Data Mining (ICDM), pages 889-894, IEEE, 2016.
[34] Aggarwal, C.C. Data Mining - The Textbook. Springer, 2015.
[35] Wasserman, L. All of Statistics. Springer, 2005.

Prerequisites

Students should have basic working knowledge of data analysis and statistics, e.g. by successfully having taken courses related to data mining, machine learning, and/or statistics, such as Information Retrieval and Data Mining, Machine Learning, Probabilistic Graphical Models, Probabilistic Machine Learning, Elements of Machine Learning, Elements of Statistical Learning, etc.

Registration

To register for the course, and to receive the credentials for the Zoom meetings, YouTube stream, and necessary materials, send an email from your university address to tada-staff (at) mpi-inf.mpg.de with the subject 'TADA Registration' by noon April 14th 2021, be sure include your name, matriculation number, and study subject in the email.

I will send out the credentials to registered students in the course of the afternoon of April 14th.

Late registrations will be accepted, but I cannot guarantee you will receive the credentials on time.

Lectures

The lectures start at 10:00. All lectures will be held online via Zoom and streamed live to YouTube. The recorded videos will be available until the end of the semester. The Zoom meetings, YouTube streams, and edited video will be linked from the schedule above. The credentials to access these will be distributed by email to all registered students. Questions can be asked during and right after the lecture. Please alert me to a question via the 'raise hand' option, and/or write it out via the Zoom chat option.

Assignments

Students will individually do one assignment per topic – four in total. For every assignment, you will have to read one or more research papers and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you've drawn from should be referenced. The expected length of a report is 3 pages, but there is no limit.

A sample assignment from 2015, together with a well-graded report can be found here.

The deadlines for the reports are on the day indicated in the schedule at 10:00 Saarbrücken standard-time. You are free to hand in earlier.

Grading and Exam

The assignments will be graded in scale of Fail, Pass, Good, and Excellent. Any assignment not handed in by the deadline is automatically considered failed, and cannot be re-done. You are allowed to re-do one Failed assignment: you have to hand in the improved assignment within two weeks. Two failures mean you are not eligible for the exam, and hence failed the course.

You can earn up to three bonus points by obtaining Excellent or Good grades for the assignments. An Excellent grade gives you one bonus point, as do every two Good grades, up to a maximum of three bonus points. Each bonus point improves your final grade by 1/3 assuming you pass the final exam. For example, if you have two bonus points and you receive 2.0 from the final exam, your final grade will be 1.3. You fail the course if you fail the final exam, irrespective of your possible bonus points. Failed assignments do not reduce your final grade, provided you are eligible to sit the final exam.

The final exams will be oral. The final exam will cover all the material discussed in the lectures and the topics on which you did your assignments. The main exam will be on July 29th. The re-exam will be on September 30th. The exact time slot per student will be announced per email. Inform the lecturer of any potential clashes as soon as you know them.

Online Lectures, Zoom, and Privacy

I would be happy if despite the current situation we can create a pleasant lecture environment. Personal interactions, with your camera switched on, strongly contribute to such an environment -- it's not much fun to hold a monologue to a screen full of black squares, nor do I then have any impression of whether you get the main points. I hence also strongly encourage you to ask questions verbally. Note that this is voluntary. You may switch off both your camera and your microphone, and register under a pseudonym. Questions are still possible, in particular using the chat function.

I decided to use Zoom as a videoconferencing service. Note that this provider (Zoom Video Communications, Inc., 55 Almaden Blvd, Suite 600, San Jose, CA 95113, USA) can access all data that you provide when registering for the video conference. If you do not provide personal data during the registration, there is still a possibility that Zoom identifies you using your IP address. Neither CISPA, nor the computer science department would have decided to use Zoom if we considered this as a significant risk. As an additional precaution, we have opted to use European computing centers. Should you still have privacy concerns (and are using an Internet Service Provider that can map IP addresses to your name), we suggest using an anonymization service such as Tor.

You can find Zoom's complete privacy policy here.