Subgroup Discovery SS'17


News

more ▾

Course Information

Type Seminar (7 ECTS)
Lecturer Dr. Mario Boley
Email mboley+sgd (at) mpi-inf.mpg.de
Lectures Wednesday, 10:15–11:45 in Room E1.7 3.23.
Max Capacity Participants are by now fixed. Guests are always welcome.
Summary In this seminar we'll be investigating the following questions: How can we identify whether there exist any sub-populations in your data that stand out and are easy to describe? How can we efficiently discover such subgroups from data? How can we do so with optimality guarantees? How can we define 'standing out' in a meaningful manner? and, what does easy to describe mean? We'll be exploring this from a local pattern mining, or better, subgroup discovery perspective.

Introduction

Areas in the east of Germany with very low average income have the highest fraction of non-voters. First-class passengers with a low number of siblings on board had the highest chance to survive the sinking of the Titanic. Subgroup Discovery is a Data Mining technique for the automatic discovery of such and similar patterns from data. In this seminar we will investigate together the following questions: "What are important applications of subgroup discovery and how can we translate their requirements into a formal discovery task?" and "How can we efficiently solve the resulting computational problems?"

Schedule

Month Day What Details Slides
May 24 Lecture Introduction to seminar and subgroup discovery PDF
May 31 Lecture Topic intro 1: Branch-&-bound (A), Optimistic estimators (B), Redundancy (C) PDF
June 7 Lecture Topic intro 2: Heuristic solvers (D), Statistical considerations (E), Applications (F) PDF
June 14 Lecture Report and presentation instructions, Q&A PDF
June 30 Deadline Extended abstract
July 18 Deadline Paper draft
July 26 Deadline Paper
August 2 Talks Student presentations (9-12 am)
August 4 Talks Student presentations (9-12 am)

Please submit required documents on day of deadline via mail to lecturer (and mentor if different from lecturer).

Materials

All required reading will be made available here. You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password. Latex templates for the paper and the extended abstract are here.

[1] Atzmueller, M. Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(1):35-49, Wiley Online Library, 2015.
[2] Boley, M., Goldsmith, B.R., Ghiringhelli, L.M. & Vreeken, J. Identifying Consistent Statements about Numerical Data with Dispersion-Corrected Subgroup Discovery. arXiv preprint arXiv:1701.07696, 2017.
[3] Boley, M., Lucchese, C., Paurat, D. & Gärtner, T. Direct local pattern sampling by efficient two-step random procedures. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 582-590, 2011.
[4] Boley, M. & Grosskreutz, H. Non-redundant subgroup discovery using a closure system. Machine Learning and Knowledge Discovery in Databases, pages 179-194, Springer, 2009.
[5] Duivesteijn, W. & Knobbe, A. Exploiting False Discoveries--Statistical Validation of Patterns and Quality Measures in Subgroup Discovery. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 151-160, 2011.
[6] Duivesteijn, W., Feelders, A. & Knobbe, A. Different slopes for different folks: mining for exceptional regression models with cook's distance. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 868-876, 2012.
[7] Goldsmith, B.R., Boley, M., Vreeken, J., Scheffler, M. & Ghiringhelli, L.M. Uncovering structure-property relationships of materials by subgroup discovery. New Journal of Physics, 19(1):013031, IOP Publishing Ltd and Deutsche Physikalische Gesellschaft, 2017.
[8] Grosskreutz, H., Rüping, S. & Wrobel, S. Tight optimistic estimates for fast subgroup discovery. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 440-456, 2008.
[9] Grosskreutz, H., Boley, M. & Krause-Traudes, M. Subgroup discovery for election analysis: a case study in descriptive data mining. In International Conference on Discovery Science, pages 57-71, 2010.
[10] Grosskreutz, H. & Paurat, D. Fast and memory-efficient discovery of the top-k relevant subgroups in a reduced candidate space. Machine Learning and Knowledge Discovery in Databases, pages 533-548, Springer, 2011.
[11] Lavra\vc, N., Kav\vsek, B., Flach, P. & Todorovski, L. Subgroup discovery with CN2-SD. Journal of Machine Learning Research, 5(Feb):153-188, 2004.
[12] Lemmerich, F., Atzmueller, M. & Puppe, F. Fast exhaustive subgroup discovery with numerical target concepts. Data Mining and Knowledge Discovery, 30(3):711-762, Springer, 2016.
[13] Li, J., Liu, J., Toivonen, H., Satou, K., Sun, Y. & Sun, B. Discovering statistically non-redundant subgroups. Knowledge-Based Systems, 67:315-327, Elsevier, 2014.
[14] Moens, S. & Boley, M. Instant exceptional model mining using weighted controlled pattern sampling. In International Symposium on Intelligent Data Analysis, pages 203-214, 2014.
[15] Schmidt, J., Hapfelmeier, A., Mueller, M., Perneczky, R., Kurz, A., Drzezga, A. & Kramer, S. Interpreting PET scans by structured patient data: a data mining case study in dementia research. Knowledge and Information Systems, 24(1):149-170, Springer, 2010.
[16] Song, H., Kull, M., Flach, P. & Kalogridis, G. Subgroup Discovery with Proper Scoring Rules. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 492-510, 2016.
[17] Webb, G.I. OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3:431-465, 1995.
[18] Webb, G.I. Discovering associations with numeric variables. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 383-388, 2001.

Structure and Format

In general terms, the course will consist of

  1. introductory lectures, and
  2. reading, discussing and presenting scientific articles

The seminar is split into two parts.

The first phase will take place in four consecutive weeks during the semester (May 22 to June 16---one meeting per week, on Wednesday between 10:00 and 12:00). In this phase there will be two to three introductory lectures on subgroup discovery basics and some initial reading assignments to all participants. The phase is concluded by one discussion meeting where we identify key scientific questions and problems.

In the second phase, students prepare one detailed report (8 pages) and one short lecture (40m) on one assigned topic as well as one short report (1 page) on a secondary topic. Finally, we will conclude the seminar with the student lectures, which will be held in block-form in the first week after the end of the course period (July 31 to August 4), during which there most likely will be two to four presentations per day, most likely between 9:00 and 13:00. The exact times, and whether we will use all days, are to be decided.

Prerequisites

Students should have basic working knowledge of data analysis and statistics, e.g. by successfully having taken courses related to data mining, machine learning, and/or statistics, such as Topics in Algorithmic Data Analysis, Information Retrieval and Data Mining, Machine Learning, Probabilistic Graphical Models, Statistical Learning, etc.

Student Presentations

Every student will have to give a presentation and write a paper on an assigned main topic as well as handing on one short extended abstract on a secondary topic.

The presentation should be approximately 30 minutes long, and will be followed by a 10 minute discussion. The style of the presentation should be like a lecture; your fellow students can follow it with only the previous lectures of the course as background material, you should only discuss those details that are necessary, and make use of examples where possible. Topic will be assigned by the lecturer.

In addition to the presentation, students will have to hand in a paper that discusses the assigned topic. Unlike the presentation, which has to be sufficiently high-level, the paper is where you are allowed to show what you've learned in more detail. Summarize the material you read as clearly as you can in your own words, identifying the key contributions/most interesting or important aspects, relating the topic to any/all previous lectures in the course and/or papers read for the course, all the while correctly referring to the sources you've drawn from. The expected length of the report is 8 pages.

Finally, the extended abstract about the a secondary topic is a short document of at most two pages. Please use to the provided Latex templates for both documents. The templates also contain further information on required content and style.

Students will be graded on the slides, the presentation, your answers in the discussion, and the two written doscuments. Students are allowed to ask for feedback on their slides once a week, up till latest 1 week before the presentation. The provided reading material are provided as starting points. For some topics the provided materials will be enough, for others you may want to gather more material. When in doubt, contact the lecturer.