## Assignment 3: Correlation, Causation and Graphs

The deadline is June 30th at 10:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.

For the topic of your assignment, choose one of the following:

1. #### Growing a Graph

Generating synthetic graphs that exhibit realistic properties is a difficult problem with many applications. In the last few years the stochastic Kronecker approach [1,2] got very popular, receiving a lot of praise left and right. Recently, however, there have been a few papers that are more critical about the model [3].

Read [1] and [3] and form your own opinion. Optionally, read (the relevant parts of) [2] for a more in-depth discussion.

2. #### Featuring Recursion and Aggregation

In the last twenty years a small group of European researchers has been working on multi-relational data mining: mining data over multiple relations. One key problem is how to generate informative local 'multi relational features'. One of the main solutions is called 'propositionalization'. Read Knobbe et al. [4] as an example.

In recent years mining graph-structured data has become very popular, in particular in the USA. Initially only simple undirected unweighted and unlabelled graphs were considered. These days more and more work appears on more complex graphs, considering more interesting problem settings. Read Henderson et al. [5] as an example.

Discuss the relation between graph mining and multi-relational data mining. Are they identical, similar, or ultimately different? How similar are the problems and the solutions these two papers propose? Is graph mining reinventing propositionalization? Why (not)? Is graph mining reinventing multi-relational data mining? Why (not)?

3. #### The Root of All Evil — (Hard)

Discovering sources of viral infections is a hot area of research. Recently, within three months three groups of authors published results on how to discover culprits from incomplete data. Read these three papers [6,7,8] and discuss how (dis)similar the explored ideas and proposed solutions are. Be critical, try to see through the hype, not everything that is said to be radically different has to be such.

(Bonus) Do (any of) the proposed methods have any weaknesses? Do you see more general, more realistic problem settings can be solved? (For example, how would you extend [6] to the SIR model?) Discuss. Warning: this part is open ended, it considers a research question that we are currently looking into ourselves. If you do well, you are invited to join our efforts and share in the eternal glory that awaits us.

4. #### Caveat Lector — (Hard)

Determining causal relations from data is perhaps the holiest of grails in data mining. Not surprisingly, it is not an easy task. We recently proposed a new approach to inferring whether $$X$$ causes $$Y$$, or vice versa. The main principle is simple; if describing $$X$$ is easier when given $$Y$$, than vice versa, we say that $$Y$$ is more likely to be caused $$X$$ than vice versa. To use this principle to use, we define a practical score based on cumulative entropy. Experiments show it works rather well. At the same time, it's a first attempt. And first attempts always mean there are some points of improvement.

In this assignment your task is to carefully read [9] and critically discuss it. That is, identify the most important choices the author makes, and evaluate whether you agree with him, or whether alternate solutions (also) make (more) sense. In other words, find as many points of improvement of the principle, as well as of the practical instantiation, as possible. That is, you should consider every aspect critically. Does considering $$\Delta_{X' \rightarrow Y}$$ instead of $$\Delta_{X \rightarrow Y}$$ really make more sense? Further, instead of calculating it directly, we estimate conditional and multivariate cumulative entropy. What are the effects of the way we estimate it on the causality score? Would it be possible to improve in theory? Would it be possible to improve in practice? How?

Return the assignment by email to tada-staff@mpi-inf.mpg.de by 30 June, 1000 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.

## References

You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.

 [1] Leskovec, J. & Faloutsos, C. Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, 2005. [2] Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C. & Ghahramani, Z. Kronecker Graphs: An Approach to Modeling Networks. Journal of Machine Learning Research, 11(Feb):985-1042, 2010. [3] Seshadhri, C., Pinar, A. & Kolda, T. An in-depth analysis of stochastic Kronecker graphs. , 60(2):13-32, 2013. [4] Knobbe, A., de Haas, M. & Siebes, A. Propositionalisation and Aggregates. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discoverys (PKDD), Freiburg, Germany, 2001. [5] Henderson, K., Gallagher, B., Li, L., Akoglu, L., Eliassi-Rad, T., Tong, H. & Faloutsos, C. It's who you know: graph mining using recursive structural features. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, 2011. [6] Sundareisan, S., Vreeken, J. & Prakash, B.A. Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics. In Proceedings of the SIAM International Conference on Data Mining (SDM'15), SIAM, 2015. [7] Sefer, E. & Kingsford, C. Diffusion Archaeology for Diffusion Progression History Reconstruction. In Proceedings of the 14th IEEE International Conference on Data Mining (ICDM), 2014. [8] Farajtabar, M., Gomez-Rodriguez, M., Du, N., Zamani, M., Zha, H. & Song, L. Back to the Past: Source Identification in Diffusion Networks from Partially Observed Cascades. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2015. [9] Vreeken, J. Causal Inference by Direction of Information. In Proceedings of the SIAM International Conference on Data Mining (SDM'15), SIAM, 2015.