## Assignment 4: Graphs

The deadline is July 27th at 18:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.

For the topic of your assignment, choose one of the following:

1. #### Growing a Graph

Generating synthetic graphs that exhibit realistic properties is a difficult problem with many applications. In the last few years the stochastic Kronecker approach [1] got very popular, receiving a lot of praise left and right. Recently, however, there have been a few papers that are more critical about the model [2].

Read [1], [2] and form your own opinion. Optionally, read (the relevant parts of) [3] for a more in-depth discussion.

(Bonus) Read also [4] and discuss the implications. Does this approach (ignoring attributes) improve over the Kronecker approach, or does it have additional flaws? Taking the attributes into account, is it convincing, are there any obvious or subtle flaws in the scheme, and how would you go about and extend it to generate even more realistic graphs?

2. #### Oddities

A large part of data mining is focused on discovering regularities in data, while oftentimes the exceptions to these underlying regularities are as, or perhaps even more informative to the domain expert. Detecintg anomalies, or outliers, in graphs is particularly challenging as unlike for a row in a table—where we only regard that row and that's it—to determine whether a node (or an edge) in a graph is odd, we may have to consider the whole graph.

Read [5], [6], [7], each of which look differently at graphs and identify different types of anomalies. Or, do they? Carefully examine what types of anomalies each method can catch, whether there are conceptual overlaps, which makes it possible (or even easy) to adapt them to discover (roughly) the same things, or whether they really do discover different things in your data. If they do, is that bad? If they don't, is that bad? Can you say what is the best method for anomaly detection? Most importantly, reason about how these methods determine what is an anomaly, and critically discuss whether this makes sense, or not, and why.

3. #### Hyper! Hyper!

A lot of research attention in graph mining is focused on the discovery of communities, groups of nodes that strongly interact. Many papers assume highly simplistic models, such as that communities strongly interact within, but only loosely to the outside, or, that communities do not overlap. Recently, different groups of researchers started considering that, just like graphs themselves, communities within these graphs may also show powerlaw-like edge distributions.

Read [8], [9], and [10], and critically discuss these approaches. Example questions you can address include the following. What are their (hidden or not) assumptions, what are the strengths, and what are the weaknesses? Consider also the evaluation, what are these methods trying to reconstruct? Is that a meaningful goal, or is it a self-fulfilling prophecy? Does any of these papers convince that actual communities follow their assumptions, or they only postulate so? The first paper considers overlapping communities, whereas the two others can not, which sounds like an important downside. Is it? Really? Why (not)? And so on.

4. #### The Root of All Evil

Discovering sources of viral infections is a hot area of research. Recently, within three months three groups of authors published results on how to discover culprits from incomplete data. Read these three papers [11,12,13] and discuss how (dis)similar the explored ideas and proposed solutions are. Be critical, try to see through the hype, not everything that is said to be radically different has to be such.

(Bonus) Do (any of) the proposed methods have any weaknesses? Do you see more general, more realistic problem settings can be solved? (For example, how would you extend [11] to the SIR model?) Read also [14], who consider identifying sources on dynamic, interaction graphs, without having to assume any normal infection model. This sounds very tasty, yet there is no such thing as a free lunch. Discuss the advantages and the disadvantages. In particular, draw the connections to the methods above, and investigate whether the shortest path idea can be back-ported into those settings. Warning: this part is open ended, it considers a research question that we are currently looking into ourselves.

Return the assignment by email to tada-staff@mpi-inf.mpg.de by 27 July, 1800 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.

## References

You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.

 [1] Leskovec, J. & Faloutsos, C. Realistic, Mathematically Tractable Graph Generation and Evolution, Using Kronecker Multiplication. In Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Porto, Portugal, 2005. [2] Seshadhri, C., Pinar, A. & Kolda, T. An in-depth analysis of stochastic Kronecker graphs. , 60(2):13-32, 2013. [3] Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C. & Ghahramani, Z. Kronecker Graphs: An Approach to Modeling Networks. Journal of Machine Learning Research, 11(Feb):985-1042, 2010. [4] Robles, P., Moreno, S. & Neville, J. Sampling of Attributed Networks from Hierarchical Generative Models. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'16), ACM, 2017. [5] Akoglu, L., McGlohon, M. & Faloutsos, C. oddball: Spotting Anomalies in Weighted Graphs. In Proceedings of Advances in Knowledge Discovery and Data Mining, 14th Pacific-Asia Conference (PAKDD'10), pages 410-421, 2010. [6] Perozzi, B. & Akoglu, L. Scalable Anomaly Ranking of Attributed Neighborhoods. In Proceedings of the SIAM International Conference on Data Mining (SDM'16), pages 207-215, 2016. [7] Shin, K., Eliassi-Rad, T. & Faloutsos, C. CoreScope: Graph Mining Using k-Core Analysis - Patterns, Anomalies and Algorithms. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM'16), pages 469-478, 2016. [8] Yang, J. & Leskovec, J. Overlapping Communities Explain Core-Periphery Organization of Networks. Proceedings of the IEEE, 102(12), IEEE, 2014. [9] Araujo, M., Günnemann, S., Mateos, G. & Faloutsos, C. Beyond Blocks: Hyperbolic Community Detection. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Data (ECMLPKDD'14), Springer, 2014. [10] Metzler, S., Günnemann, S. & Miettinen, P. Hyperbolae are no Hyperbole: Modelling Communities that are not Cliques. In Proceedings of the IEEE International Conference on Data Mining (ICDM'16), IEEE, 2016. [11] Sundareisan, S., Vreeken, J. & Prakash, B.A. Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics. In Proceedings of the SIAM International Conference on Data Mining (SDM'15), SIAM, 2015. [12] Sefer, E. & Kingsford, C. Diffusion Archaeology for Diffusion Progression History Reconstruction. In Proceedings of the 14th IEEE International Conference on Data Mining (ICDM), 2014. [13] Farajtabar, M., Gomez-Rodriguez, M., Du, N., Zamani, M., Zha, H. & Song, L. Back to the Past: Source Identification in Diffusion Networks from Partially Observed Cascades. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2015. [14] Rozenshtein, P., Gionis, A., Prakash, B.A. & Vreeken, J. Reconstructing an Epidemic over Time. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'16), pages 1835-1844, ACM, 2016.