Data is big. We dig it.

How can we discover novel and interesting things from data?

How can we explore the data, and not our assumptions?

That's what we develop theory and algorithms for.

The EDA group on June 8th 2017

The IEEE ICDM Tao Li Award recognizes excellent early career researchers for their impact on research contributions, impact, and services within the first ten years of their obtaining their PhD. This inaugural year, the award committee selected **Jilles Vreeken** for this honour — who is both deeply honoured, and uncharacteristically speechless.

Out of 948 submissions, the award committee of IEEE ICDM 2018 selected our paper Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms
by **Panagiotis Mandros**, **Mario Boley**, and **Jilles Vreeken**
for the IEEE ICDM 2018 Best Paper Award! We will receive the award in Singapore on November 19th. Hurray!

**Kailash Budhathoki** and **Panagiotis Mandros** will present two papers at IEEE ICDM 2018 in Singapore. **Kailash** will present his work on accurate causal inference on discrete data, in which he shows that by simply optimising the residual entropy we can accurately identify the most likely causal direction—with guarantees. **Panagiotis** will present his work on discovering reliable approximate functional dependencies, in which he shows that although this problem is NP-hard, using his optimistic estimator we can solve it exactly in reasonable time, as well as get extremely good solutions using a greedy strategy too.

**Iva Farag** was unhappy with the fact that Slim was restricted to using patterns without overlap, and looked into the theoretical details as well as the practical algorithmics for how to alleviate this. In her Master thesis, she shows that the problem is related to weighted set cover, and based on this proposes three cover algorithms that do allow overlap, two of which give guarantees on the quality of the solution. Experiments show that with GreCo we find more succinct, more insightful patterns that are less prone to fitting noise. Congratulations, Iva!

With Smoothie, **Maha Aburahma** proposes a parameter-free algorithm for smoothing discrete data. In short, given a noisy transaction database, the algorithm makes local adjustments such that the overall MDL-complexity of the data and model is minimised. It does so step by step, providing a continuum of increasingly smoothened data. The MDL-optimum coincides with the optimal denoised data, which lends itself for pattern mining and knowledge discovery. Congratulations, Maha!

For her Master thesis, **Yuliia Brendel** studied how we can recover the dependency network over a multivariate continuous-valued data set, without having to assume anything about the data distribution. She did so using the notion of cumulative entropy, and proposes the Grip algorithm to robustly estimate it for multivariate case. Experiments show that Grip performs very well even for highly non-linear, highly noisy, and high dimensional data and dependencies.
Congratulations, Yuliia!

In her Master thesis, **Maike Eissfeller** considered the problem of how to identify which nodes were most likely responsible for starting an *epidemic* in a large, weighted graph. She build upon the NetSleuth algorithm, and showed how to extend the theory to weighted graphs, how to make it more robust against the non-convex score, and how to improve its results by local re-optimization.
Congratulations, Maike!

Given two discrete valued time series can we tell whether they are causally related? That is, can we tell whether \(x\) causes \(y\), or whether \(y\) causes \(x\)? In the paper he presented on May 3rd at the SIAM Data Mining Conference, Kailash shows we can do so accurately, efficiently, and without having to make assumptions on the distribution of these time series, or about the lag of the causal effect. You can find the paper and implementation here.

**Tatiana Dembelova** received her Master of Science degree for her thesis on how to how to discretize multivariate data such that we maintain the most important interactions between the attributes. In particular, she showed that existing work based on interaction distances performs less well than desired, and proposed a new approach based on footprint interactions that is highly robust against noise and the curse of dimensionality both in theory and in practice.
Congratulations, Tatiana!

**Robin Burghartz** received his Master of Science degree for his thesis on how to identify interesting non-redundant pattern sets through the use of adaptive codes. Loosely speaking, he showed that when describing a row of data, if we adaptively only consider those patterns we know we can possibly use, instead of all, we can identify those patterns that stand out strongly from those already selected are chosen, leading to much smaller and much less redundant pattern sets.
Congratulations, Robin!

**Henrik Jilke** presented his Master thesis on the efficient discovery of powerlaw-distributed communities in large graphs. He proposed a lossless score based on the Minimum Descrtipion Length principle to identify whether a subgraph stands out sufficiently to be considered a community, and gave the efficient Explore algorithm to heuristically discover the best set of such communities. Experiments validate his method is able to discover large, powerlaw-distributed communities that other methods miss.
Congratulations, Henrik!

**Benjamin Hättasch** finished his Master of Science by handing in his thesis on the automatic refinement of ontologies using compression-based learning. In a nutshell, Benjamn shows how we can efficiently describe a given text using an ontology. His main result is the Refine algorithm, that iteratively refines the ontology such that we maximize the compression. The resulting ontologies are a much better representation of the text distribution, as well as allow him to identify the key topics of the text without supervision.
Congratulations, Benjamin!

We are happy and proud to announce that **Jonas Fischer** got accepted as a PhD student in the International Max Planck Research School for Computer Science (IMPRS-CS) to pursue a PhD on the topic of algorithmic data analysis.
He was already a student in the Saarbrücken Graduate School of Computer Science, and recently finished his Master thesis in Bioinformatics on the topic of highly efficient methylation calling.

We are excited to announce that **David Kaltenpoth** got accepted as a PhD student in the International Max Planck Research School for Computer Science (IMPRS-CS). He was already a member of the Saarbrücken Graduate School of Computer Science.
He will work on the topic of information theoretic causal inference, in particular the theory and practice of determining whether potential causal dependencies are confounded.

We warmly welcome **Sebastian Dalleiger** as a PhD student in the Exploratory Data Analysis group.
Sebastian finished his Master's in Informatics at Saarland University in 2016, and will now join our group to work on information theoretic approaches to mining interpretable and useful structure from data.

We warmly welcome **Janis Kalofolias** as a PhD student in the Exploratory Data Analysis group.
Janis recently finished his Master's in Informatics at Saarland University, and will now join our group to work on the theoretical foundations of mining interesting patterns from data.

We are happy to announce that **Alexander Marx** got accepted as a PhD student in the International Max Planck Research School for Computer Science (IMPRS-CS) and the Saarbrücken Graduate School of Computer Science!
He will work on the efficient discovery and interpretable description of interesting sub-populations in data, with the grand goals of discovering causal dependencies that lead to the discovery of novel materials.

**Amirhossein Baradaranshahroudi** finished his Master of Science by handing in his thesis on fast discovery of non-linearly correlated segments in multivariate time series. In his thesis, Amir shows that through fast-fourier transformation, convolution, and pre-computation we can bring down the computational complexity of computing the distance correlation between all pairwise windows in \(O(n^4 \log n)\) instead of \(O(n^5 d)\). For discovery in long time series, he proposes an effective and efficient heuristic that only takes \(O(nwd)\) time.
Congratulations, Amir!

**Apratim Bhattacharyya** finished his Master of Science by handing in his thesis *'Squish: Efficiently Summarising Event Sequences with Rich and Interleaving Patterns'*. Squish improves over the state of the art by considering a much richer description language, allowing both nesting and interleaving of patterns, as well as both variances and partial occurrences of patterns. Moreover, Squish is not only orders of magnitude faster than the state of the art, experiments show it also discovers much better and more easily interpretable models. Congratulations, Apratim!

**Beata Wójciak** handed in her thesis *'Spaghetti: Finding Storylines in Large Collections of Documents'*
on the 29th of September, and so fullfilled the requirements to become a Master of Science in Informatics.
In her thesis, Beata studied the problem of making sense from large, time-stamped, collections of documents, and proposed the efficient Spaghetti algorithm
to discover the pattern storylines in a corpus. This allows us to draw a map showing which documents are connected, as well as easily interpret the storylines.
Congratulations, Beata!

For this Bachelor thesis, **Magnus Halbe** studied whether sketching can speed up Slim.
In particular, he investigated whether DHP and min-hashing can used to reliably and efficiently identify co-occurring patterns.
In this thesis, titled *'Skim: Alternative Candidate Selections for Slim through Sketching'*, Magnus shows that the answer is *'not really.'*.
Whereas the sketches ably identify heavy hitters, they are less efficient in identifying more subtle patterns.
He therefore proposes the Skim algorithm, that combines the best of both worlds.
Congratulations, Magnus!

Last summer, **Polina Rozenshtein** did an internship in our group. She presented the resulting paper `Reconstructing an Epidemic over Time' at ACM SIGKDD 2016. Together with B. Aditya Prakash and Aris Gionis
we studied the problem of finding the seed nodes of an epidemic, if we are given an interaction graph, and a sparse and noisy sample of node states over time. We propose the CulT (Culprits in Time) algorithm,
that reliably, efficiently, and without making any assumptions on the viral process can recover both the number and location of the original seed nodes.
We give a short explanation, with kittens, on YouTube here.

During the summer of 2014 **Roel Bertens** did an internship in our group. He present the outcome `Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns' at ACM SIGKDD 2016. In this paper we propose the Ditto algorithm,
an efficient heuristic for finding succinct summaries of complex event sequences with patterns that can span over multiple attributes and may include gaps. You can find the paper and implementation here, and for a short introduction, see our YouTube video here.

**Kailash Budhathoki** has been invited to attend the 4th Heidelberg Laureate Forum. During the 18th and 23rd of September 2016, he will get to meet laureates of the most prestiguous awards in Mathematics and Computer Science, such as Turing Award winners Manuel Blum, Vinton Cerf, Richard Karp, and John Hopcroft, as well as 199 other highly talented young scientists.

At the SIAM International Conference on Data Mining 2016, **Panagiotis Mandros** presented
uds,
which allows for
*Universal Dependency Analysis*. That is, it is a robust and efficient measure for
non-linear and multivariate correlations, which does *not* require any prior assumptions, yet *does* allow for meaningful comparison, no matter the cardinality or distribution of the subspace.

At the same venue, **Jilles Vreeken** presented both light,
a linear-time method for detecting non-linear change points in massively high dimensional time series,
as well as
flexi, a highly flexible method for mining high quality subgroups through optimal discretisation, that works with virtually *any* quality measure.