Data is big. We dig it.

How can we discover novel and interesting things from data?

How can we obtain inherently interpretable models?

How can we draw reliable causal conclusions?

That's exactly what we develop theory and algorithms for.

The EDA group on June 8th 2017

On Thursday March 4th, **Panagiotis Mandros** succesfully defended his Ph.D. thesis titled 'Discovering Robust Dependencies from Data'. The promotion committee consisting of Profs. Dietrich Klakow, Gerhard Weikum, Geoff Webb, and Jilles Vreeken, decided that he not only passed the requirements for a degree of Doctor of Engineering but also awarded his thesis with the distinction **Summa Cum Laude**. Congratulations, Dr.-Ing. Mandros!

Great success for the EDA group — four papers accepted for presentation at the 2021 SIAM International Conference on Data Mining (SDM).
**Alex** will present his joint work with Lincen Yang on estimating conditional mutual information for discrete-continuous mixtures.
**Boris** will present ProSeqo for mining concise yet powerful models from event sequence data.
**Janis** will present Susan, the structural similarity random walk kernel that he developed together with Pascal Welke.
Last, but not least, **Kailash** will present Dice for mining reliable causal rules. Congratulations to all!

In her Master thesis, **Edith Heiter** studies the problem of how to factor out prior knowledge from low-dimensional embeddings. In other words, how can we visualise a high dimensional dataset, such that we reveal structure that goes beyond what we already knew? In her thesis, Edith proposes not one, but two methods to factor out arbitrary distance matrices. With Jedi she proposes to adapt the objective function of *t-SNE* in a well-founded manner, while with Confetti she proposes a method that allows us to factor out knowledge from *arbitrary* embedding algorithms. Through many experiments, she showed that both work well in practice, earning her the title Master of Science. Congratulations, Edith!

On Monday July 3rd, **Kailash Budhathoki** succesfully defended his Ph.D. thesis titled 'Causal Inference on Discrete Data'. The promotion committee consisting of Profs. Dietrich Klakow, Gerhard Weikum, Tom Heskes, and Jilles Vreeken, decided that he not only passed the requirements for a degree of Doctor of Philosophy of the Natural Sciences (Dr.rer.nat.) but also awarded his thesis with the distinction **Summa Cum Laude**. Congratulations, Dr. Budhathoki!

**EDA** will present three papers at ACM SIGKDD 2020, the flagship conference in data mining. **Jonas** will present his work on discovering patterns of mutual exclusivity, in which he proposed the Mexican algorithm. **Panagiotis** will present his work together with Frederic Penerath on how to use smoothing to measure and mine reliable functional dependencies, as well as work together with **David** on how to discover functional dependencies from mixed-type data.

How can we detect review spam campaigns, the groups of users that participate in these, as well as determine the spamicity of individual reviewers that actively try to hide their spamming behaviour? In her Master thesis, **Sandra Sukarieh** answers all three questions. The main premise is that a campaign requires multiple users and abnormal scores. Sprap identifies users that *surprisingly* often review products together with other users that *surprisingly* often score differently from the norm. Experiments show that Sprap works remarkably well in practice, without even having to consider the content of the reviews. In other words, Sandra\'s Master thesis campaign was a great success. Congratulations!

**Alexander Marx** has been invited to attend the Heidelberg Laureate Forum. While the actual event is postponed to next year due to Corona, he will then get to meet laureates of the most prestiguous awards in Mathematics and Computer Science, such as Turing Award winners Manuel Blum, Vinton Cerf, Richard Karp, and Judea Pearl, as well as 199 other highly talented young scientists.

Warm welcome to **Joscha Cueppers** as a PhD student in the EDA group! Joscha recently finished his MSc thesis with us, and now joins to pursue his PhD. He'll be working on statistically well-founded pattern discovery from structured data, such as sequences and graphs, to gain insight in the causal processes that generated this data. Welcome, Joscha!

We warmly welcome **Corinna Coupette** as a PhD student in the EDA group! Corinna already holds a PhD in Law, as well as a Masters degree in Informatics. She will be working both on the theory of, as well as on methods for meaningful analysis of complex graphs. The theory aspects she will work on with Christoph Lenzen of the Max Planck Institute for Informatics, while she'll work on method for graph mining with Jilles Vreeken. Welcome, Corinna!

In his Master thesis, **Joscha Cueppers** considers the problem of discovering patterns that reliably predict future events. That is, he is interested in discovering sequential patterns from an event sequence \(X\) for which with high accuracy how long it will take until we see an interesting event happening in event sequence \(Y\). He modelled the problem using MDL, and proposes the Cake algorithm to discover a small set of non-redundant patterns that together predict \(Y\) as well as possible given \(X\). The experiments show the results are very tasty. Congratulations, Joscha!

Warm welcome to **Osman Ali Mian** as a PhD student in the EDA group! Osman recently finished his MSc thesis with us, and now joins to pursue his PhD. He'll be working on theory and methods for doing causal inference in realistic settings – e.g. methods that scale, can deal with data from multiple sources, can deal with missing data, and so on. Welcome, Osman!

Suppose we are given multiple snapshots of a graph over time, how can we discover patterns of change and similarity between them? **Divyam Saran** studied this problem for his Master thesis, and proposed the MDL-based Mango algorithm to discover succinct and non-redundant summaries that give clear insight in what is happening between the graphs. In a nutshell, he discovers significant structure per graph, and then uses the structures from adjacent graphs to refine the overall temporal summary – identifying growing, shrinking, and changing structures such as cliques, stars, and bi-partite subgraphs. Congratulations, Divyam!

How can we discover fully oriented causal networks from observational data? In this Master thesis, **Osman Ali Mian** shows how we can use the Algorithmic Markov Condition to not only discover high quality causal skeletons, but at the same time orient all the edges from cause to effect. To find such networks from data, he proposes Globe, which instantiates the ideal using MDL and non-parametric multivariate regression splines. The experiments show that his proposal outperforms the state of the art constraint-based as well as score-based methods. Congratulations, Osman!

We warmly welcome **Boris Wiegand** as a PhD student in the EDA group! Boris is employed by the Dillinger steel works, and will work on topics related to extracting high-quality models from production logs – for example, to gain insight in patterns, bottlenecks, as well as to optimize both planning and production. He will be co-supervised by Jilles Vreeken and Dietrich Klakow. Welcome, Boris!

**Panagiotis** has been invited to attend the Heidelberg Laureate Forum. During the 3rd week of September he will he will then get to meet laureates of the most prestiguous awards in Mathematics and Computer Science, such as Turing Award winners Manuel Blum, Vinton Cerf, Richard Karp, and Yoshua Bengio, as well as 199 other highly talented young scientists.

While almost all data analysis methods produce only one single model, reality is much more complex, much more layered than that. How can we discover not one, but *multiple* high-quality explanations for a dataset, each of which show increasingly yet significantly more detail than the others? This is exactly the answer that **Simina Ana Cotop** answers in her Master thesis, in which she proposes the Grim algorithm that instantiates Kolmogorov's structure function for pattern-based summarization. Through many experiments she shows that Grim indeed returns insightful high level as well as detailed in-depth summaries. Congratulations, Simina!

Out of 948 submissions, the award committee of IEEE ICDM 2018 selected our paper Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms
by **Panagiotis Mandros**, **Mario Boley**, and **Jilles Vreeken**
for the IEEE ICDM 2018 Best Paper Award! We will receive the award in Singapore on November 19th. Hurray!

The IEEE ICDM Tao Li Award recognizes excellent early career researchers for their impact on research contributions, impact, and services within the first ten years of their obtaining their PhD. This inaugural year, the award committee selected **Jilles Vreeken** for this honour — who is both deeply honoured, and uncharacteristically speechless.

While we're very sad that **Mario Boley** will leave us, we are very happy that on October 1st 2018 he will make the next step in his career and join Monash University in Melbourne, Australia as Tenure Track faculty. We wish Mario all the best, and are looking forward to continue working together on topics such as subgroup and functional dependency discovery. Congratulations, Mario!

**Kailash Budhathoki** and **Panagiotis Mandros** will present two papers at IEEE ICDM 2018 in Singapore. **Kailash** will present his work on accurate causal inference on discrete data, in which he shows that by simply optimising the residual entropy we can accurately identify the most likely causal direction—with guarantees. **Panagiotis** will present his work on discovering reliable approximate functional dependencies, in which he shows that although this problem is NP-hard, using his optimistic estimator we can solve it exactly in reasonable time, as well as get extremely good solutions using a greedy strategy too.

**Iva Farag** was unhappy with the fact that Slim was restricted to using patterns without overlap, and looked into the theoretical details as well as the practical algorithmics for how to alleviate this. In her Master thesis, she shows that the problem is related to weighted set cover, and based on this proposes three cover algorithms that do allow overlap, two of which give guarantees on the quality of the solution. Experiments show that with GreCo we find more succinct, more insightful patterns that are less prone to fitting noise. Congratulations, Iva!

With Smoothie, **Maha Aburahma** proposes a parameter-free algorithm for smoothing discrete data. In short, given a noisy transaction database, the algorithm makes local adjustments such that the overall MDL-complexity of the data and model is minimised. It does so step by step, providing a continuum of increasingly smoothened data. The MDL-optimum coincides with the optimal denoised data, which lends itself for pattern mining and knowledge discovery. Congratulations, Maha!

For her Master thesis, **Yuliia Brendel** studied how we can recover the dependency network over a multivariate continuous-valued data set, without having to assume anything about the data distribution. She did so using the notion of cumulative entropy, and proposes the Grip algorithm to robustly estimate it for multivariate case. Experiments show that Grip performs very well even for highly non-linear, highly noisy, and high dimensional data and dependencies.
Congratulations, Yuliia!

During his studies, **Boris Wiegand** worked at the Dillinger steel plant, where among others they use specialized rolling mills to highly precisely turn chunks of red-hot steel into plates of specified thickness. These rolls in these mills undergo incredible temprature and pressure, and hence need to be replaced ever so often. The question is, when? In his Master thesis, Boris proposed a data-driven model that outperforms the industry-standard phsyics-based model, as well as how we can use this to optimize the milling schedule. Congratulations, Boris!

In her Master thesis, **Maike Eissfeller** considered the problem of how to identify which nodes were most likely responsible for starting an *epidemic* in a large, weighted graph. She build upon the NetSleuth algorithm, and showed how to extend the theory to weighted graphs, how to make it more robust against the non-convex score, and how to improve its results by local re-optimization.
Congratulations, Maike!

Given two discrete valued time series can we tell whether they are causally related? That is, can we tell whether \(x\) causes \(y\), or whether \(y\) causes \(x\)? In the paper he presented on May 3rd at the SIAM Data Mining Conference, Kailash shows we can do so accurately, efficiently, and without having to make assumptions on the distribution of these time series, or about the lag of the causal effect. You can find the paper and implementation here.

**Tatiana Dembelova** received her Master of Science degree for her thesis on how to how to discretize multivariate data such that we maintain the most important interactions between the attributes. In particular, she showed that existing work based on interaction distances performs less well than desired, and proposed a new approach based on footprint interactions that is highly robust against noise and the curse of dimensionality both in theory and in practice.
Congratulations, Tatiana!

**Robin Burghartz** received his Master of Science degree for his thesis on how to identify interesting non-redundant pattern sets through the use of adaptive codes. Loosely speaking, he showed that when describing a row of data, if we adaptively only consider those patterns we know we can possibly use, instead of all, we can identify those patterns that stand out strongly from those already selected are chosen, leading to much smaller and much less redundant pattern sets.
Congratulations, Robin!

**Henrik Jilke** presented his Master thesis on the efficient discovery of powerlaw-distributed communities in large graphs. He proposed a lossless score based on the Minimum Descrtipion Length principle to identify whether a subgraph stands out sufficiently to be considered a community, and gave the efficient Explore algorithm to heuristically discover the best set of such communities. Experiments validate his method is able to discover large, powerlaw-distributed communities that other methods miss.
Congratulations, Henrik!