Exploratory Data Analysis.
Data is big. We dig it.

How can we discover novel and interesting things from data?
How can we explore the data, and not our assumptions?
That's what we develop theory and algorithms for.

EDA group at work on June 24th 2016 The EDA group on June 24th 2016

News

Janis Kalofolias

Janis joins EDA as a PhD student

We warmly welcome Janis Kalofolias as a PhD student in the Exploratory Data Analysis group. Janis recently finished his Master's in Informatics at Saarland University, and will now join our group to work on the theoretical foundations of mining interesting patterns from data.

(7 November 2016)

Alexander Marx

Alex receives IMPRS-CS PhD Fellowship

We are happy to announce that Alexander Marx got accepted as a PhD student in the International Max Planck Research School for Computer Science (IMPRS-CS) and the Saarbrücken Graduate School of Computer Science! He will work on the efficient discovery and interpretable description of interesting sub-populations in data, with the grand goals of discovering causal dependencies that lead to the discovery of novel materials.

(1 November 2016)

Amirhossein Baradaranshahroudi

Amir proposes BVCorr to discover non-linearly correlated segments

Amirhossein Baradaranshahroudi finished his Master of Science by handing in his thesis on fast discovery of non-linearly correlated segments in multivariate time series. In his thesis, Amir shows that through fast-fourier transformation, convolution, and pre-computation we can bring down the computational complexity of computing the distance correlation between all pairwise windows in \(O(n^4 \log n)\) instead of \(O(n^5 d)\). For discovery in long time series, he proposes an effective and efficient heuristic that only takes \(O(nwd)\) time. Congratulations, Amir!

(14 October 2016)

Apratim Bhattacharyya

Apratim shows how to Squish event sequences

Apratim Bhattacharyya finished his Master of Science by handing in his thesis 'Squish: Efficiently Summarising Event Sequences with Rich and Interleaving Patterns'. Squish improves over the state of the art by considering a much richer description language, allowing both nesting and interleaving of patterns, as well as both variances and partial occurrences of patterns. Moreover, Squish is not only orders of magnitude faster than the state of the art, experiments show it also discovers much better and more easily interpretable models. Congratulations, Apratim!

(30 September 2016)

Beata Wojciak

Beata discovers the Spaghetti in document collections

Beata Wójciak handed in her thesis 'Spaghetti: Finding Storylines in Large Collections of Documents' on the 29th of September, and so fullfilled the requirements to become a Master of Science in Informatics. In her thesis, Beata studied the problem of making sense from large, time-stamped, collections of documents, and proposed the efficient Spaghetti algorithm to discover the pattern storylines in a corpus. This allows us to draw a map showing which documents are connected, as well as easily interpret the storylines. Congratulations, Beata!

(29 September 2016)

Magnus Halbe

Magnus combines sketching and Slim into Skim

For this Bachelor thesis, Magnus Halbe studied whether sketching can speed up Slim. In particular, he investigated whether DHP and min-hashing can used to reliably and efficiently identify co-occurring patterns. In this thesis, titled 'Skim: Alternative Candidate Selections for Slim through Sketching', Magnus shows that the answer is 'not really.'. Whereas the sketches ably identify heavy hitters, they are less efficient in identifying more subtle patterns. He therefore proposes the Skim algorithm, that combines the best of both worlds. Congratulations, Magnus!

(28 September 2016)

Culprits in Time

Polina presents CulT at KDD

Last summer, Polina Rozenshtein did an internship in our group. She presented the resulting paper `Reconstructing an Epidemic over Time' at ACM SIGKDD 2016. Together with B. Aditya Prakash and Aris Gionis we studied the problem of finding the seed nodes of an epidemic, if we are given an interaction graph, and a sparse and noisy sample of node states over time. We propose the CulT (Culprits in Time) algorithm, that reliably, efficiently, and without making any assumptions on the viral process can recover both the number and location of the original seed nodes. We give a short explanation, with kittens, on YouTube here.

(24 July 2016)

Keeping it Short and Simple

Roel presents Ditto at KDD

During the summer of 2014 Roel Bertens did an internship in our group. He present the outcome `Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns' at ACM SIGKDD 2016. In this paper we propose the Ditto algorithm, an efficient heuristic for finding succinct summaries of complex event sequences with patterns that can span over multiple attributes and may include gaps. You can find the paper and implementation here, and for a short introduction, see our YouTube video here.

(24 July 2016)

Heidelberg Laureate Forum, September 18-23th 2016

Kailash invited to the Heidelberg Laureate Forum

Kailash Budhathoki has been invited to attend the 4th Heidelberg Laureate Forum. During the 18th and 23rd of September 2016, he will get to meet laureates of the most prestiguous awards in Mathematics and Computer Science, such as Turing Award winners Manuel Blum, Vinton Cerf, Richard Karp, and John Hopcroft, as well as 199 other highly talented young scientists.

(24 April 2016)

Non-linear correlation discovered using UDS

Panos and Jilles present Flexi, Light, and UdS at SIAM SDM

At the SIAM International Conference on Data Mining 2016, Panagiotis Mandros presented uds, which allows for Universal Dependency Analysis. That is, it is a robust and efficient measure for non-linear and multivariate correlations, which does not require any prior assumptions, yet does allow for meaningful comparison, no matter the cardinality or distribution of the subspace.

At the same venue, Jilles Vreeken presented both light, a linear-time method for detecting non-linear change points in massively high dimensional time series, as well as flexi, a highly flexible method for mining high quality subgroups through optimal discretisation, that works with virtually any quality measure.