With CoCa, we can tell whether two continuous variables are causally related, or jointly caused by a hidden confounder. More information here.
Given a database and a target attribute, we are after telling whether there exists a functional, or approximately functional dependency of the target on any set of other attributes in the data, to do so efficiently, as well as reliably, without bias to sample size or dimensionality. To this end we propose the Fedora algorithm. More information here.
Slope infers, with very high accruacy, the most likely direction of causation between two numeric univariate variables based on local and global regression. More information here.
We propose the Crack algorithm for identifying the most likely direction of causation between two univariate or multivariate variables of single or mixed-type data. More information here.
Causal Inference on Multivariate and Mixed Type Data. In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Data (ECMLPKDD), Springer, 2018.
Suppose we are given a set of databases, such as sales records over different branches. How can we characterise the differences and the norm between these datasets? What are the patterns that characterise the overall distribution, and what are those that are important for the individual datasets? That is exactly what the DiffNorm algorithm reveals. More information here.
The Difference and the Norm – Characterising Similarities and Differences between Databases. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 206-223, Springer, 2015.
Slim mines high-quality Krimp code tables directly from data, as opposed to filtering a candidate collection. By doing so, Slim obtains smaller code tables that provide better compression ratios, while also improving on classification accuracy, runtime, and reducing the memory complexity with orders of magnitude. More information here.
mtv is a well-founded approach for summarizing data with itemsets; using a probabilistic maximum entropy model, we iteratively find that itemset that provides us the most new information, and update our model accordingly. We can either mine top-k patterns, or identify the best summarisation by MDL or BIC. More information here.
Self-sufficient itemsets are an effective way to summarise the key associations in data. However, their computation appears demanding, as assessing whether an itemset is self-sufficient requires consideration of all pairwise partitions of an itemset, as well as all its supersets. We propose an branch-and-bound algorithm that employs two powerful pruning techniques to extract them efficiently. More information here.
We consider mining informative serial episodes — subsequences allowing for gaps — from event sequence data. We formalize the problem by the Minimum Description Length principle, and give algorithms for selecting good pattern sets from candidate collections as well as for parameter free mining of such models directly from data. More information here.
We study how to obtain concise descriptions of discrete multivariate sequential data in terms of rich multivariate sequential patterns. We introduce Ditto, and show it discovers succinct pattern sets that capture highly interesting associations within and between sequences. More information here.
Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'16), pp 735-744, ACM, 2016.
CompreX discovers anomalies in data using pattern-based compression. Informally, it finds a collection of dictionaries that describe the norm of a database succinctly, and subsequently flags points dissimilar to the norm – those with high compression cost – as anomalies. More information here.
Anomalies are often characterised as the absence of patterns. We observe that the co-occurrence of patterns can also be anomalous – many people prefer Coca Cola, while others prefer buy Pepsi Cola, and hence anyone who buys both stands out. We formally introduce this new class of anomalies, and propose UpC, an efficient algorithm to discover these anomalies in transaction data. More information here.
Stijl mines descriptions of ordered binary data. We model data hierarchically with noisy tiles - rectangles with significantly different density than their parent tile. To identify good trees, we employ the Minimum Description Length principle, and give an algorithm for mining optimal sub-tiles in just O(nmmin(n,m)) time. More information here.
Discovering Descriptive Tile Trees by Fast Mining of Optimal Geometric Subtiles. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 9-24, Springer, 2012.
Boolean Matrix Factorization has many desirable properties, such as high interpretability and natural sparsity. However, no method for selecting the correct model order has been available. We propose to use the Minimum Description Length principle, and show that besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, and, as experiments show, is highly accurate. More information here.
Visualizing large graphs often results in excessive edges crossings and overlapping nodes. We propose Facets, a new scalable approach for adaptively exploring large million-node graphs from a local perspective, guiding them to focus on nodes and neighborhoods that are most subjectively interesting. More information here.
With CulT, we propose a method to reconstruct an epidemic over time, or, more general, reconstructing the propagation of an activity in a network. More information here.
Measuring the difference between data mining results is an important open problem in exploratory data mining. We discuss an information theoretic approach for measuring how much information is shared between results, and give a proof of concept for binary data. More information here.
Suppose we are given a large graph in which, by some external process, a handful of nodes are marked. What can we say about these nodes? Are they close together in the graph? or, if segregated, how many groups do they form? We approach this problem by trying to find simple connection pathways between sets of marked nodes — using MDL to identify the optimal result. We propose the efficient dot2dot algorithm for approximating this goal. More information here.
Given a database and a target attribute, we are after telling whether there exists a functional, or approximately functional dependency of the target on any set of other attributes in the data, to do so efficiently, as well as reliably, without bias to sample size or dimensionality. To this end we propose the Dora algorithm. More information here.
We provide a general theory on correlation, without making any such assumptions. Simply put, we propose correlation by compression. To this end, we propose two correlation measures based on solid information theoretic foundations, i.e. Kolmogorov complexity, and show how to instantiate these for discrete data. More information here.
Discovering whether a subspace is correlated is a core task in data mining. For practical use, such a measure should be universal in the sense that it captures correlation in subspaces of any dimensionality and allows to meaningfully compare correlation scores across different subspaces. To this end we propose UdS, an efficient non-parametric multivariate non-linear dependency measure. More information here.
Correlation analysis is one of the key elements of data mining, but many methods are limited to either linear or pairwise correlations. We propose mac, a non-parametric, multivariate and non-linear correlation measure based on maximal correlation analysis. More information here.
How can we quantify the difference between two datasets, without having to assume a distribution? We formalise the well-known Jensen-Shannon divergence using cumulative distribution funtions. cjs allows us to calculate divergences directly and efficiently from data without the need for estimation. More information here.
Non-Parametric Jensen-Shannon Divergence. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 173-189, Springer, 2015.
Discretization is the transformation of continuous data into discrete bins. The key question is how to discretize data such that multivariate associations and patterns are preserved. That is exactly what ipd does. More information here.
A good first impression of a dataset is paramount to how we proceed our analysis. We discuss mining high-quality high-level descriptive summaries for binary and categorical data. Our approach builds summaries by clustering attributes that strongly correlate, and uses the Minimum Description Length principle to identify the best clustering. More information here.