Causal Inference

more ▾

With Dice, we can efficiently mine reliable causal rules from observational data. More information here.

Budhathoki, K, Boley, M & Vreeken, J Discovering Reliable Causal Rules. In: Proceedings of the SIAM International Conference on Data Mining (SDM), SIAM, 2021.

Based on the Algorithmic Markov Condition, Globe discovers fully oriented causal networks from observational data. More information here.

Mian, OA, Marx, A & Vreeken, J Discovering Fully Oriented Causal Networks. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), AAAI, 2021.

We show under which conditions regularized regression can be used to identify cause from effect between pairs of univariate continuous-valued random variables. More information here.

Marx, A & Vreeken, J Identifiability of Cause and Effect using Regularized Regression. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'19), ACM, 2019.

With CoCa, we can tell whether two continuous variables are causally related, or jointly caused by a hidden confounder. More information here.

Kaltenpoth, D & Vreeken, J We Are Not Your Real Parents: Telling Causal From Confounded by MDL. In: SIAM International Conference on Data Mining (SDM), SIAM, 2019.

With Acid, we can highly robustly infer the correct causal direction between two univariate discrete variables using stochastic complexity. More information here.

Budhathoki, K & Vreeken, J Accurate Causal Inference on Discrete Data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM'18), IEEE, 2018.

Slope infers, with very high accruacy, the most likely direction of causation between two numeric univariate variables based on local and global regression. More information here.

Marx, A & Vreeken, J Telling Cause from Effect by Local and Global Regression. Knowledge and Information Systems vol.60(3), pp 1277-1305, IEEE, 2019.

We propose the Crack algorithm for identifying the most likely direction of causation between two univariate or multivariate variables of single or mixed-type data. More information here.

Marx, A & Vreeken, J Causal Inference on Multivariate and Mixed Type Data. In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Data (ECMLPKDD), Springer, 2018.

With CuTe, we can robustly infer the correct causal direction between two event sequences using sequential normalized maximum likelihood. More information here.

Budhathoki, K & Vreeken, J Causal Inference on Event Sequences. In: Proceedings of the SIAM Conference on Data Mining (SDM), pp 55-63, SIAM, 2018.

With CiSC, we can highly robustly infer the correct causal direction between two univariate discrete variables using stochastic complexity. More information here.

Budhathoki, K & Vreeken, J MDL for Causal Inference on Discrete Data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM'17), pp 751-756, IEEE, 2017.

We propose the Origo algorithm for identifying the most likely direction of causation between two univariate or multivariate discrete nominal or binary variables. More information here.

Budhathoki, K & Vreeken, J Origo: Causal Inference by Compression. Knowledge and Information Systems vol.56(2), pp 285-307, Springer, 2018.

We consider non-parametric causal inference. That is, given two variables of which we know that they are be correlated, Ergo can efficiently and reliably infer their causal direction – without having to assume a distribution. More information here.

Vreeken, J Causal Inference by Direction of Information. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 909-917, SIAM, 2015.

Pattern Sets

more ▾

With Reaper, we can highly efficiently discover high quality pattern sets. More information here.

Dalleiger, S & Vreeken, J The Relaxed Maximum Entropy Distribution and its Application to Pattern Discovery. In: Proceedings of the IEEE International Conference on Data Mining (ICDM'20), IEEE, 2020.

With Mexican, we can efficiently discover pattern sets expressing co-occurrence and mutual exclusivity from discrete data. More information here.

Fischer, J & Vreeken, J Discovering Succinct Pattern Sets Expressing Co-Occurrence and Mutual Exclusivity . In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'20), ACM, 2020.

With Disc, we can efficiently discover the pattern composition of a binary dataset. More information here.

Dalleiger, S & Vreeken, J Explainable Data Decompositions. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI'20), AAAI, 2020.

Grab discovers succinct, non-redundant and highly characteristic sets of rules and patterns from binary data. More information here.

Fischer, J & Vreeken, J Sets of Robust Rules, and How to Find Them. In: Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Data (ECMLPKDD), Springer, 2019.

Suppose we are given a set of databases, such as sales records over different branches. How can we characterise the differences and the norm between these datasets? What are the patterns that characterise the overall distribution, and what are those that are important for the individual datasets? That is exactly what the DiffNorm algorithm reveals. More information here.

Budhathoki, K & Vreeken, J The Difference and the Norm – Characterising Similarities and Differences between Databases. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 206-223, Springer, 2015.

Self-sufficient itemsets are an effective way to summarise the key associations in data. However, their computation appears demanding, as assessing whether an itemset is self-sufficient requires consideration of all pairwise partitions of an itemset, as well as all its supersets. We propose an branch-and-bound algorithm that employs two powerful pruning techniques to extract them efficiently. More information here.

Webb, G & Vreeken, J Efficient Discovery of the Most Interesting Associations. Transactions on Knowledge Discovery from Data vol.8(3), pp 1-31, ACM, 2014.

Measuring the difference between data mining results is an important open problem in exploratory data mining. We discuss an information theoretic approach for measuring how much information is shared between results, and give a proof of concept for binary data. More information here.

Tatti, N & Vreeken, J Comparing Apples and Oranges – Measuring Differences between Exploratory Data Mining Results. Data Mining and Knowledge Discovery vol.25(2), pp 173-207, Springer, 2012.

Stijl mines descriptions of ordered binary data. We model data hierarchically with noisy tiles - rectangles with significantly different density than their parent tile. To identify good trees, we employ the Minimum Description Length principle, and give an algorithm for mining optimal sub-tiles in just O(nmmin(n,m)) time. More information here.

Tatti, N & Vreeken, J Discovering Descriptive Tile Trees by Fast Mining of Optimal Geometric Subtiles. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 9-24, Springer, 2012.

mtv is a well-founded approach for summarizing data with itemsets; using a probabilistic maximum entropy model, we iteratively find that itemset that provides us the most new information, and update our model accordingly. We can either mine top-k patterns, or identify the best summarisation by MDL or BIC. More information here.

Mampaey, M, Vreeken, J & Tatti, N Summarizing Data Succinctly with the Most Informative Itemsets. Transactions on Knowledge Discovery from Data vol.6(4), pp 1-44, ACM, 2012.

Boolean Matrix Factorization has many desirable properties, such as high interpretability and natural sparsity. However, no method for selecting the correct model order has been available. We propose to use the Minimum Description Length principle, and show that besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, and, as experiments show, is highly accurate. More information here.

Miettinen, P & Vreeken, J mdl4bmf: Minimal Description Length for Boolean Matrix Factorization. Transactions on Knowledge Discovery from Data vol.8(4), pp 1-30, ACM, 2014.

Slim mines high-quality Krimp code tables directly from data, as opposed to filtering a candidate collection. By doing so, Slim obtains smaller code tables that provide better compression ratios, while also improving on classification accuracy, runtime, and reducing the memory complexity with orders of magnitude. More information here.

Smets, K & Vreeken, J Slim: Directly Mining Descriptive Patterns. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 236-247, SIAM, 2012.

We aim at finding itemsets that characterise the data well. To this end, we construct decision trees by which we can pack the data succinctly, and from which we can subsequently identify the most important itemsets. The Pack algorithm can either filter a candidate collection, as well as mine its models directly from data. More information here.

Tatti, N & Vreeken, J Finding Good Itemsets by Packing Data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 588-597, IEEE, 2008.

The Krimp algorithm mines sets of itemsets by the MDL principle, defining the best set of patterns as the set that compresses the data best. The resulting code tables are orders of magnitude smaller than the number of (closed) frequent itemsets. They are highly characteristic for the data, and obtain high accuracy on many data mining tasks. More information here.

Vreeken, J, van Leeuwen, M & Siebes, A Krimp: Mining Itemsets that Compress. Data Mining and Knowledge Discovery vol.23(1), pp 169-214, Springer, 2011.

Subgroup Discovery

We consider the problem of discovering robustly connected subgraphs that have simple descriptions. Our aim is, hence, to discover vertex sets which not only a) induce a subgraph that is difficult to fragment into disconnected components, but also b) can be selected from the entire graph using just a simple conjunctive query on their vertex attributes. More information here.

Kalofolias, J, Boley, M & Vreeken, J Discovering Robustly Connected Subgraphs with Simple Descriptions. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), IEEE, 2019.

We argue that in many applications, such as scientific discovery, subgroups are only useful if they are additionally representative of the global distribution with regard to a control variable: when the distribution of this control variable is the same, or almost the same, as over the whole data. We give an efficient algorithm to find such subgroups in the case of a numeric target and binary control variable. More information here.

Kalofolias, J, Boley, M & Vreeken, J Efficiently Discovering Locally Exceptional yet Globally Representative Subgroups. In: Proceedings of the IEEE International Conference on Data Mining (ICDM'17), IEEE, 2017.

In subgroup discovery, discovering discover high quality one-dimensional subgroups as well as high quality refinements is a crucial task. For nominal attributes this is easy, but for numerical attributes this is more challenging. We propose to use optimal binning to find high quality binary features for numeric and ordinal attributes. More information here.

Nguyen, H-V & Vreeken, J Flexibly Mining Better Subgroups. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 585-593, SIAM, 2016.

Functional Dependency Discovery

more ▾

Given a database and a target attribute, we are after telling whether there exists a functional, or approximately functional dependency of the target on any set of other attributes in the data, regardless of whether these are nominal or continuous valued, to do so efficiently, as well as reliably, without bias to sample size or dimensionality. To this end we propose the MixDora algorithm. More information here.

Mandros, P, Kaltenpoth, D, Boley, M & Vreeken, J Discovering Functional Dependencies from Mixed-Type Data. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'20), ACM, 2020.

In this paper we propose a corrected-for-chance, consistent, and efficient estimator for normalized total correlation, by which we obtain a reliable, naturally inpretable, non-parametric measure for correlation over multivariate sets of categorical variables. We also propose an efficient algorithm for discovering reliable correlations. More information here.

Mandros, P, Boley, M & Vreeken, J Discovering Reliable Correlations in Categorical Data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM'19), IEEE, 2019.

Given a database and a target attribute, we are after telling whether there exists a functional, or approximately functional dependency of the target on any set of other attributes in the data, to do so efficiently, as well as reliably, without bias to sample size or dimensionality. To this end we propose the Fedora algorithm. More information here.

Mandros, P, Boley, M & Vreeken, J Discovering Dependencies with Reliable Mutual Information. Knowledge and Information Systems vol.62, pp 4223-4253, Springer, 2020.

Given a database and a target attribute, we are after telling whether there exists a functional, or approximately functional dependency of the target on any set of other attributes in the data, to do so efficiently, as well as reliably, without bias to sample size or dimensionality. To this end we propose the Dora algorithm. More information here.

Mandros, P, Boley, M & Vreeken, J Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms. In: Proceedings of the IEEE International Conference on Data Mining (ICDM'18), IEEE, 2018.

Sequence Mining

more ▾

We propose Proseqo for discovering accurate, yet easily understandable models from complex event sequence data. More information here.

Wiegand, B, Klakow, D & Vreeken, J Mining Easily Understandable Models from Complex Event Data. In: SIAM International Conference on Data Mining (SDM), SIAM, 2021.

How can we discover patterns that are not just reliable in that they accurately predict that something of interest will happen, but also reliable in that they can tell us when this will happen? With Omen we can. More information here.

Cueppers, J & Vreeken, J Just Wait For It... Mining Sequential Patterns with Reliable Prediction Delays. In: Proceedings of the IEEE International Conference on Data Mining (ICDM'20), IEEE, 2020.

We consider mining informative serial episodes — subsequences allowing for gaps — from event sequence data. We formalize the problem by the Minimum Description Length principle, and give algorithms for selecting good pattern sets from candidate collections as well as for parameter free mining of such models directly from data. More information here.

Bhattacharyya, A & Vreeken, J Efficiently Summarising Event Sequences with Rich Interleaving Patterns. In: Proceedings of the SIAM Conference on Data Mining (SDM), pp 795-803, SIAM, 2017.

We study how to obtain concise descriptions of discrete multivariate sequential data in terms of rich multivariate sequential patterns. We introduce Ditto, and show it discovers succinct pattern sets that capture highly interesting associations within and between sequences. More information here.

Bertens, R, Vreeken, J & Siebes, A Keeping it Short and Simple: Summarising Complex Event Sequences with Multivariate Patterns. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'16), pp 735-744, ACM, 2016.

We consider mining informative serial episodes — subsequences allowing for gaps — from event sequence data. We formalize the problem by the Minimum Description Length principle, and give algorithms for selecting good pattern sets from candidate collections as well as for parameter free mining of such models directly from data. More information here.

Tatti, N & Vreeken, J The Long and the Short of It: Summarising Event Sequences with Serial Episodes. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp 462-470, ACM, 2012.

Graph Mining

more ▾

We propose Susan, an efficient to compute random walk graph kernel that picks up structural similarity. More information here.

Kalofolias, J, Welke, P & Vreeken, J SUSAN: The Structural Similarity Random Walk Kernel. In: Proceedings of the SIAM International Conference on Data Mining (SDM), SIAM, 2021.

We introduce a unified solution to knowledge graph characterization by formulating the problem as unsupervised summarization with a set of inductive, soft rules, which describe what is normal, and thus can be used to identify what is abnormal, whether it be strange or missing. More information here.

Belth, C, Zheng, X, Vreeken, J & Koutra, D What is Normal, What is Strange, and What is Missing in a Knowledge Graph. In: Proceedings of the Web Conference (WWW), ACM, 2020.

With CulT, we propose a method to reconstruct an epidemic over time, or, more general, reconstructing the propagation of an activity in a network. More information here.

Rozenshtein, P, Gionis, A, Prakash, BA & Vreeken, J Reconstructing an Epidemic over Time. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp 1835-1844, ACM, 2016.

Visualizing large graphs often results in excessive edges crossings and overlapping nodes. We propose Facets, a new scalable approach for adaptively exploring large million-node graphs from a local perspective, guiding them to focus on nodes and neighborhoods that are most subjectively interesting. More information here.

Pienta, R, Lin, Z, Kahng, M, Vreeken, J, Talukdar, PP, Abello, J, Parameswaran, G & Chau, DH AdaptiveNav: Adaptive Discovery of Interesting and Surprising Nodes in Large Graphs. In: Proceedings of the IEEE Conference on Visualization (VIS), IEEE, 2015.

Measuring the difference between data mining results is an important open problem in exploratory data mining. We discuss an information theoretic approach for measuring how much information is shared between results, and give a proof of concept for binary data. More information here.

Koutra, D, Kang, U, Vreeken, J & Faloutsos, C VoG: Summarizing and Understanding Large Graphs. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 91-99, SIAM, 2014.

Suppose we are given a large graph in which, by some external process, a handful of nodes are marked. What can we say about these nodes? Are they close together in the graph? or, if segregated, how many groups do they form? We approach this problem by trying to find simple connection pathways between sets of marked nodes — using MDL to identify the optimal result. We propose the efficient dot2dot algorithm for approximating this goal. More information here.

Akoglu, L, Vreeken, J, Tong, H, Chau, DH, Tatti, N & Faloutsos, C Mining Connection Pathways for Marked Nodes in Large Graphs. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 37-45, SIAM, 2013.

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? With NetSleuth, we answer this question affirmatively for the Susceptible-Infected virus propagation model. More information here.

Prakash, BA, Vreeken, J & Faloutsos, C Spotting Culprits in Epidemics: How many and Which ones?. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 11-20, IEEE, 2012.

Outlier Analysis

Anomalies are often characterised as the absence of patterns. We observe that the co-occurrence of patterns can also be anomalous – many people prefer Coca Cola, while others prefer buy Pepsi Cola, and hence anyone who buys both stands out. We formally introduce this new class of anomalies, and propose UpC, an efficient algorithm to discover these anomalies in transaction data. More information here.

Bertens, R, Vreeken, J & Siebes, A Efficiently Discovering Unexpected Pattern-Co-Occurrences. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 126-134, SIAM, 2017.

Detecting whether any important statistics over your time series changed is an important aspect of time series analysis. With Light, we tackle the problem of efficiently and effectively detecting non-linear changes over very high dimensional time series. More information here.

Nguyen, H-V & Vreeken, J Linear-time Detection of Non-Linear Changes in Massively High Dimensional Time Series. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 828-836, SIAM, 2016.

CompreX discovers anomalies in data using pattern-based compression. Informally, it finds a collection of dictionaries that describe the norm of a database succinctly, and subsequently flags points dissimilar to the norm – those with high compression cost – as anomalies. More information here.

Akoglu, L, Tong, H, Vreeken, J & Faloutsos, C Fast and Reliable Anomaly Detection in Categoric Data. In: Proceedings of ACM Conference on Information and Knowledge Management (CIKM), pp 415-424, ACM, 2012.