Abstract:
The paper presents a method of interactive construction of global Hidden Markov Models based on local patterns discovered in sequence data. The method works by finding interesting sequences whose probability in data differs from that predicted by the model. The patterns are then presented to the user who updates the model using their understanding of the modelled domain. It is argued that such an approach leads to more understandable models than automated approaches. An application to modelling webpage visitors behavior is presented, showing the strengths of the proposed approach.
Abstract:
We study a discovery framework in which background knowledge on variables and their relations within a discourse area is available in the form of a graphical model. Starting from an initial, hand-crafted or possibly empty graphical model, the network evolves in an interactive process of discovery. We focus on the central step of this process: given a graphical model and a database, we address the problem of finding the most interesting attribute sets. We formalize the concept of interestingness of attribute sets as the divergence between their behavior as observed in the data, and the behavior that can be explained given the current model. We derive an exact algorithm that finds all attribute sets whose interestingness exceeds a given threshold. We then consider the case of a very large network that renders exact inference unfeasible, and a very large database or data stream. We devise an algorithm that efficiently finds the most interesting attribute sets with prescribed approximation bound and confidence probability, even for very large networks and infinite streams. We study the scalability of the methods in controlled experiments; a case-study sheds light on the practical usefulness of the approach.
Abstract:
The paper presents minimum variance patterns: a new class of itemsets and rules for numerical data, which capture arbitrary continuous relationships between numerical attributes without the need for discretization. The approach is based on finding polynomials over sets of attributes whose variance, in a given dataset, is close to zero. Sets of attributes for which such functions exist are considered interesting. Further, two types of rules are introduced, which help extract understandable relationships from such itemsets. Efficient algorithms for mining minimum variance patterns are presented and verified experimentally.
Abstract:
In this paper we show an efficient method for inducing classifiers that directly optimize the area under the ROC curve. Recently, AUC gained importance in the classification community as a mean to compare the performance of classifiers. Because most classification methods do not optimize this measure directly, several classification learning methods are emerging that directly optimize the AUC. These methods, however, require many costly computations of the AUC, and hence, do not scale well to large datasets. In this paper, we develop a method to increase the efficiency of computing AUC based on a polynomial approximation of the AUC. As a proof of concept, the approximation is plugged into the construction of a scalable linear classifier that directly optimizes AUC using a gradient descent method. Experiments on real-life datasets show a high accuracy and efficiency of the polynomial approximation.
Abstract:
We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records--which is infeasible for data streams. We devise a fast matching algorithm that uses only a small sample of records, and is yet guaranteed to find a matching that is a close approximation of the matching that would be obtained if the entire stream were processed. The method can be applied to any given (combination of) similarity metrics that can be estimated from a sample with bounded error; we apply the algorithm to several metrics. We give a rigorous proof of the method's correctness and report on experiments using large databases.
Abstract:
The paper introduces a notion of support for realvalued functions. It is shown how to approximate supports of a large class of functions based on supports of so called polynomial itemsets, which can efficiently be mined using an Apriori-style algorithm. An upper bound for the error of such an approximation can be reliably computed. The concept of an approximating representation was introduced, which extends the idea of concise representations to numerical data. It has been shown that many standard statistical modelling tasks such as nonlinear regression and least squares curve fitting can efficiently be solved using only the approximating representation, without accessing the original data at all. Since many of those methods traditionally require several passes over the data, our approach makes it possible to use such methods with huge datasets and data streams where several repeated scans are very costly or outright impossible.
Abstract:
We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records--which is infeasible for data streams. We devise a fast matching algorithm that uses only a small sample of records, and is yet guaranteed to match the most similar attributes with high probability. The method can be applied to any given (combination of) similarity metrics that can be estimated from a sample with bounded error; we apply the algorithm to several metrics. We give a rigorous proof of the method's correctness and report on experiments using large databases.
Abstract:
We study the mining of interesting patterns in the presence of numerical attributes. Instead of the usual discretization methods, we propose the use of rank based measures to score the similarity of sets of numerical attributes. New support measures for numerical data are introduced, based on extensions of Kendall's tau, and Spearman's Footrule and rho. We show how these support measures are related. Furthermore, we introduce a novel type of pattern combining numerical and categorical attributes. We give efficient algorithms to find all frequent patterns for the proposed support measures, and evaluate their performance on real-life datasets.
Abstract:
A new class of associations (polynomial itemsets and polynomial association rules) is presented which allows for discovering nonlinear relationships between numeric attributes without discretization. For binary attributes, proposed associations reduce to classic itemsets and association rules. Many standard association rule mining algorithms can be adapted to finding polynomial itemsets and association rules. We applied polynomial associations to add non-linear terms to logistic regression models. Significant performance improvement was achieved over stepwise methods, traditionally used in statistics, with comparable accuracy.
Abstract:
The paper discusses the properties of an attribute selection criterion for building rough set reducts based on discernibilty matrix and compares it with Shannon entropy and Gini index used for building decision trees. It has been shown theoretically and experimentally that entropy and Gini index tend to work better if the reduct is later used for prediction of previously unseen cases, and the criterion based on the discernibility matrix tends to work better for learning functional relationships where generalization is not an issue.
Abstract:
We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer leaves than the trees built with standard methods and have comparable or better accuracy.
Abstract:
We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer leaves than the trees built with standard methods and have comparable or better accuracy.
Abstract:
We consider a model in which background knowledge on a given domain of interest is available in terms of a Bayesian network, in addition to a large database. The mining problem is to discover \em unexpected patterns: our goal is to find the strongest discrepancies between network and database. This problem is intrinsically difficult because it requires inference in a Bayesian network and processing the entire, potentially very large, database. A sampling-based method that we introduce is efficient and yet provably finds the approximatelymost interesting unexpected patterns. We give a rigorous proof of the method's correctness. Experiments shed light on its efficiency and practicality for large-scale Bayesian networks and databases.
Abstract:
We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer leaves than the trees built with standard methods and have comparable or better accuracy.
Abstract:
In this paper we present an overview of machine learning and data mining methods which we believe are useful for researchers working in the area of speech recognition and analysis. We present selected algorithms for classification, clustering, sequence mining and rough sets based methods. Potential applications in speech recognition and analysis are also discussed. We hope that this paper will encourage speech recognition researchers to more frequently use machine learning and data mining algorithms.
Abstract:
We present algorithms for fast generation of short reducts which avoid building the discernibility matrix explicitly. We show how information obtained from this matrix can be obtained based only on the distributions of attribute values. Since the size of discernibility matrix is quadratic in the number of data records, not building the matrix explicitly gives a very significant speedup and makes it possible to find reducts even in very large databases. Algorithms are given for both absolute and relative reducts. Experiments show that our approach outperforms other reduct finding algorithms. Furthermore it is shown that many heuristic reduct finding algorithms using the discernibility matrix in fact select attributes based on their Gini index. A new definition of conditional Gini index is presented, motivated by reduct finding heuristics.
Abstract:
We characterize measures on free Boolean algebras and we examine the relationships that exist between measures and binary tables in relational databases. It is shown that these measures are completely defined by their values on positive conjunctions, and a formula that yields this value is obtained using the method of indicators. An extension of the notion of support that is well suited for tables with missing values is presented. Finally, we obtain Bonferroni-type inequalities that allow for approximative evaluations of these measures for several types of queries. An approximation algorithm and an analysis of the results produced is also included.
Abstract:
The paper presents a method for pruning frequent itemsets based on background knowledge represented by a Bayesian network. The interestingness of an itemset is defined as the absolute difference between its support estimated from data and from the Bayesian network. Efficient algorithms are presented for finding interestingness of a collection of frequent itemsets, and for finding all attribute sets with a given minimum interestingness. Practical usefulness of the algorithms and their efficiency have been verified experimentally.
Abstract:
Increasing interest in new pattern recognition methods has been motivated by bioinformatics research. The analysis of gene expression data originated from microarrays constitutes an important application area for classification algorithms and illustrates the need for identifying important predictors. We show that the Goodman-Kruskal coefficient can be used for constructing minimal classifiers for tabular data, and we give an algorithm that can construct such classifiers.
Abstract:
We introduce an extension of the notion of Shannon conditional entropy to a more general form of conditional entropy that captures both the conditional Shannon entropy and a similar notion related to the Gini index. The proposed family of conditional entropies generates a collection of metrics over the set of partitions of finite sets, which can be used to construct decision trees. Experimental results suggest that by varying the parameter that defines the entropy it is possible to obtain smaller decision trees for certain databases without sacrificing accurracy.
Abstract:
The purpose of this paper is to examine the usability of Bonferroni-type combinatorial inequalities to estimation of support of itemsets as well as general Boolean expressions. Families of inequalities for various types of Boolean expressions are presented and evaluated experimentally.
Abstract:
Data mining algorithms produce huge sets of rules, practically impossible to analyze manually. It is thus important to develop methods for removing redundant rules from those sets. We present a solution to the problem using the Maximum Entropy approach. The problem of efficiency of Maximum Entropy computations is addressed by using closed form solutions for the most frequent cases. Analytical and experimental evaluation of the proposed technique indicates that it efficiently produces small sets of interesting association rules.
Abstract:
We characterize measures on free Boolean algebras and we examine the relationships that exists between measures and binary tables in relational databases. It is shown that these measures are completely defined by their values on positive conjunctions and an algorithm that leads to the construction of measures starting from its values on positive conjunction is also given, including a formula that allows the evaluation of measures for arbitrary polynomials. Finally, we study pairs of measures generated by ternary tables, that is, by tables that contain missing or unknown values.
Abstract:
We characterize measures on free Boolean algebras and we examine the relationships that exist between measures and binary tables in relational databases. It is shown that these measures are completely defined by their values on positive conjunctions, and a formula that yields this value is obtained using the method of indicators. We also obtain Bonferroni-type inequalities that allow for approximative evaluations of these measures. Finally we present a measure extending the notion of support that is well suited for tables with missing values.
Abstract:
The aim of this paper is to present an axiomatization of a generalization of Shannon's entropy starting from partitions of finite sets. The proposed axiomatization yields as special cases the Havrda-Charvat entropy, and thus, provides axiomatizations for the Shannon entropy, the Gini index, and for other types of entropy used in classification and data mining.
Abstract:
The paper presents a new general measure of rule interestingness. Many known measures such as chi-square, gini gain or entropy gain can be obtained from this measure by setting some numerical parameters, including the amount of trust we have in the estimation of the probability distribution of the data. Moreover, we show that there is a continuum of measures having chi-square, Gini gain and entropy gain as boundary cases. Therefore our measure generalizes both conditional and unconditional classical measures of interestingness. Properties and experimental evaluation of the new measure are also presented.
Abstract:
We introduce the notion of entropy for a set of attributes of a table in a relational database starting from the notion of entropy for finite functions. We examine the connections that exist between conditional entropies of attribute sets and lossy decompositions of tables and explore the properties of the entropy of sets of attributes regarded as an outer measure on the set of subsets of a heading of a table. Finally, we suggest a generalization of functional dependencies based on conditional entropy.
Abstract:
In this paper we present a new axiomatization of the notion of entropy of functions between finite sets and we introduce and axiomatize the notion of conditional entropy between functions. The results can be directly applied to logic functions, which can be regarded as functions between finite sets. Our axiomatizations are based on properties of entropy with regard to operations commonly applied to discrete functions and are related to the usage of entropy as a measure of the energy dissipated by circuits that implement discrete functions.