S. Jaroszewicz. Interactive HMM Construction Based on Interesting Sequences. In Proc. of Local Patterns to Global Models (LeGo'08) Workshop at the 12th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'08), pages 82-91, Antwerp, Belgium, 2008.

Abstract:
The paper presents a method of interactive construction of global Hidden Markov Models based on local patterns discovered in sequence data. The method works by finding interesting sequences whose probability in data differs from that predicted by the model. The patterns are then presented to the user who updates the model using their understanding of the modelled domain. It is argued that such an approach leads to more understandable models than automated approaches. An application to modelling webpage visitors behavior is presented, showing the strengths of the proposed approach.




S. Jaroszewicz, T. Scheffer, D.A. Simovici. Scalable pattern mining with Bayesian networks as background knowledge. Data Mining and Knowledge Discovery, 18(1), pages 56-100, 2009.

Abstract:
We study a discovery framework in which background knowledge on variables and their relations within a discourse area is available in the form of a graphical model. Starting from an initial, hand-crafted or possibly empty graphical model, the network evolves in an interactive process of discovery. We focus on the central step of this process: given a graphical model and a database, we address the problem of finding the most interesting attribute sets. We formalize the concept of interestingness of attribute sets as the divergence between their behavior as observed in the data, and the behavior that can be explained given the current model. We derive an exact algorithm that finds all attribute sets whose interestingness exceeds a given threshold. We then consider the case of a very large network that renders exact inference unfeasible, and a very large database or data stream. We devise an algorithm that efficiently finds the most interesting attribute sets with prescribed approximation bound and confidence probability, even for very large networks and infinite streams. We study the scalability of the methods in controlled experiments; a case-study sheds light on the practical usefulness of the approach.




S. Jaroszewicz. Minimum Variance Associations -- Discovering Relationships in Numerical Data. In The Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pages 172-183, Osaka, Japan, 2008.

Abstract:
The paper presents minimum variance patterns: a new class of itemsets and rules for numerical data, which capture arbitrary continuous relationships between numerical attributes without the need for discretization. The approach is based on finding polynomials over sets of attributes whose variance, in a given dataset, is close to zero. Sets of attributes for which such functions exist are considered interesting. Further, two types of rules are introduced, which help extract understandable relationships from such itemsets. Efficient algorithms for mining minimum variance patterns are presented and verified experimentally.




T. Calders, S. Jaroszewicz. Efficient AUC Optimization for Classification. In 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'07), pages 42-53, Warsaw, Poland, 2007, Best paper award.

Abstract:
In this paper we show an efficient method for inducing classifiers that directly optimize the area under the ROC curve. Recently, AUC gained importance in the classification community as a mean to compare the performance of classifiers. Because most classification methods do not optimize this measure directly, several classification learning methods are emerging that directly optimize the AUC. These methods, however, require many costly computations of the AUC, and hence, do not scale well to large datasets. In this paper, we develop a method to increase the efficiency of computing AUC based on a polynomial approximation of the AUC. As a proof of concept, the approximation is plugged into the construction of a scalable linear classifier that directly optimizes AUC using a gradient descent method. Experiments on real-life datasets show a high accuracy and efficiency of the polynomial approximation.




S. Jaroszewicz, L. Ivantysynova, T. Scheffer. Schema Matching on Streams with Accuracy Guarantees. Intelligent Data Analysis, 12(3), pages 253-270, 2008.

Abstract:
We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records--which is infeasible for data streams. We devise a fast matching algorithm that uses only a small sample of records, and is yet guaranteed to find a matching that is a close approximation of the matching that would be obtained if the entire stream were processed. The method can be applied to any given (combination of) similarity metrics that can be estimated from a sample with bounded error; we apply the algorithm to several metrics. We give a rigorous proof of the method's correctness and report on experiments using large databases.




S. Jaroszewicz, M. Korzeń. Approximating Representations for Large Numerical Databases. In 7th SIAM International Conference on Data Mining (SDM'07), pages 521-526, Minneapolis, MN, 2007.

Abstract:
The paper introduces a notion of support for realvalued functions. It is shown how to approximate supports of a large class of functions based on supports of so called polynomial itemsets, which can efficiently be mined using an Apriori-style algorithm. An upper bound for the error of such an approximation can be reliably computed. The concept of an approximating representation was introduced, which extends the idea of concise representations to numerical data. It has been shown that many standard statistical modelling tasks such as nonlinear regression and least squares curve fitting can efficiently be solved using only the approximating representation, without accessing the original data at all. Since many of those methods traditionally require several passes over the data, our approach makes it possible to use such methods with huge datasets and data streams where several repeated scans are very costly or outright impossible.




S. Jaroszewicz, L. Ivantysynova, T. Scheffer. Accurate Schema Matching on Streams. In 4th International Workshop on Knowledge Discovery from Data Streams at the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'06), pages 3-12, 2006.

Abstract:
We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records--which is infeasible for data streams. We devise a fast matching algorithm that uses only a small sample of records, and is yet guaranteed to match the most similar attributes with high probability. The method can be applied to any given (combination of) similarity metrics that can be estimated from a sample with bounded error; we apply the algorithm to several metrics. We give a rigorous proof of the method's correctness and report on experiments using large databases.




T. Calders, B. Goethals, S. Jaroszewicz. Mining Rank-Correlated Sets of Numerical Attributes. In Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), 2006.

Abstract:
We study the mining of interesting patterns in the presence of numerical attributes. Instead of the usual discretization methods, we propose the use of rank based measures to score the similarity of sets of numerical attributes. New support measures for numerical data are introduced, based on extensions of Kendall's tau, and Spearman's Footrule and rho. We show how these support measures are related. Furthermore, we introduce a novel type of pattern combining numerical and categorical attributes. We give efficient algorithms to find all frequent patterns for the proposed support measures, and evaluate their performance on real-life datasets.




S. Jaroszewicz. Polynomial Association Rules with Applications to Logistic Regression. In Proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), 2006.

Abstract:
A new class of associations (polynomial itemsets and polynomial association rules) is presented which allows for discovering nonlinear relationships between numeric attributes without discretization. For binary attributes, proposed associations reduce to classic itemsets and association rules. Many standard association rule mining algorithms can be adapted to finding polynomial itemsets and association rules. We applied polynomial associations to add non-linear terms to logistic regression models. Significant performance improvement was achieved over stepwise methods, traditionally used in statistics, with comparable accuracy.




S. Jaroszewicz, M. Korzeń. Comparison of Information Theoretical Measures for Reduct Finding. In Proc. of the 8th International Conference on Artificial Intelligence and Soft Computing (ICAISC'06), pages 518-527, Zakopane, June, 2006.

Abstract:
The paper discusses the properties of an attribute selection criterion for building rough set reducts based on discernibilty matrix and compares it with Shannon entropy and Gini index used for building decision trees. It has been shown theoretically and experimentally that entropy and Gini index tend to work better if the reduct is later used for prediction of previously unseen cases, and the criterion based on the discernibility matrix tends to work better for learning functional relationships where generalization is not an issue.




D. A. Simovici, S. Jaroszewicz. A new metric splitting criterion for decision trees. Parallel Algorithms Appl., 21(4), pages 239-256, 2006.

Abstract:
We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer leaves than the trees built with standard methods and have comparable or better accuracy.




D. A. Simovici, S. Jaroszewicz. Generalized Conditional Entropy and a Metric Splitting Criterion for Decision Trees. In 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD'06), pages 35-44, Singapore, 2006.

Abstract:
We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer leaves than the trees built with standard methods and have comparable or better accuracy.




S. Jaroszewicz, T. Scheffer. Fast Discovery of Unexpected Patterns in Data, Relative to a Bayesian Network. In 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2005), pages 118-127, Chicago, IL, August, 2005.

Abstract:
We consider a model in which background knowledge on a given domain of interest is available in terms of a Bayesian network, in addition to a large database. The mining problem is to discover \em unexpected patterns: our goal is to find the strongest discrepancies between network and database. This problem is intrinsically difficult because it requires inference in a Bayesian network and processing the entire, potentially very large, database. A sampling-based method that we introduce is efficient and yet provably finds the approximatelymost interesting unexpected patterns. We give a rigorous proof of the method's correctness. Experiments shed light on its efficiency and practicality for large-scale Bayesian networks and databases.




D. Simovici, S. Jaroszewicz. A New Metric Splitting Criterion for Decision Trees. International Journal of Parallel, Emergent and Distributed Systems, 21(4), pages 239-256, August, 2006.

Abstract:
We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer leaves than the trees built with standard methods and have comparable or better accuracy.




S. Jaroszewicz, W. Kosiński. Machine Learning for Speech Recognition Researchers. In Proceedings of the Speech Analysis, Synthesis and Recognition Applications of Phonetics Conference, Kraków, Poland, September, 2005, Publication on CD-ROM.

Abstract:
In this paper we present an overview of machine learning and data mining methods which we believe are useful for researchers working in the area of speech recognition and analysis. We present selected algorithms for classification, clustering, sequence mining and rough sets based methods. Potential applications in speech recognition and analysis are also discussed. We hope that this paper will encourage speech recognition researchers to more frequently use machine learning and data mining algorithms.




M. Korzeń, S. Jaroszewicz. Finding Reducts Without Building the Discernibility Matrix.. In Proceedings of the Fifth International Conference on Intelligent Systems Design and Applications (ISDA'05), pages 450-455, Wroc³aw, Poland, 2005.

Abstract:
We present algorithms for fast generation of short reducts which avoid building the discernibility matrix explicitly. We show how information obtained from this matrix can be obtained based only on the distributions of attribute values. Since the size of discernibility matrix is quadratic in the number of data records, not building the matrix explicitly gives a very significant speedup and makes it possible to find reducts even in very large databases. Algorithms are given for both absolute and relative reducts. Experiments show that our approach outperforms other reduct finding algorithms. Furthermore it is shown that many heuristic reduct finding algorithms using the discernibility matrix in fact select attributes based on their Gini index. A new definition of conditional Gini index is presented, motivated by reduct finding heuristics.




S. Jaroszewicz, D. A. Simovici, I. Rosenberg. Measures on Boolean Polynomials and Their Applications in Data Mining. Discrete Applied Mathematics, 144(1-2), pages 123-139, November, 2004.

Abstract:
We characterize measures on free Boolean algebras and we examine the relationships that exist between measures and binary tables in relational databases. It is shown that these measures are completely defined by their values on positive conjunctions, and a formula that yields this value is obtained using the method of indicators. An extension of the notion of support that is well suited for tables with missing values is presented. Finally, we obtain Bonferroni-type inequalities that allow for approximative evaluations of these measures for several types of queries. An approximation algorithm and an analysis of the results produced is also included.




Szymon Jaroszewicz, Dan Simovici. Interestingness of Frequent Itemsets Using Bayesian Networks as Background Knowledge. In 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2004), pages 178-186, Seattle, WA, August, 2004.

Abstract:
The paper presents a method for pruning frequent itemsets based on background knowledge represented by a Bayesian network. The interestingness of an itemset is defined as the absolute difference between its support estimated from data and from the Bayesian network. Efficient algorithms are presented for finding interestingness of a collection of frequent itemsets, and for finding all attribute sets with a given minimum interestingness. Practical usefulness of the algorithms and their efficiency have been verified experimentally.




S. Jaroszewicz, D. A. Simovici, W. P. Kuo, L. Ohno-Machado. The Goodman-Kruskal Ceofficient and its Applications in Genetic Diagnosis of Cancer. IEEE Transactions on Biomedical Engineering, 51(7), pages 1095-1102, July, 2004.

Abstract:
Increasing interest in new pattern recognition methods has been motivated by bioinformatics research. The analysis of gene expression data originated from microarrays constitutes an important application area for classification algorithms and illustrates the need for identifying important predictors. We show that the Goodman-Kruskal coefficient can be used for constructing minimal classifiers for tabular data, and we give an algorithm that can construct such classifiers.




Dan A. Simovici, Szymon Jaroszewicz. Generalized Conditional Entropy and Decision Trees. In Journees francophones d'Extraction et de Gestion de Connaissances (EGC 2003), pages 369-380, Lyon, France, January, 2003.

Abstract:
We introduce an extension of the notion of Shannon conditional entropy to a more general form of conditional entropy that captures both the conditional Shannon entropy and a similar notion related to the Gini index. The proposed family of conditional entropies generates a collection of metrics over the set of partitions of finite sets, which can be used to construct decision trees. Experimental results suggest that by varying the parameter that defines the entropy it is possible to obtain smaller decision trees for certain databases without sacrificing accurracy.




S. Jaroszewicz, D. A. Simovici. Support Approximations Using Bonferroni-type Inequalities. In 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002), pages 212-223, Helsinki, Finland, August, 2002.

Abstract:
The purpose of this paper is to examine the usability of Bonferroni-type combinatorial inequalities to estimation of support of itemsets as well as general Boolean expressions. Families of inequalities for various types of Boolean expressions are presented and evaluated experimentally.




S. Jaroszewicz, D. A. Simovici. Pruning Redundant Association Rules Using Maximum Entropy Principle. In Advances in Knowledge Discovery and Data Mining, 6th Pacific-Asia Conference, PAKDD'02, pages 135-147, Taipei, Taiwan, May, 2002.

Abstract:
Data mining algorithms produce huge sets of rules, practically impossible to analyze manually. It is thus important to develop methods for removing redundant rules from those sets. We present a solution to the problem using the Maximum Entropy approach. The problem of efficiency of Maximum Entropy computations is addressed by using closed form solutions for the most frequent cases. Analytical and experimental evaluation of the proposed technique indicates that it efficiently produces small sets of interesting association rules.




I. Rosenberg, D. A. Simovici, S. Jaroszewicz. On Functions Defined on Free Boolean Algebras. In IEEE Intenrational Symposiun on Multiple-Valued Logic ISMVL'02, Boston, MA, May, 2002.

Abstract:
We characterize measures on free Boolean algebras and we examine the relationships that exists between measures and binary tables in relational databases. It is shown that these measures are completely defined by their values on positive conjunctions and an algorithm that leads to the construction of measures starting from its values on positive conjunction is also given, including a formula that allows the evaluation of measures for arbitrary polynomials. Finally, we study pairs of measures generated by ternary tables, that is, by tables that contain missing or unknown values.




S. Jaroszewicz, D. A. Simovici, I. Rosenberg. An Inclusion-Exclusion Result for Boolean Polynomials and Its Applications in Data Mining. In Workshop on Discrete Mathematics and Data Mining (DM&DM), Second SIAM International Conference on Data Mining, Arlington, VA, April, 2002.

Abstract:
We characterize measures on free Boolean algebras and we examine the relationships that exist between measures and binary tables in relational databases. It is shown that these measures are completely defined by their values on positive conjunctions, and a formula that yields this value is obtained using the method of indicators. We also obtain Bonferroni-type inequalities that allow for approximative evaluations of these measures. Finally we present a measure extending the notion of support that is well suited for tables with missing values.




D. A. Simovici, S. Jaroszewicz. An Axiomatization of Partition Entropy. IEEE Transactions on Information Theory, 48(7), pages 2138-2142, July, 2002.

Abstract:
The aim of this paper is to present an axiomatization of a generalization of Shannon's entropy starting from partitions of finite sets. The proposed axiomatization yields as special cases the Havrda-Charvat entropy, and thus, provides axiomatizations for the Shannon entropy, the Gini index, and for other types of entropy used in classification and data mining.




S. Jaroszewicz, D. A. Simovici. A General Measure of Rule Interestingness. In 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2001), pages 253-265, 2001.

Abstract:
The paper presents a new general measure of rule interestingness. Many known measures such as chi-square, gini gain or entropy gain can be obtained from this measure by setting some numerical parameters, including the amount of trust we have in the estimation of the probability distribution of the data. Moreover, we show that there is a continuum of measures having chi-square, Gini gain and entropy gain as boundary cases. Therefore our measure generalizes both conditional and unconditional classical measures of interestingness. Properties and experimental evaluation of the new measure are also presented.




D. A. Simovici, S. Jaroszewicz. On Information-Theoretical Aspects of Relational Databases. In Finite versus Infinite, Springer Verlag, London, 2000.

Abstract:
We introduce the notion of entropy for a set of attributes of a table in a relational database starting from the notion of entropy for finite functions. We examine the connections that exist between conditional entropies of attribute sets and lossy decompositions of tables and explore the properties of the entropy of sets of attributes regarded as an outer measure on the set of subsets of a heading of a table. Finally, we suggest a generalization of functional dependencies based on conditional entropy.




S. Jaroszewicz, D. A. Simovici. On Axiomatization of Conditional Entropy of Functions Between Finite Sets. In Proceedings of the 29th International Symposium on Multiple-Valued Logic, pages 24-28, Freiburg, Germany, 1999.

Abstract:
In this paper we present a new axiomatization of the notion of entropy of functions between finite sets and we introduce and axiomatize the notion of conditional entropy between functions. The results can be directly applied to logic functions, which can be regarded as functions between finite sets. Our axiomatizations are based on properties of entropy with regard to operations commonly applied to discrete functions and are related to the usage of entropy as a measure of the energy dissipated by circuits that implement discrete functions.