ARMiner Frequently Asked Questions Version 1.02 (last updated on December 5th 2001 by Laurentiu Cristofor) _____________________________________________________________________ Q: What is the purpose of this document? A: After getting several emails that asked similar questions I realized that it would be better to collect answers to "frequently asked questions" in one document in the hope that visitors of the ARMiner page may get answers to their questions quickly (and I don't have to repeat explanations often :)). So here are the questions that were asked most often and the answers I provided for them: _____________________________________________________________________ Q: What are the ARMiner requirements? A: You need a Java 1.2 distribution in order to run and/or compile ARMiner. It does not run or compile with Java 1.1 or earlier versions. _____________________________________________________________________ Q: How do I set up ARMiner? A: Just uncompress the tar.gz file in some directory and then launch the server and the client as described in the readme.txt file. The files needed by the server are the ones in bin/Server/ and the files needed by the client are in bin/Client/. _____________________________________________________________________ Q: How can I make my own data available to ARMiner? A: Get the db_asc_tools.tar.gz archive from the add-ons section of this site. The tutorial included in the archive should help you. The basic procedure would be to set up your data in an ASCII file as described in the tutorial and then to use the asc2db program to create a .db file. Then you can upload the .db file to the ARMiner server by selecting Database/Add in ARMiner's menu. _____________________________________________________________________ Q: Isn't there a simpler way of importing data into ARMiner's format? A: Unfortunately, no. Although we thought of writing a more complex tool eventually we didn't had enough time for it. Therefore you will need to either use the asc2db tool or to write your own conversion Java program. _____________________________________________________________________ Q: I'd like to use my own data with ARMiner, can you help me out? A: If I could use your data in the research that I am doing then I might help you converting it to ARMiner's format, but otherwise you'll have to rely only on the db_asc_tools.tar.gz archive and your own Java skills. _____________________________________________________________________ Q: How do I recompile ARMiner? A: To recompile ARMiner you need a Java 1.2 distribution and, optionally, a make utility. The following explanations assume that you have a make program, if not then you can read the contents of the makefile to see what commands you need to issue for the compilation and build of the jar files. To compile the server: 1. copy src/server/* and src/common/* to some directory, let's say buildServer/ 2. copy makefile from add-ons section of ARMiner website to directory buildServer/ 3. run 'make' or 'make allServer' while in directory buildServer/ 4. You should have obtained two jar files: Server.jar and DBConfig.jar. These, together with the contents of the DB/ directory, are what you need to run the server. 5. You can optionally type 'make clean' to delete all Java compiled files. To compile the client: 1. copy src/client/* and src/common/* to some directory, let's say buildClient/ 2. copy makefile from add-ons section of ARMiner website to directory buildClient/ 3. run 'make allClient' while in directory buildClient/ 4. You should have obtained one jar file: Client.jar. This and the files first.gif and last.gif are the files needed for running the client. 5. You can optionally type 'make clean' to delete all Java compiled files. _____________________________________________________________________ Q: What are known problems with ARMiner? Can you add feature X? A: Since ARMiner has been released I have focused on eliminating all major errors and I think I'm quite done with this stage. There are many things however that are lacking and that could be frustrating for someone who expects a complete product. But ARMiner was never intended to be equivalent to a commercial association rule mining application. Its intent was to provide people with a tool for experimenting and exploring association rules. As of December 5th 2001, I am redirecting my efforts to the completion of ARtool (www.cs.umb.edu/~laur/ARtool/). ARtool will contain updated versions of the core files of ARMiner and newer algorithms. I also intend to write a better interface that offers more functionality. I will still fix outstanding errors in ARMiner if they are reported to me but I do not intend to add any new features in the near future. _____________________________________________________________________ Q: What are the usual times it takes to mine a database? A: The time taken to mine a database depends on 4 factors: a) size, the number of rows (tuples) of the database. All algorithms included with ARMiner scale linearly with the size of the database. This means that if you increase the size of the database by 2, the time taken by the algorithms will increase by 2. b) number of attributes of the database. In the worst case the time taken by the algorithms will increase exponentially with respect to the number of attributes. c) the minimum support specified for the mining. The lower this value, the longer the algorithm will take to execute. Again here the time can increase dramatically with the decrease of the support. d) the density of the database. The databases used by ARMiner represent binary data, that is in a row you either have or not have an item/attribute present. A database like this could be represented as a matrix of 0s and 1s, with rows corresponding to the rows of the database and columns corresponding to the attributes/items. A 1 would indicate the presence of an attribute in a row, a 0 would indicate the lack of an attribute. The density of the database refers to the density of this matrix. The more dense the database is, the longer it will take to mine, other factors being constant. The algorithms that come with ARMiner were applied mainly to sparse (low density) databases, like supermarket data and they do not perform very well on dense data, i.e. they are slow on dense data. As you can see, there are plenty of factors that influence the time taken by a mining process, so it is not easy to predict what will be the time taken when doing a mining operation. You have to know both the data and the parameters that you are using to get a rough idea of how much time it will take for the results to be computed. _____________________________________________________________________ Q: Why is ARMiner so slow for some mining operations? Is it because the algorithms are inefficient? A: No, unfortunately the problem of finding all association rules is a complex one and there is no known efficient solution for it. The problem is basically NP-complete, which means that the worst case performance can be exponential which is really bad. The best you can hope is to improve efficiency of algorithms by constant factors. For example you can notice that most of the time the Closure algorithm is about twice as fast as Apriori. You cannot get rid of the exponential worst case except if someone proves some day that P=NP (it's a great open problem in computer science theory, whether P is equal or not to NP, P and NP being classes of problems, for more info about this see a book on computer science theory). So the short answer is: the algorithms are among the most efficient currently known but the problem is difficult and no matter what algorithm or implementation you use, there will be databases on which it can take days (or less if you run out of memory :)) to get a result. _____________________________________________________________________ Q: How should I start mining a database? A: Here are a couple of advices to help you start mining a database. First you should get to know the characteristics of your data: how many rows and attributes it has, what is its density, etc. Start mining using a high minimum support value, let's say 0.9 (if you know your dat is not dense you can start with lower value, 0.2 for example). If you don't get any rules or very few rules then decrease this to 0.8. At one point (it will happen sooner for denser databases) you will start getting plenty of rules to satisfy your search. Personally, I have very rarely mined databases for supports smaller than 1% (0.01 as expected by ARMiner). Playing a little with the supports will make you get a feeling for what will take time and what will work fast. Limiting the number of attributes you are interested in will not speed-up the mining too much. This is due to the fact that ARMiner caches the frequent itemsets and therefore searches for all of them anyway. The speed-up obtained in the rule generation procedure will be hard to notice since this part is already quite fast. The caching mechanism provides however an advantage since once you mined a database for minimum support x you will never need to repeat the process if your mining uses minimum supports higher or equal to x. The last advice is to use FPgrowth, which is the fastest algorithm currently implemented for ARMiner. _____________________________________________________________________ Q: Can I contribute to ARMiner? A: Actually nobody asked this question yet. But if you are interested in contributing to ARMiner then you're welcome. Get in touch with me and we can discuss the improvements you would like to add. _____________________________________________________________________ Thank you for reading this document. If you have any questions do not hesitate to send me email at laur@cs.umb.edu. Laurentiu Cristofor