Publications

2009

Controlling Patterns of Geospatial Phenomena

T. Stepinski, W. Ding, C. Eick, submitted to a journal, 2009


Large-scale Dependency Knowledge Acquisition and its Extrinsic Evaluation through Word Sense Disambiguation

P. Chen, W. Ding, D. Brown, C. Bowes, International Conference on Tools with Artificial Intelligence (ICTAI 2009), to appear, New Jersey, USA, November, 2009


A Fully Unsupervised Word Sense Disambiguation Method and Its Evaluation on Coarse-grained All-words Task

P. Chen, W. Ding, C. Bowes, D. Brown, North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT 2009), Boulder, Colorado, May 2009


A Gaze-Controlled Interface to Virtual Reality Applications for Motor- and Speech-Impaired Users

W. Ding, P. Chen, H. Al-Mubaid, M. Pomplun, “A Gaze-Controlled Interface to Virtual Reality Applications for Motor- and Speech-Impaired Users,” HCI International 2009, San Diego, CA, July 2009.

This project aims to overcome the access barriers to virtual worlds for motor- and speech-impaired users by building a gaze-controlled interface for Second Life that will enable them to interact with the virtual world by just moving their eyes. We have conducted a study to assess (1) the facilitation of gaze-controlled text input using word prediction technique to speed up chat-style text input in virtual worlds, (2) the influence of screen layout on the efficiency of text input, (3) the effect of the maximum number of suggested words on typing efficiency, and (4) the performance of non-disabled vs. motor-impaired users. Non-disabled subjects and Amyotrophic Lateral Sclerosis (ALS) patients have participated in our experiment. Experimental results show that on average the patients took less time and fewer corrections per letter than did the non-disabled subjects. This finding suggests that our interface design is suitable for motor-impaired users.


A Lexical Knowledge Representation Model for Natural Language Understanding

P. Chen, W. Ding, C. Ding, "A Connectionist-based Lexical Knowledge Model", the International Journal of Cognitive Informatics and Natural Intelligence (IJCiNi), to appear, 2009.

Knowledge representation is essential for semantics modeling and intelligent information processing. For decades researchers have proposed many knowledge representation techniques. However, it is a daunting problem how to capture deep semantic information effectively and support the construction of a large-scale knowledge base efficiently. This paper describes a new knowledge representation model, SenseNet, which provides semantic support for commonsense reasoning and natural language processing. SenseNet is formalized with a Hidden Markov Model. An inference algorithm is proposed to simulate human-like natural language understanding procedure. A new measurement, confidence, is introduced to facilitate the natural language understanding. We present a detailed case study of applying SenseNet to retrieving compensation information from company proxy filings.


Word Classification: An Experimental Approach with Naive Bayes

W. Ding, H. Al-Mubaid, S. Kotagiri, the ISCA 24th International Conference on Computers and Their Applications (CATA-2009), to appear, New Orleans, Louisiana, April, 2009

Word classification is of significant interest in the domain of natural language processing and it has direct applications in information retrieval and knowledge discovery. This paper presents an experimental method using Naïve Bayes for word classification. The method is based on combing successful feature selection techniques on Mutual Information and Chi-Square with Naïve Bayes for word classification. We utilize the advances in feature-selection techniques in information retrieval and propose an efficient method to select key features for term identification and classification. We evaluate the method using real-world texts taken from the Wall Street Journal news articles. The experimental results proved that the method is fairly effective and competitive for word classification.


Discovery of Geospatial Discriminating Patterns from Remote Sensing Datasets

W. Ding, T. Stepinski, J. Salazar, "Discovery of Geospatial Discriminating Patterns from Remote Sensing Datasets", in SIAM International Conference on Data Mining (SDM), Nevada, April 2009.

Large amounts of remotely sensed data calls for data mining techniques to fully utilize their rich information content. In this paper, we study new means of discovery and summarization of knowledge contained in the spatial patterns of remote sensing datasets. Several geospatial feature variables are fused together, and the vector of their values at each spatial cell is considered as a transaction to be used in association analysis. The concept of emerging patterns is applied to ascertain the variables that exert dominant influence on the distribution of a selected class variable. A new value-iteration method is introduced to optimally split the spatial domain of the selected variable into two classes. This division is used to calculate the set of patterns that are emerging with respect to the two classes; these patterns are the controlling factors---they are responsible for the spatial distribution of the class variable. A method for a concise summarization of controlling factors is introduced using a similarity measure that is custom-made for the type of patterns stemmed from remote sensing measurements. Using such a similarity measure, controlling factors are clustered providing brief description of different manners, in which the class variable is constrained by the explanatory variables. We evaluate our method in a real-world application pertaining to the density of vegetation within the continental United States. Examination of patterns related to the high vegetation cover provides a summary of data dependencies that helps to develop a better empirical model of the vegetation growth.


Discovery of Feature-Based Hot Spots Using Supervised Clustering

W. Ding, T.Stepinski, R. Parmar, D. Jiang, C. F. Eick, "Discovery of Feature-Based Hot Spots Using Supervised Clustering", in the International Journal of Computers and Geosciences, Elsevier, March 2009.

Feature-based hot spots are localized regions where the attributes of objects attain high values. There is considerable interest in automatic identification of feature-based hot spots. This paper approaches the problem of finding feature-based hot spots from a data mining perspective, and describes a method that relies on supervised clustering to produce a list of hot spot regions. Supervised clustering uses a fitness function rewarding isolation of the hot spots to optimally subdivide the dataset. The clusters in the optimal division are ranked using the interestingness of clusters that encapsulate their utility for being hot spots. Hot spots are associated with the top ranked clusters. The effectiveness of supervised clustering as a hot spot identification method is evaluated for four conceptually different clustering algorithms using a dataset describing the spatial distribution of ground ice on Mars. Clustering solutions are visualized by specially developed raster approximations. Further assessment of the ability of different algorithms to yield hot spots is performed using raster approximations. Density-based clustering algorithm is found to be the most effective for hot spot identification. The results of the hot spot discovery by supervised clustering are comparable to those obtained using the G* statistic, but the new method offers a high degree of automation, making it an ideal tool for mining large datasets for the existence of potential hot spots.

2008

A Framework for Regional Association Rule Mining and Scoping in Spatial Datasets

W. Ding, C.F. Eick, X. Yuan, J. Wang, J.P. Nicot, submitted to a journal, 2008


Parsing Tree Matching Based Question Answering

P. Chen, W. Ding, T. Simmons, C. Lacayo, “Parsing Tree Matching Based Question Answering”, Text Analysis Conference (TAC) Workshop, Gaithersburg, Maryland USA, November, 2008.

This paper describes the Question and Answering system participating Question Answering track in Text Analysis Conference organized by National Institute of Standard and Technology 2008. Our Question and Answering system attempts to use a human style of logic to search their respective document sources and return possible answers to a question.


Discovering Controlling Factors of Geospatial Variables

T.F. Stepinski, W. Ding and C.F. Eick, "Discovering Controlling Factors of Geospatial Variables", in Proc. of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Irvine, CA, USA, November, 2008.

Efficient means of determining factors controlling spatial distribution of an environmental class variable are of significant interest in Earth science. In this paper, we present a method for automated discovery of controlling factors by mining for emerging patterns in a database constructed from the fusion of several explanatory datasets. We introduce a new definition of pattern support to account for spatial character of the data and systematically evaluate the effectiveness of our technique using a real-world application pertaining to density of vegetation cover. Experimental results show that our method can successfully identify controlling factors for the presence of high vegetation cover.


Finding Regional Co-Location Patterns for Sets of Continuous Variables

C. F. Eick, R. Parmar, W. Ding, T. F. Stepinski, J. P. Nicot, "Finding Regional Co-Location Patterns for Sets of Continuous Variables", in Proc. of the 16th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS 2008), Irvine, CA, USA, November, 2008.

This paper proposes a novel framework for mining regional colocation patterns with respect to sets of continuous variables in spatial datasets. The goal is to identify regions in which multiple continuous variables with values from the wings of their statistical distribution are co-located. A co-location mining framework is introduced that operates in the continuous domain and which views regional co-location mining as a clustering problem in which an externally given fitness function has to be maximized. Interestingness of co-location patterns is assessed using products of z-scores of the relevant continuous variables. The proposed framework is evaluated by a domain expert in a case study that analyzes Arsenic contamination in Texas water wells centering on regional co-location patterns. Our approach is able to identify known and unknown regional co-location patterns, and different sets of algorithm parameters lead to the characterization of Arsenic distribution at different scales. Moreover, inconsistent colocation sets are found for regions in South Texas and West Texas that can be clearly attributed to geological differences in the two regions, emphasizing the need for regional co-location mining techniques. Moreover, a novel, prototype-based region discovery algorithm named CLEVER is introduced that uses randomized hill climbing, and searches a variable number of clusters and larger neighborhood sizes.


Towards Regional Knowledge Discovery in Spatial Datasets

W. Ding, R. Jiamthapthaksin, R.Parmar, D. Jiang, T. F. Stepinski, and C. F. Eick, "Towards Regional Knowledge Discovery in Spatial Datasets", the Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Osaka, Japan, May, 2008.

This paper presents a novel region discovery framework geared towards finding scientifically interesting places in spatial datasets. We view region discovery as a clustering problem in which an externally given fitness function has to be maximized. The framework adapts four representative clustering algorithms, exemplifying prototype-based, grid-based, density-based, and agglomerative clustering algorithms, and then we systematically evaluated the four algorithms in a real-world case study. The task is to find feature-based hotspots where extreme densities of deep ice and shallow ice co-locate on Mars. The results reveal that the density-based algorithm outperforms other algorithms inasmuch as it discovers more regions with higher interestingness, the grid-based algorithm can provide acceptable solutions quickly, while the agglomerative clustering algorithm performs best to identify larger regions of arbitrary shape. Moreover, the results indicate that there are only a few regions on Mars where shallow and deep ground ice co-locate, suggesting that they have been deposited at different geological times.


An Interactive Visualization Model for Large High-Dimensional Datasets

W. Ding, Ping Chen, "An Interactive Visualization Model for Large High-Dimensional Datasets: A Case Study", Data Engineering: Mining, Information, and Intelligence. Editors: Yupo Chan, John Talburt, Terry Talley, Springer, 2008.

Data visualization gives a direct view of complex data, which is especially helpful for analysis of large high dimensional datasets. However, existing methods often lose simplicity and clarity while rendering large amount of complex data. In this paper, we discuss some essential properties that a data visualization system should have. Also we present an interactive data visualization model which can effectively and efficiently visualize large high dimensional datasets. We evaluate our system with an oil exploration dataset.

2007

On Regional Association Rule Scoping

W. Ding and C. Eick and X. Yuan and J. Wang and J.P. Nicot, "On Regional Association Rule Scoping", in Proc. of the International workshop on Spatial and Spatio-temporal Data Mining in Cooperation with IEEE ICDM 2007, Omaha, NE, USA, October, 2007

A special challenge for spatial data mining is that information is not distributed uniformly in spatial data sets. Consequently, the discovery of regional knowledge is of fundamental importance. Unfortunately, regional patterns frequently fail to be discovered due to insufficient global confidence and/or support in traditional association rule mining. Regional association rules, by definition, only hold in a subspace but not in the global space. One novel challenge is how to evaluate the impact of regional association rules. This paper centers on regional association rule scoping. We introduce a reward-based region discovery framework that employs clustering to find places where regional association rules are valid. We evaluate our approach in a real-world case study to discover arsenic risk zones in the Texas water supply. The experimental results are validated by domain experts and compared with published results on arsenic contamination.


Mining Regional Knowledge in Spatial Datasets

W. Ding, C. Eick, "Mining Regional Knowledge in Spatial Datasets", in Proc. of Grace Hopper Celebration of Women in Computing, Orlando, FL, October 2007.

My research interests lie in the field of spatial data mining and its applications in geosciences and planetary sciences. Spatial data mining has been identified as a key technology to automate the extraction of interesting, useful, but implicit patterns in large spatial datasets. Firstly, I work on finding feature-based hot spots in the multivariate, realvalued datasets. The method is empirically evaluated on a real-world database of ground ice on Mars. Secondly, I am interested in regional association rule mining and scoping. My current project is to identify hot spots of arsenic in the Texas water supply and to discover what causes high arsenic concentrations in Texas. In summary, my PhD research centers on constructing a region discovery framework to systematically discover regional patterns and apply it to realworld applications in planetary and earth sciences.

2006

A Framework for Regional Association Rule Mining in Spatial Datasets

W. Ding and C. Eick and J. Wang and X. Yuan, "A Framework for Regional Association Rule Mining in Spatial Datasets", in Proc. of the 6th IEEE International Conference on Data Mining (IEEE-ICDM'06), Hong Kong, China, December, 2006.

The immense explosion of geographically referenced data calls for efficient discovery of spatial knowledge. One of the special challenges for spatial data mining is that information is usually not uniformly distributed in spatial datasets. Consequently, the discovery of regional knowledge is of fundamental importance for spatial data mining. This paper centers on discovering regional association rules in spatial datasets. In particular, we introduce a novel framework to mine regional association rules relying on a given class structure. A reward-based regional discovery methodology is introduced, and a divisive, grid-based supervised clustering algorithm is presented that identifies interesting subregions in spatial datasets. Then, an integrated approach is discussed to systematically mine regional rules. The proposed framework is evaluated in a real-world case study that identifies spatial risk patterns from arsenic in the Texas water supply.


SenseNet: A Knowledge Representation Model for Computational Semantics

P. Chen, W. Ding, C. Ding, "SenseNet: A Knowledge Representation Model for Computational Semantics", in Proc. of the 5th IEEE International Conference on Cognitive Informatics, Beijing, China, July, 2006.

Knowledge representation is essential for semantics modeling and intelligent information processing. For decades researchers have proposed many knowledge representation techniques. However, it is a daunting problem how to capture deep semantic information effectively and support the construction of a large-scale knowledge base efficiently. This paper describes a new knowledge representation model, SenseNet, which provides semantic support for commonsense reasoning and natural language processing. SenseNet is formalized with a Hidden Markov Model. An inference algorithm is proposed to simulate human-like text analysis procedure. A new measurement, confidence, is introduced to facilitate the text analysis. We present a detailed case study of applying SenseNet to retrieving compensation information from company proxy filings.

2005

Parametric Surface Denoising

I.A. Kakadiaris, I. Konstantinidis, E. Papadakis, W. Ding, D.J. Kouri, and D.K. Hoffman, "Parametric Surface Denoising", in Proc. of SPIE Wavelets XI, E. Papadakis, A. Laine, M. Unser (Eds), San Diego, CA, USA, July, 2005.

Three dimensional (3D) surfaces can be sampled parametrically in the form of range image data. Smooth- ing/denoising of such raw data is usually accomplished by adapting techniques developed for intensity image processing, since both range and intensity images comprise parametrically sampled geometry and appearance measurements, respectively. We present a transform-based algorithm for surface denoising, motivated by our previous work on intensity image denoising, which utilizes a non-separable Parseval frame and an ensemble thresholding scheme. The frame is constructed from separable (tensor) products of a piecewise linear spline tight frame and incorporates the weighted average operator and the Sobel operators in directions that are integer multiples of 45o. We compare the performance of this algorithm with other transform-based methods from the recent literature. Our results indicate that such transform methods are suited to the task of smoothing range images.


Web-based Interactive Visualization of Data Cubes

X. Wang, P. Chen, and W. Ding, "Web-based Interactive Visualization of Data Cubes", in Proc. the 2005 International Conference on Modeling, Simulation and Visualization Methods (MSV'05), Las Vegas, USA, June, 2005.

Data Cube is an effective technique for data mining. Because of the complex relationships among aggregation values of a data cube, designing an efficient method or tool to visualize the complex relationships becomes a challenging work in the data cube technique. Information visualization with computer graphics can help improving this process. Recently, we developed a Web-based interactive data cube visualization system that can be applied to visualize a single data cube or parallel data cubes conveniently on the Web. This paper presents the basic principle, structure and features of the system.


Using a Pre-Assessment Exam to Construct an Effective Concept-based Genetic Program for Predicting Course Success

G. Boetticher, W. Ding, C. Moen, and K. Yue, "Using a Pre-Assessment Exam to Construct an Effective Concept-based Genetic Program for Predicting Course Success", In Proc. of the 36th SIGCSE Technical Symposium on Computer Science Education (ACM SIGCSE'05), pp. 500-504, St. Louis, Missouri, USA, Feb. 2005.

There is a limit on the amount of time a faculty member may devote to each student. As a consequence, a faculty member must quickly determine which student needs more attention than others throughout a semester. One of the most demanding courses in the CS curriculum is a data structures course. This course has a tendency for high drop rates at our university. A pre-assessment exam is developed for the data structures class in order to provide feedback to both faculty and students. This exam helps students determine how well prepared they are for the course. In order to determine a student's chance of success in this course, a Genetic Program-based experiment is constructed based upon the preassessment exam. The result is a model that produces an average accuracy of 79 percent.

2004

Design and Evolution of an Undergraduate Course on Web Application Development

K. Yue, W. Ding, "Design and Evolution of an Undergraduate Course on Web Application Development", in Proc. of the 9th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education (ACM ITiCSE'04), pp. 22-26, Leeds, UK, June, 2004

Web technologies have become essential in the computing curricula. However, teaching a Web development course to computing students is challenging because of large bodies of knowledge, rapidly changing technologies, demanding support infrastructures and diverse background of audiences. This paper presents the evolution and the experiences we have gained in teaching a Web development course for the past seven years. We incorporate selected leading edge Web technologies as soon as they become mature and stable. The course covers a broad spectrum of Internet technologies to provide a solid conceptual framework. It also includes an in-depth study of a selected technology to provide the necessary depth and knowledge to build realistic Web applications. This paper describes the course design, our choice of topics, programming assignments, course delivery and our experience in coping with the rapidly changing Web technologies.


A Model for Open Content Communities to Support Effective Learning and Teaching

K. Yue, A. Yang, W. Ding, and P. Chen, "A Model for Open Content Communities to Support Effective Learning and Teaching", in Proc. of the IADIS International Conference on Web-based Communities, pp. 533-536, Lisbon, Portugal, April 2004.

Open Source Software (OSS) has provided a successful model for community-based collaborative development of software. The success of OSS has triggered interests in applying similar approaches to other areas besides software development, such as open courseware development and open content projects. However, there are nearly no projects on building highly collaborative Open Content Community (OCC) for developing high quality, comprehensive, rich and freely distributable educational materials on specific subjects. Learners can directly use these educational materials to effectively learn the respective subjects, and instructors can use them to construct courses. This paper presents an OSS-based model for building an OCC that supports volunteers to effectively develop, evaluate and use open content educational materials. The model is composed of fine-grained knowledge units to encourage high degree of collaboration. It also has a hierarchical module-based framework for structuring projects. The community Website provides tools and services for content development, project management and project navigation. It is designed to provide high flexibility to cater to varying requirements of different projects, which may evolve in a way similar to OSS projects. An initial prototype has been developed and the authors are in the process of fine-tuning the prototype for experimentation with sample projects.


Knowledge Management for Agent-based Tutoring Systems

P. Chen, W. Ding, "Knowledge Management for Agent-based Tutoring Systems", Designing Distributed Learning Environments: With Intelligent Software Agents, pp. 146-161, Ed. F. Lin, Idea Group, Inc., 2004.

As the educational field is becoming increasingly technology-heavy, more and more educational systems involve on-line or interactive training and tutoring techniques, and lots of educational information becomes available via Intranet and World Wide Web. Managing large volumes of learning information and knowledge is one of the crucial issues for these educational systems as appropriate knowledge management is the key to more effective and efficient learning. The chapter discusses that an intelligent agents system could be successfully applied to the educational field and how knowledge management techniques plays a very important role.


Open Courseware and Computer Science Education

K. Yue, A. Yang, W. Ding, and P. Chen, "Open Courseware and Computer Science Education", ACM Journal of Computing Sciences in Colleges, Volume 20, Issue 1, Utah, USA, October, 2004.

The recent enthusiastic reception of the MIT OpenCourseWare (OCW) project has significantly improved the general awareness of Open Courseware (OC). However, many other lesser known projects and resources can also be classified as OC. The OC movement can potentially provide a vast pool of resources to satisfy diverse needs of Computer Science (CS) educators. However, there are only limited discussions on the possible meanings of OC to CS education. This paper elaborates several important facets of OC. It describes how CS educators can utilize raw educational materials from OC and how OC can support a continuum of approaches on constructing courseware. The impact of OC on CS educators will likely be greater than that of Open Source Software (OSS), since CS educators are more likely developers of course contents but only users of OSS. Thus, this paper suggests deeper and broader studies on the opportunities and challenges of OC provided to CS education.

2003

Icon-based Visualization of Large High-Dimensional Datasets

P. Chen, C. Hu, W. Ding, and H. Lynn, "Icon-based Visualization of Large High-Dimensional Datasets", in Proc. of the 3rd IEEE International Conference on Data Mining (ICDM'03), pp. 505-508, Melbourne, Florida, Nov. 2003.

High dimensional data visualization is critical to data analysts since it gives a direct view of original data. We present a method to visualize large amount of high dimensional data. We divide dimensions of data into several groups. Then, we use one icon to represent each group, and associate visual properties of each icon with dimensions in each group. A high dimensional data record will be represented by multiple different types of icons located in the same position. Furthermore, we use summary icons to display local details of viewer's interests and the whole data set at meantime. We show its effectiveness and efficiency through a case study on a real large data set.

2001

Using a Model Checker to Test Safety Properties

P. Ammann, W. Ding, and D. Xu, "Using a Model Checker to Test Safety Properties", in Proc. of the 7th IEEE International Conference on Engineering of Complex Computer Systems, pp. 212-221, Skovde, Sweden, June 2001.

In addition to providing a sound basis for analysis, formal methods can support other development activities; in our case the target is specification-based testing at the system level. We use the formal method of model checking to either generate new test sets or analyze existing test sets with respect to safety properties expressed in a temporal logic. We consider two types of tests: failing tests, in which a system must reject (fail) a specific dangerous action, and passing tests, in which a system must accept (pass) a safe action in a context that also includes a plausible dangerous action. We formalize our notion of dangerous actions with a mutation model for model checking specifications, and we develop coverage criteria to assess test sets. The coverage criteria are based on the logic operators from the Computation Tree Logic (CTL) and encompass the idea of scenarios where a dangerous action is either inevitable (A) or possible (E) as of the next state (X) or at some point in the future (F). We demonstrate the feasibility of our approach with an example.

2000

Evaluation of Three Specification-based Testing Criteria

A. Abdurazik, P. Ammann, W. Ding, and J. Offutt, "Evaluation of Three Specification-based Testing Criteria", in Proc. of the 6th IEEE International Conference on Engineering of Complex Computer Systems, pp. 179-187, Tokyo, Japan, Sept. 2000.

This paper compares three specification-based testing criteria using Mathur and Wong's PROBSUBSUMES measure. The three criteria are specification-mutation coverage, full predicate coverage, and transition-pair coverage. A novel aspect of the work is that each criterion is encoded in a model checker, and the model checker is used first to generate test sets for each criterion and then to evaluate test sets against alternate criteria. Significantly, the use of the model checker for generation of test sets eliminates human bias from this phase of the experiment. The strengths and weaknesses of the criteria are discussed.


Copyright Note: The electronic versions of the published papers are made available to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and conditions invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.