This volume contains papers presented at The 2013 International Conference on Data Mining (DMIN’13). Data mining is a relatively young discipline that is critically important if we want to effectively learn from the tremendous amounts of data that are routinely being generated in science, engineering, medicine, business, and other areas. Data mining attracts innovative and influential contributions to both research and practice, across a wide range of academic disciplines and application domains. DMIN conferences seek to acknowledge and facilitate excellence in research and applications in the area of data mining. DMIN conferences are held annually within WORLDCOMP, the largest annual gathering of researchers in computer science, computer engineering and applied computing. WORLDCOMP'13 assembles a spectrum of 22 affiliated research conferences, workshops, and symposiums into a coordinated research meeting. Each conference has its own program committee as well as referees and own indexed proceedings. Attendees have full access to all 22 conferences' sessions, tracks, and tutorials.DMIN seeks to reflect the multi- and interdisciplinary nature of data mining and to facilitate the exchange and development of novel ideas, open communication and networking amongst researchers and practitioners in different research domains. As in 2012 and the years before, we hope that the 2013 International Conference on Data Mining will provide a forum for you to present your research in a professional environment, exchange ideas, and network and interact across research areas. DMIN actively supports students and beginning researchers from lesser developed countries by funding registration and accommodation, in order to allow for a truly international networking and understanding. The 2013 conference has provided an international and multicultural experience with contributions from 26 different countries.
Additional Info
  • Publisher: Laxmi Publications
  • Language: English
  • Chapter 1

    Maintenance Knowledge Management with Fusion of CMMS and CM Price 2.99  |  2.99 Rewards Points

    Maintenance can be considered as an information, knowledge processing and management system. The management of knowledge resources in maintenance is a relatively new issue compared to Computerized Maintenance Management Systems (CMMS) and Condition Monitoring (CM) approaches and systems. Information Communication technologies (ICT) systems including CMMS, CM and enterprise administrative systems amongst others are effective in supplying data and in some cases information. In order to be effective the availability of high-quality knowledge, skills and expertise are needed for effective analysis and decision-making based on the supplied information and data. Information and data are not by themselves enough, knowledge, experience and skills are the key factors when maximizing the usability of the collected data and information.

  • Chapter 2

    Sentimental Analysis on Turkish Blogs via Ensemble Classifier Price 2.99  |  2.99 Rewards Points

    Sentimental analysis on web-mined data has an increasing impact on most of the studies. Sentimental influence of any content on the web is one of the most curios questions by the content creators and publishers. In this study, we have researched the impact of the comments collected from five different web sites in Turkish with more than 2 million comments in total. The web sites are from newspapers; movie reviews, e-marketing web site and a literature web site. We mix all the comments into a single file. The comments also have a like or dislike number, which we use as ground proof of the impact of the comment, as the sentimental of the comment.
     

  • Chapter 3

    Reliable Probabilistic Classification of Mammographic Masses using Random Forests Price 2.99  |  2.99 Rewards Points

    Mammography is the most effective method for identifying breast cancer in its earliest stages. Random forests (RF) have been successfully used for the task of classification with good performance, but without information about the reliability in classifications. In this paper, we present a novel reliable probabilistic approach to classify mammographic masses as benign, malignant and normal tissues. The main aim of this paper is to improve the performance of Random forests by introducing a recently developed algorithmic framework, namely the Venn Probability Machine, for making reliable decisions in the face of uncertainty.

  • Chapter 4

    Identifying Patterns and Anomalies in Delayed Neutron Monitor Data of Nuclear Power Plant Price 2.99  |  2.99 Rewards Points

    In nuclear fission, a delayed neutron is a neutron emitted by one of the fission products any time from a few milliseconds to a few minutes after the fission event. The counts of delayed neutrons constitute a time series sequence. The analysis of such time series can prove to be very significant for purpose of predictive maintenance in nuclear power plants. In this paper we aim to identify anomalies in neutron counts, which may be generated due to possible leaks in the nuclear reactor channel. Real world case data comprising of readings from Delayed Neutron Monitors (DNM) has been analyzed. The time sequences formed by the delayed neutrons have first been symbolically represented using Symbolic Approximation Algorithm (SAX), then anomaly detection and pattern detection algorithms have been applied on them.

  • Chapter 5

    Alleviating the Class Imbalance problem in Data Mining Price 2.99  |  2.99 Rewards Points

    The class imbalance problem in two-class data sets is one of the most important problems. When examples of one class in a training data set vastly outnumber examples of the other class, standard machine learning algorithms tend to be overwhelmed by the majority class and ignore the minority class. There are several algorithms to alleviate the problem of class imbalance in literature. In this paper the existing RUSBoost, EasyEnsemble and BalanceCascade algorithms have been compared with each other using different classifiers like C4.5, SVM, and KNN as the base learners.

  • Chapter 6

    Efficiency of crop yield forecasting depending on the moment of prediction based on large remote sensing data set Price 2.99  |  2.99 Rewards Points

    Agricultural yields can be predicted from detailed multi-year remote sensing image sequences using measured features of vegetation conditions. In this paper, the dependency between the moment of prediction and the accuracy of the forecast is studied. The linear model is selected as a basic approach of yield forecasting. Then, the model is extended with non-linear components (factors) in order to improve the accuracy of the forecasts. The extensions take into consideration long-term technological advances in agricultural productivity as well as regional variations in yields (fertility of the lands). The accuracy of the model has been estimated based on the time period between the moment of the forecast formation and the harvest time.
     

  • Chapter 7

    Neural Network Forecasting with the S&P 500 Index Across Decades Price 2.99  |  2.99 Rewards Points

    The purpose of this paper is to track the effectiveness of a neural network as a forecasting tool across six decades, using only information derived from closing prices. From 1950 through 2010, a neural network for each decade was trained on ten years of S&P 500 data and used to forecast the S&P 500’s direction each day of the following year. The set of inputs and structure of the networks remained constant across time. Only the data sets used for training and forecasting changed. The results show that, with one exception over 60 years, the neural networks remained robust from training to validation sets and were correct more than 50% of the time.

  • Chapter 8

    Data Uncertainty Handling in High Level Information Fusion Price 2.99  |  2.99 Rewards Points

    Situation/threat modeling and threat prediction require higher levels of data fusion to provide actionable information to the warfighter. A significant challenge to the fusion of information into higher levels of knowledge is the uncertainty in the underlying data. This uncertainty may be in the form of trust pedigree, sensor noise, and data relevancy. Handling these elements within the fusion structure is vital in order to develop high level information fusion (HLIF) systems for multi-sensory, multi-use applications.

  • Chapter 9

    A Preliminary Approach to Study the Causality of Freezing of Gait for Parkinson's: Bayesian Belief Network Approach Price 2.99  |  2.99 Rewards Points

    Parkinson disease patients suffer from a disabling phenomenon called freezing of gait, which can be described as if their feet are „frozen‟ or stuck, but that the top half of their body is still able to move. In this paper, we make a graphical probabilistic modeling study, "Bayesian Belief Network (BBN) approach" of a previously collected dataset that represents the measurements of acceleration sensors placed in the ankle, knee and hip of PD patients during their march. In an attempt to know if this is a traditional BBN model or a causal one, we built a FoG Model and tested it, first by forming an Epidemiological Approach, then, by inferring causal relations based on Additive Noise Models (ANM). Consequently, we built a Bayesian Naive Classifier Model related to FoG.

  • Chapter 10

    Evaluation of Monte Carlo Subspace Clustering with OpenSubspace Price 2.99  |  2.99 Rewards Points

    We present the results of a thorough evaluation of the subspace clustering algorithm SEPC using the OpenSubspace framework. We show that SEPC outperforms competing projected and subspace clustering algorithms on synthetic and some real world data sets. We also show that SEPC can be used to effectively discover clusters with overlapping objects (i.e., subspace clustering).

  • Chapter 11

    MineTool-3DM2: An Algorithm for Data Mining of 3D Simulation Data Price 2.99  |  2.99 Rewards Points

    Scientific simulations are a valuable discovery tool in a variety of sciences, especially is space physics where scientific observation and in situ measurements are not always possible. Recent advances in kinetic simulations running on petascale computers have enabled 3D simulations of a variety of important scientific processes. However, knowledge extraction from massive and complex data sets generated from petascale simulations still poses a major obstacle to scientific progress. We propose a new approach to solving this problem by utilizing an innovative feature extraction technique in combination with a specialized classification algorithm which can be applied to 3D simulation datasets. In our previous work [12] we showed how data from 2D simulations as well as many other real life examples can be represented in a form of multivariate time series. In this work, we have adapted our multivariate time series analysis data mining technique to handle 3D simulation data.

  • Chapter 12

    Actions Ontology System for Action Rules Discovery in Mammographic Mass Data Price 2.99  |  2.99 Rewards Points

    Actionable knowledge is a golden nugget within the data mining research field. Action rules describe possible transitions of objects in an information system - from one state to another more desirable state, with respect to a distinguished attribute. In this paper we propose an improved method for generating action rules by incorporating an additional ontology layer on top of the information system. It contains nodes of higher-level actions knowledge, which are linked with individual terms at the lower levels. The system shows the likely changes within classification attributes, with respect to a decision attribute of our interest. We experiment with Mammographic Mass DataSet in attempts to re-classify tumors from malignant to benign. In addition to medical domain, application areas include financial, and industrial domain.

  • Chapter 13

    GDP Forecasting through Data Mining of Seaport Export-Import Records Price 2.99  |  2.99 Rewards Points

    With the ever increasing ubiquitousness of globalization through international trade, principally on sea, there seems to be a direct correlation to a nation’s Gross Domestic Product(GDP). Traditionally, in literature, structural models have predicted GDP correlation with the export-import tonnage on a cross-section of commodities. In this paper, machine learning and data mining techniques on publicly available, export and import tonnage of commodities at sea ports of the nation in question are analysed. Algorithms are then considered that output real GDP forecasts for the fiscal. The dataset for the exercise consists of daily export and import tonnage at a given port. Several ports in the country of interest are then considered. With data for several years and the accompanying GDP forecast on a daily basis, the question provides a challenging supervised learning problem to be analysed, with an appropriately sized data set, that is expected to generalize.

  • Chapter 14

    Association Rule Mining for finding correlations among people Price 2.99  |  2.99 Rewards Points

    Data mining is the process of extracting interesting, non-trivial, implicit, previously, unknown and potentially useful information or patterns from large information repositories. This paper focuses on Association Rule Mining on large image datasets. ARM is largely applied on datasets containing text, but we shall exploit its capabilities to mine images to get interesting and useful correlations and determine the degree of togetherness among faces in the video. Video processing generates a very large dataset which makes it difficult to analyze it manually. Our research model presented in this paper combines two of the most actively researched areas of computer science: Computer Vision and Data Mining.
     

  • Chapter 15

    Toward Sustainable High-Yield Agriculture via Intelligent Control Systems Price 2.99  |  2.99 Rewards Points

    Hunger ranks as the number one health risk facing the world today, with scarcity of natural resources playing a key part in the problem. Aquaponics has the potential for high-yield plant and animal production but has parameters that are substantially more difficult to maintain. To prevent failure and ensure maximum yields for minimal outside input, this paper proposes AI-based data mining to learn and maintain proper environmental conditions.Experiments are conducted that determine the appropriateness of various AI techniques for this project. These AI techniques are being applied in a real-world aquaponics farm.

  • Chapter 16

    Extending Local Similarity Indexes with KNN for Link Prediction Price 2.99  |  2.99 Rewards Points

    One of the challenges in big data analytics is discovering previously unknown relationships between objects. Two common examples are suggesting friends in social media networks and predicting interactions between biological proteins. Both of these cases are examples of link prediction. Link prediction algorithms accept a graph and a pair of nodes and predict whether or not there should be an edge between those nodes. Local similarity indices are link prediction algorithms based on the assumption that if two nodes are structurally similar, there should be an edge between them. This concept can be extended by using the machine learning notion of k-nearest neighbor so that an edge from u to v is predicted if nodes similar to u have an edge to v, or nodes similar to v have an edge to u. It is straightforward to extend local similarity indices to k-nn versions of the algorithms, and with suitable selection of k accuracy is improved. Although there is additional computational cost, it can be amortized such that operations such as finding all predictions have similar computation time.

  • Chapter 17

    A New Simple Classification Algorithm enabling a New Approach for Identification of Virtual Bullying Price 2.99  |  2.99 Rewards Points

    In this paper we present a new, simple, classification algorithm which can be used to identify a change in virtual behaviour between a sender and recipient which could be used as an early indicator of virtual bullying or harassment. This application is not only, a novel application of Data Mining techniques but also, a new approach used to identify virtual bullying by virtue of identification of a change in behaviour.

  • Chapter 18

    Using Data Mining to Analyze Donation Data for a Local Food Bank Price 2.99  |  2.99 Rewards Points

    Food insecurity is one of the difficult situations a lot of American communities face today. Hunger, particularly experienced by children has serious impacts on the society. Fighting hunger cannot solely depend on the government assistance programs. Non-profit organizations such as Feeding America play a very important role in this effort. These organizations heavily rely on food donations. However, it is not easy to understand donation and hence presents challenges for those organizations to plan and manage their resources. In this research, data mining techniques were applied to analyze donation data from a local food bank and useful information was generated to help the food bank manage their resources.
     

  • Chapter 19

    Flash reactivity : adaptative models in recommender systems Price 2.99  |  2.99 Rewards Points

    Recommendation systems take advantage of products and users information in order to propose items to targeted consumers. Collaborative recommendation systems, content-based recommendation systems and a few hybrid systems have been developed. We propose a dynamic and adaptive framework to overcome the usual issues of nowadays systems. We present a method based on adaptation in time in order to provide recommendations in phase with the present instant. The system includes a dynamic adaptation
    to enhance the accuracy of rating predictions by applying a new similarity measure. We did several experiments on films data from Vodkaster, showing that systems incorporating dynamic adaptation improve significantly the quality of recommendations compared to static ones.

  • Chapter 20

    Analysis of Truck Compressor Failures Based on Logged Vehicle Data Price 2.99  |  2.99 Rewards Points

    In multiple industries, including automotive one, predictive maintenance is becoming more and more important, especially since the focus shifts from product to service-based operation. It requires, among other, being able to provide customers with uptime guarantees. It is natural to investigate the use of data mining techniques, especially since the same shift of focus, as well as technological advancements in the telecommunication solutions, makes long-term data collection more widespread.
            In this paper we describe our experiences in predicting compressor faults using data that is logged on-board Volvo trucks. We discuss unique challenges that are posed by the specifics of the automotive domain. We show that predictive maintenance is possible and can result in significant cost savings, despite the relatively low amount of data available. We also discuss some of the problems we have encountered by employing out-of-the-box machine learning solutions, and identify areas where our task diverges from common assumptions underlying the majority of data mining research.
     

  • Chapter 21

    Proposed Business Intelligence Models for Medical Risk Assessment Case study of Venous Thrombosis Disease in Egypt Price 2.99  |  2.99 Rewards Points

    Risk assessment tools have been widely used in various fields such as Information Technology, Environmental studies as well as Healthcare.  This paper explores the use of Business Intelligence tools in the healthcare industry in developing countries. In doing so, three different models using SQL Server 2008 Business Intelligence Tool were explored. These models are Naïve Bayes, Decision Trees and Neural Networks. Hence, a prototype Intelligent Risk Assessment Model, DVTRAM (Deep Vein Thrombosis Risk Assessment Model) is proposed. It applies different data mining techniques in order to uncover hidden patterns that may lead to medical complications such as Pulmonary Embolism (PE). Results showed that all of the three models were able to extract patterns in response to the predictable state. As for the performance of the models, they varied depending on the class value. In the future, the outcomes may constitute a good background for the development of a Medical Expert System in the domain of Internal Medicine.

  • Chapter 22

    Improve the Quality of Product Recommendation based on Multi-channel CRM for E-commerce Price 2.99  |  2.99 Rewards Points

    In Internet age, more and more Web applications and services are developed for electronic commerce (EC).
    However, the quality of product recommendations is still not good for electronic commerce. There are hundreds of thousands products placed on EC websites, but low percentage of those products were purchased by customers even though they still purchased many products. Because the scattered products customers purchased, customer-product matrix is also very sparse. It is difficult to find customers with the similar product preferences and the quality of the traditional product recommendation – the collaborative filtering method is not good. In this paper, we tried to propose a multi-channel customer relationship management (CRM) approach to solve the sparse problem of customer-product matrix, which results in the poor quality of product recommendations due to the difficulty of finding customers with the similar product preferences. We considered not only the similar users of the Web channel, but also the similar users of the other channels (e.g. television and catalog) in a multi-channel retailer. By these similar users from the multiple channels, the recommended products were ordered by the weighted frequent counts of the most frequent items purchased by the similar users with the hybrid weights for the Web target user.
     

  • Chapter 23

    Using Recursive Sorting to Improve Accuracy of Memory-based Collaborative Filtering Recommendations Price 2.99  |  2.99 Rewards Points

    Modern user behavior datasets contain millions of records, so quickly combining all potentially relevant ratings is often not feasible. Instead, we make suggestions from a small set of the most relevant ratings, so that the memory-based recommender systems could produce simple and accurate results. We propose a new instance selection algorithm that removes irrelevant data after sorting it twice, unlike the traditional approach where the data is only sorted once. The accuracy of the resulting recommendations on the Netflix dataset is considerably better than the standard approach.

  • Chapter 24

    Mining for Hydrologic Features in LiDAR Data Price 2.99  |  2.99 Rewards Points

    Light Detection and Ranging (LiDAR) can generate 3D point data of terrains with high resolution and accuracy, enabling automated detection of important hydrologic features. This paper describes a method for detecting sinkholes in LiDAR data. Current methods of sinkhole detection are lengthy and labor intensive, requiring hours or days of manual work. The method demonstrated in this study can locate sinkholes in the
    same LiDAR data within minutes with no need for human intervention.