International Science Index
Hybrid Reliability-Similarity-Based Approach for Supervised Machine Learning
Data mining has, over recent years, seen big advances because of the spread of internet, which generates everyday a tremendous volume of data, and also the immense advances in technologies which facilitate the analysis of these data. In particular, classification techniques are a subdomain of Data Mining which determines in which group each data instance is related within a given dataset. It is used to classify data into different classes according to desired criteria. Generally, a classification technique is either statistical or machine learning. Each type of these techniques has its own limits. Nowadays, current data are becoming increasingly heterogeneous; consequently, current classification techniques are encountering many difficulties. This paper defines new measure functions to quantify the resemblance between instances and then combines them in a new approach which is different from actual algorithms by its reliability computations. Results of the proposed approach exceeded most common classification techniques with an f-measure exceeding 97% on the IRIS Dataset.
Comparative Evaluation of Accuracy of Selected Machine Learning Classification Techniques for Diagnosis of Cancer: A Data Mining Approach
With recent trends in Big Data and advancements
in Information and Communication Technologies, the healthcare
industry is at the stage of its transition from clinician oriented to
technology oriented. Many people around the world die of cancer
because the diagnosis of disease was not done at an early stage.
Nowadays, the computational methods in the form of Machine
Learning (ML) are used to develop automated decision support
systems that can diagnose cancer with high confidence in a timely
manner. This paper aims to carry out the comparative evaluation
of a selected set of ML classifiers on two existing datasets: breast
cancer and cervical cancer. The ML classifiers compared in this study
are Decision Tree (DT), Support Vector Machine (SVM), k-Nearest
Neighbor (k-NN), Logistic Regression, Ensemble (Bagged Tree) and
Artificial Neural Networks (ANN). The evaluation is carried out based
on standard evaluation metrics Precision (P), Recall (R), F1-score and
Accuracy. The experimental results based on the evaluation metrics
show that ANN showed the highest-level accuracy (99.4%) when
tested with breast cancer dataset. On the other hand, when these
ML classifiers are tested with the cervical cancer dataset, Ensemble
(Bagged Tree) technique gave better accuracy (93.1%) in comparison
to other classifiers.
Artificial neural networks
, breast cancer
, cervical cancer
, logistic regression
, machine learning
, support vector machine.
Implementation of an IoT Sensor Data Collection and Analysis Library
Due to the development of information technology and wireless Internet technology, various data are being generated in various fields. These data are advantageous in that they provide real-time information to the users themselves. However, when the data are accumulated and analyzed, more various information can be extracted. In addition, development and dissemination of boards such as Arduino and Raspberry Pie have made it possible to easily test various sensors, and it is possible to collect sensor data directly by using database application tools such as MySQL. These directly collected data can be used for various research and can be useful as data for data mining. However, there are many difficulties in using the board to collect data, and there are many difficulties in using it when the user is not a computer programmer, or when using it for the first time. Even if data are collected, lack of expert knowledge or experience may cause difficulties in data analysis and visualization. In this paper, we aim to construct a library for sensor data collection and analysis to overcome these problems.
Application of Granular Computing Paradigm in Knowledge Induction
This paper illustrates an application of granular computing approach, namely rough set theory in data mining. The paper outlines the formalism of granular computing and elucidates the mathematical underpinning of rough set theory, which has been widely used by the data mining and the machine learning community. A real-world application is illustrated, and the classification performance is compared with other contending machine learning algorithms. The predictive performance of the rough set rule induction model shows comparative success with respect to other contending algorithms.
Linguistic Summarization of Structured Patent Data
Patent data have an increasingly important role in economic growth, innovation, technical advantages and business strategies and even in countries competitions. Analyzing of patent data is crucial since patents cover large part of all technological information of the world. In this paper, we have used the linguistic summarization technique to prove the validity of the hypotheses related to patent data stated in the literature.
Discovering User Behaviour Patterns from Web Log Analysis to Enhance the Accessibility and Usability of Website
Finding relevant information on the World Wide Web is becoming highly challenging day by day. Web usage mining is used for the extraction of relevant and useful knowledge, such as user behaviour patterns, from web access log records. Web access log records all the requests for individual files that the users have requested from the website. Web usage mining is important for Customer Relationship Management (CRM), as it can ensure customer satisfaction as far as the interaction between the customer and the organization is concerned. Web usage mining is helpful in improving website structure or design as per the user’s requirement by analyzing the access log file of a website through a log analyzer tool. The focus of this paper is to enhance the accessibility and usability of a guitar selling web site by analyzing their access log through Deep Log Analyzer tool. The results show that the maximum number of users is from the United States and that they use Opera 9.8 web browser and the Windows XP operating system.
FCNN-MR: A Parallel Instance Selection Method Based on Fast Condensed Nearest Neighbor Rule
Instance selection (IS) technique is used to reduce
the data size to improve the performance of data mining methods.
Recently, to process very large data set, several proposed methods
divide the training set into some disjoint subsets and apply IS
algorithms independently to each subset. In this paper, we analyze
the limitation of these methods and give our viewpoint about how to
divide and conquer in IS procedure. Then, based on fast condensed
nearest neighbor (FCNN) rule, we propose a large data sets instance
selection method with MapReduce framework. Besides ensuring the
prediction accuracy and reduction rate, it has two desirable properties:
First, it reduces the work load in the aggregation node; Second
and most important, it produces the same result with the sequential
version, which other parallel methods cannot achieve. We evaluate the
performance of FCNN-MR on one small data set and two large data
sets. The experimental results show that it is effective and practical.
Knowledge Discovery and Data Mining Techniques in Textile Industry
This paper addresses the issues and technique for textile industry using data mining techniques. Data mining has been applied to the stitching of garments products that were obtained from a textile company. Data mining techniques were applied to the data obtained from the CHAID algorithm, CART algorithm, Regression Analysis and, Artificial Neural Networks. Classification technique based analyses were used while data mining and decision model about the production per person and variables affecting about production were found by this method. In the study, the results show that as the daily working time increases, the production per person also decreases. In addition, the relationship between total daily working and production per person shows a negative result and the production per person show the highest and negative relationship.
Summarizing Data Sets for Data Mining by Using Statistical Methods in Coastal Engineering
Coastal regions are the one of the most commonly used places by the natural balance and the growing population. In coastal engineering, the most valuable data is wave behaviors. The amount of this data becomes very big because of observations that take place for periods of hours, days and months. In this study, some statistical methods such as the wave spectrum analysis methods and the standard statistical methods have been used. The goal of this study is the discovery profiles of the different coast areas by using these statistical methods, and thus, obtaining an instance based data set from the big data to analysis by using data mining algorithms. In the experimental studies, the six sample data sets about the wave behaviors obtained by 20 minutes of observations from Mersin Bay in Turkey and converted to an instance based form, while different clustering techniques in data mining algorithms were used to discover similar coastal places. Moreover, this study discusses that this summarization approach can be used in other branches collecting big data such as medicine.
Clustering Categorical Data Using the K-Means Algorithm and the Attribute’s Relative Frequency
Clustering is a well known data mining technique used in pattern recognition and information retrieval. The initial dataset to be clustered can either contain categorical or numeric data. Each type of data has its own specific clustering algorithm. In this context, two algorithms are proposed: the k-means for clustering numeric datasets and the k-modes for categorical datasets. The main encountered problem in data mining applications is clustering categorical dataset so relevant in the datasets. One main issue to achieve the clustering process on categorical values is to transform the categorical attributes into numeric measures and directly apply the k-means algorithm instead the k-modes. In this paper, it is proposed to experiment an approach based on the previous issue by transforming the categorical values into numeric ones using the relative frequency of each modality in the attributes. The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values. The scalability and accuracy of the two methods are experimented. The obtained results show that our proposed method outperforms the binary method in all cases.
CompPSA: A Component-Based Pairwise RNA Secondary Structure Alignment Algorithm
The biological function of an RNA molecule depends
on its structure. The objective of the alignment is finding the
homology between two or more RNA secondary structures. Knowing
the common functionalities between two RNA structures allows
a better understanding and a discovery of other relationships
between them. Besides, identifying non-coding RNAs -that is not
translated into a protein- is a popular application in which RNA
structural alignment is the first step A few methods for RNA
structure-to-structure alignment have been developed. Most of these
methods are partial structure-to-structure, sequence-to-structure, or
structure-to-sequence alignment. Less attention is given in the
literature to the use of efficient RNA structure representation and the
structure-to-structure alignment methods are lacking. In this paper,
we introduce an O(N2) Component-based Pairwise RNA Structure
Alignment (CompPSA) algorithm, where structures are given as
a component-based representation and where N is the maximum
number of components in the two structures. The proposed algorithm
compares the two RNA secondary structures based on their weighted
component features rather than on their base-pair details. Extensive
experiments are conducted illustrating the efficiency of the CompPSA
algorithm when compared to other approaches and on different real
and simulated datasets. The CompPSA algorithm shows an accurate
similarity measure between components. The algorithm gives the
flexibility for the user to align the two RNA structures based on
their weighted features (position, full length, and/or stem length).
Moreover, the algorithm proves scalability and efficiency in time and
Performance Analysis of Proprietary and Non-Proprietary Tools for Regression Testing Using Genetic Algorithm
The present paper addresses to the research in the area of regression testing with emphasis on automated tools as well as prioritization of test cases. The uniqueness of regression testing and its cyclic nature is pointed out. The difference in approach between industry, with business model as basis, and academia, with focus on data mining, is highlighted. Test Metrics are discussed as a prelude to our formula for prioritization; a case study is further discussed to illustrate this methodology. An industrial case study is also described in the paper, where the number of test cases is so large that they have to be grouped as Test Suites. In such situations, a genetic algorithm proposed by us can be used to reconfigure these Test Suites in each cycle of regression testing. The comparison is made between a proprietary tool and an open source tool using the above-mentioned metrics. Our approach is clarified through several tables.
A Study on the Nostalgia Contents Analysis of Hometown Alumni in the Online Community
This study aims to analyze the text terms posted on an online community of people from the same hometown and to understand the topic and trend of nostalgia composed online. For this purpose, this study collected 144 writings which the natives of Yeongjong Island, Incheon, South-Korea have posted on an online community. And it analyzed association relations. As a result, online community texts means that just defining nostalgia as ‘a mind longing for hometown’ is not an enough explanation. Second, texts composed online have abstractness rather than persons’ individual stories. This study figured out the relationship that had the most critical and closest mutual association among the terms that constituted nostalgia through literature research and association rule concerning nostalgia. The result of this study has a characteristic that it summed up the core terms and emotions related to nostalgia.
Exploring Influence Range of Tainan City Using Electronic Toll Collection Big Data
Big Data has been attracted a lot of attentions in many fields for analyzing research issues based on a large number of maternal data. Electronic Toll Collection (ETC) is one of Intelligent Transportation System (ITS) applications in Taiwan, used to record starting point, end point, distance and travel time of vehicle on the national freeway. This study, taking advantage of ETC big data, combined with urban planning theory, attempts to explore various phenomena of inter-city transportation activities. ETC, one of government's open data, is numerous, complete and quick-update. One may recall that living area has been delimited with location, population, area and subjective consciousness. However, these factors cannot appropriately reflect what people’s movement path is in daily life. In this study, the concept of "Living Area" is replaced by "Influence Range" to show dynamic and variation with time and purposes of activities. This study uses data mining with Python and Excel, and visualizes the number of trips with GIS to explore influence range of Tainan city and the purpose of trips, and discuss living area delimited in current. It dialogues between the concepts of "Central Place Theory" and "Living Area", presents the new point of view, integrates the application of big data, urban planning and transportation. The finding will be valuable for resource allocation and land apportionment of spatial planning.
Development of Prediction Models of Day-Ahead Hourly Building Electricity Consumption and Peak Power Demand Using the Machine Learning Method
To encourage building owners to purchase electricity at the wholesale market and reduce building peak demand, this study aims to develop models that predict day-ahead hourly electricity consumption and demand using artificial neural network (ANN) and support vector machine (SVM). All prediction models are built in Python, with tool Scikit-learn and Pybrain. The input data for both consumption and demand prediction are time stamp, outdoor dry bulb temperature, relative humidity, air handling unit (AHU), supply air temperature and solar radiation. Solar radiation, which is unavailable a day-ahead, is predicted at first, and then this estimation is used as an input to predict consumption and demand. Models to predict consumption and demand are trained in both SVM and ANN, and depend on cooling or heating, weekdays or weekends. The results show that ANN is the better option for both consumption and demand prediction. It can achieve 15.50% to 20.03% coefficient of variance of root mean square error (CVRMSE) for consumption prediction and 22.89% to 32.42% CVRMSE for demand prediction, respectively. To conclude, the presented models have potential to help building owners to purchase electricity at the wholesale market, but they are not robust when used in demand response control.
An Improvement of Multi-Label Image Classification Method Based on Histogram of Oriented Gradient
Image Multi-label Classification (IMC) assigns a label or a set of labels to an image. The big demand for image annotation and archiving in the web attracts the researchers to develop many algorithms for this application domain. The existing techniques for IMC have two drawbacks: The description of the elementary characteristics from the image and the correlation between labels are not taken into account. In this paper, we present an algorithm (MIML-HOGLPP), which simultaneously handles these limitations. The algorithm uses the histogram of gradients as feature descriptor. It applies the Label Priority Power-set as multi-label transformation to solve the problem of label correlation. The experiment shows that the results of MIML-HOGLPP are better in terms of some of the evaluation metrics comparing with the two existing techniques.
Application of Data Mining Techniques for Tourism Knowledge Discovery
Application of five implementations of three data mining classification techniques was experimented for extracting important insights from tourism data. The aim was to find out the best performing algorithm among the compared ones for tourism knowledge discovery. Knowledge discovery process from data was used as a process model. 10-fold cross validation method is used for testing purpose. Various data preprocessing activities were performed to get the final dataset for model building. Classification models of the selected algorithms were built with different scenarios on the preprocessed dataset. The outperformed algorithm tourism dataset was Random Forest (76%) before applying information gain based attribute selection and J48 (C4.5) (75%) after selection of top relevant attributes to the class (target) attribute. In terms of time for model building, attribute selection improves the efficiency of all algorithms. Artificial Neural Network (multilayer perceptron) showed the highest improvement (90%). The rules extracted from the decision tree model are presented, which showed intricate, non-trivial knowledge/insight that would otherwise not be discovered by simple statistical analysis with mediocre accuracy of the machine using classification algorithms.
Arabic Light Stemmer for Better Search Accuracy
Arabic is one of the most ancient and critical languages in the world. It has over than 250 million Arabic native speakers and more than twenty countries having Arabic as one of its official languages. In the past decade, we have witnessed a rapid evolution in smart devices, social network and technology sector which led to the need to provide tools and libraries that properly tackle the Arabic language in different domains. Stemming is one of the most crucial linguistic fundamentals. It is used in many applications especially in information extraction and text mining fields. The motivation behind this work is to enhance the Arabic light stemmer to serve the data mining industry and leverage it in an open source community. The presented implementation works on enhancing the Arabic light stemmer by utilizing and enhancing an algorithm that provides an extension for a new set of rules and patterns accompanied by adjusted procedure. This study has proven a significant enhancement for better search accuracy with an average 10% improvement in comparison with previous works.
Using Multi-Arm Bandits to Optimize Game Play Metrics and Effective Game Design
Game designers have the challenging task of building games that engage players to spend their time and money on the game. There are an infinite number of game variations and design choices, and it is hard to systematically determine game design choices that will have positive experiences for players. In this work, we demonstrate how multi-arm bandits can be used to automatically explore game design variations to achieve improved player metrics. The advantage of multi-arm bandits is that they allow for continuous experimentation and variation, intrinsically converge to the best solution, and require no special infrastructure to use beyond allowing minor game variations to be deployed to users for evaluation. A user study confirms that applying multi-arm bandits was successful in determining the preferred game variation with highest play time metrics and can be a useful technique in a game designer's toolkit.
Predicting Groundwater Areas Using Data Mining Techniques: Groundwater in Jordan as Case Study
Data mining is the process of extracting useful or hidden information from a large database. Extracted information can be used to discover relationships among features, where data objects are grouped according to logical relationships; or to predict unseen objects to one of the predefined groups. In this paper, we aim to investigate four well-known data mining algorithms in order to predict groundwater areas in Jordan. These algorithms are Support Vector Machines (SVMs), Naïve Bayes (NB), K-Nearest Neighbor (kNN) and Classification Based on Association Rule (CBA). The experimental results indicate that the SVMs algorithm outperformed other algorithms in terms of classification accuracy, precision and F1 evaluation measures using the datasets of groundwater areas that were collected from Jordanian Ministry of Water and Irrigation.
Development of Innovative Islamic Web Applications
The rich Islamic resources related to religious text,
Islamic sciences, and history are widely available in print and in
electronic format online. However, most of these works are only
available in Arabic language. In this research, an attempt is made
to utilize these resources to create interactive web applications in
Arabic, English and other languages. The system utilizes the Pattern
Recognition, Knowledge Management, Data Mining, Information
Retrieval and Management, Indexing, storage and data-analysis
techniques to parse, store, convert and manage the information from
authentic Arabic resources. These interactive web Apps provide
smart multi-lingual search, tree based search, on-demand information
matching and linking. In this paper, we provide details of application
architecture, design, implementation and technologies employed. We
also presented the summary of web applications already developed.
We have also included some screen shots from the corresponding web
sites. These web applications provide an Innovative On-line Learning
Systems (eLearning and computer based education).
Road Accidents Bigdata Mining and Visualization Using Support Vector Machines
Useful information has been extracted from the
road accident data in United Kingdom (UK), using data analytics
method, for avoiding possible accidents in rural and urban areas.
This analysis make use of several methodologies such as data
integration, support vector machines (SVM), correlation machines
and multinomial goodness. The entire datasets have been imported
from the traffic department of UK with due permission. The
information extracted from these huge datasets forms a basis for
several predictions, which in turn avoid unnecessary memory
lapses. Since data is expected to grow continuously over a period
of time, this work primarily proposes a new framework model
which can be trained and adapt itself to new data and make
accurate predictions. This work also throws some light on use of
SVM’s methodology for text classifiers from the obtained traffic
data. Finally, it emphasizes the uniqueness and adaptability of
SVMs methodology appropriate for this kind of research work.
Evaluation of Ensemble Classifiers for Intrusion Detection
One of the major developments in machine learning in the past decade is the ensemble method, which finds highly accurate classifier by combining many moderately accurate component classifiers. In this research work, new ensemble classification methods are proposed with homogeneous ensemble classifier using bagging and heterogeneous ensemble classifier using arcing and their performances are analyzed in terms of accuracy. A Classifier ensemble is designed using Radial Basis Function (RBF) and Support Vector Machine (SVM) as base classifiers. The feasibility and the benefits of the proposed approaches are demonstrated by the means of standard datasets of intrusion detection. The main originality of the proposed approach is based on three main parts: preprocessing phase, classification phase, and combining phase. A wide range of comparative experiments is conducted for standard datasets of intrusion detection. The performance of the proposed homogeneous and heterogeneous ensemble classifiers are compared to the performance of other standard homogeneous and heterogeneous ensemble methods. The standard homogeneous ensemble methods include Error correcting output codes, Dagging and heterogeneous ensemble methods include majority voting, stacking. The proposed ensemble methods provide significant improvement of accuracy compared to individual classifiers and the proposed bagged RBF and SVM performs significantly better than ECOC and Dagging and the proposed hybrid RBF-SVM performs significantly better than voting and stacking. Also heterogeneous models exhibit better results than homogeneous models for standard datasets of intrusion detection.
Business-Intelligence Mining of Large Decentralized Multimedia Datasets with a Distributed Multi-Agent System
The rapid generation of high volume and a broad variety of data from the application of new technologies pose challenges for the generation of business-intelligence. Most organizations and business owners need to extract data from multiple sources and apply analytical methods for the purposes of developing their business. Therefore, the recently decentralized data management environment is relying on a distributed computing paradigm. While data are stored in highly distributed systems, the implementation of distributed data-mining techniques is a challenge. The aim of this technique is to gather knowledge from every domain and all the datasets stemming from distributed resources. As agent technologies offer significant contributions for managing the complexity of distributed systems, we consider this for next-generation data-mining processes. To demonstrate agent-based business intelligence operations, we use agent-oriented modeling techniques to develop a new artifact for mining massive datasets.
Case-Based Reasoning: A Hybrid Classification Model Improved with an Expert's Knowledge for High-Dimensional Problems
Data mining and classification of objects is the process of data analysis, using various machine learning techniques, which is used today in various fields of research. This paper presents a concept of hybrid classification model improved with the expert knowledge. The hybrid model in its algorithm has integrated several machine learning techniques (Information Gain, K-means, and Case-Based Reasoning) and the expert’s knowledge into one. The knowledge of experts is used to determine the importance of features. The paper presents the model algorithm and the results of the case study in which the emphasis was put on achieving the maximum classification accuracy without reducing the number of features.
Determination of the Bank's Customer Risk Profile: Data Mining Applications
In this study, the clients who applied to a bank branch for loan were analyzed through data mining. The study was composed of the information such as amounts of loans received by personal and SME clients working with the bank branch, installment numbers, number of delays in loan installments, payments available in other banks and number of banks to which they are in debt between 2010 and 2013. The client risk profile was examined through Classification and Regression Tree (CART) analysis, one of the decision tree classification methods. At the end of the study, 5 different types of customers have been determined on the decision tree. The classification of these types of customers has been created with the rating of those posing a risk for the bank branch and the customers have been classified according to the risk ratings.
A Case-Based Reasoning-Decision Tree Hybrid System for Stock Selection
Stock selection is an important decision-making problem. Many machine learning and data mining technologies are employed to build automatic stock-selection system. A profitable stock-selection system should consider the stock’s investment value and the market timing. In this paper, we present a hybrid system including both engage for stock selection. This system uses a case-based reasoning (CBR) model to execute the stock classification, uses a decision-tree model to help with market timing and stock selection. The experiments show that the performance of this hybrid system is better than that of other techniques regarding to the classification accuracy, the average return and the Sharpe ratio.
Predication Model for Leukemia Diseases Based on Data Mining Classification Algorithms with Best Accuracy
In recent years, there has been an explosion in the rate of using technology that help discovering the diseases. For example, DNA microarrays allow us for the first time to obtain a "global" view of the cell. It has great potential to provide accurate medical diagnosis, to help in finding the right treatment and cure for many diseases. Various classification algorithms can be applied on such micro-array datasets to devise methods that can predict the occurrence of Leukemia disease. In this study, we compared the classification accuracy and response time among eleven decision tree methods and six rule classifier methods using five performance criteria. The experiment results show that the performance of Random Tree is producing better result. Also it takes lowest time to build model in tree classifier. The classification rules algorithms such as nearest- neighbor-like algorithm (NNge) is the best algorithm due to the high accuracy and it takes lowest time to build model in classification.
Performance Comparison of ADTree and Naive Bayes Algorithms for Spam Filtering
Classification is an important data mining technique
and could be used as data filtering in artificial intelligence. The
broad application of classification for all kind of data leads to be
used in nearly every field of our modern life. Classification helps us
to put together different items according to the feature items decided
as interesting and useful. In this paper, we compare two
classification methods Naïve Bayes and ADTree use to detect spam
e-mail. This choice is motivated by the fact that Naive Bayes
algorithm is based on probability calculus while ADTree algorithm is
based on decision tree. The parameter settings of the above
classifiers use the maximization of true positive rate and
minimization of false positive rate. The experiment results present
classification accuracy and cost analysis in view of optimal classifier
choice for Spam Detection. It is point out the number of attributes to
obtain a tradeoff between number of them and the classification
Intelligent Recognition of Diabetes Disease via FCM Based Attribute Weighting
In this paper, an attribute weighting method called fuzzy C-means clustering based attribute weighting (FCMAW) for classification of Diabetes disease dataset has been used. The aims of this study are to reduce the variance within attributes of diabetes dataset and to improve the classification accuracy of classifier algorithm transforming from non-linear separable datasets to linearly separable datasets. Pima Indians Diabetes dataset has two classes including normal subjects (500 instances) and diabetes subjects (268 instances). Fuzzy C-means clustering is an improved version of K-means clustering method and is one of most used clustering methods in data mining and machine learning applications. In this study, as the first stage, fuzzy C-means clustering process has been used for finding the centers of attributes in Pima Indians diabetes dataset and then weighted the dataset according to the ratios of the means of attributes to centers of theirs. Secondly, after weighting process, the classifier algorithms including support vector machine (SVM) and k-NN (k- nearest neighbor) classifiers have been used for classifying weighted Pima Indians diabetes dataset. Experimental results show that the proposed attribute weighting method (FCMAW) has obtained very promising results in the classification of Pima Indians diabetes dataset.