Bioassay is the measurement of the potency of a chemical substance by its effect on a living animal or plant tissue. Bioassay data and chemical structures from pharmacokinetic and drug metabolism screening are mined from and housed in multiple databases. Bioassay prediction is calculated accordingly to determine further advancement. This paper proposes a four-step preprocessing of datasets for improving the bioassay predictions. The first step is instance selection in which dataset is categorized into training, testing, and validation sets. The second step is discretization that partitions the data in consideration of accuracy vs. precision. The third step is normalization where data are normalized between 0 and 1 for subsequent machine learning processing. The fourth step is feature selection where key chemical properties and attributes are generated. The streamlined results are then analyzed for the prediction of effectiveness by various machine learning algorithms including Pipeline Pilot, R, Weka, and Excel. Experiments and evaluations reveal the effectiveness of various combination of preprocessing steps and machine learning algorithms in more consistent and accurate prediction.
DNA data have been used in forensics for decades. However, current research looks at using the DNA as a biometric identity verification modality. The goal is to improve the speed of identification. We aim at using gene data that was initially used for autism detection to find if and how accurate is this data for identification applications. Mainly our goal is to find if our data preprocessing technique yields data useful as a biometric identification tool. We experiment with using the nearest neighbor classifier to identify subjects. Results show that optimal classification rate is achieved when the test set is corrupted by normally distributed noise with zero mean and standard deviation of 1. The classification rate is close to optimal at higher noise standard deviation reaching 3. This shows that the data can be used for identity verification with high accuracy using a simple classifier such as the k-nearest neighbor (k-NN).
The purpose of this study is providing an improved mode choice model considering parameters including age grouping of prime-aged and old age. In this study, 2010 Household Travel Survey data were used and improper samples were removed through the analysis. Chosen alternative, date of birth, mode, origin code, destination code, departure time, and arrival time are considered from Household Travel Survey. By preprocessing data, travel time, travel cost, mode, and ratio of people aged 45 to 55 years, 55 to 65 years and over 65 years were calculated. After the manipulation, the mode choice model was constructed using LIMDEP by maximum likelihood estimation. A significance test was conducted for nine parameters, three age groups for three modes. Then the test was conducted again for the mode choice model with significant parameters, travel cost variable and travel time variable. As a result of the model estimation, as the age increases, the preference for the car decreases and the preference for the bus increases. This study is meaningful in that the individual and households characteristics are applied to the aggregate model.
Application of five implementations of three data mining classification techniques was experimented for extracting important insights from tourism data. The aim was to find out the best performing algorithm among the compared ones for tourism knowledge discovery. Knowledge discovery process from data was used as a process model. 10-fold cross validation method is used for testing purpose. Various data preprocessing activities were performed to get the final dataset for model building. Classification models of the selected algorithms were built with different scenarios on the preprocessed dataset. The outperformed algorithm tourism dataset was Random Forest (76%) before applying information gain based attribute selection and J48 (C4.5) (75%) after selection of top relevant attributes to the class (target) attribute. In terms of time for model building, attribute selection improves the efficiency of all algorithms. Artificial Neural Network (multilayer perceptron) showed the highest improvement (90%). The rules extracted from the decision tree model are presented, which showed intricate, non-trivial knowledge/insight that would otherwise not be discovered by simple statistical analysis with mediocre accuracy of the machine using classification algorithms.
Remotely sensed data are a significant source for monitoring and updating databases for land use/cover. Nowadays, changes detection of urban area has been a subject of intensive researches. Timely and accurate data on spatio-temporal changes of urban areas are therefore required. The data extracted from multi-temporal satellite images are usually non-stationary. In fact, the changes evolve in time and space. This paper is an attempt to propose a methodology for changes detection in urban area by combining a non-stationary decomposition method and stochastic modeling. We consider as input of our methodology a sequence of satellite images I1, I2, … In at different periods (t = 1, 2, ..., n). Firstly, a preprocessing of multi-temporal satellite images is applied. (e.g. radiometric, atmospheric and geometric). The systematic study of global urban expansion in our methodology can be approached in two ways: The first considers the urban area as one same object as opposed to non-urban areas (e.g. vegetation, bare soil and water). The objective is to extract the urban mask. The second one aims to obtain a more knowledge of urban area, distinguishing different types of tissue within the urban area. In order to validate our approach, we used a database of Tres Cantos-Madrid in Spain, which is derived from Landsat for a period (from January 2004 to July 2013) by collecting two frames per year at a spatial resolution of 25 meters. The obtained results show the effectiveness of our method.
Hepatitis is one of the most common and dangerous diseases that affects humankind, and exposes millions of people to serious health risks every year. Diagnosis of Hepatitis has always been a challenge for physicians. This paper presents an effective method for diagnosis of hepatitis based on interval Type-II fuzzy. This proposed system includes three steps: pre-processing (feature selection), Type-I and Type-II fuzzy classification, and system evaluation. KNN-FD feature selection is used as the preprocessing step in order to exclude irrelevant features and to improve classification performance and efficiency in generating the classification model. In the fuzzy classification step, an “indirect approach” is used for fuzzy system modeling by implementing the exponential compactness and separation index for determining the number of rules in the fuzzy clustering approach. Therefore, we first proposed a Type-I fuzzy system that had an accuracy of approximately 90.9%. In the proposed system, the process of diagnosis faces vagueness and uncertainty in the final decision. Thus, the imprecise knowledge was managed by using interval Type-II fuzzy logic. The results that were obtained show that interval Type-II fuzzy has the ability to diagnose hepatitis with an average accuracy of 93.94%. The classification accuracy obtained is the highest one reached thus far. The aforementioned rate of accuracy demonstrates that the Type-II fuzzy system has a better performance in comparison to Type-I and indicates a higher capability of Type-II fuzzy system for modeling uncertainty.
Music has always been an integral part of human’s daily lives. But, for the most people, reading musical score and turning it into melody is not easy. This study aims to develop an Automatic music score recognition system using digital image processing, which can be used to read and analyze musical score images automatically. The technical approaches included: (1) staff region segmentation; (2) image preprocessing; (3) note recognition; and (4) accidental and rest recognition. Digital image processing techniques (e.g., horizontal /vertical projections, connected component labeling, morphological processing, template matching, etc.) were applied according to musical notes, accidents, and rests in staff notations. Preliminary results showed that our system could achieve detection and recognition rates of 96.3% and 91.7%, respectively. In conclusion, we presented an effective automated musical score recognition system that could be integrated in a system with a media player to play music/songs given input images of musical score. Ultimately, this system could also be incorporated in applications for mobile devices as a learning tool, such that a music player could learn to play music/songs.
Advance in techniques of image and video processing has enabled the development of intelligent video surveillance systems. This study was aimed to automatically detect moving human objects and to analyze events of dual human interaction in a surveillance scene. Our system was developed in four major steps: image preprocessing, human object detection, human object tracking, and motion trajectory analysis. The adaptive background subtraction and image processing techniques were used to detect and track moving human objects. To solve the occlusion problem during the interaction, the Kalman filter was used to retain a complete trajectory for each human object. Finally, the motion trajectory analysis was developed to distinguish between the interaction and non-interaction events based on derivatives of trajectories related to the speed of the moving objects. Using a database of 60 video sequences, our system could achieve the classification accuracy of 80% in interaction events and 95% in non-interaction events, respectively. In summary, we have explored the idea to investigate a system for the automatic classification of events for interaction and non-interaction events using surveillance cameras. Ultimately, this system could be incorporated in an intelligent surveillance system for the detection and/or classification of abnormal or criminal events (e.g., theft, snatch, fighting, etc.).
The edges of low contrast images are not clearly distinguishable to human eye. It is difficult to find the edges and boundaries in it. The present work encompasses a new approach for low contrast images. The Chebyshev polynomial based fractional order filter has been used for filtering operation on an image. The preprocessing has been performed by this filter on the input image. Laplacian of Gaussian method has been applied on preprocessed image for edge detection. The algorithm has been tested on two test images.
Color Histogram is considered as the oldest method used by CBIR systems for indexing images. In turn, the global histograms do not include the spatial information; this is why the other techniques coming later have attempted to encounter this limitation by involving the segmentation task as a preprocessing step. The weak segmentation is employed by the local histograms while other methods as CCV (Color Coherent Vector) are based on strong segmentation. The indexation based on local histograms consists of splitting the image into N overlapping blocks or sub-regions, and then the histogram of each block is computed. The dissimilarity between two images is reduced, as consequence, to compute the distance between the N local histograms of the both images resulting then in N*N values; generally, the lowest value is taken into account to rank images, that means that the lowest value is that which helps to designate which sub-region utilized to index images of the collection being asked. In this paper, we make under light the local histogram indexation method in the hope to compare the results obtained against those given by the global histogram. We address also another noteworthy issue when Relying on local histograms namely which value, among N*N values, to trust on when comparing images, in other words, which sub-region among the N*N sub-regions on which we base to index images. Based on the results achieved here, it seems that relying on the local histograms, which needs to pose an extra overhead on the system by involving another preprocessing step naming segmentation, does not necessary mean that it produces better results. In addition to that, we have proposed here some ideas to select the local histogram on which we rely on to encode the image rather than relying on the local histogram having lowest distance with the query histograms.
The Simulation based VLSI Implementation of FELICS (Fast Efficient Lossless Image Compression System) Algorithm is proposed to provide the lossless image compression and is implemented in simulation oriented VLSI (Very Large Scale Integrated). To analysis the performance of Lossless image compression and to reduce the image without losing image quality and then implemented in VLSI based FELICS algorithm. In FELICS algorithm, which consists of simplified adjusted binary code for Image compression and these compression image is converted in pixel and then implemented in VLSI domain. This parameter is used to achieve high processing speed and minimize the area and power. The simplified adjusted binary code reduces the number of arithmetic operation and achieved high processing speed. The color difference preprocessing is also proposed to improve coding efficiency with simple arithmetic operation. Although VLSI based FELICS Algorithm provides effective solution for hardware architecture design for regular pipelining data flow parallelism with four stages. With two level parallelisms, consecutive pixels can be classified into even and odd samples and the individual hardware engine is dedicated for each one. This method can be further enhanced by multilevel parallelisms.
Natural resources management including water resources requires reliable estimations of time variant environmental parameters. Small improvements in the estimation of environmental parameters would result in grate effects on managing decisions. Noise reduction using wavelet techniques is an effective approach for preprocessing of practical data sets. Predictability enhancement of the river flow time series are assessed using fractal approaches before and after applying wavelet based preprocessing. Time series correlation and persistency, the minimum sufficient length for training the predicting model and the maximum valid length of predictions were also investigated through a fractal assessment.
Advances in the field of image processing envision a new era of evaluation techniques and application of procedures in various different fields. One such field being considered is the biomedical field for prognosis as well as diagnosis of diseases. This plethora of methods though provides a wide range of options to select from, it also proves confusion in selecting the apt process and also in finding which one is more suitable. Our objective is to use a series of techniques on bone scans, so as to detect the occurrence of rheumatoid arthritis (RA) as accurately as possible. Amongst other techniques existing in the field our proposed system tends to be more effective as it depends on new methodologies that have been proved to be better and more consistent than others. Computer aided diagnosis will provide more accurate and infallible rate of consistency that will help to improve the efficiency of the system. The image first undergoes histogram smoothing and specification, morphing operation, boundary detection by edge following algorithm and finally image subtraction to determine the presence of rheumatoid arthritis in a more efficient and effective way. Using preprocessing noises are removed from images and using segmentation, region of interest is found and Histogram smoothing is applied for a specific portion of the images. Gray level co-occurrence matrix (GLCM) features like Mean, Median, Energy, Correlation, Bone Mineral Density (BMD) and etc. After finding all the features it stores in the database. This dataset is trained with inflamed and noninflamed values and with the help of neural network all the new images are checked properly for their status and Rough set is implemented for further reduction.
Real-time or in-line process monitoring frameworks are designed to give early warnings for a fault along with meaningful identification of its assignable causes. In artificial intelligence and machine learning fields of pattern recognition various promising approaches have been proposed such as kernel-based nonlinear machine learning techniques. This work presents a kernel-based empirical monitoring scheme for batch type production processes with small sample size problem of partially unbalanced data. Measurement data of normal operations are easy to collect whilst special events or faults data are difficult to collect. In such situations, noise filtering techniques can be helpful in enhancing process monitoring performance. Furthermore, preprocessing of raw process data is used to get rid of unwanted variation of data. The performance of the monitoring scheme was demonstrated using three-dimensional batch data. The results showed that the monitoring performance was improved significantly in terms of detection success rate of process fault.
This paper discusses the designing of knowledge integration of clinical information extracted from distributed medical ontologies in order to ameliorate a machine learning-based multilabel coding assignment system. The proposed approach is implemented using a decision tree technique of the machine learning on the university hospital data for patients with Coronary Heart Disease (CHD). The preliminary results obtained show a satisfactory finding that the use of medical ontologies improves the overall system performance.
Nowadays, hand vein recognition has attracted more attentions in identification biometrics systems. Generally, hand vein image is acquired with low contrast and irregular illumination. Accordingly, if you have a good preprocessing of hand vein image, we can easy extracted the feature extraction even with simple binarization. In this paper, a proposed approach is processed to improve the quality of hand vein image. First, a brief survey on existing methods of enhancement is investigated. Then a Radon Like features method is applied to preprocessing hand vein image. Finally, experiments results show that the proposed method give the better effective and reliable in improving hand vein images.
Brain ArterioVenous Malformation (BAVM) is an abnormal tangle of brain blood vessels where arteries shunt directly into veins with no intervening capillary bed which causes high pressure and hemorrhage risk. The success of treatment by embolization in interventional neuroradiology is highly dependent on the accuracy of the vessels visualization. In this paper the performance of clustering techniques on vessel segmentation from 3- D rotational angiography (3DRA) images is investigated and a new technique of segmentation is proposed. This method consists in: preprocessing step of image enhancement, then K-Means (KM), Fuzzy C-Means (FCM) and Expectation Maximization (EM) clustering are used to separate vessel pixels from background and artery pixels from vein pixels when possible. A post processing step of removing false-alarm components is applied before constructing a three-dimensional volume of the vessels. The proposed method was tested on six datasets along with a medical assessment of an expert. Obtained results showed encouraging segmentations.
It is important to predict yield in semiconductor test process in order to increase yield. In this study, yield prediction means finding out defective die, wafer or lot effectively. Semiconductor test process consists of some test steps and each test includes various test items. In other world, test data has a big and complicated characteristic. It also is disproportionably distributed as the number of data belonging to FAIL class is extremely low. For yield prediction, general data mining techniques have a limitation without any data preprocessing due to eigen properties of test data. Therefore, this study proposes an under-sampling method using support vector machine (SVM) to eliminate an imbalanced characteristic. For evaluating a performance, randomly under-sampling method is compared with the proposed method using actual semiconductor test data. As a result, sampling method using SVM is effective in generating robust model for yield prediction.
Heavy rainfall greatly affects the aerodynamic performance of the aircraft. There are many accidents of aircraft caused by aerodynamic efficiency degradation by heavy rain. In this Paper we have studied the heavy rain effects on the aerodynamic efficiency of NACA 64-210 & NACA 0012 airfoils. For our analysis, CFD method and preprocessing grid generator are used as our main analytical tools, and the simulation of rain is accomplished via two phase flow approach-s Discrete Phase Model (DPM). Raindrops are assumed to be non-interacting, non-deforming, non-evaporating and non-spinning spheres. Both airfoil sections exhibited significant reduction in lift and increase in drag for a given lift condition in simulated rain. The most significant difference between these two airfoils was the sensitivity of the NACA 64-210 to liquid water content (LWC), while NACA 0012 performance losses in the rain environment is not a function of LWC . It is expected that the quantitative information gained in this paper will be useful to the operational airline industry and greater effort such as small scale and full scale flight tests should put in this direction to further improve aviation safety.
Heavy rainfall greatly affects the aerodynamic performance of the aircraft. There are many accidents of aircraft caused by aerodynamic efficiency degradation by heavy rain. In this Paper we have studied the heavy rain effects on the aerodynamic efficiency of cambered NACA 64-210 and symmetric NACA 0012 airfoils. Our results show significant increase in drag and decrease in lift. We used preprocessing software gridgen for creation of geometry and mesh, used fluent as solver and techplot as postprocessor. Discrete phase modeling called DPM is used to model the rain particles using two phase flow approach. The rain particles are assumed to be inert. Both airfoils showed significant decrease in lift and increase in drag in simulated rain environment. The most significant difference between these two airfoils was the NACA 64-210 more sensitivity than NACA 0012 to liquid water content (LWC). We believe that the results showed in this paper will be useful for the designer of the commercial aircrafts and UAVs, and will be helpful for training of the pilots to control the airplanes in heavy rain.