A Two-Step Method for Clustering Mixed Categorical And Numeric Data


A Two-Step Method for Clustering Mixed Categorical
And Numeric Data
Abstract:
                                      Various clustering algorithms have been developed to group data into clusters in diverse domains. However, these clustering algorithms work effectively either on pure numeric data or on pure categorical data, most of them perform poorly on mixed categorical and numeric data types. In this paper, a new two-step clustering method is presented to find clusters on this kind of data. In this approach the items in categorical attributes are processed to construct the similarity or relationships among them based on the ideas of co-occurrence; then all categorical attributes can be converted into numeric attributes based on these constructed relationships. Finally, since all categorical data are converted into numeric, the existing clustering algorithms can be applied to the dataset without pain. Nevertheless, the existing clustering algorithms suffer from some disadvantages or weakness, the proposed two-step method integrates hierarchical and partitioning clustering algorithm with adding attributes to cluster objects. This method defines the relationships among items, and improves the weaknesses of applying single clustering algorithm. Experimental evidences show that robust results can be achieved by applying this method to cluster mixed numeric and categorical data.

Existing System:
                                 Various clustering applications have emerged in diverse domains. However, most of the traditional clustering algorithms are designed to focus either on numeric data or on categorical data. The collected data in real world often contain both numeric and categorical attributes. It is difficult for applying traditional clustering algorithm directly into these kinds of data. Typically, when people need to apply traditional distance-based clustering algorithms (ex., k-means) to group these types of data, a numeric value will be assigned to each category in this attributes. Some categorical values, for example “low”, “medium” and “high”, can easily be transferred into numeric values. But if categorical attributes contain the values like “red”, “white” and “blue” … etc., it cannot be ordered naturally. How to assign numeric value to these kinds of categorical attributes will be a challenge work.

Disadvantages:
1. The major problem of existing clustering algorithms is that most of them treat every attribute as a single entity,and ignore the relationships among them. However, there
may be some relationships among attributes.
2. the traditional clustering algorithm cannot handle this kind of data effectively.
3. The experimental results show that the proposed approach can achieve a high quality of clustering results.

Proposed System
                                 The TMCM algorithm is based on above observation to produce pure numeric attributes. The algorithm is shown on lists a sample data set, and this data set will be used to illustrate the proposed ideas. The first step in the proposed approach is to read the input data and normalize the numeric attributes’ value into the range of zero and one. The goal of this process is to avoid certain attributes with a large range of values will dominate the results of clustering. Additionally, a categorical attribute A with most number of items is selected to be the base attribute, and the items appearing in base attribute are defined as base items. This strategy is to ensure that a non-base item can map to multiple base items. If an attribute with fewer items is selected as the base attribute, the probability of mapping several non-based items to the same based items will be higher. In such a case, it may make different categorical items get the same numeric value.

Advantages:
1)      Clustering is considered an important tool for data mining. The goal of data clustering is aimed at dividing the data set into several groups such that objects have a high degree of similarity to each other in the same group and have a high degree of dissimilarity to the ones in different groups.
2)      The TMCM algorithm is based on above observation to produce pure numeric attributes.
3)      the TMCM algorithm integrates HAC and k-means clustering algorithms to cluster mixed type of data. Applying other algorithms or sophisticated similarity measures into TMCM may yield better results.

Software Requirements Specification:

Software Requirements:
Front End                         :     java swings
Back End                          :     No Database
IDE                                    :     my eclipse 8.0
Language                          :     java (jdk1.6.0)
Operating System             :    windows XP

Hardware Requirements:
System                                             :   Pentium IV 2.4 GHz.
Hard Disk                            :   80 GB.
Floppy Drive                        :   1.44 Mb.
Monitor                                :   14’ Colour Monitor.
Mouse                                   :   Optical Mouse.
Ram                                       :   512 Mb.
Keyboard                              :   101 Keyboards.
Module Description:
  • In first step clustering, several similar objects are grouped into subsets, and these subsets are treated as objects to be input into second step clustering. Thus noise or outlier can be smoothed in k-means clustering process.
  • The added attributes not only offer useful information for clustering, but also reduce the influence of noise and outlier.
  • In second clustering step, the initial selections of cancroids will be groups of similar objects. It is believed that this strategy will be a better solution than a random selection used in most applications
Algorithm:
1.      K-means algorithm
2.      K-mediods algorithm
3.      Agglomerative algorithm


0 comments:

Post a Comment