Record matching over query results from multiple web databases


Record matching over query results from multiple web databases

Abstract:
                            Record matching, which identifies the records that represent the same real-world entity, is an important step for data integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. These methods are not applicable for the Web database scenario, where the records to match are query results dynamically generated on-the- fly. Such records are query-dependent and a pre-learned method using training examples from previous query results may fail on the results of a new query. To address the problem of record matching in the Web database scenario, we present an unsupervised, online record matching method, UDD, which, for a given query, can effectively identify duplicates from the query result records of multiple Web databases. After removal of the same-source duplicates, the “presumed” non duplicate records from the same source can be used as training examples alleviating the burden of users having to manually label training examples. Starting from the non duplicate set, we use two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier, to iteratively identify duplicates in the query results from multiple Web databases. Experimental results show that UDD works well for the Web database scenario where existing supervised methods do not apply.

Existing System:
Designing a system that helps users integrate and, more importantly, compare the query results returned from multiple Web databases, a crucial task is to match the different sources’ records that refer to the same real-world entity. For example, shows some of the query results returned by two online bookstores, booksamillion.com and abebooks.com, in response to the same query “Harry Potter” over the Title field. Before comparing the results (records) we have to find the decision making attributes means weights of the attributes. Upto now for this comparison we are majorly depending upon the record matching methods which are supervised, which requires the user to provide training data. This works is based on predefined matching rules hand-coded by domain experts or matching rules learned offline by some learning method from a set of training examples. Such approaches work well in a traditional database environment, where all instances of the target databases can be readily accessed, as long as a set of high-quality representative records can be examined by experts or selected for the user to label.

Disadvantages:
Ø  The main challenging of this system is to reduce the duplicate records from different urls.
Ø  Existing record matching methods are supervised, which requires the user to provide training data. Which are not applicable for the Web database scenario, where the records to match are query results dynamically generated on the-fly.
Ø  Most existing work requires human-labeled training data (positive, negative, or both), which places a heavy burden on users.

Ø  Web database records(results) are query-dependent so a pre-learned(supervised) method using training examples from previous query results may fail on the results of a new query.

Proposed System:
In the Web database scenario, the records to match are highly query-dependent, since they can only be obtained through online queries. Moreover, they are only a partial and biased portion of all the data in the source Web databases. Consequently, hand-coding or offline-learning approaches (supervised techniques) are not appropriate for two reasons. First, the full data set is not available beforehand, and therefore, good representative data for training are hard to obtain. Second, and most importantly, even if good representative data are found and labeled for learning, the rules learned on the representatives of a full data set may not work well on a partial and biased part of that data set. To illustrate this problem, consider a query for books of a specific author, such as “J. K. Rowling.” Depending on how the Web databases process such a query, all the result records for this query may well have only “J. K. Rowling” as the value for the Author field. In this case, the Author field of these records is ineffective for distinguishing the records that should be matched and those that should not. To reduce the influence of such fields in determining which records should match, their weighting should be adjusted to be much lower than the weighting of other fields or even be zero. Moreover, for each new query, depending on the results returned, the field weights should probably change too, which is not possible in the supervised-learning based methods.
  
Advantages:
Ø  In this paper, By using the unsupervised methods we are over coming the training data problems which are not applicable for Web database scenario.
Ø  We propose a new record matching method Unsupervised Duplicate Detection (UDD) for the specific record matching problem of identifying duplicates among records in query results from multiple Web databases.
Ø  Two classifiers, WCSS and SVM, are used cooperatively in the used in record matching to identify the duplicate pairs from all potential duplicate pairs iteratively.
Ø  Experimental results show that our approach have high performance then  previous work that requires training examples.




















Architecture:

 System Architecture


Software Requirements Specification:
Software Requirements:
Front End                          :     Jsp,Servlet
Back End                           :     Oracle 10g
IDE                                    :     my eclipse 8.0
Language                           :     java (jdk1.6.0)
Operating System              :    windows XP

Hardware Requirements:
System                        :   Pentium IV 2.4 GHz.
Hard Disk       :   80 GB.
Floppy Drive   :   1.44 Mb.
Monitor           :   14’ Colour Monitor.
Mouse             :   Optical Mouse.
Ram                 :   512 Mb.
Keyboard        :   101 Keyboards.
Modules Description:
    1. Get the records from  multiple web databases
    2. Identifying the similarity function
    3. Unsupervised Duplicate Detection
Get the records from multiple web databases:
              In this project our focus is on getting records from multiple web databases of the same domain, i.e., web databases that provide the same type of records in response to user queries. Suppose there are s records in data source A and there are t records in data source B, with each record having a set of fields/attributes. Each of the t records in data source B can potentially be a duplicate of each of the s records in data source A.

Identifying the similarity function:

                Web database scenario, the records to match are highly query-dependent, since they can only be obtained through online queries. Consequently, hand-coding or offline-learning approaches are not applicable for web database scenario. In this paper, to identify the similarity between the records we have to assign a weight to the record contained attributes. Depending upon the attribute weights only we will find the similarity and we will determine the duplicates.


Unsupervised Duplicate Detection:
                 Important aspect of duplicate detection is to reduce the number of record pair comparisons. Several methods have been proposed for this purpose including standard blocking, sorted neighborhood method, Bigram Indexing, and record clustering. Even though these methods differ in how to partition the data set into blocks, they all considerably reduce the number of comparisons by only comparing records from the same block. Since any of these methods can be incorporated into UDD to reduce the number of record pair comparisons, we do not further consider this issue.


Algorithm:
·         Duplicate vector identification algorithm
·         Component weight assignment algorithm

2 comments:

  1. what is the initial step that i take to implement it in real world.?plz help me...itz urgent...

    ReplyDelete
  2. hi,
    I want these project code in Java.
    Here are my Details:
    9908002976, mail me the contact details at teja1103.projects@gmail.com

    ReplyDelete