FiVaTech: Page-Level Web Data Extraction From Template Pages

Home » Unlabelled » FiVaTech: Page-Level Web Data Extraction From Template Pages

FiVaTech: Page-Level Web Data Extraction From Template Pages

Posted by hari Saturday 31 March 2012 2 comments

FiVaTech: Page-Level Web Data Extraction

From Template Pages

Abstract

Web data extraction has been an important part for many Web data analysis applications. In this paper, we formulate the data extraction problem as the decoding process of page generation based on structured data and tree templates. We propose an unsupervised, page-level data extraction approach to deduce the schema and templates for each individual Deep Website, which contains either singleton or multiple data records in one Webpage. FiVaTech applies tree matching, tree alignment, and mining techniques to achieve the challenging task. In experiments, FiVaTech has much higher precision than EXALG and is comparable with other record-level extraction systems like ViPER and MSE. The experiments show an encouraging result for the test pages used in many state-of-the-art Web data extraction works.

Existing System:

Generally speaking, templates, as a common model for all pages, occur quite fixed as opposed to data values which vary across pages. Finding such a common template requires multiple pages or a single page containing multiple records as input. When multiple pages are given, the extraction target aims at page-wide information (e.g., Road Runner and EXALG). When single pages are given, the extraction target is usually constrained to record wide information, which involves the addition issue of record-boundary detection. Page-level extraction tasks, although do not involve the addition problem of boundary detection, are much more complicated than record-level extraction tasks since more data are concerned. A common technique that is used to find template is alignment: either string alignment (e.g., IEPAD, Road Runner) or tree alignment (e.g., DEPTA). As for the problem of distinguishing template and data, most approaches assume that HTML tags are part of the template, while EXALG considers a general model where word tokens can also be part of the template and tag tokens can also be data. However, EXALG’s approach, without explicit use of alignment, produces many accidental equivalent classes, making the reconstruction of the schema not complete.

Disadvantages:

• Complex Schema: The “schema” of the information encoded in the web pages could be very complex with arbitrary levels nesting. For instance, each book page can contain a set of authors, with each author having a set of addresses and so on.

• Template vs. Data: Syntactically, there is nothing that distinguishes the text that is part of the template and the text that is part of the data.

Proposed System:

In this paper, we focus on page-level extraction tasks and propose a new approach, called FiVaTech, to automatically detect the schema of a Website. The proposed technique presents a new structure, called fixed/variant pattern tree, a tree that carries all of the required information needed to identify the template and detect the data schema. We combine several techniques: alignment, pattern mining, as well as the idea of tree templates to solve the much difficult problem of page-level template construction. In experiments, FiVa Tech has much higher precision than EXALG, one of the few page-level extraction systems, and is comparable with other record-level extraction systems like ViPER and MSE.

Advantages:

· We focus on page-level extraction tasks and propose a new approach, called FiVaTech, to automatically detect the schema of a Website.

· The proposed technique presents a new structure, called fixed/variant pattern tree, a tree that carries all of the required information needed to identify the template and detect the data schema.

Architecture:

General Description of EXALG

HARDWARE & SOFTWARE REQUIREMENTS:

HARDWARE REQUIREMENTS:

· System : Pentium IV 2.4 GHz.

· Hard Disk : 40 GB.

· Floppy Drive : 1.44 Mb.

· Monitor : 15 VGA Color.

· Mouse : Logitech.

· Ram : 512 MB.

SOFTWARE REQUIREMENTS:

· Operating system : Windows XP Professional.

· Coding Language : java(jdk1.6.0)

· Front End : Struts Framework

· Back End : Oracle 10g

· IDE : my eclipse 8.0

Modules Description:

User Registration:

User can be register inside the database through that sensitive information. User can be getting the credentials of information like username and password.

Input Web pages:

Template pages are generated by embedding a data instance in a predefined template via a CGI program. Thus, the reverse engineering of finding the template and the data schema given input Web pages should be established on some page generation model, which we describe next. In this paper, we propose a tree based page generation model, which encodes data by sub tree concatenation instead of string concatenation. This is because both data schema and Web pages are tree-like structures. Thus, we also consider templates as tree structures. The advantage of tree-based page generation model is that it will not involve ending tags (e.g., </html>, </body>, etc.) into their templates as in string-based page generation model applied in EXALG.

DOM Trees Creation:

The first module merges all input DOM trees at the same time into a structure called fixed/variant pattern tree, which can then be used to detect the template and the schema of the Website in the second module. In this section, we will introduce how input DOM trees can be recognized and merged into the pattern tree for schema detection.

Tree Merging:

In the peer node recognition step, two nodes with the same tag name are compared to check if they are peer sub trees. All peer sub trees will be denoted by the same symbol. . In the matrix alignment step, the system tries to align nodes (symbols) in the peer matrix to get a list of aligned nodes child List. In addition to alignment, the other important task is to recognize variant leaf nodes that correspond to basic-typed data. . In the pattern mining step, the system takes the aligned child List as input to detect every repetitive pattern in this list starting with length 1. For each detected repetitive pattern, all occurrences of this pattern except for the first one are deleted for further mining of longer repeats. The result of this mining step is a modified list of nodes without any repetitive patterns. . In the last step (line 12), the system recognizes optional nodes if a node disappears in some columns of the matrix and group nodes according to their occurrence vector. After the above four steps, the system inserts nodes in the modified child List as children of P. For non leaf child node c, if c is not a fixed template tree (as defined in the next section), the algorithm recursively calls the tree merging algorithm with the peer sub trees of c (by calling procedure peer Node ðc;MÞ, which returns nodes in M having the same symbol of c) to build the pattern tree.

Peer Matrix Alignment:

After peer node recognition, all peer sub trees will be given the same symbol. For leaf nodes, two text nodes take the same symbol when they have the same text values, and two <img> tag nodes take the same symbol when they have the same SRC attribute values. To convert M into an aligned peer matrix, we work row by row such that each row has (except for empty columns) either the same symbol for every column or is a text (<img>) node of variant text (SRC attribute, respectively) values. In the latter case, it will be marked as basic-typed for variant texts. From the aligned matrix M, we get a list of nodes, where each node corresponds to a row in the aligned matrix.

Algorithms:

Multiple trees merging algorithm