CLICK HERE TO DOWNLOAD REPORT ON Web Warehousing
Web Warehousing Report Transcript
• How persistent is information on the web?
• How do web characteristics affect the design of Web Warehouses? I believe that the task of modelling the web must be part of the process of web data integration, because accurate models are crucial in making important design decisions at an early WWh development stage. Web models also enable the tuning of a WWh to reflect the evolution of the web.
The methodology used in this research was mainly experimental. I derived a model of a portion of the web and, based on it, I developed Webhouse, a WWh for investigating the influence of web characteristics in Web Warehouses design.
This development was performed in collaboration with other members of my research group. Figure 1.2 presents an overview of the components of Webhouse. Each one addresses one stage of the integration process: modelling, extraction, transformation and loading. Although the integration process is decomposed in several steps, they are not independent from each other
. • Viúva Negra (VN): extracts information from the web by iteratively following the linked URLs embedded in web pages. These systems are broadly known as crawlers; • WebCat: transforms web documents into an uniform data model (Martins& Silva, 2005b). This component was designed and developed by BrunoMartins; • Versus: loads and stores web data;
• Webstats: models the web, generating statistics on the documents and correspondent meta-data stored in Versus. The influence of web characteristics was studied during the design of each one of them. The extraction most sensitive stage of the integration process, because the software component interacts directly with the web and must address unpredictable situations. This thesis focuses mainly on the aspects of extracting information from the web and loading it into the WWh. The transformation of web data is not thoroughly discussed in this work. The efficiency of Webhouse as a complete system was validated through its application in several real usage scenarios.
This research was validated by applying the Engineering Method (Zelkowitz & Wallace, 1998). Several versions of Webhouse were iteratively developed and tested until the design could not be significantly improved. The Portuguese Web was chosen as a case study to analyze the impact of web characteristics in the design of a WWh. Models of the web were extracted through the analysis of the information integrated in the WWh. On its turn, a WWh requires models of the web to be designed. The performance of each version of the WWh and gradually improve it. So, although this thesis presents a sequential structure, the actual research was conducted as an iterative process. 1.2 Contributions Designing Web Warehouses is complex and requires combining knowledge from different domains. This thesis provides contributions in multiple aspects of webdata integration research: Web Characterization: concerns the monitoring and modelling of the web; Web Crawling: investigates the automatic extraction of contents from the web; Web Warehousing: studies the integration of web data. My specific contributions in each field are: Web Characterization:
• A thorough characterization of the structural properties of the Portuguese Web (Gomes & Silva, 2005);
• New models for estimating URL and content persistence on the web.
• Despite the ephemeral nature of the web, there is persistent information and this thesis presents a characterization of it (Gomes & Silva,2006a); • A detailed description of hazardous situations on the web that make it difficult to automate the processing of web data. Web Crawling:
• A novel architecture for a scalable, robust and distributed crawler (Gomes & Silva, 2006b)
; • An analysis of techniques to partition the URL space among the processes of a distributed crawler;
• A study of bandwidth and storage saving techniques, that avoid the download of duplicates and invalid URLs. Web Warehousing:
• A new architecture for a WWh that addresses all the stages of web data integration, from its extraction from the web to its processing by mining applications; • An analysis of the impact of web characteristics in the design and performance of a Web Warehouse;
• An algorithm that eliminates duplicates at storage level in a distributed system (Gomes et al., 2006b). Chapter 2 Web characterization The design of efficient Web Warehouses requires combining knowledge from Web characterization and Crawling. Web Characterization concerns the analysis of data samples to model characteristics of the web.
Crawling studies the automatic harvesting of web data. Crawlers are frequently used to gather samples of web data in order to characterize it. Web warehouses are commonly populated with crawled data. Research in crawling contributes to optimizing the extraction stage of the web data integration process. 2.1 Web characterization A characterization of the web is of great importance. It reflects technological and sociological aspects and enables the study of the web evolution. An accurate characterization of the web improves the design and performance of applications that use it as a source of information (Cho & Garcia-Molina, 2000a). This section introduces the terminology adopted to clarify web characterization concepts.
It discusses sampling methodologies and the identification of contents belonging to web communities. Finally, it presents previous works on the characterization of the structural properties and information persistence on the web. 2.2 Terminology As the web evolves, new concepts emerge and existing terms gain new meanings .Studies in web characterization are meant to be used as historical documents that enable the analysis of the evolution of the web. However, there is not a standard terminology and the current meaning of the terms may become obscure in the future. Between 1997 and 1999, the World-Wide Web Consortium (W3C) promoted the Web Characterization Activity with the purpose of defining and implementing mechanisms to support web characterization initiatives (W3C, 1999a).
The scope of this activity was to characterize the web as a general distributed system, not focusing on specific users or sites. In 1999, the W3C released a working draft defining a web characterization terminology (W3C, 1999b). The definitions used in this thesis were derived from that draft: Content: file resulting from a successful HTTP download; Media type: identification of the format of a content through a Multipurpose Internet Mail Extension (MIME) type (Freed & Borenstein , 1996a); Meta-data: information that describes the content. Meta-data can be generated during the download of a content (e.g. time spent to be downloaded), gathered from HTTP header fields (e.g. date of last modification) or extracted from a content (e.g. HTML meta-tags); Page: content with the media type text/html (Connolly & Masinter, 2000); Home page: content identified by an URL where the le path component is empty or a '/' only; Link: hypertextual reference from one content to another; Site:
collection of contents referenced by URLs that share the same host name (Fielding et al., 1999); Invalid URL: an URL that references a content that cannot be downloaded; Web server: a machine connected to the Internet that provides access to contents through the HTTP protocol; Duplicates: a set of contents that are bytewise equal; Partial duplicates: a set of contents that replicate a part of a content; Duplicate hosts (duphosts): sites with different names that simultaneously serve the same content (Henzinger, 2003); Subsite: cluster of contents within a site, maintained by a different publisher than that of the parent site; Virtual hosts: sites that have different names but are hosted on the same IP address and web server; Publisher or author: entity responsible for publishing information on the web. Some of the definitions originally proposed in the draft are controversial and had to be adapted to become more explicit.
The W3C draft defined that a page was a collection of information, consisting of one or more web resources, intended to be rendered simultaneously, and identified by a single URL. According to this definition, it is confusing to determine the contents that should be considered as part of a page. For instance, consider an HTML content and its embedded images.
This information is meant to be rendered simultaneously but the images are referenced by several URLs different from the URL of the HTML content. Researchers commonly describe their experimental data sets providing the number of pages (Cho & Garcia-Molina, 2000a; Fetterly et al., 2003; Lavoie et al.,1997). According to the W3C definition, a data set containing one million pages should include embedded images. However, most researchers considered that a page was a single HTML document. A set of byte wise equal contents are duplicates.
However, there are also similar contents that replicate a part of another content (partial duplicates). Defining a criterion that identifies contents as being similar enough to be considered the same is highly subjective. If multiple contents only differ on the value of a visit counter that changes on every download, they could reasonably be considered the same. However, when the difference between them is only as short as a number on the date of a historical event, this small difference could be very significant.
0 comments