Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗
- Author: Mehmed Kantardzic
Book online «Data Mining by Mehmed Kantardzic (inspirational novels TXT) 📗». Author Mehmed Kantardzic
In the weight-propagation phase, we extract good hubs and authorities from the base set V by giving a concrete numeric interpretation to all of them. We associate a nonnegative authority weight ap and a nonnegative hub weight hp with each page p ∈ V. We are interested only in the relative values of these weights; therefore, normalization is applied so that their total sum remains bounded. Since we do not impose any prior estimates, we set all a and h values to a uniform constant initially. The final weights are unaffected by this initialization.
We now update the authority and hub weights as follows. If a page is pointed to by many good hubs, we would like to increase its authority weight. Thus, we update the value of ap for the page p to be the sum of hq over all pages q that link to p:
where the notation q→p indicates that page q links to page p. In a strictly dual fashion, if a page points to many good authorities, we increase its hub weight
There is a more compact way to write these updates. Let us number the pages {1,2, … , n}, and define their adjacency matrix A to be n × n matrix whose (i, j)th element is equal to 1 if page i links to page j, and 0 otherwise. All pages at the beginning of the computation are both hubs and authorities, and, therefore, we can represent them as vectors
Our update rules for authorities and hubs can be written as
or, substituting one into another relation,
These are relations for iterative computation of vectors a and h. Linear algebra tells us that this sequence of iterations, when normalized, converges to the principal eigenvector of ATA. This says that the hub and authority weights we compute are truly an intrinsic feature of the linked pages collected, not an artifact of our choice of initial weights. Intuitively, the pages with large weights represent a very dense pattern of linkage, from pages of large hub weights to pages of large authority weights. Finally, HITS produces a short list consisting of the pages with the largest hub weights and the pages with the largest authority weights for the given search topic. Several extensions and improvements of the HITS algorithm are available in the literature. Here we will illustrate the basic steps of the algorithm using a simple example.
Suppose that a search engine has selected six relevant documents based on our query, and we want to select the most important authority and hub in the available set. The selected documents are linked into a directed subgraph, and the structure is given in Figure 11.2a, while corresponding adjacency matrix A and initial weight vectors a and h are given in Figure 11.2b.
Figure 11.2. Initialization of the HITS algorithm. (a) Subgraph of the linked pages; (b) adjacency matrix A and weight vectors for the given graph.
The first iteration of the HITS algorithm will give the changes in the a and h vectors:
Even this single iteration of the HITS algorithm shows that, in the given set of documents, document-5 has the most authority and document-1 is the best hub. Additional iterations will correct the weight factors for both vectors, but the obtained ranking of authorities and hubs will stay unchanged for this example.
The continuous growth in the size and use of the Internet creates difficulties in the search for information. Resource discovery is frustrating and inefficient when simple keyword searches can convey hundreds of thousands of documents as results. Many of them are irrelevant pages, some of them may have been moved, and some abandoned. While the first Web-mining algorithm HITS is primarily based on static information describing the Web-site structure, the second one, LOGSOM, uses dynamic information about a user’s behavior. LOGSOM is a sophisticated method that organizes the layout of the information in a user-interpretable graphic form. The LOGSOM system uses self-organizing maps (SOM) to organize Web pages into a two-dimensional (2-D) table, according to users’ navigation patterns. The system organizes Web pages according to the interest of Web users by keeping track of their navigation paths.
The SOM technique is used as the most appropriate technique for the problem of Web-page organization because of its strength not only in grouping data points into clusters, but also in graphically representing the relationship among clusters. The system starts with a Web-log file indicating the date, time, and address of the requested Web pages as well as the IP address of the user’s machine. The data are grouped into meaningful transactions or sessions, where a transaction is defined by a set of user-requested Web pages. We assume that there is a finite set of unique URLs:
and a finite set of m user transactions:
Transactions are represented as a vector with binary values ui:
where
Preprocessed log files can be represented as a binary matrix. One example is given in Table 11.1.
TABLE 11.1. Transactions Described by a Set of URLs
Since the dimensions of a table (n × m) for real-world applications would be very large, especially as input data to SOM, a reduction is necessary. By using the K-means clustering algorithm, it is possible to cluster transactions into prespecified number k (k m) of transaction groups. An example of a transformed table with a new, reduced data set is represented in Table 11.2, where the elements in the rows represent the total number of times a group accessed a particular URL (the form of the table and values are only one illustration, and
Comments (0)