To read this content please select one of the options below:

A structural, content‐similarity measure for detecting spam documents on the web

Maria Soledad Pera (Computer Science Department, Brigham Young University, Provo, Utah, USA)

Yiu‐Kai Ng (Computer Science Department, Brigham Young University, Provo, Utah, USA)

International Journal of Web Information Systems

ISSN: 1744-0084

Article publication date: 20 November 2009

Downloads

319

Abstract

Purpose

–

The web provides its users with abundant information. Unfortunately, when a web search is performed, both users and search engines must deal with an annoying problem: the presence of spam documents that are ranked among legitimate ones. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. To improve the quality of web searches, the number of spam documents on the web must be reduced, if they cannot be eradicated entirely. This paper aims to present a novel approach for identifying spam web documents, which have mismatched titles and bodies and/or low percentage of hidden content in markup data structure.

Design/methodology/approach

–

The paper shows that by considering the degree of similarity among the words in the title and body of a web docuemnt D, which is computed by using their word‐correlation factors; using the percentage of hidden context in the markup data structure within D; and/or considering the bigram or trigram phase‐similarity values of D, it is possible to determine whether D is spam with high accuracy

Findings

–

By considering the content and markup of web documents, this paper develops a spam‐detection tool that is: reliable, since we can accurately detect 84.5 percent of spam/legitimate web documents; and computational inexpensive, since the word‐correlation factors used for content analysis are pre‐computed.

Research limitations/implications

–

Since the bigram‐correlation values employed in the spam‐detection approach are computed by using the unigram‐correlation factors, it imposes additional computational time during the spam‐detection process and could generate higher number of misclassified spam web documents.

Originality/value

–

The paper verifies that the spam‐detection approach outperforms existing anti‐spam methods by at least 3 percent in terms of F‐measure.

Keywords

Citation

Soledad Pera, M. and Ng, Y. (2009), "A structural, content‐similarity measure for detecting spam documents on the web", International Journal of Web Information Systems, Vol. 5 No. 4, pp. 431-464. https://doi.org/10.1108/17440080911006207

Publisher

:

Emerald Group Publishing Limited

To read this content please select one of the options below:

Please note you do not have access to teaching notes

A structural, content‐similarity measure for detecting spam documents on the web

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Originality/value

Keywords

Citation

Publisher

Related articles

To read this content please select one of the options below:

Please note you do not have access to teaching notes

Abstract

Purpose

Design/methodology/approach

Findings

Research limitations/implications

Originality/value

Keywords

Citation

Publisher

Related articles

All feedback is valuable

Report an issue or find answers to frequently asked questions