Combating Comment Spam with Machine Learning Approaches
Abstract
The feature of posting comments enables websites visitors (e.g., Youtube and Amazon) to interact and contribute to the posted content by adding comments. The fact that such comments are becoming part of the website content so that many visitors read them and that such comments are usually unvetted make them attractive to spammers for the purposes of advertising, spreading malware, phishing attacks, or spreading political or religious views. Due to large volume of comment spam, using manual filtration and vetting is unpractical and hence automatic spam detection techniques play a de-facto role in fighting spam content. In this paper, we propose and develop a comment spam detection mechanism that can be deployed as a browser plugin for inspecting the Document Object Model (DOM) of the web page in question and remove comments with spam content. We examine most detection features in the literature along with proposing new features to build a comment spam classifier. In order to test the accuracy of our classifier, we manually label a new corpus of blogs comments. We encourage other researchers to build upon our work and we hope that our corpus will benefit the research community in this area.