Building a Better Spam Detector

Search Engine Optimization No Comments »

Posted by Nick Gerner

A couple of weeks ago the AIRWeb held its 2008 conference. After seeing Dr. Garcia's post on the conference I was going read the papers and provide a high-level overview of some of the papers.  However, after I saw that they were holding a web spam competition, my interests headed in a different direction.  At the risk of raising Dr. Garcia's ire (a mistake I've made in the past), I have, with tongue placed throughly in cheek, developed my own spam detection algorithm.  And that algorithm has performed surprisingly well! 

I turned my project into a tool you can use to check if a domain name looks spammy.  I'm not making any guarantees with it though.  It could be a nifty tool if you want to have a high-quality domain name.  Or at least, one which is not obviously spammy:

You probably understand the basic idea of spam detection:

  • The engines (and surfers) don't like the spam pages. 
  • Enter the PhD-types with their fancy models, support from the engines with their massive data-centers, funding for advanced research, and whole lot more smarts than I've got.

To measure these things it's interesting to look at the "true positive rate" and the "false positive rate".  The perfect algorithm has a true positive rate of 100% (or 1.0) (all spam is identified) and a false positive rate of 0% (no non-spam pages are marked as spam).  However, just as there's no such thing as both calorie-free and delicious chocolate (PLEASE tell me I'm wrong!), there is rarely a perfect algorithm.  So we are faced with trade-offs. 

On the one hand you could label everything as non-spam and you would never have a false positive.  This is the point on the graph where x=0 and y=0.  On the other hand, you could just label everything as spam and no SERPs would contain any spam pages.  Of course, there would be no pages in any SERP at all, and this whole industry would have to go to back-up plans.  I'd be a roadie for my wife's Rock Band tours. 

Here is a plot illustrating the trade-offs, as adapted by me from the AIRWeb 2008 conference, including my own fancy algorithm. "y=x" is the base-line random classifier (you better be smarter than this one!) 
see the actual results

Clearly, the graph shows that I totally rocked :)   Ignore for a minute the line labeled "SEOmoz - fair", we'll come back to this.  As you can see, at a false positive rate of 10% (0.10), I was able to successfully label over 50% of the spam pages, outperforming the worst algorithm from the workshop (Skvortsov at ~39%), and performing nearly as well as the best (Geng, et al. at ~55%).  My own algorithm,  SpamDilettante® (patent pending!) , developed in just two days, with only the help of our very own totally amazing intern Danny Dover and secret dev weapon Ben Hendrickson has outperformed some of the best and brightest researchers in the field of adversarial information retrieval.

Well, graphs lie.  And so do I.  Let me explain what's going on here.  First of all, my algorithm really does classify spam.  And I really did develop it in just two days without using a link-graph, extracting complex custom web features, or racking up many days or months of compute cluster time.  But there are some important caveats I'll get to, and these are illustrated by the much worse line called "fair" above.

What I did was begin with not one of Rand's most popular blog posts.  However, this post is actually filled with excellent content (see the graph above).  Most of the things I couldn't actually compute very easily:

  • High ratio of ad blocks to content
  • Small amounts of unique content
  • Very few direct visits
  • Less likely to have links from trusted sources
  • Unlikely to register with Google/Yahoo!/MSN Local Services
  • Many levels of links away from highly trusted websites
  • Cloaking based on user-agent or IP address is common

The list goes on.  These are not things that even some of the researchers backed by the engines could get a hold of.  So I ignored these.  After all, how important could knowing if there's traffic to a site or cloaking going on?  The big guys don't care about that, right?

However, some of the things I could get just from the domain name:

  • Long domain names
  • .info, .cc, .us and other cheap, easy to grab TLDs
  • Use of common, high-commercial value spam keywords in the domain name
  • More likely to contain multiple hyphens in the domain name
  • Less likely to have .com or .org extensions
  • Almost never have .mil, .edu or .gov extensions

So I figured a linear regression (a very simple statistical model) based on these factors would be a pretty neat brief project.  So I grabbed Danny and got some of the other mozzers to put together a rather short list of somewhat random (but valid) domain names.  Danny spent an afternoon browsing all kinds of filth that I'm sure his professors (and family) are pleased he's seeing during his employment at SEOmoz.  In the end we had about 1/3 of our set labeled as spam, 1/3 labeled as non-spam, and 1/3 labeled as "unknown" (mostly non-english sites which probably have great content). 

With the hard work done for me, I wrote a script to extract the above features (in just a few lines of python code).  I took the 2/3 of the data which was labeled as spam/non-spam and divided it into an 80% "training set" and 20% "test set".  This is important because if you can see the labels for all your data you might as well just hard-code what's spam and what's not.  And then you win (remember that "perfect" algorithm?).  Anyway, I just did a linear regression on this 80% set and got my classifier.

To get performance numbers I used my classifier on my reserved 20% test set.  Basically it spewed a bunch of numbers like "0.87655" which you could think of as probabilities.  To get the above curve, I tried a series of thresholds (e.g. IF prob > 0.7 THEN spam ELSE not spam).  This is the trade-off between false positive and false negative, and gives me the above curves.

And that's the story of how I beat the academicians.

O.K., back to reality for a moment; on to the caveats. 

  • As I pointed out in the introduction to the competition my data set is a much simpler classification problem (complete label coverage and almost no class imbalance)
  • As Rebecca says, "it's just one of those "common seo knowledge" things--.info, .biz, a lot of .net [are spam]," and my dataset includes lot of these "easy targets".  The competition is all .uk (and mostly co.uk)
  • My dataset is awfully small and likely has all kinds of sampling problems.  My results probably do not generalize.

But these are just guesses.  Can we support the hypothesis that I do not, in fact, rock?  Well that's what the "fair" line above is all about.  I actually downloaded the data set from the challenge (jus the urls and labels) and ran my classifier on it.  Suddenly my competitive system doesn't look so good.  I had a professor that had a rubric for these things, and according to his rubric I'm just one notch above "terrible" stuck squarely at "poor".  The next closest guy in the competition is "good" (skipping "mediocre", of course) and the best system is "excellent".

As E. Garcia said in his original post which started me on this, "it is time to revisit [the] drawing board."

Do you like this post? YesNo

More: continued here

Microsoft Yanks Its Offer For Yahoo

Search Engine Optimization No Comments »

Microsoft has decided to withdraw its offer for Yahoo and not pursue a hostile takeover bid, according to a letter from Microsoft CEO Steve Ballmer, just released: "Clearly a deal is not to be." This marks an amazing turn in the MicroHoo drama. When the acquisition bid was first announced in February, it was discussed by Microsoft's executives as though concluding the deal was only a matter of formality. (Postscript: See also Yahoo's response and related discussion to both on Techmeme here and here).

It's not entirely clear if this marks the end of the whole episode but it certainly appears to (at least on the surface.) A skeptic might see this as a power play to deflate Yahoo's stock and then come back with a tender offer or another try to buy the company in a weakened state.

Click to continue reading...

More: continued here

SenseBot Releases FireFox Plugin for Google Result Summaries

Search Engine Optimization No Comments »

SenseBot is a search engine that produces summaries of the results that appear when you perform a search. Its a tool that is very useful when you want to make sense of a topic and there’s a deluge of content from the web to choose from. Now, SenseBot has provided a FireFox plug-in that integrates [...]

More: continued here

Microsoft Board Meets, Indicates Higher Bid for Yahoo

Search Engine Optimization No Comments »

Late yesterday afternoon, the Wall Street Journal got word of a Microsoft board meeting. And ever since they reported the news, the speculation and rumor mills have been working overtime.

Henry Blodget over at the Silicon Valley Insider got a glimpse of a WSJ story suggesting that MSFT would raise the bid to $32-$33 a share. The story is no longer to be found on the interwebs, which is likely Microsoft's strategy, according to Blodget. The apparent strategy is to get comment out of Yahoo CEO Jerry Yang on whether or not the upped offer would be accepted.

Earlier reports have both shareholders and Yahoo execs saying "I see your $32-33 and raise you a $35-37." This is not likely to please the big wigs from Redmond.

But they may have forced their own hand in the matter when they didn't offer a higher bid sooner. It's the Yahoo-Google deal that likely tipped the scales in favor of Yahoo in the negotiating process.

More: continued here

Google’s CEO Discusses DoubleClick, MicroHoo, Don’t Be Evil, Googlephobia, And Mobile Search

Search Engine Optimization No Comments »

CNBC news anchor Maria Bartiromo interviewed Google CEO Eric Schmidt on Tuesday and discussed a wide range of topics, from MicroHoo to advertising on YouTube, mobile, and Googlephobia. There are no real surprises and nothing especially new (other than the statement about new ad products for YouTube this year). However, it you have time it makes for interesting reading.

Below are my edited excerpts of the lengthy transcript of the CNBC interview, which you can find in full, together with a short, edited clip of the interview, here.

Click to continue reading...

More: continued here

Pimp My Site: Tweaking High Traffic Landing Pages

Search Engine Optimization No Comments »

When you have a page that is bringing in a lot of eyeballs, it may be tempting to just leave it alone. Cliches become mantras. "There's no need to stir the pot." "Let sleeping cats lie." "If it ain't broke, don't fix it."

The problem is that if your traffic isn't converting, then your landing page is, in fact, broken.

Thankfully, the Google Website Optimizer team is serving up tips on knowing which pages to tweak via a post on their blog.

First up are Landing Pages. Go to the "Content" section of Google Analytics and check out the "Top Landing Pages." The pages you need to focus on have high bounce rates and high entrance rates. The blog recounted a scenario Avinash Kaushik spoke about at SES. He said he was searching for faucets and the top sponsored result led him to a sinks page. Perhaps the site is experiencing large numbers of site visitors, but they're just throwing money away if they're not giving the people what they want.

Secondly, check out funnel pages. These are pages that a visitor arrives at after performing an action such as a purchase, registration or download. In Google Analytics, you can set up a funnel which contains 10 pages pertaining to a goal. Then, you'll be able to view "funnel visualization" reports that can show you where your site visitors get stuck in the process.

Remember, it's the conversions that matter the most. Not the clicks or the referrals or the number of eyeballs. If your site visitors are not engaging in the actionable goal you've set for them, then it's time to tweak.

Related Reading:
Your Baby's Ugly - Why You Need Landing Page Optimization Now
PPC Triage Now! Emergency Action Steps for Dying AdWords

More: continued here

Google Adwords: TV Ads for Everyone!

Search Engine Optimization No Comments »

Google TV Ads have been in private beta since last summer, but now they're groomed and prepped and available for U.S. customers. Why in the world should a search marketer such as yourself care about TELEVISION? I'm so glad you asked.

Offline advertising prompts online searches. Last summer, iProspect released study data suggesting that 37% of TV watchers are prompted to conduct a search based on a TV ad. I can attest to the validity of this statement as I've seen these results in the marketing analytics of a Fortune 500 company I previously worked with.

Because of this, integration is crucial. If your TV people aren't talking to your Web Marketing people, then you're not maximizing your marketing, plain and simple.

And then there's the future. Expect TV advertising to become more and more interactive, so that it's not just marketing campaigns that are integrated - but the actual ad is a mashup of TV + Web.

Whether you go with Google, engage in a local TV campaign or hire an ad agency, Google's announcement today is a great reminder of keeping our marketing eyes on the big picture.

More: continued here

Copyright © 2006-2007 SEOsk.com | articliex - articles