[wp-hackers] New(?) anti-spam technique

Thu Oct 21 18:25:01 UTC 2004

First of all, sorry for the long post, especially if it turns out this
has all been refuted before.

I've seen an increase in comment spam the last couple of days. This
particular species looks a lot like legitimate comments in that it
isn't riddled with links, the comment text is not advertising, and
isn't anonymous. Of course, the email address is fake and the spam
comes from dozens, probably hundreds, of different IP addresses. Flood
control was the only thing keeping it in check. The blacklist works
for a little while but then the spam would change and I'd have to
update the blacklist again.

The first thing I did was check my logs and noticed that the bot was
posting directly to wp-comments-post.php with no referrer. So this
took care of the problem immediately (but this isn't the new
technique!):

In wp-comments-post.php:

// Anti-spam
// Prevent posting directly to this page without having come from another page
// on this site.
$siteurl = strtolower(get_settings('siteurl'));
$referer = strtolower($_SERVER['HTTP_REFERER']);
if ( strpos($referer, $siteurl) === false ) {
    // I chose this action because I don't want to give a spammer any clues.
    die();
}

This works FOR NOW but since HTTP_REFERER is ridiculously easy to
spoof it probably won't work for long. I'm a little surprised that
spam bots aren't sophisticated enough to do this already.

So, that took care of the immediate problem but I wanted a more long
term solution. Captchas came to mind immediately but I dislike
captchas because they require legitimate commenters to jump through
hoops. I remembered the discussion Kitty started last month about
putting a unique hash in the page that must be posted back when the
form is submitted. This is a good idea, and I implemented it, but it
is also easy for a bot programmer to circumvent since all of the
information you need to circumvent it is right there in the form. I
made it a little more complicated by requiring the client to compute
an md5 hash of the comment + post title and post that back as well
(Javascript can do this nicely). But, while that makes a spammer jump
through more hoops, once they've done that we're back where we started
again.

Unsatisfied, I tried "hashcash" (http://www.hashcash.org/). Hashcash
is a clever technique that requires the client to expend a certain
amount of work computing a "stamp". It requires 2-3 seconds to compute
a hashcash stamp, it can't be faked, and it takes a trivial amount of
time to verify. This delay is no problem for a reader submitting a
single comment but is a huge drain on a spammer who wants to post
hundreds of comments a second. And if you use date, post_id, server
address, and remote address as the stamp seed then the spammer has to
compute a new stamp for every post on every server every day. I
implemented this as well, in Javascript, but unfortunately Javascript
is too slow to compute stamps in a reasonable amount of time. I found
that a 16-bit collision takes much too long to compute in Javascript
but it only takes a fraction of a second in C which is probably more
like what spammers would use (even 12-bit collisions are barely
possible with Javascript). I chose Javascript because I wanted the
requirements burden on the client to be as low as possible (Java,
Flash, ActiveX are all unnacceptable requirements for posting
comments). Here's something like the algorithm I implemented:

The form onsubmit action would call something like this:

function hashcash() {
    stampseed1 = 'some combination of date + post id + server address
+ remote address';
    stampseed2 = 0;
    while(true) {
        hash = hex_md5(stampseed1 + stampseed2);
        if (hash.substr(0,4) == "0000") { // Look for a 16-bit collision
            document.getElementById('commentform').stamp.value = stampseed1 +
            stampseed2;
            break;
        }
        stampseed2++;
    }
}

And then wp-comments-post.php would just need to
substr(md5($_POST['stamp']), 0, 4) == "0000" to verify the stamp.

I still like this idea the best, even though I abandonded it. If
Javascript could compute these faster then it would be a viable
solution.

Then it occurred to me that one way to tell humans from bots is that
humans are much slower. This is what I came up with:

- When a client hits the comment form, start a timer. Start a unique
timer for each client (by IP address or some other method).
- When the client posts the form, check the timer. If the time elapsed
is less than 3 seconds or more than 5 minutes (the "window"), moderate
the comment.
- Delete the timer.

Here's how I implemented it:

[1] At the top of wp-comments.php:
    $_SESSION['CommentTimer_'.$_SERVER['REMOTE_ADDR']] = time();

[2] In wp-comments-post.php after check_comment():
    if (!isset($_SESSION['CommentTimer_'.$_SERVER['REMOTE_ADDR']])
    || time()-$_SESSION['CommentTimer_'.$_SERVER['REMOTE_ADDR']] < 3
    || time()-$_SESSION['CommentTimer_'.$_SERVER['REMOTE_ADDR']] > 300) {
        $approved = 0;
        $comment = "** COMMENT TIMER SPAM **\n".$comment;
    }
    unset($_SESSION['CommentTimer_'.$_SERVER['REMOTE_ADDR']]);

[3] In wp-config.php:
    session_start();

This is simple, completely server-side, and transparent to readers. It
has zero client-side requirements other than you take a couple of
seconds to enter your comment. It busts spambots that spoof IPs on
every request, that don't use the form, and that don't wait. I used
$_SESSION for this quick implementation but there is no reason why you
couldn't store the timers in a database table. You'd just need to
clean out old timers occassionally (perhaps automatically each time
the admin went to comment moderation). Also, these could probably be
wrapped in existing function calls so templates wouldn't have to be
modified (for example, embedding [2] in check_comment()).

So far it has stopped my spam completely. The main drawback that comes
to mind is that it requires the client to have a stable IP address
during a session, so it might be a problem for people with certain
ISPs (AOL?).  This can be gotten around by using cookie based
identification but then it doesn't work for people who won't accept
cookies.  This method is partially defeatable too. Defeating this
method requires a bot to:

Hit/Wait/Post cycle:

- Hit hundreds of comment forms on different sites.
- Hit them all again within the window to post the comment.
- Repeat for each comment.

I say partially because, while the above spamming technique will post
the spam, it should drastically slow down the spammer and thus reduce
the number of sites he can cover thus reducing the spread of spam. If
this is widely implemented, you will be less likely to even encounter
spam because fewer spammers will even be able to get to your site in
their lists.

Assume:
    - The spambot has 100,000 wp-comment-post.php entries in its index
    - Each of those blogs implements a 10-second flood control
    - The spammer can currently post to them all in 10 seconds

A naive spam bot would Hit/Wait/Post each post reducing its output
from 100,000 posts/10 seconds to 1 post/3 seconds. Even if he has a
network of zombie machines doing his bidding, each zombie doing naive
Hit/Wait/Post would be reduced to 1 post/3 seconds. 100 zombies in
aggregate would be able to do 100 posts/3 seconds.

A more sophisticated circumvention would be to hit them all once (10
seconds), then hit them all again (10 seconds), reducing the output to
100,000 spams every 20 seconds.  But using this technique, if a
spambot has millions of posts in his index, then by the time it got
around to do the second hit it might have fallen outside of the
window.

Another technique would be to start multiple threads on the same
spamming machine that do Hit/Wait/Post. A machine that could start
100,000 threads in 10 seconds would be reduced to 100,000 posts/13
seconds. But I don't think 100,000 threads on a single machine in 10
seconds is feasible (is it?). If he could start 100 threads then he
would be reduced to 100 posts/3 seconds. If a zombie network using
this technique split up the 100,000 posts among 100 machines and each
machine could do 100,000 posts/10 seconds then they could do all
100,000 posts in just over 3 seconds.

Someone check my math?  :-)

But is this the general case? This spammer appeared to be coming from
hundreds of different IP addresses, but I wonder if he really had
access to a zombie network or if he was just using IP spoofing from a
single machine.

What do you think?

--
John
http://flagrantdisregard.com/