Internet and e-mail policy and practice
including Notes on Internet E-mail


2011
Months
Dec

Click the comments link on any story to see comments or add your own.


Subscribe to this blog


RSS feed


Home :: Email


26 Dec 2011

Filtering spam at the transport level Email

An interesting new paper from the Naval Postgraduate School (paper here, conference slides here) describes what appears to be an interesting new twist on spam filtering, looking at the characteristics of the TCP session through which the mail is delivered.

They observe that bots typically live on cable or DSL connections with slow congested upstreams. TCP sessions from bots turn out to be fairly easy to recognize by RTT, window, and retransmits, something that people have known at least since a paper at the 2008 CEAS conference on the topic.

This paper tries to see whether it would be practical to use that info to manage spam in real time. They have a network analyzer called SpamFlow that figures out per-connection characteristics. Then as a proof of concept they wrote a Spamassassin plugin to train on the data from SpamFlow and try and do filtering. They do some sort of hand-wavey load testing to see whether SpamFlow can keep up with a realistic mail load, and if it trains fast enough that it would provide useful data in real time. They claim that their results show that it does both.

It's not obvious how best you would use this in combination with all of the other anti-spam tools people we have, most notably blacklists like the CBL that very accurately identify IPs of botted hosts by looking at the characteristics of mail received at large spamtraps. One thing that occurs to me is this sort of thing might be useful if mail moves to IPv6, since building v6 blacklists will be hard due to the size of the address space, while this lets you estimate the bottiness of each connection directly. Also, rather than accepting or rejecting mail, you might slow down mail reception from hosts that seem to be bots, both to give preference to non-bot senders, and because bots tend to be impatient so if you slow down a dubious connection and it gives up, it was probably a bot. The Turntide appliance did something similar five years ago, although it used different heuristics for deciding what to slow down.

This technique looks only at the characteristics of the TCP session, and not at the contents of the session, which means it also doesn't look at the contents of the messages. It might be useful in contexts where for legal or political reasons the spam filter isn't allowed to look at the messages, but users want spam filtering anyway. The authors point out that it is in principle applicable to any TCP transaction, so it might be useful against web queries from bots, too.

It's hardly a FUSSP, but it's an interesting paper.


posted at: 23:40 :: permanent link to this entry :: 3 comments
posted at: 23:40 :: permanent link to this entry :: 3 comments

comments...        (Jump to the end to add your own comment)


Don't some well known spam-filters (not 100% if this is publicly available information, so decided not to mention their names) do something like this? I haven't had the time to read the paper, so there may be a subtle difference in the way it is implemented, but I'm pretty sure the idea isn't new.

As for web queries from bots, I would think it is significantly more difficult to distinguish a bot's request to a website and a request made by a human using a real browser to the same website from potentially the same computer.

(by Martijn Grooten 27 Dec 2011 12:37)



The only other TCP-level filter I know is Turntide, and as I said, it uses different criteria.

Looking at the paper again, they referred not to web queries from bots, which I agree are too small to get good stats, but bots used as sleazy CDNs, which would probably work.

(by John L 27 Dec 2011 13:05)



Funny, this. About five years ago we did some in-depth work on using TCP-headers to identify operating systems and use machine-learning on that as an indicator of high likelihood of spam. Not surprisingly, this yielded very good results.

This paper looks like it arriving at the same conclusion through a slightly more roundabout way (FIN, OOO packets, RSTs, TCP connection setup packets and similar are usually the indicators used to differentiate operating systems with TCP packet info). I am personally quite sceptical that RTT's, latency and other network characteristics are good indicators.

Nevertheless I am happy that someone is resurrecting this research topic.

(by Marc Sheldon 23 Jan 2012 04:10)


Add your comment...

Note: all comments require an email address to send a confirmation to verify that it was posted by a person and not a spambot. The comment won't be visible until you click the link in the confirmation. Unless you check the box below, which almost nobody does, your email won't be displayed, and I won't use it for other purposes.

 
Name:
Email: you@wherever (required, for confirmation)
Title: (optional)
Comments:
Show my Email address
Save my Name and Email for next time

Topics


My other sites

Who is this guy?

Airline ticket info

Taughannock Networks

Other blogs

CAUCE
It turns out you don’t need a license to hunt for spam.
5 days ago

A keen grasp of the obvious
Italian Apple Cake
563 days ago

Related sites

Coalition Against Unsolicited Commercial E-mail

Network Abuse Clearinghouse



© 2005-2020 John R. Levine.
CAN SPAM address harvesting notice: the operator of this website will not give, sell, or otherwise transfer addresses maintained by this website to any other party for the purposes of initiating, or enabling others to initiate, electronic mail messages.