background preloader

Spam

Facebook Twitter

Anti-spam techniques. To prevent email spam (a.k.a. unsolicited bulk email), both end users and administrators of email systems use various anti-spam techniques.

Anti-spam techniques

Some of these techniques may be embedded in products, services and software to ease the burden on users and administrators. No technique is a complete solution to the spam problem, and each has trade-offs between incorrectly rejecting legitimate email (false positives) vs. not rejecting all spam (false negatives), and the associated costs in time and effort. Anti-spam techniques can be broken into four broad categories: those that require actions by individuals, those that can be automated by email administrators, those that can be automated by email senders and those employed by researchers and law enforcement officials. Naive Bayes spam filtering. Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayesian inference to calculate a probability that an email is or is not spam.

Naive Bayes spam filtering

Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low false positive spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s. Naive Bayes classifier. A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions.

Naive Bayes classifier

A more descriptive term for the underlying probability model would be "independent feature model". An overview of statistical classifiers is given in the article on pattern recognition. Introduction[edit] In simple terms, a naive Bayes classifier assumes that the value of a particular feature is unrelated to the presence or absence of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 3" in diameter. For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. Despite their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.

Probabilistic model[edit] Document classification. The documents to be classified may be texts, images, music, etc.

Document classification

Each kind of document possesses its special classification problems. When not otherwise specified, text classification is implied. "Content based" versus "request based" classification[edit] Content based classification is classification in which the weight given to particular subjects in a document determines the class to which the document is assigned. Bag-of-words model. The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR).

Bag-of-words model

In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Recently, the bag-of-words model has also been used for computer vision.[1] An early reference to "bag of words" in a linguistic context can be found in Zellig Harris's 1954 article on Distributional Structure.[2] Example implementation[edit] The following models a text document using bag-of-words. Here are two simple text documents: John likes to watch movies. John also likes to watch football games.

Based on these two text documents, a dictionary is constructed as: which has 10 distinct words. Where each entry of the vectors refers to count of the corresponding entry in the dictionary (this is also the histogram representation). Bayesian poisoning. Spammers also hope to cause the spam filter to have a higher false positive rate by turning previously innocent words into spammy words in the Bayesian database (statistical type I errors) because a user who trains their spam filter on a poisoned message will be indicating to the filter that the words added by the spammer are a good indication of spam.

Bayesian poisoning

§Empirical results[edit] A Plan for Spam. August 2002 (This article describes the spam-filtering techniques used in the spamproof web-based mail reader we built to exercise Arc.

A Plan for Spam

An improved algorithm is described in Better Bayesian Filtering.) I think it's possible to stop spam, and that content-based filters are the way to do it. The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. To the recipient, spam is easily recognizable. I think we will be able to solve the problem with fairly simple algorithms. The statistical approach is not usually the first one people try when they write spam filters. And so you do, and in the beginning it works. I spent about six months writing software that looked for individual spam features before I tried the statistical approach.

False positives are innocent emails that get mistakenly identified as spams. The more spam a user gets, the less likely he'll be to notice one innocent mail sitting in his spam folder. Better Bayesian Filtering. January 2003 (This article was given as a talk at the 2003 Spam Conference.

Better Bayesian Filtering

It describes the work I've done to improve the performance of the algorithm described in A Plan for Spam, and what I plan to do in the future.) The first discovery I'd like to present here is an algorithm for lazy evaluation of research papers. Probability. Suppose that being over 7 feet tall indicates with 60% probability that someone is a basketball player, and carrying a basketball indicates this with 72% probability.

Probability

If you see someone who is over 7 feet tall and carrying a basketball, what is the probability that they're a basketball player? If a and b are the probabilities associated with two independent pieces of evidence, then combined they indicate a probability of: ab ------------------- ab + (1 - a)(1 - b) So in this case our answer is: which is .794. When there are more than two pieces of evidence, the formula expands as you might expect: Combining Probabilities.