![]() |
Projects |
|
Bayesian Noise Reduction: Contextual Symmetry Logic Utilizing Pattern Consistency Analysis Last Update: January 14, 2004
Jonathan A. Zdziarskijonathan@nuclearelephant.com
Modern day language classification requires the use of machine learning, which relies heavily on presented learning input. Most of today's algorithms (Bayes, Chi-Square, etcetera) are inherently sound and accurate, however regardless of which algorithm is used, a great deal of any algorithm's accuracy is related directly to the quality of data provided - the Garbage In, Garbage Out rule. Bayesian Noise Reduction is a statistical approach to evaluating coherence by instantiating a series of machine-generated contexts to serve as a means of contrast. This makes it possible to identify text that is out of context using a form of pattern consistency checking. BNR attempts to solve the problem commonly referred to as "Bayesian Noise" which, in its simplest definition, refers to irrelevant data present in a message being classified. Bayesian Noise Reduction dubs irrelevant text in order to provide cleaner classification and is implemented as a pre-filter to existing language classification functions.
libbnr is a GPL implementation of the Bayesian Noise Reduction algorithm as outlined in the white paper. It may be compiled and linked into your existing applications (in accordance with the GPL) with little effort to provide noise reduction pre-filtering.
Q. When do you normally begin training for contexts? A. BNR's contexts are based on token values, and so it's more effective when the token values are fairly stable. On the first day of training, your filter knows nothing and has to learn these probabilities. As your filter learns, tokens begin to migrate towards a particular value and so it's not very beneficial to train contexts while these tokens are fluctuating so dramatically. Ideally, you should start training the contexts after a certain threshold of learning, when you can be confident many of the values in the database have found a fairly stable value. The database doesn't have to be completely trained, but should at least be mature enough to deliver some reasonable level of accuracy above 95-99%. If you've performed corpus feeding to train your database, this can speed up the time you'll need to achieve stable values. Q. What training modes (for the tokens) generally work best? A. Since BNR's contexts are based on token values, how you train the tokens in the database can impact BNR's efficiency. The best training modes to work with are those that leave the database fairly nonvolatile. Train-on-Error and Train-on-Uncertain generally work the best. You can still use training modes such as Train-Everything, but you might find after time that the contexts become somewhat watered down. In this event, it's a good idea to divide all of the contexts' spam and nonspam counters by some standard value to restore the freshness of the contexts. Q. What training modes (for the contexts) generally work best? A. The BNR algorithm functions quite well when training contexts on every message, but I've also found that training contexts only on hard-to-classify messages seems to make BNR more sensitive to noise. Q. When can I start using the contexts my filter has learned? A. You'll want to wait until the contexts have had a reasonable number of messages to train from before using them. A good 250 to 500 spam and nonspam should be good enough to start using BNR. Q. You mentioned at the Spam Conference this could possibly be circumvented. So why should I use it? A. Unfortunately I didn't have the time to elaborate on this within my 20-minute window. The BNR algorithm is pretty rock solid and resilient to all kinds of attacks, however many people misunderstood John Graham-Cumming's example of circumventing statistical spam filters in 2004 and so now there is a concern about circumventing any filtering technology. Graham-Cumming's example was intended to illustrate a vulnerability in theory, but many people labeled it as doable and easy! The BNR algorithm, just like statistical filtering, can be circumvented in theory but it's extremely unlikely that it ever will be, at least without massive security holes left open in mail clients, gobs of Internet bandwidth, and hours and hours of processing time. It requires an enormous level of resources and dependence on specific operating system vulnerabilities just to mine a user's hot token list alone (attacks which BNR is designed to fight off). It would require exponentially more resources to try and circumvent this as it would require first mining the exact value patterns for a particular user (which are very specific to that user), then mine not only a hot token list but also calculate the values of cold tokens (in order to form a pattern). All of this work, which is only theoretically even feasable, and should this exploit work successfully, it only succeeds on the one target user and then the filter adapts so you have to start the mining process all over again! This may be a fun challenge for the test lab, but spammers work in volume. And should a spammer seek to invest the amount of time and money in finding a way to circumvent the specific BNR contexts in my database, then I've cost them money which is, of course, counterproductive for spammers. It's safe to say that while circumventing this approach may be possible in theory, it's highly unlikely due to the mass complexity required to do such a thing. Take me to Nuclear Elephant | DSPAM |
|
All Website Content © 2004 Jonathan A. Zdziarski. All Rights Reserved. |
| Reproduction prohibited without permission |