[geeklog-devel] Google Summer of Code 2009: Geeklog

saurabh gupta saurabhgupta1403 at gmail.com
Sun Mar 15 14:39:17 EDT 2009

On Fri, Mar 13, 2009 at 12:31 AM, Dirk Haun <dirk at haun-online.de> wrote:
> saurabh gupta wrote:
>>> My gut feeling is that our users won't be willing to spend a lot of time
>>> training a spam filter. I may be wrong, though
>>Well, what I thought in this part is that the spam filter will work in its
>>own way initially, but in case if sometimes  a post is made *spam* by
>>mistake, then user can mark it as *not spam* (similar to what we have in
>>gmail) and the spam filter should be intelligent enough to adapt to this and
>>vice versa. Training will be done automatically.
> You would still have to save all the posts, at least for a while, to be
> able to correct any false negatives. So apart from the technical issues
> (you currently can't save a post marked as spam such that it could be
> posted properly again later), there's also the issue of having a
> (possibly) really long list of spam posts in a queue.
> Of course, you could purge that queue on a regular basis, e.g. delete
> all posts older than 24 or 48 hours. We would have to test how that
> works in real life.

What we can do is to mark the spam comments as spam and group them in
a spam category (spam queue). To escape the space issue and long list
of spam comments, we can limit the size of spam list. For example,
maximum number of spam comments are 30 and whenever new spam comment
is added, the first one added will be deleted ( first in first out).
The maximum number of spam comments can also be given as an option to
the site Admin to set. The benefits of this approach will be :

1. There will be a second chance for the false positives.

2. Test Mode ( http://wiki.geeklog.net/index.php/SoC_spam-x_overhaul#Test_Mode
) is implemented in a better way. For example, when any user add a
regular expression as filter rule, there will be a button to check the
validity of that expression. It will then check all the comments (or
some) and according to the latest rule, add the spam comments in the
spam queue. The user can then see if the expression fed by him is
proper or not.

3. Similarly, the *Mass Delete Spam Comments* can be modified and in a
similar way, the caught spams will be placed in the spam queue.

Other modifications which can be implemented in spam-X plugin are:

1. While adding an entry in the *Spam-X Personal Blacklist*, it
doesn't check for the duplicate entries. So, if a user press the
button *Add Censor list* multiple times, all the entries are added
multiple times. The same happens with other blacklist entries also.
So, an API for duplicate check entries can be added in this.

2. If spam queue is implemented, then whenever, a comment is caught as
spam, it will be entered in the spam queue and those words will be
highlighted or underlined (marked) which provided a basis of spam
detection. For example, whenever a comment is caught as spam because
its content has the word *xyz*, then this word will be highlighted and
the comment will be sent to spam queue. This is beneficial to provide
an overview and idea to the site user to see that which word is
adulterated and is helpful for the test mode case of spam-X training.
Another advantage of this will be to implement a use_counter (
http://wiki.geeklog.net/index.php/SoC_spam-x_overhaul#Use_Counter ) in
a better way.

Comments and suggestions are welcome.

Saurabh Gupta
NSIT,New Delhi, India

More information about the geeklog-devel mailing list