I enjoyed the blog post by taoeffect/itistoday/greg at Tao Effect Blog; a good story, well told, and full of enthusiasm (increasingly a scarce commodity in online communities). I noticed that there was a slight increase in activity in the newLISP corner of the Twitterverse as a result: up from one or two twitters a week to 12 in one day.
I thought it would be nice to graph the tweet frequency. I added a short function to the Dragonfly Twitter module to draw a simple bar chart - see it here. However, I'm not too impressed with the effect. I'm not convinced that the choice of bar graph is correct either.
What I'm looking for is some kind of 'bar code' type of graph, where each vertical line represents a point in time to mark each tweet, and any increase in frequency shows up as a cluster of lines closer together. I don't know what this type of graph is called, or how to produce it using Google's chart API, though. If it's not possible, I'll think about drawing one using the HTML5 Canvas element. Help wanted!
Update: I wrote a new graph plug-in, using the HTML canvas. It looks more like the bar-code thing now. Opera doesn't draw the text, but it's OK in FireFox, Safari, and Chrome.
Seasonal greetings from Unbalanced Parentheses Headquarters!
This post uses the HTML 5 Canvas, and should work properly on recent standards-compliant browsers such as Safari, Firefox, and Google Chrome. The Opera browser can't handle this, which surprised me. As for Internet Explorer ... I suspect you won't see anything. Also, even if the canvas works well, there's still the problem of all those Unicode fonts. We haven't completely left behind the early days of the web, when every other page had a "Best viewed in browser X" banner.
The image is generated afresh each time you load the page, so the colours and positions of the various greetings are different each time. This is because the image is generated by embedded newLISP code in the HTML database which is evaluated only at browse time. The only tricky part of the operation is to make sure the code survives being translated by Markdown into HTML, then being uploaded via xmlrpc to be stored in the newLISP database ready for being processed by Dragonfly.
If you want a challenge, see how many different languages you can identify (without cheating)!
This post describes the newLISP Bayesian Comment Spam Killer. It won't kill Bayesian comments - although it might - but it tries to kill spam comments on blogs, using Bayesian analysis.
The story starts after the aspiring commenter clicks the Submit button on the comment form, and after the CGI script or web framework has extracted the information from the commenter's posted submission. To makes things easy, here are some declarations that get me quickly to the same position:
(set 'comment-date "20091114T163223Z")
(set 'storyid "projectnestorpart1")
(set 'comment "Very nice site!")
(set 'commentator "svQrVW a href=\"http://asdfhh.com/")
(set 'commentator-uri "svQrVW a href=\"http://asdfhh.com/")
(set 'ip-address ("94.102.60.174"))
The first thing to do is to save this information in a file. There are many ways to do this, but I like to save data in newLISP format wherever possible, because it saves time and effort when reading it back in:
; make a suitable path name
(set 'path (string {/Users/me/blog/comments/}
story-id "-" comment-date ".txt"))
; save as association list
(set 'comment-list
(list
(list 'comment-date comment-date)
(list 'storyid storyid)
(list 'commentator commentator)
(list 'comment comment-text)
(list 'ip-address ip-list)
(list 'status "spam")
(list 'commentator-uri commentator-uri)))
(save path 'comment-list)
A few weeks after opening a comments form to the intelligent citizens of cyberspace, there will be hundreds of little newLISP files in the directory, containing all kinds of comment. Each file looks something like this:
(set 'Comments:comment-list '(
(Comments:comment-date "20091114T163223Z")
(Comments:storyid "projectnestorpart1")
(Comments:comment "svQrVW a href=\"http://asdfhh.com/ etc etc ")
(Comments:commentator "svQrVW a href=\"http://asdfhh.com/")
(Comments:commentator-uri "svQrVW a href=\"http://asdfhh.com/")
(Comments:ip-address ("94.102.60.174"))
(Comments:status "spam")
))
I've added a status tag to each one, with the default value of "spam". That means that every comment so far is considered spam. That's not good (although very close to the actual truth), so I must also manually alter any genuine comments and tag them as "approved". That's a vital task, and for a while I did it by hand, until the collection of comments was large enough for me to trust the Bayesian analysis to do it automatically.
Once I've got a reasonable collection of comments, I'm ready to start building the Comment Spam Killer.
(context 'Comments)
A little macro I've been using recently provides a modified append:
(define-macro (extend)
(setf (eval (args 0)) (append (eval (args 0)) (eval (args 1)))))
This accepts a symbol holding a list, and a list, and adds the elements in the list at the end of the symbol's current elements.
I want somewhere to store the analysis:
(define MAIN:spam-corpus)
This function extracts a list of the words used in all the comments:
(define (build-word-lists dir)
(dolist (nde (directory dir {^[^.].*txt}))
(if (directory? (append dir nde))
; directory, recurse
(build-word-lists (append dir nde "/"))
; file: read info and make a list of its contents
(letn ((file (string dir nde))
(comment-list (load file))
(commentator (lookup 'commentator comment-list))
(comment (lookup 'comment comment-list))
(comment-status (lookup 'status comment-list))
(commentator-ip (lookup 'ip-address comment-list))
(commentator-uri (lookup 'commentator-uri comment-list))
(word-list '()))
(extend word-list (parse commentator "[^A-Za-z]" 0))
(extend word-list (parse comment "[^A-Za-z]" 0))
(extend word-list (parse commentator-uri "[^A-Za-z]" 0))
; sometimes ip addresses are stored in a list...
(if (list? commentator-ip)
(dolist (i commentator-ip) (extend word-list (list i))))
(cond
((= comment-status "approved")
(extend genuine-comments (clean empty? word-list)))
((= comment-status "spam")
(extend spam-comments (clean empty? word-list))))))))
And the two lists can be turned into a Bayesian-ready dictionary with:
(bayes-train spam-comments genuine-comments 'MAIN:spam-corpus)
The resulting spam-corpus is a context that provides two numbers for each word in the comments. Here's an informative extract:
;
("prepended" (0 2))
("prescription" (36 0))
("present" (0 1))
("presepe" (3 0))
("pretty" (0 1))
("price" (2 0))
("primari" (3 0))
("primaria" (6 0))
("primitive" (0 1))
("princessdc" (2 0))
("print" (0 2))
("printing" (4 0))
("println" (0 5))
("prior" (0 1))
("priors" (0 3))
;...
The contents of the spam context hold a list of words and the number of times that each word occurs in the first category, the spam comments, or the second category, the genuine comments. The apparent discrepancy between print and printing is easily resolved once you look at the original comments - something to do with custom T-shirt printing, whereas print was twice mentioned in a piece of newLISP code in a comment.
(define (analyse-comment file)
(letn ((comment-list (load file))
(commentator (lookup 'commentator comment-list))
(comment (lookup 'comment comment-list))
(comment-status (lookup 'status comment-list))
(commentator-ip (lookup 'ip-address comment-list))
(commentator-uri (lookup 'commentator-uri comment-list))
(word-list '())
(spam-comments '())
(genuine-comments '()))
(extend word-list (parse commentator "[^A-Za-z]" 0))
(extend word-list (parse comment "[^A-Za-z]" 0))
(extend word-list (parse commentator-uri "[^A-Za-z]" 0))
(if (list? commentator-ip)
(dolist (i commentator-ip) (extend word-list (list i))))
(clean empty? word-list)
(set 'spam-score (bayes-query word-list 'MAIN:spam-corpus))))
which returns a double-valued spam score for each comment. The two numbers are the probabilities that a comment belongs in the first or second category.
It's now easy to decide whether to reject a comment based on the two numbers returned by this function. The example I started with manages to score (1 0), a clear indication that this apparently harmless phrase is, when considered as part of a comment as a whole, usually a comment from a spammer.
If you're wondering where the comments form is on this site - well, there isn't one; I decided against using up disk space storing hundreds of unwanted comments!