This Blog Has a Video In It.
Onda: Tutorial & Demo from Daniel Hough on Vimeo.
It's beginning to feel pretty complete
The Website
So for the past few weeks the website has been up. I'm not going to rant and rave about the URL yet because I'm not entirely convinced it's ready.
The clustering is reasonably accurate, and the interface is coming along very well. There are a few issues with efficiency though. For some of the pages, particularly the ones which involve deeply-joined queries (finding all the clusters used by a given source requires the sources table linked to the articles table linked to the clusters table, for example). I have come up with better ways to do it though, and I've implemented a little "be patient, loading!" screen for particular pages using javascript.
All of the compulsory requirements for the project have been fulfilled to a decent extent, though there are a few more insights I wish to develop for looking at diversity. However the optional requirement (a bit of an oxymoron you'd think) of a sentiment detection is not yet fulfilled unfortunately. It probably would not be a particularly difficult thing to do, but since there are more pressing issues I thought I'd leave it to until I've done the report to a better extent.
There is one persistent problem which bothers me: sometimes (at least 2 times a week) there seems to be duplicate clusters. One article will be just below the similarity threshold for a cluster, so it'll create a new one instead of joining with it.
As I see it there are at least two solutions to this problem:
- 'Candidate clusters' created by the normal method, and any new clusters which are particularly similar (higher threshold than the standard one) to an existing cluster can be merged with that cluster.
- A supervised method, making use of collective intelligence. Users can specify when two clusters are, as they see it, on the same topic. Then, they will be either merged together after a number of votes or they will be flagged for an administrator (me) to merge them.
However, I'm not entirely sure I'll get any of this done before the report is written. If not, it'll make good stuff to discuss in the report.