A Blog Named Dan | I'd like a constant adventure, please.

Mar/10

2

Success!

So the training corpus has been counted and the clustering functions are in place. There are a few options:

  • Cluster method: single link, group average or complete link
  • Cluster type: agglomerative hierarchical or flat
  • Threshold: between 0 and 1, for the level of similarity required for two vector models to be considered on potentially the same topic
  • Normalize? Should term frequencies be normalized or not?
  • Title weight: the amount of weight the terms in the title are given for the counts and normalized term frequencies
  • Leading section percentage: the percentage of the start of the article which is considered the “leading section”
  • Leading section weight: same as title weight. This hasn’t proved to had much of an effect, actually – in fact, at some levels it totally screws up classification

Group average clustering works well and is more efficient than both single link and complete link, and forming an ‘average’ model for a cluster will be used regardless of the actual method chosen eventually, since this will be used to determine the way a cluster is represented in human-readable terms. The article in the cluster which is closest to the average will be used to describe the cluster.

So I ran 144 tests which all used flat clustering, originally with no variation on leading section or percentage, and then once I’d determined the best settings to use, I had some slight variation, but unfortunately it actually didn’t help results particularly. Thus, I doubt I’ll be putting any weight on the leading terms.

Conversely, putting a weight on the title is very helpful, and works especially well with a high threshold of around 0.60. There were 415 articles being evaluated in a number of ways described briefly below. Anyway, the full results are here.

The main things to look at are purity and f-measure. Purity is a measure of the crossover between classes (training clusters) and the generated clusters. F-measure places importance on a particular parameter, and so the higher it is the better. I placed importance on the true positive parameter – i.e., I wanted the system to generate as many true positives as possible and as few false positives as possible. I did this because I figure that it’s less detrimental to the accuracy of the system if articles are accidentally given their own cluster than it is if they are accidentally clustered with unrelated articles, which is the false positive measure. Rand is the Rand Measure which is essentially a linear combination of the confusion matrix (true, false positives & negatives). NMI is the normalized mutual information. It tells us how our information about the pre-determined classes improves as we are told what the clusters are – this a high NMI is better. Of all the tests, NMI tends to stay around 0.27 mark. For more information on all of these measures see this page.

My ‘favourite’ configuration is number 108 – Flat, Complete Link clustering with a 0.6 similarity threshold and a title weight of x19, and no weight on the leading text. This gives only 6 false positives and has an excellent purity score – NMI is around 0.28, one of the higher values.

You may notice that none of the configurations use agglomerative clustering. This is because in the time it took for 16 tests using flat clustering, one full agglomerative test over the 415 articles wasn’t even finished. In other words, it’s incredibly slow – so slow that I certainly will not be using it in the final system, since a requirement is to update frequently. If this is how long it takes over 415 articles, what about 1200 (the local corpus) or 7000 (the corpus on the server).

Later on I will probably run a agglomerative tests for measurable proof but only if there’s time.

In the meantime, I’ve run the clusterer over the 1200 locally-stored articles and will soon work on some pretty graphs and measures for them. Until then, adios!

RSS Feed

No comments yet.

Leave a comment!

«

»

Find it!

Theme Design by devolux.org