I’ve finally finished the RSS Parsing & HTML Parsing section of the project. Since about 0:00 this morning (26/01/2010) the system has collected 180 unique articles from the Daily Mail, the Guardian, the Telegraph and the Express.
I’m going to self-cluster these articles as they come in and soon enough will begin developing the modular system which represents articles as (for the time being, just) vectors, and the methods needed to compare and cluster them. Then accuracy can be measured, settings tweaked and algorithms debugged until I find the best configuration.
After that, the mammoth task of just letting it run for ages begins, while in the meantime I a) begin a report about this crazy adventure and b) work on some rad visualisations for the data collected.
That’s the plan at least. Wish me luck!

