Archive for November 2009
Since my last post I’ve been doing a lot of reading up on the background topic my project has become. The structure of the project has changed slightly, with a lot less emphasis on the Text Reuse aspect and more focus on Diversity, which means that the system will be heavily focussed on measuring diversity between and within sources in online news.
I have been writing up my Survey & Analysis report, as well as doing some Python programming experimentation, reading RSS reads and Parsing HTML pages from a few newspapers’ websites.
Irritatingly, some newspapers insist on having faulty HTML websites. But if Firefox can render them then I should be able to parse them, so I’ll have to work on that. It is not a high-priority task at the moment though, as I simply want to be looking at a snapshot of the news in a window of a couple of days, to see what the differences are so that perhaps I can use what I have measured myself as training data, and use my own measurement as something for the system to learn by.
Research-wise, I have been delving into the ACM digital library and downloading many, many papers about story link detection and article clustering – hopefully if I can get SLD working well, clustering will be a simple task. Clustering is a huge section of the system because once models (either language models or vector space models, or something else) have been constructed, perhaps I can move onto more complex tasks such as topic detection, or more fun and visually satisfying tasks of sentiment detection, and generating understandable graphs. We’ll see!
