Illustrations
Before I went on my travels to Korea and then Europe (which I will finish writing about soon, I promise!) I made a couple of little illustrations as part of my improving my skills at Illustrator, and just for a bit of fun I guess. They're kind of potentially part of a series, but I'm not sure yet about the details of that.
The first one I called Fancy a Fly? (click for the full version)
And the second is called, quite simply and unimaginatively, Para-Whale.
I hope you like 'em
I'm working on some more stuff at the moment - one or two things in the same vein and something else totally unrelated.
Soju, Maekju, Galbi and Kimchi (S. Korea Day One)
From now on, I'll be updating from this blog with brief descriptions and photos of what I've been up to on my travels this summer - finally something more interesting than my dissertation!
My sister, Natalie, and her boyfriend, Nick, have been staying in Dangjin, Chungcheongnam-do in South Korea, working as English language teaching assistants at a Korean school. After finishing all my Uni work, I took a couple of flights over to see them for a couple of weeks, so that's what I'll be blogging about for a while. I'll be doing a blog for each day of my trip and write it as if from the end of that day and post-date them to catch up.
You know in films when they play stereotypical Asian music to set the scene when our protagonist arrives in an Asian country? That's what Incheon airport is like. It looks, feels and sounds like a film, and probably even smells like one.
Outside of the airport isn't much different, but not as beautiful today as I'd been hoping for. The weather isn't great. But the few people I've spoken to so far have been extremely helpful and friendly; the lady at the information desk; the bus driver; the person on the bus who stopped the bus driver from driving further, preventing me from going too far!
So on my first night, I met N&N's friends, Steph, Danny and Dave. They're all teachers, from the US and South Africa. All lovely people, very welcoming and very good at breaking the ice with a bit of Soju (a vodka-like Korean liqueur) in my Maekju (beer). Very effective. We hit a restaurant called Don S Top (or Don's Top? or Don Stop? who knows) ate Baechu Kimchi (spicy marinated cabbage) and Galbi (marinated beef you cook on a grill at your own table!). Pretty awesome, tasted amazing and a really fun way to eat: just stuff it all into your mouth at once.
One of the many odd things (at least to an Korea newbie like me) about Korea is it's obsession with things like "wellbeing" - to symbolize this, they have a tree - a plastic tree - in bars, ironically. A plastic tree, symbolizing wellbeing, in a bar full of booze. Oh, and a lady with a baby! Go figure.
Oh, and by the way, I saw a Starcraft game on TV. What.
This Blog Has a Video In It.
Onda: Tutorial & Demo from Daniel Hough on Vimeo.
It's beginning to feel pretty complete
The Website
So for the past few weeks the website has been up. I'm not going to rant and rave about the URL yet because I'm not entirely convinced it's ready.
The clustering is reasonably accurate, and the interface is coming along very well. There are a few issues with efficiency though. For some of the pages, particularly the ones which involve deeply-joined queries (finding all the clusters used by a given source requires the sources table linked to the articles table linked to the clusters table, for example). I have come up with better ways to do it though, and I've implemented a little "be patient, loading!" screen for particular pages using javascript.
All of the compulsory requirements for the project have been fulfilled to a decent extent, though there are a few more insights I wish to develop for looking at diversity. However the optional requirement (a bit of an oxymoron you'd think) of a sentiment detection is not yet fulfilled unfortunately. It probably would not be a particularly difficult thing to do, but since there are more pressing issues I thought I'd leave it to until I've done the report to a better extent.
There is one persistent problem which bothers me: sometimes (at least 2 times a week) there seems to be duplicate clusters. One article will be just below the similarity threshold for a cluster, so it'll create a new one instead of joining with it.
As I see it there are at least two solutions to this problem:
- 'Candidate clusters' created by the normal method, and any new clusters which are particularly similar (higher threshold than the standard one) to an existing cluster can be merged with that cluster.
- A supervised method, making use of collective intelligence. Users can specify when two clusters are, as they see it, on the same topic. Then, they will be either merged together after a number of votes or they will be flagged for an administrator (me) to merge them.
However, I'm not entirely sure I'll get any of this done before the report is written. If not, it'll make good stuff to discuss in the report.
Success!
So the training corpus has been counted and the clustering functions are in place. There are a few options:
- Cluster method: single link, group average or complete link
- Cluster type: agglomerative hierarchical or flat
- Threshold: between 0 and 1, for the level of similarity required for two vector models to be considered on potentially the same topic
- Normalize? Should term frequencies be normalized or not?
- Title weight: the amount of weight the terms in the title are given for the counts and normalized term frequencies
- Leading section percentage: the percentage of the start of the article which is considered the "leading section"
- Leading section weight: same as title weight. This hasn't proved to had much of an effect, actually - in fact, at some levels it totally screws up classification
Group average clustering works well and is more efficient than both single link and complete link, and forming an 'average' model for a cluster will be used regardless of the actual method chosen eventually, since this will be used to determine the way a cluster is represented in human-readable terms. The article in the cluster which is closest to the average will be used to describe the cluster.
So I ran 144 tests which all used flat clustering, originally with no variation on leading section or percentage, and then once I'd determined the best settings to use, I had some slight variation, but unfortunately it actually didn't help results particularly. Thus, I doubt I'll be putting any weight on the leading terms.
Conversely, putting a weight on the title is very helpful, and works especially well with a high threshold of around 0.60. There were 415 articles being evaluated in a number of ways described briefly below. Anyway, the full results are here.
The main things to look at are purity and f-measure. Purity is a measure of the crossover between classes (training clusters) and the generated clusters. F-measure places importance on a particular parameter, and so the higher it is the better. I placed importance on the true positive parameter - i.e., I wanted the system to generate as many true positives as possible and as few false positives as possible. I did this because I figure that it's less detrimental to the accuracy of the system if articles are accidentally given their own cluster than it is if they are accidentally clustered with unrelated articles, which is the false positive measure. Rand is the Rand Measure which is essentially a linear combination of the confusion matrix (true, false positives & negatives). NMI is the normalized mutual information. It tells us how our information about the pre-determined classes improves as we are told what the clusters are - this a high NMI is better. Of all the tests, NMI tends to stay around 0.27 mark. For more information on all of these measures see this page.
My 'favourite' configuration is number 108 - Flat, Complete Link clustering with a 0.6 similarity threshold and a title weight of x19, and no weight on the leading text. This gives only 6 false positives and has an excellent purity score - NMI is around 0.28, one of the higher values.
You may notice that none of the configurations use agglomerative clustering. This is because in the time it took for 16 tests using flat clustering, one full agglomerative test over the 415 articles wasn't even finished. In other words, it's incredibly slow - so slow that I certainly will not be using it in the final system, since a requirement is to update frequently. If this is how long it takes over 415 articles, what about 1200 (the local corpus) or 7000 (the corpus on the server).
Later on I will probably run a agglomerative tests for measurable proof but only if there's time.
In the meantime, I've run the clusterer over the 1200 locally-stored articles and will soon work on some pretty graphs and measures for them. Until then, adios!
Monitoring the Feeds
I've finally finished the RSS Parsing & HTML Parsing section of the project. Since about 0:00 this morning (26/01/2010) the system has collected 180 unique articles from the Daily Mail, the Guardian, the Telegraph and the Express.
I'm going to self-cluster these articles as they come in and soon enough will begin developing the modular system which represents articles as (for the time being, just) vectors, and the methods needed to compare and cluster them. Then accuracy can be measured, settings tweaked and algorithms debugged until I find the best configuration.
After that, the mammoth task of just letting it run for ages begins, while in the meantime I a) begin a report about this crazy adventure and b) work on some rad visualisations for the data collected.
That's the plan at least. Wish me luck!
The wait has begun
The survey & analysis section of my dissertation is complete, and I'm really seeing the project as a big picture now as opposed a jumble of ideas floating around in my head. The more I got through the (25-page-limit) report, the more ideas I wanted to incorporate, but unfortunately the only limit isn't page numbers but also time, so some of my more favourite aspects of the system (query engine, interactive graphs and visualisations, sentiment classification) may end up falling by the wayside if I find other University modules, or the more crucial aspects of the system becoming more time-consuming than anticipated.
However, I am particularly excited about the idea of all the results being visualised in interesting and interactive ways. I have recently been reading David McCandless's Information is Beautiful and been inspired by the different types of charts he uses, such a Treemaps or Bubble Charts. These, and some others, should be able to be procedurally generated, though some of the more interesting-looking and themed charts would not be so easy.
Anyway, now I'm in Australia for Christmas and looking forward to starting revision, whoo!
Learning
Since my last post I've been doing a lot of reading up on the background topic my project has become. The structure of the project has changed slightly, with a lot less emphasis on the Text Reuse aspect and more focus on Diversity, which means that the system will be heavily focussed on measuring diversity between and within sources in online news.
I have been writing up my Survey & Analysis report, as well as doing some Python programming experimentation, reading RSS reads and Parsing HTML pages from a few newspapers' websites.
Irritatingly, some newspapers insist on having faulty HTML websites. But if Firefox can render them then I should be able to parse them, so I'll have to work on that. It is not a high-priority task at the moment though, as I simply want to be looking at a snapshot of the news in a window of a couple of days, to see what the differences are so that perhaps I can use what I have measured myself as training data, and use my own measurement as something for the system to learn by.
Research-wise, I have been delving into the ACM digital library and downloading many, many papers about story link detection and article clustering - hopefully if I can get SLD working well, clustering will be a simple task. Clustering is a huge section of the system because once models (either language models or vector space models, or something else) have been constructed, perhaps I can move onto more complex tasks such as topic detection, or more fun and visually satisfying tasks of sentiment detection, and generating understandable graphs. We'll see!
Progress so far
Last week I chose the topic of my dissertation, so immediately I began to do some research into the techniques I'll need to understand in order to complete it.
Essentially, I need to design and implement a system which can investigate diversity in a number of ways between and within sources of news, including (but not necessarily limited to) language, topics, attention to detail and reuse of text. It will do so by crawling through the RSS feeds and possibly corpora of older news material, detecting topics and comparing articles within topics.
There are a wide varieties of techniques and technologies required for this, most of which I have not fully investigated yet, but naturally I will be soon. I will be using Python for the project, since it seems to be very suitable for the project. Python has comprehensive standard library of modules including modules for reading RSS files, and a number of string functions which should be useful for text processing, and is widely used among text processors.
Furthermore, it is cross-platform, and although development will be mostly done on a Windows PC, it will actually be running on a Linux-based server.
Experiments with Python
I'm quite new to Python, but so far I've found it simple to pick up, and most things that I've needed so far have been built-in. I've read about reading RSS files with Python and created a very simple RSS reader, and covered Dot plot, a technique for comparing DNA sequences which has been used in the past to compare text by Ken Church and Jonathan Helfman. I have begun work on a text-based Dotplot program too, which was also very simple. Eventually both these things will make their way into the final product, but for now it's just learning and getting into the mindset of text processing.
Diversity in Online News
Welcome to my blog. For a while, the only topic is likely to be my 3rd Year Project at the University of Sheffield, whose topic is on Diversity in Online News. After that, I may just leave it or I may branch out, we'll see how it goes.


