Last week I chose the topic of my dissertation, so immediately I began to do some research into the techniques I’ll need to understand in order to complete it.
Essentially, I need to design and implement a system which can investigate diversity in a number of ways between and within sources of news, including (but not necessarily limited to) language, topics, attention to detail and reuse of text. It will do so by crawling through the RSS feeds and possibly corpora of older news material, detecting topics and comparing articles within topics.
There are a wide varieties of techniques and technologies required for this, most of which I have not fully investigated yet, but naturally I will be soon. I will be using Python for the project, since it seems to be very suitable for the project. Python has comprehensive standard library of modules including modules for reading RSS files, and a number of string functions which should be useful for text processing, and is widely used among text processors.
Furthermore, it is cross-platform, and although development will be mostly done on a Windows PC, it will actually be running on a Linux-based server.
Experiments with Python
I’m quite new to Python, but so far I’ve found it simple to pick up, and most things that I’ve needed so far have been built-in. I’ve read about reading RSS files with Python and created a very simple RSS reader, and covered Dot plot, a technique for comparing DNA sequences which has been used in the past to compare text by Ken Church and Jonathan Helfman. I have begun work on a text-based Dotplot program too, which was also very simple. Eventually both these things will make their way into the final product, but for now it’s just learning and getting into the mindset of text processing.

