<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Basically, Dan &#187; Experiments</title>
	<atom:link href="http://danielhough.co.uk/blog/tag/experiments/feed/" rel="self" type="application/rss+xml" />
	<link>http://danielhough.co.uk/blog</link>
	<description>One long adventure.</description>
	<lastBuildDate>Sun, 01 Aug 2010 14:47:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Learning</title>
		<link>http://danielhough.co.uk/blog/2009/11/learning/</link>
		<comments>http://danielhough.co.uk/blog/2009/11/learning/#comments</comments>
		<pubDate>Mon, 23 Nov 2009 12:41:08 +0000</pubDate>
		<dc:creator>Dan</dc:creator>
				<category><![CDATA[Dissertation]]></category>
		<category><![CDATA[Experiments]]></category>

		<guid isPermaLink="false">http://danielhough.co.uk/blog/?p=13</guid>
		<description><![CDATA[Since my last post I've been doing a lot of reading up on the background topic my project has become. The structure of the project has changed slightly, with a lot less emphasis on the Text Reuse aspect and more focus on Diversity, which means that the system will be heavily focussed on measuring diversity [...]]]></description>
			<content:encoded><![CDATA[<p>Since my last post I've been doing a lot of reading up on the background topic my project has become. The structure of the project has changed slightly, with a lot less emphasis on the <em>Text Reuse</em> aspect and more focus on <em>Diversity</em>, which means that the system will be heavily focussed on measuring diversity between and within sources in online news.</p>
<p>I have been writing up my Survey &amp; Analysis report, as well as doing some Python programming experimentation, reading RSS reads and Parsing HTML pages from a few newspapers' websites.</p>
<p>Irritatingly, some newspapers insist on having faulty HTML websites. But if Firefox can render them then I should be able to parse them, so I'll have to work on that. It is not a high-priority task at the moment though, as I simply want to be looking at a snapshot of the news in a window of a couple of days, to see what the differences are so that perhaps I can use what I have measured myself as training data, and use my own measurement as something for the system to learn by.</p>
<p>Research-wise, I have been delving into the ACM digital library and downloading many, many papers about story link detection and article clustering - hopefully if I can get SLD working well, clustering will be a simple task. Clustering is a huge section of the system because once models (either language models or vector space models, or something else) have been constructed, perhaps I can move onto more complex tasks such as topic detection, or more fun and visually satisfying tasks of sentiment detection, and generating understandable graphs. We'll see!</p>
]]></content:encoded>
			<wfw:commentRss>http://danielhough.co.uk/blog/2009/11/learning/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Progress so far</title>
		<link>http://danielhough.co.uk/blog/2009/10/progress-so-far/</link>
		<comments>http://danielhough.co.uk/blog/2009/10/progress-so-far/#comments</comments>
		<pubDate>Mon, 05 Oct 2009 17:06:41 +0000</pubDate>
		<dc:creator>Dan</dc:creator>
				<category><![CDATA[Dissertation]]></category>
		<category><![CDATA[Dot plot]]></category>
		<category><![CDATA[Experiments]]></category>

		<guid isPermaLink="false">http://danielhough.co.uk/blog/?p=5</guid>
		<description><![CDATA[Last week I chose the topic of my dissertation, so immediately I began to do some research into the techniques I'll need to understand in order to complete it. Essentially, I need to design and implement a system which can investigate diversity in a number of ways between and within sources of news, including (but [...]]]></description>
			<content:encoded><![CDATA[<p>Last week I chose the topic of my dissertation, so immediately I began to do some research into the techniques I'll need to understand in order to complete it.</p>
<p>Essentially, I need to design and implement a system which can investigate diversity in a number of ways between and within sources of news, including (but not necessarily limited to)  <strong>language</strong>, <strong>topics</strong>, <strong>attention to detail</strong> and <strong>reuse of text</strong>. It will do so by crawling through the RSS feeds and possibly corpora of older news material, detecting topics and comparing articles within topics.</p>
<p>There are a wide varieties of techniques and technologies required for this, most of which I have not fully investigated yet, but naturally I will be soon. I will be using Python for the project, since it seems to be very suitable for the project. Python has comprehensive standard library of modules including modules for reading RSS files, and a number of string functions which should be useful for text processing, and is widely used among text processors.</p>
<p>Furthermore, it is cross-platform, and although development will be mostly done on a Windows PC, it will actually be running on a Linux-based server.</p>
<h2>Experiments with Python</h2>
<div id="attachment_10" class="wp-caption alignright" style="width: 310px"><a href="http://danielhough.co.uk/blog/wp-content/uploads/2009/10/20091005-dotplotprogress.PNG"><img class="size-medium wp-image-10" title="Dot Plot Progress" src="http://danielhough.co.uk/blog/wp-content/uploads/2009/10/20091005-dotplotprogress-300x202.PNG" alt="A screenshot of my progress with a simple dotplot program" width="300" height="202" /></a><p class="wp-caption-text">A screenshot of my progress with a simple dotplot program</p></div>
<p>I'm quite new to Python, but so far I've found it simple to pick up, and most things that I've needed so far have been built-in. I've read about reading RSS files with Python and created a very simple RSS reader, and covered <a title="Dot plot on Wikipedia" href="http://en.wikipedia.org/wiki/Dot_plot_%28bioinformatics%29" target="_blank">Dot plot</a>, a technique for comparing DNA sequences which has been used in the past to compare text by Ken Church and Jonathan Helfman. I have begun work on a text-based Dotplot program too, which was also very simple. Eventually both these things will make their way into the final product, but for now it's just learning and getting into the mindset of text processing.</p>
]]></content:encoded>
			<wfw:commentRss>http://danielhough.co.uk/blog/2009/10/progress-so-far/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
