<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>masochismtango &#187; Dev</title>
	<atom:link href="http://masochismtango.com/development/feed/" rel="self" type="application/rss+xml" />
	<link>http://masochismtango.com</link>
	<description></description>
	<lastBuildDate>Sat, 17 Mar 2012 15:41:43 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.2</generator>
		<item>
		<title>MEX Conference 2011</title>
		<link>http://masochismtango.com/2011/05/22/mex-conference-2011/</link>
		<comments>http://masochismtango.com/2011/05/22/mex-conference-2011/#comments</comments>
		<pubDate>Sun, 22 May 2011 17:44:52 +0000</pubDate>
		<dc:creator>Tom Godber</dc:creator>
				<category><![CDATA[Dev]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Usability]]></category>
		<category><![CDATA[austerity]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[MEX]]></category>
		<category><![CDATA[network]]></category>
		<category><![CDATA[slides]]></category>
		<category><![CDATA[usability]]></category>

		<guid isPermaLink="false">http://masochismtango.com/?p=935</guid>
		<description><![CDATA[Slides from my talk at this May&#8217;s MEX (Mobile User Experience) conference in London, where I gave the first presentation on the &#8220;Efficient UX Techniques for an Age of Network Austerity&#8221; pathway: Mex 2011 &#8211; Efficient UX Techniques for an Age of Network Austerity The slides walk through steps Masabi has taken to minimise dependency [...]]]></description>
			<content:encoded><![CDATA[<p><img src='http://masochismtango.com/wp-content/plugins/simple-post-thumbnails/timthumb.php?src=/wp-content/thumbnails/935.jpg&amp;w=48&amp;h=48&amp;zc=1&amp;ft=jpg' alt='post thumbnail' /></p>
<p>Slides from my talk at this May&#8217;s <a href="http://www.pmn.co.uk/">MEX</a> (Mobile User Experience) conference in London, where I gave the first presentation on the &#8220;Efficient UX Techniques for an Age of Network Austerity&#8221; pathway:</p>
<div style="width:425px" id="__ss_7953775"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/masabi/mex-2011-efficient-ux-techniques-for-an-age-of-network-austerity" title="Mex 2011 - Efficient UX Techniques for an Age of Network Austerity">Mex 2011 &#8211; Efficient UX Techniques for an Age of Network Austerity</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/7953775" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe></div>
<p>The slides walk through steps <a href="http://www.masabi.com/">Masabi</a> has taken to minimise dependency on network uptime in our travel apps, and why that matters.</p>
<p>The whole conference was incredibly well put together &#8211; props to Marek for that &#8211; and encouraged some stimulating debate through it&#8217;s unique interactive workshops.  Nice food too!  Highly recommended to anyone interested in mobile&#8230;</p>
<p style="text-align:center"><a href="https://picasaweb.google.com/marekp/MEXMay2011#"><img src="https://lh4.googleusercontent.com/-qK5Hbl_nRPM/TcPwGqVWLiI/AAAAAAAAE0Y/Lf5Ex8Mjf_I/s912/DSC_0676.JPG" title="Presenting at MEX 2011" width="500"/></a></p>
]]></content:encoded>
			<wfw:commentRss>http://masochismtango.com/2011/05/22/mex-conference-2011/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fixing Eclipse Update Issues</title>
		<link>http://masochismtango.com/2010/03/02/fixing-eclipse-update-issues/</link>
		<comments>http://masochismtango.com/2010/03/02/fixing-eclipse-update-issues/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 11:38:02 +0000</pubDate>
		<dc:creator>Tom Godber</dc:creator>
				<category><![CDATA[Dev]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[Eclipse]]></category>

		<guid isPermaLink="false">http://masochismtango.com/?p=529</guid>
		<description><![CDATA[Trials, tribulations and fixes to handle an update to the latest Eclipse IDE.]]></description>
			<content:encoded><![CDATA[<p><img src='http://masochismtango.com/wp-content/plugins/simple-post-thumbnails/timthumb.php?src=/wp-content/thumbnails/529.jpg&amp;w=48&amp;h=48&amp;zc=1&amp;ft=jpg' alt='post thumbnail' /></p>
<p>After a bit of a break, I&#8217;m about to start a stint of Blackberry development and really wanted to try out the new Blackberry JDE integration with Eclipse &#8211; something that promises to reduce the immense tedium of running Blackberry simulators somewhat.  Anyone who has ever tried to do that will understand how valuable this could be, both financially (time is money after all) and to your sanity.</p>
<p>The plugin requires at least Eclipse 3.4, though, and I was stuck way back on 3.3.  Eclipse was reluctant to update itself to any new version from any of the obvious &#8220;update&#8221; menu items, so I went for the simple brute force method:</p>
<ol>
<li>Zipping the old Eclipse app folder, then delete it</li>
<li>Download the latest Eclipse, and add the latest version of whatever plugins are needed</li>
<li>Reattach to the old workspace folder.</li>
</ol>
<p>This initially appeared to work, but didn&#8217;t.</p>
<h3>Ant Integration</h3>
<p>The most visible problem was that Ant builds would no longer run.  They&#8217;d start, and the red &#8216;stop&#8217; button on the console would light up (indicating I could stop the running Ant process, not that it was stopped) but no logging at all reached the console.  No dialogues appeared explaining the problem.</p>
<p>The clue lay in the workspace&#8217;s <code>.metadata/.log</code> file &#8211; there were two exceptions, at least one of which was being thrown every time I tried to run Ant:</p>
<pre><code>!ENTRY org.eclipse.core.resources 4 75 2010-03-01 21:17:55.921
!MESSAGE Errors occurred during the build.
!SUBENTRY 1 org.eclipse.mtj.core 2 75 2010-03-01 21:17:55.921
!MESSAGE Errors running builder 'Preverification Builder' on project 'Framework'.
!STACK 1
<strong>org.eclipse.core.runtime.CoreException: Build state machine has not been initialized.</strong></code></pre>
<p>or</p>
<pre><code>!ENTRY org.eclipse.ant.ui 4 120 2010-03-01 21:21:16.468
!MESSAGE Error logged from Ant UI:
!STACK 0
<strong>java.net.SocketTimeoutException: Accept timed out</strong></code></pre>
<p>Not, admittedly, much of a clue but enough to eventually track down the problem.  Ant&#8217;s configuration &#8211; in particular, the locations of its jars &#8211; are stored in your workspace, despite it being a plugin integrated into Eclipse.  If the location of Ant&#8217;s plugin folder changes, Ant stops working with this workspace.</p>
<p>To fix the problem, go to <i>Preferences</i> &gt; <i>Ant</i> &gt; <i>Runtime</i>.  Remove all jars under <i>Ant Home Entries</i>, and then find the new versions in the Eclipse plugin folder (as an External Jar Location). Apply the changes, and your builds shoudl run again.</p>
<h3>JavaME Emulation</h3>
<p>The JavaME plugin is notoriously bad at introducing breaking changes whenever it updates.  This time was no exception &#8211; my JavaME projects appeared fine in the IDE, but produced the following exception (to the console, at least) whenever a WTK emulator was run:</p>
<pre><code>Running with storage root C:\Documents and Settings\Tom\j2mewtk\2.5.2\appdb\rms
Running with locale: English_United Kingdom.1252
Running in the identified_third_party security domain
<strong>java.lang.ClassNotFoundException: framework/midp/Application</strong>
	at com.sun.midp.midlet.MIDletState.createMIDlet(+29)
	at com.sun.midp.midlet.Scheduler.schedule(+52)
	at com.sun.midp.main.Main.runLocalClass(+28)
	at com.sun.midp.main.Main.main(+80)
Execution completed.</code></pre>
<p>The fix turned out to be simple &#8211; delete the project, and check it out again.  The new version will start with fresh metadata that works with the new plugin.  Not very nice, but hardly fatal (if you&#8217;re using version control).</p>
<h3>Incompatible Plugins</h3>
<p>At the end of this, I discovered that the Blackberry JDE plugin does not support the very latest Galileo, so it was all a bit of a pointless exercise.  Such is life in mobile development&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://masochismtango.com/2010/03/02/fixing-eclipse-update-issues/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>WebHarvest: Easy Web Scraping from Java</title>
		<link>http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/</link>
		<comments>http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/#comments</comments>
		<pubDate>Mon, 15 Feb 2010 10:06:23 +0000</pubDate>
		<dc:creator>Tom Godber</dc:creator>
				<category><![CDATA[Dev]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Maven]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[scrape]]></category>
		<category><![CDATA[Tableau]]></category>
		<category><![CDATA[WebHarvest]]></category>
		<category><![CDATA[XPath]]></category>

		<guid isPermaLink="false">http://masochismtango.com/?p=478</guid>
		<description><![CDATA[A quick guide to WebHarvest's easy and powerful web scraping Java API, including code samples.]]></description>
			<content:encoded><![CDATA[<p><img src='http://masochismtango.com/wp-content/plugins/simple-post-thumbnails/timthumb.php?src=/wp-content/thumbnails/478.jpg&amp;w=48&amp;h=48&amp;zc=1&amp;ft=jpg' alt='post thumbnail' /></p>
<p>I&#8217;ve been experimenting with data visualisation for a while now, most of which is for <a href="http://www.masabi.com/">Masabi</a>&#8216;s business plan though I hope to share some offshoots soon.</p>
<p>I often have a need to quickly scrape some data out of a web page (or list of web pages), which can then be fed into Excel and on to specialist data visualisation tools like Tableau (available in a <a href="http://www.tableausoftware.com/public/">free public edition</a> here &#8211; my initial impressions are positive but it&#8217;s early days yet).</p>
<p>To this end I have turned to <a href="http://web-harvest.sourceforge.net/">WebHarvest</a>, an excellent scriptable open source API for web scraping in Java.  I really really like it, but there are some quirks and setup issues that have cost me hours so I thought I&#8217;d roll together a tutorial with the fixes.</p>
<h3>WebHarvest Config for Maven</h3>
<p>When it works <a href="http://maven.apache.org/">Maven</a> is a lovely tool to hide dependency management for Java projects, but WebHarvest is not configured qiute right out of the box to work transparently with it.  (Describing Maven is beyond the scope of this post, but if you don&#8217;t know it, it&#8217;s easy to setup with the <a href="http://m2eclipse.sonatype.org/">M2 plugin for Eclipse</a>.)</p>
<p>This is the Maven POM I ended up with to use WebHarvest in a new JavaSE project:</p>
<pre name="code" class="xml">&lt;project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"&gt;
 &lt;modelVersion&gt;4.0.0&lt;/modelVersion&gt;
 &lt;groupId&gt;WebScraping&lt;/groupId&gt;
 &lt;artifactId&gt;WebScraping&lt;/artifactId&gt;
 &lt;packaging&gt;jar&lt;/packaging&gt;
 &lt;version&gt;0.00.01&lt;/version&gt;
 &lt;properties&gt;
 &lt;project.build.sourceEncoding&gt;UTF-8&lt;/project.build.sourceEncoding&gt;
 &lt;/properties&gt;

 &lt;build&gt;
 &lt;plugins&gt;
 &lt;plugin&gt;
 &lt;artifactId&gt;maven-compiler-plugin&lt;/artifactId&gt;
 &lt;configuration&gt;
 &lt;source&gt;1.6&lt;/source&gt;
 &lt;target&gt;1.6&lt;/target&gt;
 &lt;/configuration&gt;
 &lt;/plugin&gt;
 &lt;/plugins&gt;
 &lt;/build&gt;

 &lt;repositories&gt;
 &lt;repository&gt;
 &lt;id&gt;wso2&lt;/id&gt;
 &lt;url&gt;http://dist.wso2.org/maven2/&lt;/url&gt;
 &lt;/repository&gt;
 &lt;repository&gt;
 &lt;id&gt;maven-repository-1&lt;/id&gt;
 &lt;url&gt;http://repo1.maven.org/maven2/&lt;/url&gt;
 &lt;/repository&gt;
 &lt;/repositories&gt;
 &lt;dependencies&gt;
 &lt;dependency&gt;
 &lt;groupId&gt;commons-logging&lt;/groupId&gt;
 &lt;artifactId&gt;commons-logging&lt;/artifactId&gt;
 &lt;version&gt;1.1&lt;/version&gt;
 &lt;type&gt;jar&lt;/type&gt;
 &lt;scope&gt;compile&lt;/scope&gt;
 &lt;/dependency&gt;
 &lt;dependency&gt;
 &lt;groupId&gt;log4j&lt;/groupId&gt;
 &lt;artifactId&gt;log4j&lt;/artifactId&gt;
 &lt;version&gt;1.2.12&lt;/version&gt;
 &lt;type&gt;jar&lt;/type&gt;
 &lt;scope&gt;compile&lt;/scope&gt;
 &lt;/dependency&gt;
 &lt;dependency&gt;
 &lt;groupId&gt;org.webharvest.wso2&lt;/groupId&gt;
 &lt;artifactId&gt;webharvest-core&lt;/artifactId&gt;
 &lt;version&gt;1.0.0.wso2v1&lt;/version&gt;
 &lt;type&gt;jar&lt;/type&gt;
 &lt;scope&gt;compile&lt;/scope&gt;
 &lt;/dependency&gt;
 &lt;!-- web harvest pom doesn't track dependencies well --&gt;
 &lt;dependency&gt;
 &lt;groupId&gt;net.sf.saxon&lt;/groupId&gt;
 &lt;artifactId&gt;saxon-xom&lt;/artifactId&gt;
 &lt;version&gt;8.7&lt;/version&gt;
 &lt;/dependency&gt;
 &lt;dependency&gt;
 &lt;groupId&gt;org.htmlcleaner&lt;/groupId&gt;
 &lt;artifactId&gt;htmlcleaner&lt;/artifactId&gt;
 &lt;version&gt;1.55&lt;/version&gt;
 &lt;/dependency&gt;
 &lt;dependency&gt;
 &lt;groupId&gt;bsh&lt;/groupId&gt;
 &lt;artifactId&gt;bsh&lt;/artifactId&gt;
 &lt;version&gt;1.3.0&lt;/version&gt;
 &lt;/dependency&gt;
 &lt;dependency&gt;
 &lt;groupId&gt;commons-httpclient&lt;/groupId&gt;
 &lt;artifactId&gt;commons-httpclient&lt;/artifactId&gt;
 &lt;version&gt;3.1&lt;/version&gt;
 &lt;/dependency&gt;
 &lt;/dependencies&gt;
&lt;/project&gt;</pre>
<p>You&#8217;ll note that the WebHarvest dependencies had to be added explicitly, because the jar does not come with a working pom listing them.</p>
<h3>Writing A Scraping Script</h3>
<p>WebHarvest uses XML configuration files to describe how to scrape a site &#8211; and with a few lines of Java code you can run any XML configuration and have access to any properties that the script identified from the page.  This is definitely the safest way to scrape data, as it decouples the code from the web page markup &#8211; so if the site you are scraping goes through a redesign, you can quickly adjust the config files without recompiling the code they pass data to.</p>
<p>The site  some good <a href="http://web-harvest.sourceforge.net/samples.php">example scripts</a> to show you how to get started, so I won&#8217;t repeat them here.  The easiest way to create your own is to run the WebHarvest GUI from the command line, start with a sample script, and then hack it around to get what you want &#8211; it&#8217;s an easy iterative process with good feedback in the UI.</p>
<p>As a simple example, this is a script to go to the <a href="http://developer.sonyericsson.com/device/searchDevice.do?restart=true">Sony-Ericsson developer site&#8217;s handset gallery</a> at <code>http://developer.sonyericsson.com/device/searchDevice.do?restart=true</code>, and rip each handset&#8217;s individual spec page URI:</p>
<pre name="code" class="xml">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;config&gt;
	&lt;!-- indicates we want a loop, through the list defined in &lt;list&gt;, doing &lt;body&gt; for each item where the variables uri and i are defined as the index and value of the relevant item --&gt;
	&lt;loop item="uid" index="i"&gt;
		&lt;!-- the list section defines what we will loop over - here, it pulls out the value attribute of all option tags --&gt;
		&lt;list&gt;
			&lt;xpath expression="//option/@value"&gt;
				&lt;html-to-xml&gt;
					&lt;http url="http://developer.sonyericsson.com/device/searchDevice.do?restart=true"/&gt;
				&lt;/html-to-xml&gt;
			&lt;/xpath&gt;
		&lt;/list&gt;
		&lt;!-- the body section lists instructions which are run for every iteration of the loop --&gt;
		&lt;body&gt;
			&lt;!-- we define a new variable for every iteration, using the iteration count as a suffix  --&gt;
			&lt;var-def name="uri.${i}"&gt;
				&lt;!-- template tag is important, else the $ var syntax will be ignored and won't do any value substitutions --&gt;
				&lt;template&gt;device/loadDevice.do?id=${uid}&lt;/template&gt;
			&lt;/var-def&gt;
		&lt;/body&gt;
	&lt;/loop&gt;
&lt;/config&gt;</pre>
<p>The handset URIs will end up in a list of variables, from <code>uri.1</code> to <code>uri.N</code>.</p>
<p>The XML configuration&#8217;s syntax can take a little getting used to &#8211; it appeared quite backwards to me at first, but by messing around in the GUI you can experiment and learn pretty fast.  With a basic understanding of <a href="http://www.w3schools.com/XPath/default.asp">XPath</a> to identify parts of the web page, and perhaps a little <a href="http://www.regular-expressions.info/tutorial.html">regular expression knowledge</a> to get at information surrounded by plain text, you can perform some very powerful scraping.</p>
<p>We can then define another script which will take this URI, and pull out a piece of information from the page &#8211; in this example, it will show the region(s) that the handset was released in:</p>
<pre name="code" class="xml">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;config&gt;
	&lt;!-- get the entire page --&gt;
	&lt;var-def name="wholepage"&gt;
		&lt;html-to-xml&gt;
			&lt;!-- NEVER try and pass in the entire URL as a single variable here! --&gt;
			&lt;http url="http://developer.sonyericsson.com/${uri}"/&gt;
		&lt;/html-to-xml&gt;
	&lt;/var-def&gt;
	&lt;!-- rip out the block with the specifications --&gt;
	&lt;var-def name="specsheet"&gt;
		&lt;xpath expression="//div[@class='phone-specs']"&gt;
			&lt;var name="wholepage"/&gt;
			&lt;/xpath&gt;
		&lt;/var-def&gt;
		&lt;!-- find the handset's name --&gt;
	&lt;var-def name="name"&gt;
		&lt;xpath expression="//h5[contains(text(),'Phone Model')]/following-sibling::p[1]/text()"&gt;
			&lt;var name="specsheet"/&gt;
			&lt;/xpath&gt;
	&lt;/var-def&gt;
	&lt;!-- identify the screen resolution --&gt;
	&lt;regexp&gt;
		&lt;regexp-pattern&gt;([\d]*)x([\d]*)&lt;/regexp-pattern&gt;
			&lt;regexp-source&gt;
				&lt;xpath expression="//h5[contains(text(),'Screen Sizes')]/following-sibling::p[1]/text()"&gt;
					&lt;var name="specsheet"/&gt;
				&lt;/xpath&gt;
			&lt;/regexp-source&gt;
		&lt;regexp-result&gt;
			&lt;var-def name="screen.width"&gt;&lt;template&gt;${_1}&lt;/template&gt;&lt;/var-def&gt;
			&lt;var-def name="screen.height"&gt;&lt;template&gt;${_2}&lt;/template&gt;&lt;/var-def&gt;
		&lt;/regexp-result&gt;
	&lt;/regexp&gt;
&lt;/config&gt;</pre>
<p>At this point I should note the biggest gotcha with WebHarvest, that just caused me 3 hours of hear tearing.  In the script, this line defines the page to scrape: <code>&lt;http url="http://developer.sonyericsson.com/<strong>${uri}</strong>"/&gt;</code>, where <code>${uri}</code> is a variable specified at runtime to define a URI.  This works.</p>
<p>If you were to substitute in this perfectly sensible alternative: <code>&lt;http url="<strong>${url}</strong>"/&gt;</code>, you would end up with a completely obscure runtime exception a little like this:</p>
<pre><code>Exception in thread "main" org.webharvest.exception.ScriptException: Cannot set variable in scripter: Field access: bsh.ReflectError: No such field: 1
	at org.webharvest.runtime.scripting.BeanShellScriptEngine.setVariable(Unknown Source)
	at org.webharvest.runtime.scripting.ScriptEngine.pushAllVariablesFromContextToScriptEngine(Unknown Source)
	at org.webharvest.runtime.scripting.BeanShellScriptEngine.eval(Unknown Source)
	at org.webharvest.runtime.templaters.BaseTemplater.execute(Unknown Source)
	at org.webharvest.runtime.processors.TemplateProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.VarDefProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.processors.LoopProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.Scraper.execute(Unknown Source)
	at org.webharvest.runtime.Scraper.execute(Unknown Source)
	at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:82)
	at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:49)
	at scrape.ActualScraper.main(DhfScraper.java:37)
Caused by: Field access: bsh.ReflectError: No such field: 1 : at Line: -1 : in file: <Called from Java Code> : <Compiled Java Code>

	at bsh.UtilEvalError.toEvalError(Unknown Source)
	at bsh.UtilEvalError.toEvalError(Unknown Source)
	at bsh.Interpreter.set(Unknown Source)
	... 18 more
</code></pre>
<p>You have been warned!</p>
<h3>Running The Scripts From Java</h3>
<p>WebHarvest requires very little code to run.  I created this little reusable harness class to quickly run the two types of script &#8211; one to pull information from a page, and one to farm URLs from which to scrape data.  You can use the first without the second, of course.</p>
<pre name="code" class="java">package scrape;

import java.io.*;
import java.util.*;

import org.apache.commons.logging.*;
import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.*;
import org.webharvest.runtime.variables.Variable;

/**
 * Quick hackable web scraping class.
 * @author Tom Godber
 */
public abstract class QuickScraper
{
	/** Logging object. */
	protected final Log LOG = LogFactory.getLog(getClass());
	/** Prefix for any variable scraped which defines a URL. It will be followed by a counter. */
	public static final String SCRAPED_URL_VARIABLE_PREFIX = "url.";
	/** A variable name which holds the initial URL to scrape. */
	public static final String START_URL_VARIABLE = "url";

	/** A temporary working folder. */
	private File working = new File("temp");

	/** Ensures temp folder exists.` */
	public QuickScraper()
	{
		working.mkdirs();
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * The initial URL must be set in the actual URL list config XML.
	 * @param urlConfigXml Path of an XML describing how to scrape the URL list.
	 * @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 */
	protected int scrapeUrlList(String urlConfigXml, String pageConfigXml)
	{
		return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * @param setup Optional configuration for the script
	 * @param urlConfigXml Path of an XML describing how to scrape the URL list.
	 * @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 */
	protected int scrapeUrlList(Map setup, String urlConfigXml, String pageConfigXml)
	{
		return scrapeUrlList(setup, new File(urlConfigXml), new File(pageConfigXml));
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * The initial URL must be set in the actual URL list config XML.
	 * @param urlConfigXml XML describing how to scrape the URL list.
	 * @param pageConfigXml XML describing how to scrape the individual pages found.#
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 */
	protected int scrapeUrlList(File urlConfigXml, File pageConfigXml)
	{
		return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * @param setup Optional configuration for the script
	 * @param urlConfigXml XML describing how to scrape the URL list.
	 * @param pageConfigXml XML describing how to scrape the individual pages found.
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 * @throws NullPointerException If the setup map is null.
	 */
	protected int scrapeUrlList(Map setup, File urlConfigXml, File pageConfigXml)
	{
		try
		{
			if (LOG.isDebugEnabled())	LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");
			// generate a one-off scraper based on preloaded configuration
			ScraperConfiguration config = new ScraperConfiguration(urlConfigXml);
			Scraper scraper = new Scraper(config, working.getAbsolutePath());
			// initialise any config
			setupScraperContext(setup, scraper);
			// run the script
			scraper.execute();

			// rip the URL list out of the scraped content
			ScraperContext context = scraper.getContext();
			int i=1;
			Variable scrapedUrl;
			if (LOG.isDebugEnabled())	LOG.debug("Scraping performed, pulling URLs '"+SCRAPED_URL_VARIABLE_PREFIX+"n' from "+context.size()+" variables, starting with "+i+"...");
			while ((scrapedUrl = (Variable) context.get(SCRAPED_URL_VARIABLE_PREFIX+i))  != null)
			{
				if (LOG.isTraceEnabled())	LOG.trace("Found "+SCRAPED_URL_VARIABLE_PREFIX+i+": "+scrapedUrl.toString());
				// parse this URL
				setup.put(START_URL_VARIABLE, scrapedUrl.toString());
				scrapeUrl(setup, pageConfigXml);
				// move on
				i++;
			}
			if (LOG.isDebugEnabled())	LOG.debug("No more URLs found.");
			return i;
		}
		catch (FileNotFoundException e)
		{
			if (LOG.isErrorEnabled())	LOG.error("Could not find config file '"+urlConfigXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);
			return -1;
		}
		finally
		{
			working.delete();
		}
	}

	/**
	 * Scrapes an individual page, and passed the results on for processing.
	 * The script must contain a hardcoded URL.
	 * @param configXml XML describing how to scrape an individual page.
	 */
	protected void scrapeUrl(File configXml)
	{
		scrapeUrl((String)null, configXml);
	}

	/**
	 * Scrapes an individual page, and passed the results on for processing.
	 * @param url The URL to scrape. If null, the URL must be set in the config itself.
	 * @param configXml XML describing how to scrape an individual page.
	 */
	protected void scrapeUrl(String url, File configXml)
	{
		Map setup = new HashMap();
		if (url!=null)	setup.put(START_URL_VARIABLE, url);
		scrapeUrl(setup, configXml);
	}

	/**
	 * Scrapes an individual page, and passed the results on for processing.
	 * @param setup Optional configuration for the script
	 * @param configXml XML describing how to scrape an individual page.
	 */
	protected void scrapeUrl(Map setup, File configXml)
	{
		try
		{
			if (LOG.isDebugEnabled())	LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");
			// generate a one-off scraper based on preloaded configuration
			ScraperConfiguration config = new ScraperConfiguration(configXml);
			Scraper scraper = new Scraper(config, working.getAbsolutePath());
			setupScraperContext(setup, scraper);
			scraper.execute();

			// handle contents in some way
			pageScraped((String)setup.get(START_URL_VARIABLE), scraper.getContext());

			if (LOG.isDebugEnabled())	LOG.debug("Page scraping complete.");
		}
		catch (FileNotFoundException e)
		{
			if (LOG.isErrorEnabled())	LOG.error("Could not find config file '"+configXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);

		}
		finally
		{
			working.delete();
		}
	}

	/**
	 * @param setup Any variables to be set before the script runs.
	 * @param scraper The object which does the scraping.
	 */
	private void setupScraperContext(Map<String,Object> setup, Scraper scraper)
	{
		if (setup!=null)
			for (String key : setup.keySet())
				scraper.getContext().setVar(key, setup.get(key));
	}

	/**
	 * Process a page that was scraped.
	 * @param url The URL that was scraped.
	 * @param context The contents of the scraped page.
	 */
	public abstract void pageScraped(String url, ScraperContext context);
}
</pre>
<p>Scraping a new set of data then becomes as simple as extending the class, passing in appropriate config, and pulling out whatever variables you want every time a page is scraped:</p>
<pre name="code" class="java">package scrape;

import org.webharvest.runtime.ScraperContext;
import org.webharvest.runtime.variables.Variable;

public class ActualScraper extends QuickScraper
{
	public static void main(String[] args)
	{
		try
		{
			ActualScraper scraper = new ActualScraper();
			// do the scraping
			scraper.scrapeUrlList(config, "config/se.urls.xml", "config/se.page.xml");
		}
		catch (Exception e)
		{
			e.printStackTrace();
		}
	}

	/**
	 * @see scrape.QuickScraper#pageScraped(java.lang.String, org.webharvest.runtime.ScraperContext)
	 */
	public void pageScraped(String url, ScraperContext context)
	{
		Variable nameVar = context.getVar("name");
		if (nameVar==null)
		{
			if (LOG.isWarnEnabled())	LOG.warn("Scrape for "+url+" produced no data! Ignoring");
			return;
		}

		// store this station's details
		if (LOG.isInfoEnabled())	LOG.info(name+" has "+context.getVar("screen.width").toString()+"x"+context.getVar("screen.height").toString()+" screen");
	}
}
</pre>
<p>Soi there you have it &#8211; a powerful, configurable and highly effective web scraping system with almost no code written!</p>
]]></content:encoded>
			<wfw:commentRss>http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Migrating Blogger to WordPress &#8211; Easy 301 Permalink Redirects</title>
		<link>http://masochismtango.com/2009/10/25/migrating-blogger-to-wordpress-easy-301-permalink-redirects/</link>
		<comments>http://masochismtango.com/2009/10/25/migrating-blogger-to-wordpress-easy-301-permalink-redirects/#comments</comments>
		<pubDate>Sun, 25 Oct 2009 14:41:58 +0000</pubDate>
		<dc:creator>Tom Godber</dc:creator>
				<category><![CDATA[Dev]]></category>
		<category><![CDATA[Web]]></category>
		<category><![CDATA[Blogger]]></category>
		<category><![CDATA[Wordpress]]></category>

		<guid isPermaLink="false">http://masochismtango.com/?p=355</guid>
		<description><![CDATA[This psot explains how to automatically set up the Wordpress Redirection plugin to map imported Blogger post permalinks to their new Wordpress post permalinks.]]></description>
			<content:encoded><![CDATA[<p><img src='http://masochismtango.com/wp-content/plugins/simple-post-thumbnails/timthumb.php?src=/wp-content/thumbnails/355.gif&amp;w=48&amp;h=48&amp;zc=1&amp;ft=jpg' alt='post thumbnail' /></p>
<p>I&#8217;ve been moving the Masabi web site and blog onto WordPress, from a combination of static web content and a blog driven by Blogger.  WordPress has a great import function to move the posts across, which does most of the initial work for you.</p>
<p>However, WordPress won&#8217;t by itself set up redirects for the old Blogger to new WordPress permalinks.  The two platforms shrink post titles to URLs differently, so it&#8217;s not as simple as matching WordPress permalink structures to Blogger&#8217;s under the Settings.</p>
<p>I did see one plugin which was supposed to migrate Blogger permalinks automatically, but it didn&#8217;t work and also didn&#8217;t cover the full scope I needed &#8211; I also have legacy static html links to remap into a totally different site structure.  To achieve this I turned to the excellent <a href="http://urbangiraffe.com/plugins/redirection/">Redirection plugin</a> from John Godley.</p>
<h3>Getting Inside The Database</h3>
<p>The plugin allows you to manually set up redirects with a very friendly interface, but there&#8217;s no fun migrating 70 blog posts by hand.  This is where SQL can come to our rescue!</p>
<p>VERY IMPORTANT: take a <a href="http://codex.wordpress.org/WordPress_Backups">full backup of your database</a> before you start messing around with SQL.  In theory this is a pretty low risk operation, but, you never know!</p>
<p>The Blogger import utility saves custom fields for every imported post; the Blogger permalink is held in the <code>blogger_permalink</code> field; custom fields are stored on the <code>post_meta</code> table with an obvious ID based backlink to the original table.  This makes one half of the mapping very easy to set up.</p>
<p>The other half is slightly more subtle, because WordPress permalinks are not actually stored on the <code>wp_posts</code> table.  Every post does have a GUID, but this is based on what its permalink was when you did the import &#8211; and if you imported when you created the blog and set your WordPress permalinks later, this will not reflect the post&#8217;s current permalink URI.  Ideally we&#8217;d like the 301 to point to the real end URI, so we need to get a little creative and rebuild the permalink in the way WordPress does it, from the post metadata.</p>
<p>My permalink structure looks like this:<br />
<code>YYYY/MM/DD/title/</code></p>
<p>This can be rebuilt using the following string manipulation in SQL:<br />
<code>CONCAT('/',YEAR(post_date),'/',LPAD(MONTH(post_date),2,'0'),'/',LPAD(DAY(post_date),2,'0'),'/',post_name,'/')</code></p>
<p>Notes on the SQL functions:</p>
<ul>
<li><code>CONCAT</code> just combines all of its arguments together into a single string;</li>
<li><code>LPAD</code> is used to pad the left of the string with 0s, as the month and day are always 2 digits long;</li>
<li><code>YEAR</code>, <code>MONTH</code> and <code>DAY</code> extract the relevant fields from the post&#8217;s creation date/time.</li>
</ul>
<p>Given this data, we can easily create an automatic import SQL statement for moving the data across:</p>
<pre class="sql">INSERT INTO wp_redirection_items (url,action_data,regex,group_id,status,action_type,action_code,match_type,last_access,position)
SELECT M.meta_value AS url,CONCAT('/',YEAR(P.post_date),'/',LPAD(MONTH(P.post_date),2,'0'),'/',LPAD(DAY(P.post_date),2,'0'),'/',P.post_name,'/') AS action_data,0 AS regex,1 AS group_id,'enabled' AS status,'url' AS action_type,301 AS action_code,'url' AS match_type, 0 AS last_access, 69 as position
FROM wp_postmeta M, wp_posts P
WHERE M.meta_key='blogger_permalink' AND M.post_id=P.ID AND P.post_status='publish';</pre>
<p>Run this through PHPMyAdmin, refresh the Redirection admin page, and you should now find that all of your permalinks have been moved across.  Note that we set the position field to an arbitrary constant, here 69, so we can easily delete the inserted rows if we messed up and then try again, without upsetting any other redirects already set up.</p>
]]></content:encoded>
			<wfw:commentRss>http://masochismtango.com/2009/10/25/migrating-blogger-to-wordpress-easy-301-permalink-redirects/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Running Subversion Through Ant</title>
		<link>http://masochismtango.com/2009/05/28/running-subversion-through-ant/</link>
		<comments>http://masochismtango.com/2009/05/28/running-subversion-through-ant/#comments</comments>
		<pubDate>Thu, 28 May 2009 17:57:48 +0000</pubDate>
		<dc:creator>Tom Godber</dc:creator>
				<category><![CDATA[Dev]]></category>
		<category><![CDATA[Ant]]></category>
		<category><![CDATA[Eclipse]]></category>
		<category><![CDATA[Subversion]]></category>
		<category><![CDATA[svn]]></category>

		<guid isPermaLink="false">http://masochismtango.com/?p=142</guid>
		<description><![CDATA[Ant has a somewhat limited built-in Subversion task.  An alternative if you need more power (and I can't quite remember why I did, but I did) is Subclipse's ant task - but setting it up isn't so obvious, especially if you don't use Subclipse as your main Subversion plugin.]]></description>
			<content:encoded><![CDATA[<p><img src='http://masochismtango.com/wp-content/plugins/simple-post-thumbnails/timthumb.php?src=/wp-content/thumbnails/142.jpg&amp;w=48&amp;h=48&amp;zc=1&amp;ft=jpg' alt='post thumbnail' /></p>
<p>Ant has a somewhat limited <a href="http://ant.apache.org/antlibs/svn/index.html">built-in Subversion task</a>.  An alternative if you need more power (and I can&#8217;t quite remember why I did, but I did) is <a href="http://subclipse.tigris.org/svnant.html">Subclipse&#8217;s ant task</a> &#8211; but setting it up isn&#8217;t so obvious, especially if you don&#8217;t use Subclipse as your main Subversion plugin.</p>
<p>The Ant task provides a Java wrapper around two different ways to access Subversion: either a command line &#8216;svn&#8217; command, or the JavaHL DLL.  Initially I opted for the former (using this <a href="http://www.open.collab.net/downloads/subversion/">Windows client</a>) which appeared to work passably well, but generated reams of logging that slowed commits to a crawl; the Ant task didn&#8217;t allow you to pass additional command line parameters on to the command, so there wasn&#8217;t much that could be done about it.</p>
<p>An obscure bug, probably something to do with a slightly corrupted Subversion project in our repository, finally led me to reassess today, and after an afternoon of fun I scrapped the Windows client and went after the DLL.</p>
<p>To get it (and for some reason it took me a while to think of this), download the latest &#8216;update site&#8217; zip for the latest <a href="http://subclipse.tigris.org/servlets/ProjectDocumentList?folderID=2240">Subclipse Eclipse plugin</a> (currently 1.6.2), open it and pull out the four jar files with names starting <code>org.tigris.subversion.clientadapter</code>.  Note: don&#8217;t be distracted by the Update URL, which is the conventional way to actually install the plugin.</p>
<p>Drop these four jars in some suitable lib folder, and make sure they are added to Ant&#8217;s runtime classpath &#8211; in Eclipse, you do it from the <em>Window &gt; Preferences</em> dialogue under <em>Ant &gt; Runtime</em>:</p>
<p><img class="aligncenter size-full wp-image-144" title="Eclipse Ant dialogue" src="http://masochismtango.com/wp-content/uploads/2009/05/eclipse.png" alt="Eclipse Ant dialogue" width="500" height="434" /></p>
<p>You can add them to the Ant Home or Global sections, it works either way.</p>
<p>Our entire automated tagging system now runs vastly quicker without the logging, and so far has been bug free even where the Windows client was balking about some off metadata thing (which is the only thing it didn&#8217;t adequately log, so no easy way to fix it). Excellent.</p>
]]></content:encoded>
			<wfw:commentRss>http://masochismtango.com/2009/05/28/running-subversion-through-ant/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Masabists: Cell IDs and Location-Based Services</title>
		<link>http://masochismtango.com/2007/12/01/masabists-cell-ids-and-location-based-services/</link>
		<comments>http://masochismtango.com/2007/12/01/masabists-cell-ids-and-location-based-services/#comments</comments>
		<pubDate>Sat, 01 Dec 2007 12:18:13 +0000</pubDate>
		<dc:creator>Tom Godber</dc:creator>
				<category><![CDATA[Dev]]></category>
		<category><![CDATA[Mobile]]></category>
		<category><![CDATA[LBS]]></category>
		<category><![CDATA[Masabists]]></category>

		<guid isPermaLink="false">http://masochismtango.com/?p=182</guid>
		<description><![CDATA[It was very interesting to read about the latest update to Google Maps, one of the nicest J2ME apps around at the moment, which can now find your location without GPS. My instant reaction was – “Finally! But will the operators let them continue?”]]></description>
			<content:encoded><![CDATA[<p><img src='http://masochismtango.com/wp-content/plugins/simple-post-thumbnails/timthumb.php?src=/wp-content/thumbnails/182.jpg&amp;w=48&amp;h=48&amp;zc=1&amp;ft=jpg' alt='post thumbnail' /></p>
<p class="MsoNormal"><em>Note: this was originally posted on the <a href="http://blog.masabi.com/2007/12/cell-ids-and-location-based-services.html">Masabists blog</a>.</em></p>
<p class="MsoNormal"><span lang="EN-GB">It was very interesting to read about the <a href="http://www.theregister.co.uk/2007/11/29/google_mobile_maps/">latest update to Google Maps</a>, one of the nicest J2ME apps around at the moment, which can now find your location without GPS.<span> </span>My instant reaction was – <span style="font-style: italic;">“Finally! But will the operators let them continue?”</span></span></p>
<p class="MsoNormal"><span lang="EN-GB">Current operator location services work by triangulating signal strength from multiple base stations, which can often give good accuracy in urban areas densely packed with cells.<span> </span>They carry with them a cost – low in absolute terms but sadly quite high for a lot of possible use cases &#8211; and all sort of privacy controls, which whilst clearly necessary have been a bit of a barrier to widespread adoption of Location-Based Services.</span></p>
<p class="MsoNormal"><span lang="EN-GB">Back in I think it was 2002, Masabi had a working system to track handset location by cell IDs.<span> </span>Ben, being an engineer at heart, had strapped a modem unit to a Palm PDA and written an application to read out the current cell ID and plot it onto GIF maps downloaded live from <a href="http://www.streetmap.co.uk/newmap.srf?x=529750&amp;y=179750&amp;z=1&amp;sv=529750,179750&amp;st=4&amp;ar=N&amp;mapp=newmap.srf&amp;searchp=newsearch.srf">StreetMap.co.uk</a>.<span> </span>I distinctly remember being very impressed walking down Victoria Street towards Parliament Square in Westminster and seeing it track us across the <a href="http://www.streetmap.co.uk/newmap.srf?x=529750&amp;y=179750&amp;z=1&amp;sv=529750,179750&amp;st=4&amp;ar=N&amp;mapp=newmap.srf&amp;searchp=newsearch.srf">map on this very GIF</a> with surprising accuracy</span><span lang="EN-GB">.</span></p>
<p class="MsoNormal"><span lang="EN-GB">Consensus seems to be that Google are using a very similar system, with GPS users providing location data to map out operator’s cell IDs (something I believe explicitly mentioned).<span> </span>This suggests that Google haven’t purchased the location data from the operators.<span> </span>Why would that matter?</span></p>
<p class="MsoNormal"><span lang="EN-GB">So how did we build up our cell location database?<span> </span>And if it worked, why didn’t we commercialise it?<span> </span>The two answers are connected – we were ramping<span> </span>up for a launch within certain industries which could have benefitted from a single network/limited device range service.<span> </span>Unfortunately – or perhaps fortunately, with hindsight – just before a major demo, the operator we were using decided to remove the cell broadcast info that had been supplying the base station OS grid reference locations (note: the cell IDs themselves did not appear to change, as I had erroneously stated earlier).<span><span style="font-weight: bold;"> </span></span></span></p>
<p class="MsoNormal"><span lang="EN-GB">We considered some sort of effort to map cell IDs into a database, perhaps open source, but without widespread GPS ownership this was a huge task and there was no guarantee that the operators wouldn’t choose to change the IDs at any time in the future and we were not interested in trying to make commercial promises where we had no control over key components.<span> </span>So we put it to rest.</span></p>
<p><span lang="EN-GB">Some JavaME devices can access the current cell ID, as can signed Symbian <span> </span>apps and Windows Mobile apps; Google’s <a href="http://www.google.com/support/mobile/bin/answer.py?answer=81871&amp;topic=12595">compatibility list</a> </span><span lang="EN-GB"> suggests they are targeting only these devices, suggesting they are attempting something similar.<span> </span>I wish them luck! </span></p>
<p style="text-align: right;"><em>Please comment on the <a href="http://blog.masabi.com/2007/12/cell-ids-and-location-based-services.html"> original post</a>.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://masochismtango.com/2007/12/01/masabists-cell-ids-and-location-based-services/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

