Archive for February, 2010

Aperture, Light and Stops

// February 28th, 2010 // No Comments » // Photography

This is a diagram I drew to understand aperture numbers (the f numbers on a lens) and the concept of stops. In the graph, the gap between each bar is a stop (click to enlarge):

The most confusing thing was always the non-linear nature of stops – an increase of one stop of light means that the amount of light has doubled, a reduction by a stop means a halving of the light.

You can see the diaphragm of a lens here (the white circle in the centre, ie. the hole that lets the light through):

Verified by Visa

// February 25th, 2010 // No Comments » // Web

Plenty of people have commented on the stupidity of Verified by Visa and similar schemes, which put iframed verification forms into the web purchasing system that look remarkably like phishing forms (encouraging users to trust such embedded forms), without providing any additional security benefits. A quick glance at the economics explains why sites do this – it enables them to lower their costs by shunting fraud risk onto Visa – but from a user’s perspective it’s still bloody stupid.

Even worse is when you cannot complete a payment without it, but it doesn’t work – an experience I have just had with BA, trying to book a flight over to Queen’s Day in Amsterdam. After entering all my details, I got this:

The bank would like the following information… an empty iframe. It’s actually loading a JSP on BA’s site which delivers an empty HTML page wrapping a script that tries to trigger a form that isn’t defined in the markup. Knowing that is no great consolation…

So congratulations BA, Easyjet were undoubtedly very happy to receive some cash in exchange for a functional web experience.

Data Visualisation – Global Gender Balance

// February 18th, 2010 // No Comments » // Creative

I’ve spent a few hours playing with Tableau Public, a free version of the rather expensive Tableau data visualisation app, and it’s pretty good. After a random discussion of how Estonia has, according to the Economist world stats book, 84 guys to every 100 girls, I prepared this map in a couple of hours as an exercise in understanding how Tableau works:

Click to see full size - global male:female ratios, scaled by population size

The map shows the male:female ratio of every country as coloured crosses, from red (more women) to blue (more men); the size of the cross is proportional to the country’s population.

Quick Caveats

Tableau Public is a cut-down version of commercial software, with no ability to save files locally. In theory you can publish your diagrams to the web on their site, but that feature was a bit broken when I tried it.

This means that you have to retrieve your diagrams using screenshots, and I have no idea what the legal implications of doing that are if you intend to use them for anything but personal interest and satisfaction! It does constrain the quality a little, too.

Source Data

Being lazy, I didn’t want to type in all of the raw data from the Economist book, so I pulled it from Wikipedia: population data came from here and sex ratio from here. Copying the tables directly into Excel brought a load of unwanted links and images, so I copied into Notepad++ – where it appears as tab separated values – and reloaded as a TSV file via disk.
The country names had a few extra spaces and other characters in them – I pulled these out with the LEFT, MID, LEN etc functions.

Tableau theoretically understands full country names, but I had mixed success getting this to work, so I imported a lookup table of two letter ISO 3166-2 codes, which are also supported and are unambiguous.

I then used VLOOKUP to pull together all this information into a single Excel sheet (remember to turn range lookup – the fourth, optional argument – off); as the data came from diffferent sources, there was a little messing round standardising country names. The finished spreadsheet is here if you want to play with it.

Visualizing in Tableau

It really helps to look at this tutorial video before starting anything – Tableau starts with quite a blank slate!

Open up the XLS in Tableau, and it’ll make a first stab at identifying what is what from the Excel column formats. You’ll see the fields seperated into text and numeric lists down the left hand side.
If you don’t see a globe next to the Country Code field, right click, go to the Geographic Role submenu, and select Country (ISO 3166-2). Tableau can now map this to a geographical location automatically. Also ensure all numeric columns are recognised as numeric with the Change Data Type submenu.

To recreate the map visualisation,  follow these (approximate!) steps:

  1. Select the Country Code and Total fields, and click on the Show Me! button.
  2. Select the map diagram type (near the bottopm of the list). You should see some dots across a map.
  3. Drag the Population column over to the Size box, right click and select Dimension to scale the dots by population.
  4. Drag the Total column to the Colour box and they should become shades of green; further down is a green graduation which you can click on to change, and adjust to a red-blue graduation. Click on Advanced and set the midpoint to 1.0, to make the middle grey represent a 1:1 ratio.

Your left hand panels should look a little like this now:

This is really just a very high speed starter and I’ve barely dipped below the surface – I’ve got some quite complex business plan data that I’ll be dropping in later to experiment further.

The Mechanics Behind the SLR

// February 17th, 2010 // No Comments » // Photography

A quick diagram I drew to explain the inners of an SLR, which is a mechanical setup that was developed to ensure the picture you see through the optical viewfinder matches the photo that will be taken:

This is how the mirror movement redirects the light to the sensor when a photo is actually taken:

Amusingly enough, you can view the complex mechanics that made the SLR so good for film as a drawback for the digital age – Live View, long a staple of point and shoot cameras, has been late in coming to SLRs because the whole mirror mechanism has to be bypassed.

Zenitar 16mm Fisheye on 5Dmk2 – First Impressions

// February 15th, 2010 // 1 Comment » // Photography

I shoot most of my photos with wide angle lenses, so it was inevitable that I’d eventually buy a fisheye.

Zenitar 15mm fisheye photo with Canon 5Dmk2

Choosing

Given that ultrawide distortion is an easy thing to overuse though, I was looking for something in the £100-150 sweet spot – good enough to bother using, cheap enough to bother buying. Sigma and Canon models all come in over £400, so they were instantly out. That leaves two real contendors, with quite different characteristics:

  1. Zenitar 16mm f2.8 – 180 degrees corner to corner on a full frame body like the 5Dmk2, but useless on a crop sensor body
  2. Pelang 8mm f3.5 – projects a full circle on a full frame (ie. a circle in the middle of the exposure, with black all around)

As I have a full frame body, the 15mm was the obvious choice – it produces normal rectangular photos but with just enough distortion on the edges to be interesting. I am pretty certain that whilst the 8mm circle look can be cool, it would also get a lot less use than the more subtle 15mm.

Zenitar 15mm fisheye photo with Canon 5Dmk2

Purchasing

I purchased my Zenitar for about £125 from MoscowPhoto on eBay. It arrived in under 2 weeks – a week faster than they suggested it would take.
The factory ships this lens with an old M42 mount, which usually requires an adapter but MoscowPhoto had fitted an EOS mount – this works perfectly and saves some hassle.

Zenitar 15mm fisheye photo with Canon 5Dmk2

Quality

The lens looks very 70s, and it’s relatively lightweight, but for the money build quality is fine.

Zenitar 15mm fisheye photo with Canon 5Dmk2

Picture quality is impressive for the money, as long as the lens is kept stopped down to f11 or f22. With such an enormous field of view that’s sensible anyway, as it would be hard to keep everything in focus at a wider aperture. There is a little vignetting, and in two of the four corners it is impossible to avoid small triangles of black, but it isn’t that noticable and can usually be hidden by the composition.

Zenitar 15mm fisheye photo with Canon 5Dmk2

I don’t have the skills or the inclination to do any formal testing of the lens, but you can find my first batch of experiments on Picasa if you’d like to look. They were shot as S-RAW2, tidied a little in Lightroom and then shrunk to 1600px for Picasa, so not directly as the camera shot them, but a fair representation.

Zenitar 15mm fisheye photo with Canon 5Dmk2

My subjective opinion is that at f11 and above the field was sharp enough to make me happy shooting handheld, a tripod would no doubt help – at which point the key issue for sharpness would be focussing, hard to do across the whole frame. There is noticable chromatic abberation, but it’s a fisheye so I won’t lose sleep over it. Colour saturation is disappointing compared to good glass, with colours coming out quite flat, but you can compensate in RAW and for an occasional use trick lens I don’t think it’s an issue.

Zenitar 15mm fisheye photo with Canon 5Dmk2

Shooting Tips

Using such a narrow aperture obviously does make it hard to use the lens indoors. I found it could be used handheld in a dark room at ISO 2500-3200, which after some judicious editing in Lightroom produced acceptable pictures – I doubt you could use them as professional stock photos, but they’d be acceptable for small prints, web use etc. Your mileage with other bodies will vary, of course – the 5Dmk2 is spectacularly good at high ISO, so they work well together in this respect.

Zenitar 15mm fisheye photo with Canon 5Dmk2

In good light, handheld shooting was easy – remembering to manually focus was the only real issue, and was easiest done by guesstimating using the distance markings on the dial. Speaking of manual dials, it’s worth noting that everything will have to be done manually with this lens – the camera cannot read the Aperture you have manually selected, so you have to work in Manual or Tv mode and do some test shots whenever the light changes. In practice this really isn’t challenging, and doesn’t require virtuoso knowledge of light metering. Obviously shooting RAW and using a body with good high ISO is very helpful here.

Zenitar 15mm fisheye photo with Canon 5Dmk2

Overall, I’m really pleased with this lens and think it makes a great companion to a full frame SLR. The level of distortion can be subtle or extreme depending on composition and distance from subject, and works particularly well if you avoid aligning with strong horizontals and verticals. That said, it will take me a while to work out what sort of compositions it can usefully be used for, but it is light enough that you can carry it round just in case whatever the trip.

If you have a crop sensor body, however, you’ll need to look elsewhere…

WebHarvest: Easy Web Scraping from Java

// February 15th, 2010 // 9 Comments » // Dev, Web

I’ve been experimenting with data visualisation for a while now, most of which is for Masabi‘s business plan though I hope to share some offshoots soon.

I often have a need to quickly scrape some data out of a web page (or list of web pages), which can then be fed into Excel and on to specialist data visualisation tools like Tableau (available in a free public edition here – my initial impressions are positive but it’s early days yet).

To this end I have turned to WebHarvest, an excellent scriptable open source API for web scraping in Java. I really really like it, but there are some quirks and setup issues that have cost me hours so I thought I’d roll together a tutorial with the fixes.

WebHarvest Config for Maven

When it works Maven is a lovely tool to hide dependency management for Java projects, but WebHarvest is not configured qiute right out of the box to work transparently with it. (Describing Maven is beyond the scope of this post, but if you don’t know it, it’s easy to setup with the M2 plugin for Eclipse.)

This is the Maven POM I ended up with to use WebHarvest in a new JavaSE project:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
 <modelVersion>4.0.0</modelVersion>
 <groupId>WebScraping</groupId>
 <artifactId>WebScraping</artifactId>
 <packaging>jar</packaging>
 <version>0.00.01</version>
 <properties>
 <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
 </properties>

 <build>
 <plugins>
 <plugin>
 <artifactId>maven-compiler-plugin</artifactId>
 <configuration>
 <source>1.6</source>
 <target>1.6</target>
 </configuration>
 </plugin>
 </plugins>
 </build>

 <repositories>
 <repository>
 <id>wso2</id>
 <url>http://dist.wso2.org/maven2/</url>
 </repository>
 <repository>
 <id>maven-repository-1</id>
 <url>http://repo1.maven.org/maven2/</url>
 </repository>
 </repositories>
 <dependencies>
 <dependency>
 <groupId>commons-logging</groupId>
 <artifactId>commons-logging</artifactId>
 <version>1.1</version>
 <type>jar</type>
 <scope>compile</scope>
 </dependency>
 <dependency>
 <groupId>log4j</groupId>
 <artifactId>log4j</artifactId>
 <version>1.2.12</version>
 <type>jar</type>
 <scope>compile</scope>
 </dependency>
 <dependency>
 <groupId>org.webharvest.wso2</groupId>
 <artifactId>webharvest-core</artifactId>
 <version>1.0.0.wso2v1</version>
 <type>jar</type>
 <scope>compile</scope>
 </dependency>
 <!-- web harvest pom doesn't track dependencies well -->
 <dependency>
 <groupId>net.sf.saxon</groupId>
 <artifactId>saxon-xom</artifactId>
 <version>8.7</version>
 </dependency>
 <dependency>
 <groupId>org.htmlcleaner</groupId>
 <artifactId>htmlcleaner</artifactId>
 <version>1.55</version>
 </dependency>
 <dependency>
 <groupId>bsh</groupId>
 <artifactId>bsh</artifactId>
 <version>1.3.0</version>
 </dependency>
 <dependency>
 <groupId>commons-httpclient</groupId>
 <artifactId>commons-httpclient</artifactId>
 <version>3.1</version>
 </dependency>
 </dependencies>
</project>

You’ll note that the WebHarvest dependencies had to be added explicitly, because the jar does not come with a working pom listing them.

Writing A Scraping Script

WebHarvest uses XML configuration files to describe how to scrape a site – and with a few lines of Java code you can run any XML configuration and have access to any properties that the script identified from the page. This is definitely the safest way to scrape data, as it decouples the code from the web page markup – so if the site you are scraping goes through a redesign, you can quickly adjust the config files without recompiling the code they pass data to.

The site some good example scripts to show you how to get started, so I won’t repeat them here. The easiest way to create your own is to run the WebHarvest GUI from the command line, start with a sample script, and then hack it around to get what you want – it’s an easy iterative process with good feedback in the UI.

As a simple example, this is a script to go to the Sony-Ericsson developer site’s handset gallery at http://developer.sonyericsson.com/device/searchDevice.do?restart=true, and rip each handset’s individual spec page URI:

<?xml version="1.0" encoding="UTF-8"?>
<config>
	<!-- indicates we want a loop, through the list defined in <list>, doing <body> for each item where the variables uri and i are defined as the index and value of the relevant item -->
	<loop item="uid" index="i">
		<!-- the list section defines what we will loop over - here, it pulls out the value attribute of all option tags -->
		<list>
			<xpath expression="//option/@value">
				<html-to-xml>
					<http url="http://developer.sonyericsson.com/device/searchDevice.do?restart=true"/>
				</html-to-xml>
			</xpath>
		</list>
		<!-- the body section lists instructions which are run for every iteration of the loop -->
		<body>
			<!-- we define a new variable for every iteration, using the iteration count as a suffix  -->
			<var-def name="uri.${i}">
				<!-- template tag is important, else the $ var syntax will be ignored and won't do any value substitutions -->
				<template>device/loadDevice.do?id=${uid}</template>
			</var-def>
		</body>
	</loop>
</config>

The handset URIs will end up in a list of variables, from uri.1 to uri.N.

The XML configuration’s syntax can take a little getting used to – it appeared quite backwards to me at first, but by messing around in the GUI you can experiment and learn pretty fast. With a basic understanding of XPath to identify parts of the web page, and perhaps a little regular expression knowledge to get at information surrounded by plain text, you can perform some very powerful scraping.

We can then define another script which will take this URI, and pull out a piece of information from the page – in this example, it will show the region(s) that the handset was released in:

<?xml version="1.0" encoding="UTF-8"?>
<config>
	<!-- get the entire page -->
	<var-def name="wholepage">
		<html-to-xml>
			<!-- NEVER try and pass in the entire URL as a single variable here! -->
			<http url="http://developer.sonyericsson.com/${uri}"/>
		</html-to-xml>
	</var-def>
	<!-- rip out the block with the specifications -->
	<var-def name="specsheet">
		<xpath expression="//div[@class='phone-specs']">
			<var name="wholepage"/>
			</xpath>
		</var-def>
		<!-- find the handset's name -->
	<var-def name="name">
		<xpath expression="//h5[contains(text(),'Phone Model')]/following-sibling::p[1]/text()">
			<var name="specsheet"/>
			</xpath>
	</var-def>
	<!-- identify the screen resolution -->
	<regexp>
		<regexp-pattern>([\d]*)x([\d]*)</regexp-pattern>
			<regexp-source>
				<xpath expression="//h5[contains(text(),'Screen Sizes')]/following-sibling::p[1]/text()">
					<var name="specsheet"/>
				</xpath>
			</regexp-source>
		<regexp-result>
			<var-def name="screen.width"><template>${_1}</template></var-def>
			<var-def name="screen.height"><template>${_2}</template></var-def>
		</regexp-result>
	</regexp>
</config>

At this point I should note the biggest gotcha with WebHarvest, that just caused me 3 hours of hear tearing. In the script, this line defines the page to scrape: <http url="http://developer.sonyericsson.com/${uri}"/>, where ${uri} is a variable specified at runtime to define a URI. This works.

If you were to substitute in this perfectly sensible alternative: <http url="${url}"/>, you would end up with a completely obscure runtime exception a little like this:

Exception in thread "main" org.webharvest.exception.ScriptException: Cannot set variable in scripter: Field access: bsh.ReflectError: No such field: 1
	at org.webharvest.runtime.scripting.BeanShellScriptEngine.setVariable(Unknown Source)
	at org.webharvest.runtime.scripting.ScriptEngine.pushAllVariablesFromContextToScriptEngine(Unknown Source)
	at org.webharvest.runtime.scripting.BeanShellScriptEngine.eval(Unknown Source)
	at org.webharvest.runtime.templaters.BaseTemplater.execute(Unknown Source)
	at org.webharvest.runtime.processors.TemplateProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.VarDefProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.processors.LoopProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.Scraper.execute(Unknown Source)
	at org.webharvest.runtime.Scraper.execute(Unknown Source)
	at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:82)
	at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:49)
	at scrape.ActualScraper.main(DhfScraper.java:37)
Caused by: Field access: bsh.ReflectError: No such field: 1 : at Line: -1 : in file:  : 

	at bsh.UtilEvalError.toEvalError(Unknown Source)
	at bsh.UtilEvalError.toEvalError(Unknown Source)
	at bsh.Interpreter.set(Unknown Source)
	... 18 more

You have been warned!

Running The Scripts From Java

WebHarvest requires very little code to run. I created this little reusable harness class to quickly run the two types of script – one to pull information from a page, and one to farm URLs from which to scrape data. You can use the first without the second, of course.

package scrape;

import java.io.*;
import java.util.*;

import org.apache.commons.logging.*;
import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.*;
import org.webharvest.runtime.variables.Variable;

/**
 * Quick hackable web scraping class.
 * @author Tom Godber
 */
public abstract class QuickScraper
{
	/** Logging object. */
	protected final Log LOG = LogFactory.getLog(getClass());
	/** Prefix for any variable scraped which defines a URL. It will be followed by a counter. */
	public static final String SCRAPED_URL_VARIABLE_PREFIX = "url.";
	/** A variable name which holds the initial URL to scrape. */
	public static final String START_URL_VARIABLE = "url";

	/** A temporary working folder. */
	private File working = new File("temp");

	/** Ensures temp folder exists.` */
	public QuickScraper()
	{
		working.mkdirs();
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * The initial URL must be set in the actual URL list config XML.
	 * @param urlConfigXml Path of an XML describing how to scrape the URL list.
	 * @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 */
	protected int scrapeUrlList(String urlConfigXml, String pageConfigXml)
	{
		return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * @param setup Optional configuration for the script
	 * @param urlConfigXml Path of an XML describing how to scrape the URL list.
	 * @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 */
	protected int scrapeUrlList(Map setup, String urlConfigXml, String pageConfigXml)
	{
		return scrapeUrlList(setup, new File(urlConfigXml), new File(pageConfigXml));
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * The initial URL must be set in the actual URL list config XML.
	 * @param urlConfigXml XML describing how to scrape the URL list.
	 * @param pageConfigXml XML describing how to scrape the individual pages found.#
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 */
	protected int scrapeUrlList(File urlConfigXml, File pageConfigXml)
	{
		return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * @param setup Optional configuration for the script
	 * @param urlConfigXml XML describing how to scrape the URL list.
	 * @param pageConfigXml XML describing how to scrape the individual pages found.
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 * @throws NullPointerException If the setup map is null.
	 */
	protected int scrapeUrlList(Map setup, File urlConfigXml, File pageConfigXml)
	{
		try
		{
			if (LOG.isDebugEnabled())	LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");
			// generate a one-off scraper based on preloaded configuration
			ScraperConfiguration config = new ScraperConfiguration(urlConfigXml);
			Scraper scraper = new Scraper(config, working.getAbsolutePath());
			// initialise any config
			setupScraperContext(setup, scraper);
			// run the script
			scraper.execute();

			// rip the URL list out of the scraped content
			ScraperContext context = scraper.getContext();
			int i=1;
			Variable scrapedUrl;
			if (LOG.isDebugEnabled())	LOG.debug("Scraping performed, pulling URLs '"+SCRAPED_URL_VARIABLE_PREFIX+"n' from "+context.size()+" variables, starting with "+i+"...");
			while ((scrapedUrl = (Variable) context.get(SCRAPED_URL_VARIABLE_PREFIX+i))  != null)
			{
				if (LOG.isTraceEnabled())	LOG.trace("Found "+SCRAPED_URL_VARIABLE_PREFIX+i+": "+scrapedUrl.toString());
				// parse this URL
				setup.put(START_URL_VARIABLE, scrapedUrl.toString());
				scrapeUrl(setup, pageConfigXml);
				// move on
				i++;
			}
			if (LOG.isDebugEnabled())	LOG.debug("No more URLs found.");
			return i;
		}
		catch (FileNotFoundException e)
		{
			if (LOG.isErrorEnabled())	LOG.error("Could not find config file '"+urlConfigXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);
			return -1;
		}
		finally
		{
			working.delete();
		}
	}

	/**
	 * Scrapes an individual page, and passed the results on for processing.
	 * The script must contain a hardcoded URL.
	 * @param configXml XML describing how to scrape an individual page.
	 */
	protected void scrapeUrl(File configXml)
	{
		scrapeUrl((String)null, configXml);
	}

	/**
	 * Scrapes an individual page, and passed the results on for processing.
	 * @param url The URL to scrape. If null, the URL must be set in the config itself.
	 * @param configXml XML describing how to scrape an individual page.
	 */
	protected void scrapeUrl(String url, File configXml)
	{
		Map setup = new HashMap();
		if (url!=null)	setup.put(START_URL_VARIABLE, url);
		scrapeUrl(setup, configXml);
	}

	/**
	 * Scrapes an individual page, and passed the results on for processing.
	 * @param setup Optional configuration for the script
	 * @param configXml XML describing how to scrape an individual page.
	 */
	protected void scrapeUrl(Map setup, File configXml)
	{
		try
		{
			if (LOG.isDebugEnabled())	LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");
			// generate a one-off scraper based on preloaded configuration
			ScraperConfiguration config = new ScraperConfiguration(configXml);
			Scraper scraper = new Scraper(config, working.getAbsolutePath());
			setupScraperContext(setup, scraper);
			scraper.execute();

			// handle contents in some way
			pageScraped((String)setup.get(START_URL_VARIABLE), scraper.getContext());

			if (LOG.isDebugEnabled())	LOG.debug("Page scraping complete.");
		}
		catch (FileNotFoundException e)
		{
			if (LOG.isErrorEnabled())	LOG.error("Could not find config file '"+configXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);

		}
		finally
		{
			working.delete();
		}
	}

	/**
	 * @param setup Any variables to be set before the script runs.
	 * @param scraper The object which does the scraping.
	 */
	private void setupScraperContext(Map setup, Scraper scraper)
	{
		if (setup!=null)
			for (String key : setup.keySet())
				scraper.getContext().setVar(key, setup.get(key));
	}

	/**
	 * Process a page that was scraped.
	 * @param url The URL that was scraped.
	 * @param context The contents of the scraped page.
	 */
	public abstract void pageScraped(String url, ScraperContext context);
}

Scraping a new set of data then becomes as simple as extending the class, passing in appropriate config, and pulling out whatever variables you want every time a page is scraped:

package scrape;

import org.webharvest.runtime.ScraperContext;
import org.webharvest.runtime.variables.Variable;

public class ActualScraper extends QuickScraper
{
	public static void main(String[] args)
	{
		try
		{
			ActualScraper scraper = new ActualScraper();
			// do the scraping
			scraper.scrapeUrlList(config, "config/se.urls.xml", "config/se.page.xml");
		}
		catch (Exception e)
		{
			e.printStackTrace();
		}
	}

	/**
	 * @see scrape.QuickScraper#pageScraped(java.lang.String, org.webharvest.runtime.ScraperContext)
	 */
	public void pageScraped(String url, ScraperContext context)
	{
		Variable nameVar = context.getVar("name");
		if (nameVar==null)
		{
			if (LOG.isWarnEnabled())	LOG.warn("Scrape for "+url+" produced no data! Ignoring");
			return;
		}

		// store this station's details
		if (LOG.isInfoEnabled())	LOG.info(name+" has "+context.getVar("screen.width").toString()+"x"+context.getVar("screen.height").toString()+" screen");
	}
}

Soi there you have it – a powerful, configurable and highly effective web scraping system with almost no code written!

Angles of View and Crop Factors

// February 5th, 2010 // 1 Comment » // Photography

Here are some more diagrams I drew to try and understand photography, specifically to understand crop factors and lens lengths.

Sensor Crop Factors

Every camera has a rectangular sensor which captures light. A “full frame” sensor is the same size as an old film negative – 35mm from bottom left to top right corner. Expensive SLRs use full frame sensors, but cheaper ones (such as any Canon in the x0D, xx0D and x000D ranges) use smaller sensors. The size of the sensor is described as its crop factor.

This diagram demonstrates the common crop factors found in digital SLRs (click to enlarge):

This explains why a larger crop factor will reduce vignetting (the darkening around the corners of a photo):

Note that we are talking here about conventional lenses designed for full frame cameras. Most manufacturers now also produce lenses which only work on crop factor sensors, such as Canon’s EF-S lens. These project a circle which does not reach the edge of a full frame sensor; understandably, these will vignette worse than the equivalent length lens designed for a full frame sensor.

Camera Lens Angles of View

The focal length of a lens is the number, quoted in mm, which indicates how close or far away things will look through it. This can be considered as an ‘angle of view’ – a larger focal length will produce a smaller angle of view – and a closer image. Think of it as the angle from the lens to the left edge across to the right edge of the frame.

Angles of view for a “full frame” sensor (click to enlarge):

In contrast, here are the angles of view for a 1.6x crop sensor:

For easier comparison, here they are side by side: