WebHarvest: Easy Web Scraping from Java

// February 15th, 2010 // Dev, Web

I’ve been experimenting with data visualisation for a while now, most of which is for Masabi‘s business plan though I hope to share some offshoots soon.

I often have a need to quickly scrape some data out of a web page (or list of web pages), which can then be fed into Excel and on to specialist data visualisation tools like Tableau (available in a free public edition here – my initial impressions are positive but it’s early days yet).

To this end I have turned to WebHarvest, an excellent scriptable open source API for web scraping in Java. I really really like it, but there are some quirks and setup issues that have cost me hours so I thought I’d roll together a tutorial with the fixes.

WebHarvest Config for Maven

When it works Maven is a lovely tool to hide dependency management for Java projects, but WebHarvest is not configured qiute right out of the box to work transparently with it. (Describing Maven is beyond the scope of this post, but if you don’t know it, it’s easy to setup with the M2 plugin for Eclipse.)

This is the Maven POM I ended up with to use WebHarvest in a new JavaSE project:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
 <modelVersion>4.0.0</modelVersion>
 <groupId>WebScraping</groupId>
 <artifactId>WebScraping</artifactId>
 <packaging>jar</packaging>
 <version>0.00.01</version>
 <properties>
 <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
 </properties>

 <build>
 <plugins>
 <plugin>
 <artifactId>maven-compiler-plugin</artifactId>
 <configuration>
 <source>1.6</source>
 <target>1.6</target>
 </configuration>
 </plugin>
 </plugins>
 </build>

 <repositories>
 <repository>
 <id>wso2</id>
 <url>http://dist.wso2.org/maven2/</url>
 </repository>
 <repository>
 <id>maven-repository-1</id>
 <url>http://repo1.maven.org/maven2/</url>
 </repository>
 </repositories>
 <dependencies>
 <dependency>
 <groupId>commons-logging</groupId>
 <artifactId>commons-logging</artifactId>
 <version>1.1</version>
 <type>jar</type>
 <scope>compile</scope>
 </dependency>
 <dependency>
 <groupId>log4j</groupId>
 <artifactId>log4j</artifactId>
 <version>1.2.12</version>
 <type>jar</type>
 <scope>compile</scope>
 </dependency>
 <dependency>
 <groupId>org.webharvest.wso2</groupId>
 <artifactId>webharvest-core</artifactId>
 <version>1.0.0.wso2v1</version>
 <type>jar</type>
 <scope>compile</scope>
 </dependency>
 <!-- web harvest pom doesn't track dependencies well -->
 <dependency>
 <groupId>net.sf.saxon</groupId>
 <artifactId>saxon-xom</artifactId>
 <version>8.7</version>
 </dependency>
 <dependency>
 <groupId>org.htmlcleaner</groupId>
 <artifactId>htmlcleaner</artifactId>
 <version>1.55</version>
 </dependency>
 <dependency>
 <groupId>bsh</groupId>
 <artifactId>bsh</artifactId>
 <version>1.3.0</version>
 </dependency>
 <dependency>
 <groupId>commons-httpclient</groupId>
 <artifactId>commons-httpclient</artifactId>
 <version>3.1</version>
 </dependency>
 </dependencies>
</project>

You’ll note that the WebHarvest dependencies had to be added explicitly, because the jar does not come with a working pom listing them.

Writing A Scraping Script

WebHarvest uses XML configuration files to describe how to scrape a site – and with a few lines of Java code you can run any XML configuration and have access to any properties that the script identified from the page. This is definitely the safest way to scrape data, as it decouples the code from the web page markup – so if the site you are scraping goes through a redesign, you can quickly adjust the config files without recompiling the code they pass data to.

The site some good example scripts to show you how to get started, so I won’t repeat them here. The easiest way to create your own is to run the WebHarvest GUI from the command line, start with a sample script, and then hack it around to get what you want – it’s an easy iterative process with good feedback in the UI.

As a simple example, this is a script to go to the Sony-Ericsson developer site’s handset gallery at http://developer.sonyericsson.com/device/searchDevice.do?restart=true, and rip each handset’s individual spec page URI:

<?xml version="1.0" encoding="UTF-8"?>
<config>
	<!-- indicates we want a loop, through the list defined in <list>, doing <body> for each item where the variables uri and i are defined as the index and value of the relevant item -->
	<loop item="uid" index="i">
		<!-- the list section defines what we will loop over - here, it pulls out the value attribute of all option tags -->
		<list>
			<xpath expression="//option/@value">
				<html-to-xml>
					<http url="http://developer.sonyericsson.com/device/searchDevice.do?restart=true"/>
				</html-to-xml>
			</xpath>
		</list>
		<!-- the body section lists instructions which are run for every iteration of the loop -->
		<body>
			<!-- we define a new variable for every iteration, using the iteration count as a suffix  -->
			<var-def name="uri.${i}">
				<!-- template tag is important, else the $ var syntax will be ignored and won't do any value substitutions -->
				<template>device/loadDevice.do?id=${uid}</template>
			</var-def>
		</body>
	</loop>
</config>

The handset URIs will end up in a list of variables, from uri.1 to uri.N.

The XML configuration’s syntax can take a little getting used to – it appeared quite backwards to me at first, but by messing around in the GUI you can experiment and learn pretty fast. With a basic understanding of XPath to identify parts of the web page, and perhaps a little regular expression knowledge to get at information surrounded by plain text, you can perform some very powerful scraping.

We can then define another script which will take this URI, and pull out a piece of information from the page – in this example, it will show the region(s) that the handset was released in:

<?xml version="1.0" encoding="UTF-8"?>
<config>
	<!-- get the entire page -->
	<var-def name="wholepage">
		<html-to-xml>
			<!-- NEVER try and pass in the entire URL as a single variable here! -->
			<http url="http://developer.sonyericsson.com/${uri}"/>
		</html-to-xml>
	</var-def>
	<!-- rip out the block with the specifications -->
	<var-def name="specsheet">
		<xpath expression="//div[@class='phone-specs']">
			<var name="wholepage"/>
			</xpath>
		</var-def>
		<!-- find the handset's name -->
	<var-def name="name">
		<xpath expression="//h5[contains(text(),'Phone Model')]/following-sibling::p[1]/text()">
			<var name="specsheet"/>
			</xpath>
	</var-def>
	<!-- identify the screen resolution -->
	<regexp>
		<regexp-pattern>([\d]*)x([\d]*)</regexp-pattern>
			<regexp-source>
				<xpath expression="//h5[contains(text(),'Screen Sizes')]/following-sibling::p[1]/text()">
					<var name="specsheet"/>
				</xpath>
			</regexp-source>
		<regexp-result>
			<var-def name="screen.width"><template>${_1}</template></var-def>
			<var-def name="screen.height"><template>${_2}</template></var-def>
		</regexp-result>
	</regexp>
</config>

At this point I should note the biggest gotcha with WebHarvest, that just caused me 3 hours of hear tearing. In the script, this line defines the page to scrape: <http url="http://developer.sonyericsson.com/${uri}"/>, where ${uri} is a variable specified at runtime to define a URI. This works.

If you were to substitute in this perfectly sensible alternative: <http url="${url}"/>, you would end up with a completely obscure runtime exception a little like this:

Exception in thread "main" org.webharvest.exception.ScriptException: Cannot set variable in scripter: Field access: bsh.ReflectError: No such field: 1
	at org.webharvest.runtime.scripting.BeanShellScriptEngine.setVariable(Unknown Source)
	at org.webharvest.runtime.scripting.ScriptEngine.pushAllVariablesFromContextToScriptEngine(Unknown Source)
	at org.webharvest.runtime.scripting.BeanShellScriptEngine.eval(Unknown Source)
	at org.webharvest.runtime.templaters.BaseTemplater.execute(Unknown Source)
	at org.webharvest.runtime.processors.TemplateProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.VarDefProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.processors.BodyProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.processors.LoopProcessor.execute(Unknown Source)
	at org.webharvest.runtime.processors.BaseProcessor.run(Unknown Source)
	at org.webharvest.runtime.Scraper.execute(Unknown Source)
	at org.webharvest.runtime.Scraper.execute(Unknown Source)
	at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:82)
	at scrape.QuickScraper.scrapeUrlList(QuickScraper.java:49)
	at scrape.ActualScraper.main(DhfScraper.java:37)
Caused by: Field access: bsh.ReflectError: No such field: 1 : at Line: -1 : in file:  : 

	at bsh.UtilEvalError.toEvalError(Unknown Source)
	at bsh.UtilEvalError.toEvalError(Unknown Source)
	at bsh.Interpreter.set(Unknown Source)
	... 18 more

You have been warned!

Running The Scripts From Java

WebHarvest requires very little code to run. I created this little reusable harness class to quickly run the two types of script – one to pull information from a page, and one to farm URLs from which to scrape data. You can use the first without the second, of course.

package scrape;

import java.io.*;
import java.util.*;

import org.apache.commons.logging.*;
import org.webharvest.definition.ScraperConfiguration;
import org.webharvest.runtime.*;
import org.webharvest.runtime.variables.Variable;

/**
 * Quick hackable web scraping class.
 * @author Tom Godber
 */
public abstract class QuickScraper
{
	/** Logging object. */
	protected final Log LOG = LogFactory.getLog(getClass());
	/** Prefix for any variable scraped which defines a URL. It will be followed by a counter. */
	public static final String SCRAPED_URL_VARIABLE_PREFIX = "url.";
	/** A variable name which holds the initial URL to scrape. */
	public static final String START_URL_VARIABLE = "url";

	/** A temporary working folder. */
	private File working = new File("temp");

	/** Ensures temp folder exists.` */
	public QuickScraper()
	{
		working.mkdirs();
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * The initial URL must be set in the actual URL list config XML.
	 * @param urlConfigXml Path of an XML describing how to scrape the URL list.
	 * @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 */
	protected int scrapeUrlList(String urlConfigXml, String pageConfigXml)
	{
		return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * @param setup Optional configuration for the script
	 * @param urlConfigXml Path of an XML describing how to scrape the URL list.
	 * @param pageConfigXml Path of an XML describing how to scrape the individual pages found.#
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 */
	protected int scrapeUrlList(Map setup, String urlConfigXml, String pageConfigXml)
	{
		return scrapeUrlList(setup, new File(urlConfigXml), new File(pageConfigXml));
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * The initial URL must be set in the actual URL list config XML.
	 * @param urlConfigXml XML describing how to scrape the URL list.
	 * @param pageConfigXml XML describing how to scrape the individual pages found.#
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 */
	protected int scrapeUrlList(File urlConfigXml, File pageConfigXml)
	{
		return scrapeUrlList(new HashMap(), urlConfigXml, pageConfigXml);
	}

	/**
	 * Scrapes a list of URLs which are automatically derived from a page.
	 * @param setup Optional configuration for the script
	 * @param urlConfigXml XML describing how to scrape the URL list.
	 * @param pageConfigXml XML describing how to scrape the individual pages found.
	 * @return The number of URLs processed, or -1 if the config could not be loaded.
	 * @throws NullPointerException If the setup map is null.
	 */
	protected int scrapeUrlList(Map setup, File urlConfigXml, File pageConfigXml)
	{
		try
		{
			if (LOG.isDebugEnabled())	LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");
			// generate a one-off scraper based on preloaded configuration
			ScraperConfiguration config = new ScraperConfiguration(urlConfigXml);
			Scraper scraper = new Scraper(config, working.getAbsolutePath());
			// initialise any config
			setupScraperContext(setup, scraper);
			// run the script
			scraper.execute();

			// rip the URL list out of the scraped content
			ScraperContext context = scraper.getContext();
			int i=1;
			Variable scrapedUrl;
			if (LOG.isDebugEnabled())	LOG.debug("Scraping performed, pulling URLs '"+SCRAPED_URL_VARIABLE_PREFIX+"n' from "+context.size()+" variables, starting with "+i+"...");
			while ((scrapedUrl = (Variable) context.get(SCRAPED_URL_VARIABLE_PREFIX+i))  != null)
			{
				if (LOG.isTraceEnabled())	LOG.trace("Found "+SCRAPED_URL_VARIABLE_PREFIX+i+": "+scrapedUrl.toString());
				// parse this URL
				setup.put(START_URL_VARIABLE, scrapedUrl.toString());
				scrapeUrl(setup, pageConfigXml);
				// move on
				i++;
			}
			if (LOG.isDebugEnabled())	LOG.debug("No more URLs found.");
			return i;
		}
		catch (FileNotFoundException e)
		{
			if (LOG.isErrorEnabled())	LOG.error("Could not find config file '"+urlConfigXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);
			return -1;
		}
		finally
		{
			working.delete();
		}
	}

	/**
	 * Scrapes an individual page, and passed the results on for processing.
	 * The script must contain a hardcoded URL.
	 * @param configXml XML describing how to scrape an individual page.
	 */
	protected void scrapeUrl(File configXml)
	{
		scrapeUrl((String)null, configXml);
	}

	/**
	 * Scrapes an individual page, and passed the results on for processing.
	 * @param url The URL to scrape. If null, the URL must be set in the config itself.
	 * @param configXml XML describing how to scrape an individual page.
	 */
	protected void scrapeUrl(String url, File configXml)
	{
		Map setup = new HashMap();
		if (url!=null)	setup.put(START_URL_VARIABLE, url);
		scrapeUrl(setup, configXml);
	}

	/**
	 * Scrapes an individual page, and passed the results on for processing.
	 * @param setup Optional configuration for the script
	 * @param configXml XML describing how to scrape an individual page.
	 */
	protected void scrapeUrl(Map setup, File configXml)
	{
		try
		{
			if (LOG.isDebugEnabled())	LOG.debug("Starting scrape with temp folder "+working.getAbsolutePath()+"...");
			// generate a one-off scraper based on preloaded configuration
			ScraperConfiguration config = new ScraperConfiguration(configXml);
			Scraper scraper = new Scraper(config, working.getAbsolutePath());
			setupScraperContext(setup, scraper);
			scraper.execute();

			// handle contents in some way
			pageScraped((String)setup.get(START_URL_VARIABLE), scraper.getContext());

			if (LOG.isDebugEnabled())	LOG.debug("Page scraping complete.");
		}
		catch (FileNotFoundException e)
		{
			if (LOG.isErrorEnabled())	LOG.error("Could not find config file '"+configXml.getAbsolutePath()+"' - no scraping was done for this WebHarvest XML.", e);

		}
		finally
		{
			working.delete();
		}
	}

	/**
	 * @param setup Any variables to be set before the script runs.
	 * @param scraper The object which does the scraping.
	 */
	private void setupScraperContext(Map setup, Scraper scraper)
	{
		if (setup!=null)
			for (String key : setup.keySet())
				scraper.getContext().setVar(key, setup.get(key));
	}

	/**
	 * Process a page that was scraped.
	 * @param url The URL that was scraped.
	 * @param context The contents of the scraped page.
	 */
	public abstract void pageScraped(String url, ScraperContext context);
}

Scraping a new set of data then becomes as simple as extending the class, passing in appropriate config, and pulling out whatever variables you want every time a page is scraped:

package scrape;

import org.webharvest.runtime.ScraperContext;
import org.webharvest.runtime.variables.Variable;

public class ActualScraper extends QuickScraper
{
	public static void main(String[] args)
	{
		try
		{
			ActualScraper scraper = new ActualScraper();
			// do the scraping
			scraper.scrapeUrlList(config, "config/se.urls.xml", "config/se.page.xml");
		}
		catch (Exception e)
		{
			e.printStackTrace();
		}
	}

	/**
	 * @see scrape.QuickScraper#pageScraped(java.lang.String, org.webharvest.runtime.ScraperContext)
	 */
	public void pageScraped(String url, ScraperContext context)
	{
		Variable nameVar = context.getVar("name");
		if (nameVar==null)
		{
			if (LOG.isWarnEnabled())	LOG.warn("Scrape for "+url+" produced no data! Ignoring");
			return;
		}

		// store this station's details
		if (LOG.isInfoEnabled())	LOG.info(name+" has "+context.getVar("screen.width").toString()+"x"+context.getVar("screen.height").toString()+" screen");
	}
}

Soi there you have it – a powerful, configurable and highly effective web scraping system with almost no code written!

Share and Enjoy

  • Digg
  • TwitThis
  • LinkedIn
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Ping.fm
  • Slashdot
  • email
  • Print
  • PDF

9 Responses to “WebHarvest: Easy Web Scraping from Java”

  1. [...] using java sample Filed under Uncategorized Click here to cancel reply. [...]

  2. Armen says:

    Hey I included your page link in my blog for java examples If that is ok with you. However, thought you might be able to help me figure out my problem. I run this script and I get a xquery error, why?
    here is my code:

    http://www.courtinfo.ca.gov/reference/rfp/

    <![CDATA[ ]]>

    ${url}${pdfpage}

    <![CDATA[
    declare variable $doc as node() external;
    declare variable $currDate as node() external;

    {normalize-space(data($row))}
    $currDate

    ]]>

    <![CDATA[ ]]>

  3. Daniel Bell says:

    Thanks for the post! For anyone wanting to use webharvest 2 with Maven, try the following snippet:

    org.webharvest.wso2
    webharvest-core
    2.0.0.wso2v1

    commons-email
    commons-email

    saxon
    net.sf.saxon

    org.htmlcleaner
    htmlcleaner

    org.apache.commons
    commons-email
    1.2

    net.sf.saxon
    saxon
    8.7

    net.sourceforge.htmlcleaner
    htmlcleaner
    2.2

  4. Daniel Bell says:

    Hmm… let’s try that again..

    org.webharvest.wso2
    webharvest-core
    2.0.0.wso2v1

    commons-email
    commons-email

    saxon
    net.sf.saxon

    org.htmlcleaner
    htmlcleaner

    org.apache.commons
    commons-email
    1.2

    net.sf.saxon
    saxon
    8.7

    net.sourceforge.htmlcleaner
    htmlcleaner
    2.2

  5. Daniel Bell says:

    Third time’s the charm:
    <dependency>
    <groupId>org.webharvest.wso2</groupId>
    <artifactId>webharvest-core</artifactId>
    <version>2.0.0.wso2v1</version>
    <exclusions>
    <exclusion>
    <groupId>commons-email</groupId>
    <artifactId>commons-email</artifactId>
    </exclusion>
    <exclusion>
    <artifactId>saxon</artifactId>
    <groupId>net.sf.saxon</groupId>
    </exclusion>
    <exclusion>
    <groupId>org.htmlcleaner</groupId>
    <artifactId>htmlcleaner</artifactId>
    </exclusion>
    </exclusions>
    </dependency>
    <dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-email</artifactId>
    <version>1.2</version>
    </dependency>
    <dependency>
    <groupId>net.sf.saxon</groupId>
    <artifactId>saxon</artifactId>
    <version>8.7</version><!– Should be 8.9 –>
    </dependency>
    <dependency>
    <groupId>net.sourceforge.htmlcleaner</groupId>
    <artifactId>htmlcleaner</artifactId>
    <version>2.2</version>
    </dependency>

  6. Khaleel says:

    scrapeurljava.lang.IllegalArgumentException: Host name may not be null, when I try calling Scrapeurl directly, What has gone wrong?

  7. Malik says:

    I need help in regards to some issues i am facing regarding the extraction of data can i ask here?

  8. Tom Godber says:

    I’m afraid I don’t check comments regularly and haven’t had to configure WebHarvest for a while so I doubt I’d be much help!

Leave a Reply