<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: WebHarvest: Easy Web Scraping from Java</title>
	<atom:link href="http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/feed/" rel="self" type="application/rss+xml" />
	<link>http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/</link>
	<description></description>
	<lastBuildDate>Tue, 02 Aug 2011 09:59:57 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.2</generator>
	<item>
		<title>By: Khaleel</title>
		<link>http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/comment-page-1/#comment-234</link>
		<dc:creator>Khaleel</dc:creator>
		<pubDate>Sun, 03 Jul 2011 17:13:37 +0000</pubDate>
		<guid isPermaLink="false">http://masochismtango.com/?p=478#comment-234</guid>
		<description>scrapeurljava.lang.IllegalArgumentException: Host name may not be null, when I try calling Scrapeurl directly, What has gone wrong?</description>
		<content:encoded><![CDATA[<p>scrapeurljava.lang.IllegalArgumentException: Host name may not be null, when I try calling Scrapeurl directly, What has gone wrong?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ikhsan</title>
		<link>http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/comment-page-1/#comment-228</link>
		<dc:creator>ikhsan</dc:creator>
		<pubDate>Sun, 24 Apr 2011 10:41:21 +0000</pubDate>
		<guid isPermaLink="false">http://masochismtango.com/?p=478#comment-228</guid>
		<description>Web scraping technique with JAVA
http://ikhsanaulia.blogspot.com/2011/04/teknik-dasar-web-scraping-menggunakan.html</description>
		<content:encoded><![CDATA[<p>Web scraping technique with JAVA<br />
<a href="http://ikhsanaulia.blogspot.com/2011/04/teknik-dasar-web-scraping-menggunakan.html" rel="nofollow">http://ikhsanaulia.blogspot.com/2011/04/teknik-dasar-web-scraping-menggunakan.html</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Bell</title>
		<link>http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/comment-page-1/#comment-226</link>
		<dc:creator>Daniel Bell</dc:creator>
		<pubDate>Sat, 05 Mar 2011 13:46:59 +0000</pubDate>
		<guid isPermaLink="false">http://masochismtango.com/?p=478#comment-226</guid>
		<description>Third time&#039;s the charm:
&lt;dependency&gt;
    &lt;groupId&gt;org.webharvest.wso2&lt;/groupId&gt;
    &lt;artifactId&gt;webharvest-core&lt;/artifactId&gt;
    &lt;version&gt;2.0.0.wso2v1&lt;/version&gt;
    &lt;exclusions&gt;
        &lt;exclusion&gt;
            &lt;groupId&gt;commons-email&lt;/groupId&gt;
            &lt;artifactId&gt;commons-email&lt;/artifactId&gt;
        &lt;/exclusion&gt;
        &lt;exclusion&gt;
            &lt;artifactId&gt;saxon&lt;/artifactId&gt;
            &lt;groupId&gt;net.sf.saxon&lt;/groupId&gt;
        &lt;/exclusion&gt;
        &lt;exclusion&gt;
            &lt;groupId&gt;org.htmlcleaner&lt;/groupId&gt;  
            &lt;artifactId&gt;htmlcleaner&lt;/artifactId&gt;  
        &lt;/exclusion&gt;
    &lt;/exclusions&gt;
&lt;/dependency&gt;
&lt;dependency&gt;
    &lt;groupId&gt;org.apache.commons&lt;/groupId&gt;
    &lt;artifactId&gt;commons-email&lt;/artifactId&gt;
    &lt;version&gt;1.2&lt;/version&gt;
&lt;/dependency&gt;
&lt;dependency&gt;
    &lt;groupId&gt;net.sf.saxon&lt;/groupId&gt;
    &lt;artifactId&gt;saxon&lt;/artifactId&gt;
    &lt;version&gt;8.7&lt;/version&gt;&lt;!-- Should be 8.9 --&gt;
&lt;/dependency&gt;
&lt;dependency&gt;
    &lt;groupId&gt;net.sourceforge.htmlcleaner&lt;/groupId&gt;
    &lt;artifactId&gt;htmlcleaner&lt;/artifactId&gt;
    &lt;version&gt;2.2&lt;/version&gt;
&lt;/dependency&gt;</description>
		<content:encoded><![CDATA[<p>Third time&#8217;s the charm:<br />
&lt;dependency&gt;<br />
    &lt;groupId&gt;org.webharvest.wso2&lt;/groupId&gt;<br />
    &lt;artifactId&gt;webharvest-core&lt;/artifactId&gt;<br />
    &lt;version&gt;2.0.0.wso2v1&lt;/version&gt;<br />
    &lt;exclusions&gt;<br />
        &lt;exclusion&gt;<br />
            &lt;groupId&gt;commons-email&lt;/groupId&gt;<br />
            &lt;artifactId&gt;commons-email&lt;/artifactId&gt;<br />
        &lt;/exclusion&gt;<br />
        &lt;exclusion&gt;<br />
            &lt;artifactId&gt;saxon&lt;/artifactId&gt;<br />
            &lt;groupId&gt;net.sf.saxon&lt;/groupId&gt;<br />
        &lt;/exclusion&gt;<br />
        &lt;exclusion&gt;<br />
            &lt;groupId&gt;org.htmlcleaner&lt;/groupId&gt;<br />
            &lt;artifactId&gt;htmlcleaner&lt;/artifactId&gt;<br />
        &lt;/exclusion&gt;<br />
    &lt;/exclusions&gt;<br />
&lt;/dependency&gt;<br />
&lt;dependency&gt;<br />
    &lt;groupId&gt;org.apache.commons&lt;/groupId&gt;<br />
    &lt;artifactId&gt;commons-email&lt;/artifactId&gt;<br />
    &lt;version&gt;1.2&lt;/version&gt;<br />
&lt;/dependency&gt;<br />
&lt;dependency&gt;<br />
    &lt;groupId&gt;net.sf.saxon&lt;/groupId&gt;<br />
    &lt;artifactId&gt;saxon&lt;/artifactId&gt;<br />
    &lt;version&gt;8.7&lt;/version&gt;&lt;!&#8211; Should be 8.9 &#8211;&gt;<br />
&lt;/dependency&gt;<br />
&lt;dependency&gt;<br />
    &lt;groupId&gt;net.sourceforge.htmlcleaner&lt;/groupId&gt;<br />
    &lt;artifactId&gt;htmlcleaner&lt;/artifactId&gt;<br />
    &lt;version&gt;2.2&lt;/version&gt;<br />
&lt;/dependency&gt;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Bell</title>
		<link>http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/comment-page-1/#comment-225</link>
		<dc:creator>Daniel Bell</dc:creator>
		<pubDate>Sat, 05 Mar 2011 13:45:24 +0000</pubDate>
		<guid isPermaLink="false">http://masochismtango.com/?p=478#comment-225</guid>
		<description>Hmm... let&#039;s try that again..


    org.webharvest.wso2
    webharvest-core
    2.0.0.wso2v1
    
        
            commons-email
            commons-email
        
        
            saxon
            net.sf.saxon
        
        
            org.htmlcleaner  
            htmlcleaner  
        
    


    org.apache.commons
    commons-email
    1.2


    net.sf.saxon
    saxon
    8.7&lt;!-- Should be 8.9 --&gt;


    net.sourceforge.htmlcleaner
    htmlcleaner
    2.2

</description>
		<content:encoded><![CDATA[<p>Hmm&#8230; let&#8217;s try that again..</p>
<p>    org.webharvest.wso2<br />
    webharvest-core<br />
    2.0.0.wso2v1</p>
<p>            commons-email<br />
            commons-email</p>
<p>            saxon<br />
            net.sf.saxon</p>
<p>            org.htmlcleaner<br />
            htmlcleaner  </p>
<p>    org.apache.commons<br />
    commons-email<br />
    1.2</p>
<p>    net.sf.saxon<br />
    saxon<br />
    8.7<!-- Should be 8.9 --></p>
<p>    net.sourceforge.htmlcleaner<br />
    htmlcleaner<br />
    2.2</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Bell</title>
		<link>http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/comment-page-1/#comment-224</link>
		<dc:creator>Daniel Bell</dc:creator>
		<pubDate>Sat, 05 Mar 2011 13:42:34 +0000</pubDate>
		<guid isPermaLink="false">http://masochismtango.com/?p=478#comment-224</guid>
		<description>Thanks for the post! For anyone wanting to use webharvest 2 with Maven, try the following snippet:


    org.webharvest.wso2
    webharvest-core
    2.0.0.wso2v1
    
        
            commons-email
            commons-email
        
        
            saxon
            net.sf.saxon
        
        
            org.htmlcleaner  
            htmlcleaner  
        
    


    org.apache.commons
    commons-email
    1.2


    net.sf.saxon
    saxon
    8.7&lt;!-- Should be 8.9 --&gt;


    net.sourceforge.htmlcleaner
    htmlcleaner
    2.2
</description>
		<content:encoded><![CDATA[<p>Thanks for the post! For anyone wanting to use webharvest 2 with Maven, try the following snippet:</p>
<p>    org.webharvest.wso2<br />
    webharvest-core<br />
    2.0.0.wso2v1</p>
<p>            commons-email<br />
            commons-email</p>
<p>            saxon<br />
            net.sf.saxon</p>
<p>            org.htmlcleaner<br />
            htmlcleaner  </p>
<p>    org.apache.commons<br />
    commons-email<br />
    1.2</p>
<p>    net.sf.saxon<br />
    saxon<br />
    8.7<!-- Should be 8.9 --></p>
<p>    net.sourceforge.htmlcleaner<br />
    htmlcleaner<br />
    2.2</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Armen</title>
		<link>http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/comment-page-1/#comment-197</link>
		<dc:creator>Armen</dc:creator>
		<pubDate>Mon, 19 Jul 2010 23:55:36 +0000</pubDate>
		<guid isPermaLink="false">http://masochismtango.com/?p=478#comment-197</guid>
		<description>Hey I included your page link in my blog for java examples If that is ok with you. However, thought you might be able to help me figure out my problem. I run this script and I get a xquery error, why?
here is my code:



	
    
    	http://www.courtinfo.ca.gov/reference/rfp/
    
    
    
        &lt;![CDATA[  ]]&gt;
        
        
        	
            &lt;!-- list consists of all rows in the main HTML table on the page --&gt;
            
	    		
                    
                        
                    
                
            
            
            
                       	
                
                	
                		
							
						
                	               	
                	
                		
                			${url}${pdfpage}
                		            		
                	                	    
                	
		                
		                    
		                        
		                    
		                
                	                 	          	
                
                
                
                  
                  
                  &lt;![CDATA[
                      declare variable $doc as node() external;
                      declare variable $currDate as node() external;
                      
                      
                           {normalize-space(data($row))}
                            $currDate 
                      
                      ]]&gt;
              	
            
        
        
        &lt;![CDATA[  ]]&gt;
    
    
</description>
		<content:encoded><![CDATA[<p>Hey I included your page link in my blog for java examples If that is ok with you. However, thought you might be able to help me figure out my problem. I run this script and I get a xquery error, why?<br />
here is my code:</p>
<p>    	<a href="http://www.courtinfo.ca.gov/reference/rfp/" rel="nofollow">http://www.courtinfo.ca.gov/reference/rfp/</a></p>
<p>        &lt;![CDATA[  ]]&gt;</p>
<p>            <!-- list consists of all rows in the main HTML table on the page --></p>
<p>                			${url}${pdfpage}</p>
<p>                  &lt;![CDATA[<br />
                      declare variable $doc as node() external;<br />
                      declare variable $currDate as node() external;</p>
<p>                           {normalize-space(data($row))}<br />
                            $currDate </p>
<p>                      ]]&gt;</p>
<p>        &lt;![CDATA[  ]]&gt;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Web-Harvest &#124; Garejoor</title>
		<link>http://masochismtango.com/2010/02/15/webharvest-web-scraping-from-java/comment-page-1/#comment-196</link>
		<dc:creator>Web-Harvest &#124; Garejoor</dc:creator>
		<pubDate>Fri, 16 Jul 2010 16:01:26 +0000</pubDate>
		<guid isPermaLink="false">http://masochismtango.com/?p=478#comment-196</guid>
		<description>[...]  using java sample    Filed under Uncategorized         Click here to cancel reply. [...]</description>
		<content:encoded><![CDATA[<p>[...]  using java sample    Filed under Uncategorized         Click here to cancel reply. [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>

