Blog

Scrape the web with Goutte

Goutte Scraper

A month ago's adventures including building a web scraper. Working to a tight schedule, I poked around the tubes and decided to give Goutte a whirl. Goutte is a simple wrapper around Guzzle and a bunch of Symfony components (such as BrowserKit and DomCrawler). In theory this makes grabbing a webpage as simple as:

	use Goutte\Client;
 
	$client = new Client();
	$crawler = $client->request('GET', 'http://www.symfony.com/blog/');
	$titles = $crawler->filter('h2.post > a')->each(function ($node) {
		return $node->text();
	});

This is a simple and powerful offering - however bundling together a number of components like this comes a cost as soon as you want to stray outside of the basics. The site I was scraping was an ASP site, described on one site as "some of the hardest challenges" in scraping. So some tips and tricks with Goutte:

Disable the JIRA reindex button

Atlassian's issue tracker JIRA maintains file based indexes to make looking up issues faster. Certain changes (such as adding or changes custom fields) will make JIRA prompt administrators to rebuild these indexes. Unfortunately, this process makes JIRA unaccessible until it is complete (up to 20 minutes for a medium sized instance on version 5, less on version 6). Version 6 added background reindexing, however my experience of this is that it takes hours and uses up nearly all the CPU on the server.

Tags: 

PHP Soap Server - Procedure not present

While developing a PHP SOAP service, I ran across this error after doing some updates to a WSDL:

Procedure '[name]' not present

This seems to suggest a problem with the class loaded into the SOAP server, but in fact it is typically more of a problem with the WSDL. There are a number of causes for this, I've rounded up a few of them here...

Comment Spam Sucks

I have spent considerable time over the past week going through the backlog of unapproved comments on my blog (yes, which had one post until now). Somewhere upwards of 500 comments, every single one of them spam. Mostly drug related posts with the occasional fake designer shoes or handbags thrown in. Last time it was a spammer trying to post large chunks of misquoted Ender’s Saga anyway with a lone spam link in the middle, which at least proved a little interesting.

Pages