Scrape the web with Goutte

Goutte Scraper

A month ago's adventures including building a web scraper. Working to a tight schedule, I poked around the tubes and decided to give Goutte a whirl. Goutte is a simple wrapper around Guzzle and a bunch of Symfony components (such as BrowserKit and DomCrawler). In theory this makes grabbing a webpage as simple as:

	use Goutte\Client;
 
	$client = new Client();
	$crawler = $client->request('GET', 'http://www.symfony.com/blog/');
	$titles = $crawler->filter('h2.post > a')->each(function ($node) {
		return $node->text();
	});

This is a simple and powerful offering - however bundling together a number of components like this comes a cost as soon as you want to stray outside of the basics. The site I was scraping was an ASP site, described on one site as "some of the hardest challenges" in scraping. So some tips and tricks with Goutte:

Disable the JIRA reindex button

Atlassian's issue tracker JIRA maintains file based indexes to make looking up issues faster. Certain changes (such as adding or changes custom fields) will make JIRA prompt administrators to rebuild these indexes. Unfortunately, this process makes JIRA unaccessible until it is complete (up to 20 minutes for a medium sized instance on version 5, less on version 6). Version 6 added background reindexing, however my experience of this is that it takes hours and uses up nearly all the CPU on the server.

Tags: 

PHP Soap Server - Procedure not present

While developing a PHP SOAP service, I ran across this error after doing some updates to a WSDL:

Procedure '[name]' not present

This seems to suggest a problem with the class loaded into the SOAP server, but in fact it is typically more of a problem with the WSDL. There are a number of causes for this, I've rounded up a few of them here...

Comment Spam Sucks

I have spent considerable time over the past week going through the backlog of unapproved comments on my blog (yes, which had one post until now). Somewhere upwards of 500 comments, every single one of them spam. Mostly drug related posts with the occasional fake designer shoes or handbags thrown in. Last time it was a spammer trying to post large chunks of misquoted Ender’s Saga anyway with a lone spam link in the middle, which at least proved a little interesting.

Skipping version control files while FTPing

If you are developing on your desktop and FTPing files up to your server, you can quickly end up with a mess of .svn or .git folders or files hanging around making a mess of things (you are using version control, right?). Apart from being untidy, these files can wind up as a security risk if they end up on a live site. There are a number of methods you can try to keep them at bay - avoid selecting them for upload or mark them as hidden, but another convenient way is to tell your FTP client from sending them.

Looking at two popular clients - FileZilla and WinSCP - we can see how this can be done. Essentially we are filtering what is sent on name, so the technique can apply to any other files you wouldn't want to upload (such as Window's "desktop.ini" files). You'll just have to keep in mind that it is happening in the background - trying to debug where plugin.svn.php got to could be frustrating.