web scrape

Scrape the web with Goutte

Goutte Scraper

A month ago's adventures including building a web scraper. Working to a tight schedule, I poked around the tubes and decided to give Goutte a whirl. Goutte is a simple wrapper around Guzzle and a bunch of Symfony components (such as BrowserKit and DomCrawler). In theory this makes grabbing a webpage as simple as:

	use Goutte\Client;
 
	$client = new Client();
	$crawler = $client->request('GET', 'http://www.symfony.com/blog/');
	$titles = $crawler->filter('h2.post > a')->each(function ($node) {
		return $node->text();
	});

This is a simple and powerful offering - however bundling together a number of components like this comes a cost as soon as you want to stray outside of the basics. The site I was scraping was an ASP site, described on one site as "some of the hardest challenges" in scraping. So some tips and tricks with Goutte: