Scrape the web with Goutte

Goutte Scraper

A month ago's adventures including building a web scraper. Working to a tight schedule, I poked around the tubes and decided to give Goutte a whirl. Goutte is a simple wrapper around Guzzle and a bunch of Symfony components (such as BrowserKit and DomCrawler). In theory this makes grabbing a webpage as simple as:

	use Goutte\Client;
 
	$client = new Client();
	$crawler = $client->request('GET', 'http://www.symfony.com/blog/');
	$titles = $crawler->filter('h2.post > a')->each(function ($node) {
		return $node->text();
	});

This is a simple and powerful offering - however bundling together a number of components like this comes a cost as soon as you want to stray outside of the basics. The site I was scraping was an ASP site, described on one site as "some of the hardest challenges" in scraping. So some tips and tricks with Goutte:

Using a proxy with Goutte

To use a proxy you need to pass the details to the Guzzle client it uses like so:

 
	use Goutte\Client;
 
	$client = new Client();
	$guzzle = $client->getClient();
	$guzzle->setDefaultOption('proxy', 'http://proxy:8080');
	$client->setClient($guzzle);

Add a form field

The form I was dealing with had some extra fields added by some Javascript when it is submitted. To add these, they need to be inserted into the HTML before using DomCrawler's form() function:

	$crawler = $client->request('GET', $url);
	$html = $crawler->html();
 
	$newHtml = str_replace(
			'<table id="main_table" cellpadding="0"',
			'<input type="hidden" name="extraField" value="" >
			<table id="main_table" cellpadding="0"',
			$html);
 
	$crawler->clear();
	$crawler->addHtmlContent($newHtml);
	$form = $crawler->selectButton('searchButton')->form();

There is an open issue for this problem on GitHub.

Remove a form field

Removing a field on the other hand is straight forward:

	$form = $crawler->selectButton('searchButton')->form();
	$form->remove('dontSendMe');

Fix form elements with full stops in them

Thanks to the use of parse_str in DomCrawler's Form object, any full stops or spaces are replaced with underscores. This is terribly unhelpful, and requires bypassing Goutte/BrowserKit's built in submit() function:

 
	$form = $crawler->selectButton('searchButton')->form();
	$formPhpValues = $form->getPhpValues();
	$newPhpValues = $this->replaceKeys('_', '.', $formPhpValues);
 
	$submit = $client->request($form->getMethod(), $form->getUri(), $newPhpValues, $form->getPhpFiles());

This assumes there aren't any fields that are meant to have an underscore.

Conclusion

Goutte is a powerful tool for making web scraping a simple task, but probably not the best choice if you are dealing with ASP forms :-)

Alternatives that might be worth a look: Simple HTML Dom or Selenium

Comments

I can finally edit the HTML! I had no idea about clear(). Thank you so much!

Any idea whats good for ASP forms? I was trying out CasperJS :)

Add new comment

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.