Scrape the web with Goutte

A month ago’s adventures including building a web scraper. Working to a tight schedule, I poked around the tubes and decided to give Goutte a whirl. Goutte is a simple wrapper around Guzzle and a bunch of Symfony components (such as BrowserKit and DomCrawler). In theory this makes grabbing a webpage as simple as:

use Goutte\Client;

$client = new Client();
$crawler = $client->request('GET', 'http://www.symfony.com/blog/');
$titles = $crawler->filter('h2.post > a')->each(function ($node) {
    return $node->text();
});

This is a simple and powerful offering – however bundling together a number of components like this comes a cost as soon as soon as you want to stray outside of the basics. The site I was scraping was an ASP site, described on one site as “some of the hardest challenges” in scraping. So some tips and tricks with Goutte:

Using a proxy with Goutte

To use a proxy you need to pass the details to the Guzzle client it uses like so:

use Goutte\Client;

$client = new Client();
$guzzle = $client->getClient();
$guzzle->setDefaultOption('proxy', 'http://proxy:8080');
$client->setClient($guzzle);

Add a form field

The form I was dealing with had some extra fields added by some Javascript when it is submitted. To add these, they need to be inserted into the HTML before using DomCrawler’s form() function:

$crawler = $client->request('GET', $url);
$html = $crawler->html();
 
$newHtml = str_replace(
        '
clear(); $crawler->addHtmlContent($newHtml); $form = $crawler->selectButton('searchButton')->form();

There is an open issue for this problem on GitHub.

Remove a form field

Removing a field on the other hand is straight forward:

$form = $crawler->selectButton('searchButton')->form();
$form->remove('dontSendMe');

Fix form elements with full stops in them

Thanks to the use of parse_str in DomCrawler’s Form object, any full stops or spaces are replaced with underscores. This is terribly unhelpful, and requires bypassing Goutte/BrowserKit’s built in submit() function:

$form = $crawler->selectButton('searchButton')->form();
$formPhpValues = $form->getPhpValues();
$newPhpValues = $this->replaceKeys('_', '.', $formPhpValues);

$submit = $client->request($form->getMethod(), $form->getUri(), $newPhpValues, $form->getPhpFiles());

This assumes there aren’t any fields that are meant to have an underscore.

Conclusion

Goutte is a powerful tool for making web scraping a simple task, but probably not the best choice if you are dealing with ASP forms 🙂 Alternatives that might be worth a look: Simple HTML Dom or Selenium

Spread the love

3 thoughts on “Scrape the web with Goutte”

Leave a comment