A month ago’s adventures including building a web scraper. Working to a tight schedule, I poked around the tubes and decided to give Goutte a whirl. Goutte is a simple wrapper around Guzzle and a bunch of Symfony components (such as BrowserKit and DomCrawler). In theory this makes grabbing a webpage as simple as:
use Goutte\Client; $client = new Client(); $crawler = $client->request('GET', 'http://www.symfony.com/blog/'); $titles = $crawler->filter('h2.post > a')->each(function ($node) { return $node->text(); });
This is a simple and powerful offering – however bundling together a number of components like this comes a cost as soon as soon as you want to stray outside of the basics. The site I was scraping was an ASP site, described on one site as “some of the hardest challenges” in scraping. So some tips and tricks with Goutte:
Using a proxy with Goutte
To use a proxy you need to pass the details to the Guzzle client it uses like so:
use Goutte\Client; $client = new Client(); $guzzle = $client->getClient(); $guzzle->setDefaultOption('proxy', 'http://proxy:8080'); $client->setClient($guzzle);
Add a form field
The form I was dealing with had some extra fields added by some Javascript when it is submitted. To add these, they need to be inserted into the HTML before using DomCrawler’s form() function:
$crawler = $client->request('GET', $url); $html = $crawler->html(); $newHtml = str_replace( '
There is an open issue for this problem on GitHub.
Remove a form field
Removing a field on the other hand is straight forward:
$form = $crawler->selectButton('searchButton')->form(); $form->remove('dontSendMe');
Fix form elements with full stops in them
Thanks to the use of parse_str in DomCrawler’s Form object, any full stops or spaces are replaced with underscores. This is terribly unhelpful, and requires bypassing Goutte/BrowserKit’s built in submit() function:
$form = $crawler->selectButton('searchButton')->form(); $formPhpValues = $form->getPhpValues(); $newPhpValues = $this->replaceKeys('_', '.', $formPhpValues); $submit = $client->request($form->getMethod(), $form->getUri(), $newPhpValues, $form->getPhpFiles());
This assumes there aren’t any fields that are meant to have an underscore.
Conclusion
Goutte is a powerful tool for making web scraping a simple task, but probably not the best choice if you are dealing with ASP forms 🙂 Alternatives that might be worth a look: Simple HTML Dom or Selenium
It’s impossible to complete asp form with https… or I don’t know. It’s returning null.
Any idea whats good for ASP forms? I was trying out CasperJS 🙂
I can finally edit the HTML! I had no idea about clear(). Thank you so much!