PHP Proxy & Retry Scraper

Just another php boilerplate for a php based scraper with build-in retry-logic and flexible proxy configuration.

This is a simple php scraper/crawler that uses a nice library called guzzle to parallelize your requests. It retries n times and uses a different proxy for each request (which is in some settings really useful). The script will trigger 10 requests to http://url and it will retry 3 times (maximal 4 requests per url).

Requirements:

  • PHP
  • Composer
  • Guzzle
<?php

use GuzzleHttp\HandlerStack;
use GuzzleHttp\Middleware;
use GuzzleHttp\Pool;
use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;
use GuzzleHttp\Psr7\Response;

require __DIR__.'/vendor/autoload.php';

$requestGenerator = function ($client){
    for ($i=0;$i<10;$i++){
        yield function() use ($client) {
            return $client->getAsync('https://url', ['proxy' => 'http://username:password@proxyprovider:proxyport']);
        };
    }
};

$stack = HandlerStack::create();
$stack->push(Middleware::retry(function($retries){
    return ($retries < 3);
}));
$client = new Client(['handler' => $stack]);

$pool = new Pool($client, $requestGenerator($client), [
    'concurrency' => 3,
    'fulfilled' => function (Response $response) {
        echo (string) $response->getBody().PHP_EOL;
    },
    'rejected' => function (RequestException $reason) {
        echo('some error').PHP_EOL;
    }
]);

// Initiate the transfers and create a promise
$promise = $pool->promise();

// Force the pool of requests to complete.
$promise->wait();

Links