Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
511 views
in Technique[技术] by (71.8m points)

curl - quickly validate a large list of URL's in PHP?

I have a database of content with free text in it There are about 11000 rows of data, and each row has 87 columns. There are thus (potentially) around 957000 fields to check if URLs are valid.

I did a regular expression to extract all things that look like URLs (http/s, etc.) and built up an array called $urls. I then loop through it, passing each $url to my curl_exec() call.

I have tried cURL (for each $url):

$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT_MS, 250);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECT_ONLY, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_HTTPGET, 1);
foreach ($urls as $url) {
    curl_setopt($ch, CURLOPT_URL, $url);
    $exec = curl_exec($ch);
    // Extra stuff here... it does add overhead, but not that much.
}
curl_close($ch);

As far as I can tell, this SHOULD work and be as fast as I can go, but it takes around 2-3 seconds per URL.

There has to be a faster way?

I am planning on running this via a cron job, and then check my local database first if this URL has been checked in the last 30 days, and if not, then check, so over time this will become less, but I just want to know if cURL is the best solution, and whether I am missing something to make it faster?

EDIT: Based on the comment bby Nick Zulu below, I sit with this code now:

function ODB_check_url_array($urls, $debug = true) {
  if (!empty($urls)) {
    $mh = curl_multi_init();
    foreach ($urls as $index => $url) {
      $ch[$index] = curl_init($url);
      curl_setopt($ch[$index], CURLOPT_CONNECTTIMEOUT_MS, 10000);
      curl_setopt($ch[$index], CURLOPT_NOBODY, 1);
      curl_setopt($ch[$index], CURLOPT_FAILONERROR, 1);
      curl_setopt($ch[$index], CURLOPT_RETURNTRANSFER, 1);
      curl_setopt($ch[$index], CURLOPT_CONNECT_ONLY, 1);
      curl_setopt($ch[$index], CURLOPT_HEADER, 1);
      curl_setopt($ch[$index], CURLOPT_HTTPGET, 1);
      curl_multi_add_handle($mh, $ch[$index]);
    }
    $running = null;
    do {
      curl_multi_exec($mh, $running);
    } while ($running);
    foreach ($ch as $index => $response) {
      $return[$ch[$index]] = curl_multi_getcontent($ch[$index]);
      curl_multi_remove_handle($mh, $ch[$index]);
      curl_close($ch[$index]);
    }
    curl_multi_close($mh);
    return $return;
  }
}
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

let's see..

  • use the curl_multi api (it's the only sane choice for doing this in PHP)

  • have a max simultaneous connection limit, don't just create a connection for each url (you'll get out-of-memory or out-of-resource errors if you just create a million simultaneous connections. and i wouldn't even trust the timeout errors if you just created a million connections simultaneously)

  • only fetch the headers, because downloading the body would be a waste of time and bandwidth

here is my attempt:

// if return_fault_reason is false, then the return is a simple array of strings of urls that validated.
// otherwise it's an array with the url as the key containing  array(bool validated,int curl_error_code,string reason) for every url
function validate_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $consider_http_300_redirect_as_error = true, bool $return_fault_reason) : array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (!is_string($foo)) {
            throw new InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); //?
        }
    }
    unset($foo);
    $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            curl_multi_exec($mh, $still_running);
            if ($still_running < count($workers)) {
                break;
            }
            $cms=curl_multi_select($mh, 10);
            //var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            //echo "NOT FALSE!";
            //var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLM_OK) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
                    }
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
                    }
                } else {
                    $code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
                    if ($code[0] === "3") {
                        if ($consider_http_300_redirect_as_error) {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");
                            }
                        } else {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");
                            } else {
                                $ret[] = $workers[(int)$info['handle']];
                            }
                        }
                    } elseif ($code[0] === "2") {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");
                        } else {
                            $ret[] = $workers[(int)$info['handle']];
                        }
                    } else {
                        // all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
                        }
                    }
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int)$info['handle']]));
                unset($workers[(int)$info['handle']]);
                curl_close($info['handle']);
            }
        }
        //echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            //echo "TOO MANY WORKERS!
";
            $work();
        }
        $neww = curl_init($url);
        if (!$neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
            if ($return_fault_reason) {
                $ret[$url] = array(false, -1, "curl_init() failed");
            }
            continue;
        }
        $workers[(int)$neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_NOBODY => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        //curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    }
    while (count($workers) > 0) {
        //echo "WAITING FOR WORKERS TO BECOME 0!";
        //var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

here is some test code

$urls = [
    'www.example.org',
    'www.google.com',
    'https://www.google.com',
];
var_dump(validate_urls($urls, 1000, 1, true, false));

returns

array(0) {
}

because they all timed out (1 millisecond timeout), and fail reason reporting was disabled (that's the last argument),

$urls = [
    'www.example.org',
    'www.google.com',
    'https://www.google.com',
];
var_dump(validate_urls($urls, 1000, 1, true, true));

returns

array(3) {
  ["www.example.org"]=>
  array(3) {
    [0]=>
    bool(false)
    [1]=>
    int(28)
    [2]=>
    string(39) "curl_exec error 28: Timeout was reached"
  }
  ["www.google.com"]=>
  array(3) {
    [0]=>
    bool(false)
    [1]=>
    int(28)
    [2]=>
    string(39) "curl_exec error 28: Timeout was reached"
  }
  ["https://www.google.com"]=>
  array(3) {
    [0]=>
    bool(false)
    [1]=>
    int(28)
    [2]=>
    string(39) "curl_exec error 28: Timeout was reached"
  }
}

increasing the timeout limit to 1000 we get

var_dump(validate_urls($urls, 1000, 1000, true, false));

=

array(3) {
  [0]=>
  string(14) "www.google.com"
  [1]=>
  string(22) "https://www.google.com"
  [2]=>
  string(15) "www.example.org"
}

and

var_dump(validate_urls($urls, 1000, 1000, true, true));

=

array(3) {
  ["www.google.com"]=>
  array(3) {
    [0]=>
    bool(true)
    [1]=>
    int(0)
    [2]=>
    string(50) "got a http 200 code, which is considered a success"
  }
  ["www.example.org"]=>
  array(3) {
    [0]=>
    bool(true)
    [1]=>
    int(0)
    [2]=>
    string(50) "got a http 200 code, which is considered a success"
  }
  ["https://www.google.com"]=>
  array(3) {
    [0]=>
    bool(true)
    [1]=>
    int(0)
    [2]=>
    string(50) "got a http 200 code, which is considered a success"
  }
}

and so on :) the speed should depend on your bandwidth and $max_connections variable, which is configurable.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...