javascript - Looping over urls to do the same thing

Question

Welcome To Ask or Share your Answers For Others

javascript - Looping over urls to do the same thing

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

javascript - Looping over urls to do the same thing

I am tring to scrape a few sites. Here is my code:

for (var i = 0; i < urls.length; i++) {
    url = urls[i];
    console.log("Start scraping: " + url);

    page.open(url, function () {
        waitFor(function() {
            return page.evaluate(function() {
                return document.getElementById("progressWrapper").childNodes.length == 1;
            });

        }, function() {
            var price = page.evaluate(function() {
                // do something
                return price;
            });

            console.log(price);
            result = url + " ; " + price;
            output = output + "
" + result;
        });
    });

}
fs.write('test.txt', output);
phantom.exit();

I want to scrape all sites in the array urls, extract some information and then write this information to a text file.

But there seems to be a problem with the for loop. When scraping only one site without using a loop, all works as I want. But with the loop, first nothing happens, then the line

console.log("Start scraping: " + url);

is shown, but one time too much. If url = {a,b,c}, then phantomjs does:

Start scraping: a 
Start scraping: b 
Start scraping: c 
Start scraping:

It seems that page.open isn't called at all. I am newbie to JS so I am sorry for this stupid question.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T23:50:08+0000

PhantomJS is asynchronous. By calling page.open() multiple times using a loop, you essentially rush the execution of the callback. You're overwriting the current request before it is finished with a new request which is then again overwritten. You need to execute them one after the other, for example like this:

page.open(url, function () {
    waitFor(function() {
       // something
    }, function() {
        page.open(url, function () {
            waitFor(function() {
               // something
            }, function() {
                // and so on
            });
        });
    });
});

But this is tedious. There are utilities that can help you with writing nicer code like async.js. You can install it in the directory of the phantomjs script through npm.

var async = require("async"); // install async through npm
var tests = urls.map(function(url){
    return function(callback){
        page.open(url, function () {
            waitFor(function() {
               // something
            }, function() {
                callback();
            });
        });
    };
});
async.series(tests, function finish(){
    fs.write('test.txt', output);
    phantom.exit();
});

If you don't want any dependencies, then it is also easy to define your own recursive function (from here):

var urls = [/*....*/];

function handle_page(url){
    page.open(url, function(){
        waitFor(function() {
           // something
        }, function() {
            next_page();
        });
    });
}

function next_page(){
    var url = urls.shift();
    if(!urls){
        phantom.exit(0);
    }
    handle_page(url);
}

next_page();

Categories

javascript - Looping over urls to do the same thing

javascript - Looping over urls to do the same thing

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags