Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
517 views
in Technique[技术] by (71.8m points)

Grab the resource contents in CasperJS or PhantomJS

I see that CasperJS has a "download" function and an "on resource received" callback but I do not see the contents of a resource in the callback, and I don't want to download the resource to the filesystem.

I want to grab the contents of the resource so that I can do something with it in my script. Is this possible with CasperJS or PhantomJS?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This problem has been in my way for the last couple of days. The proxy solution wasn't very clean in my environment so I found out where phantomjs's QTNetworking core put the resources when it caches them.

Long story short, here is my gist. You need the cache.js and mimetype.js files: https://gist.github.com/bshamric/4717583

//for this to work, you have to call phantomjs with the cache enabled:
//usage:  phantomjs --disk-cache=true test.js

var page = require('webpage').create();
var fs = require('fs');
var cache = require('./cache');
var mimetype = require('./mimetype');

//this is the path that QTNetwork classes uses for caching files for it's http client
//the path should be the one that has 16 folders labeled 0,1,2,3,...,F
cache.cachePath = '/Users/brandon/Library/Caches/Ofi Labs/PhantomJS/data7/';

var url = 'http://google.com';
page.viewportSize = { width: 1300, height: 768 };

//when the resource is received, go ahead and include a reference to it in the cache object
page.onResourceReceived = function(response) {
  //I only cache images, but you can change this
    if(response.contentType.indexOf('image') >= 0)
    {
        cache.includeResource(response);
    }
};

//when the page is done loading, go through each cachedResource and do something with it, 
//I'm just saving them to a file
page.onLoadFinished = function(status) {
    for(index in cache.cachedResources) {
        var file = cache.cachedResources[index].cacheFileNoPath;
        var ext = mimetype.ext[cache.cachedResources[index].mimetype];
        var finalFile = file.replace("."+cache.cacheExtension,"."+ext);
        fs.write('saved/'+finalFile,cache.cachedResources[index].getContents(),'b');
    }
};

page.open(url, function () {
    page.render('saved/google.pdf');
    phantom.exit();
});

Then when you call phantomjs, just make sure the cache is enabled:

phantomjs --disk-cache=true test.js

Some notes: I wrote this for the purpose of getting the images on a page without using the proxy or taking a low res snapshot. QT uses compression on certain text file resources and you will have to deal with the decompression if you use this for text files. Also, I ran a quick test to pull in html resources and it didn't parse the http headers out of the result. But, this is useful to me, hopefully someone else will find it so, modify it if you have problems with a specific content type.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...