Rectangle 27 7

Not sure if this is the best way but it works. It evaluate a script in the page, that increase document.body.scrollTop over time and make a screenshot after a fixed time.

page.open "http://www.somePage.com", (status) ->
      setTimeout(( ->
        page.evaluate(->
          pos = 0
          scroll = ->
            pos += 250
            window.document.body.scrollTop = pos
            setTimeout(scroll, 100)

          scroll()
        )

        setTimeout((->
          page.render('bild.png')
          phantom.exit()
        ), 5000)
      ), 1000)

How to scroll a page with phantomJs - Stack Overflow

phantomjs
Rectangle 27 9

I had problems getting debugging to work on Mac using Chrome Version 57.0.2987.133 (64-bit). I got the debugger to open on localhost:9000 (127.0.0.1:9000 didn't work for me) but after entering __run() (yes, with double underscore), there was no response. I could see other js files under Sources, mine was listed but was empty. (I did enable debugging in chrome)

I tried the same on safari and it all worked as advertised.

UPDATE for Chrome: (from Thiago Fernandes below): Apparently the issue is caused by the Chrome not accepting the enter key, so the workaround is to evaluate this function inside chrome console, to get the enterKey working:

function isEnterKey(event) { return (event.keyCode !== 229 && event.keyIdentifier === "Enter") || event.keyCode === 13; }

Thanks Thiago but what is the workaround other than install an old version of chrome? In which case I will just use safari.

function isEnterKey(event) {        return (event.keyCode !== 229 && event.keyIdentifier === "Enter") || event.keyCode === 13;     }

Thanks for this - for the longest time I couldn't tell what was wrong and didn't even realize that __run() wasn't actually getting executed.

Getting remote debugging set up with PhantomJS - Stack Overflow

phantomjs
Rectangle 27 5

JavaScript code needs some time to execute. Try to have a delay between setting the page content and calling render.

Ariya Thank you for the phantomjs! And good luck with a new versions :))

node.js - page.set('content') dosen't work with dynamic content in pha...

node.js phantomjs
Rectangle 27 2

If you run Phantom with phantom.javascriptEnabled = true; and try to login Amazon using username and password, you will get JavaScript disabled message, meaning Javascript can not execute. When JS is not enabled, you are not able to login on Amazon, because cookies are not working.

Amazon executes small JS code to set and delete cookie before login, here is part of source code:

function setCookie(c_name,value,expiredays)
    {
        var exdate=new Date();
        exdate.setDate(exdate.getDate()+expiredays);
        document.cookie=c_name+ "=" +escape(value)+
            ((expiredays==null) ? "" : ";expires="+exdate.toGMTString());
    }

function checkCookieEnabled(nodeId)
        {
            setCookie('amznTest','1',null);
            if(getCookie('amznTest')){
                deleteCookie('amznTest');
            }else{
                document.getElementById(nodeId).style.display = 'block';
            }
        }
        checkCookieEnabled('message_warning');
page.settings.javascriptEnabled = true;
phantom
phantom.cookiesEnabled = true;
page
var webPage = require('webpage');
    var page = webPage.create();
    page.settings.javascriptEnabled = true;
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36';

Now, just submit form using your username and password, and you can login.

Here is really good resource of How to login Amazon using PhantomJS. The same pattern can be used to login any other website.

phantom.cookiesEnabled = true
phantom.javascriptEnabled
cookiesEnabled = true;

How and why I put this answer? If you run Phantom with phantom.javascriptEnabled = true; and try to login Amazon using username and password, you will get JavaScript disabled message, meaning Javascript can not execute, and Amazon will refuse to login you, because cookies are not working. Amazon executes small JS code to set and delete cookie before login (check amazon source code on login page). After hours of workaround, you have to set page.settings.javascriptEnabled = true; and not only phantom.javascriptEnabled and everything worked smoothly. You can try it yourself.

That sounds like a whole lot of relevant information to include in your post instead of just vague instructions for how to set default settings.

Upvoted! For casperjs users, this page object can be accessed after calling runner.start() as runner.page.javascriptEnabled = true; where runner is the casperjs instance that has been configured. Casperjs' pageSettings config option has a javascriptEnabled property, but this is not the same as the aforementioned snippet! The answer here would also work for Facebook.com login.

javascript - PhantomJS/CasperJS site login, cookies are not accepted b...

javascript login phantomjs casperjs
Rectangle 27 3

Have you installed build-essential in your system? Because weak is a module that needs to be compiled locally. I had got it working in an Ubuntu system a week ago by installing build-essential (sudo apt-get install build-essential), but now when I try in Linux Mint, it gives me the same error you are getting. During npm install -g phantom, it warns that weak 0.3.1 could not be installed.

Have you found any solution yet?

And, by the way the key difference between the system where I was successful and the current one is that, the first one was 32 bit, the latter where it is failing is 64 bit. Does the module "weak" have any problems that prevents it from installing on a 64 bit box?

sudo add-apt-repository ppa:fkrull/deadsnakes
sudo apt-get update
sudo apt-get install python2.6
sudo update-alternatives --install /usr/bin/python python /usr/bin/python2.6 20
  • python --version should give you 2.6.x. Weak module does not get installed when Python version is higher than 2.6.x.

After this make sure that your node and npm are latest versions. I have node v0.10.28 and npm v1.4.9.

If you have phantom module already installed, remove it by running npm uninstall phantom. Then run npm cache clean.

Now, install weak separately: npm install weak. If that goes through, install phantom by running npm install phantom.

Current version of node: v0.11.14-pre Current version of npm: 1.4.9 Did all of the above. Fails part way through 'npm install weak' > weak@0.3.1 install /home/joe/Documents/My Stuff/Programming/Angular.js Projects/NodeJS Messing/FreeAgentScraper/node_modules/weak > node-gyp rebuild gyp ERR! configure error gyp ERR! stack Error: "pre" versions of node cannot be installed, use the --nodedir flag instead gyp ERR! stack at install (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/install.js:66:16)

Hmmm, I never faced that issue. I found couple of stackoverflow questions related to your issue here and here. Check them out. I guess using a stable version of node.js instead of v0.11.14-pre will resolve it. Or maybe use the --nodedir flag.

Node js and phantomjs - Cannot find module 'weak' - Stack Overflow

phantomjs
Rectangle 27 3

Have you installed build-essential in your system? Because weak is a module that needs to be compiled locally. I had got it working in an Ubuntu system a week ago by installing build-essential (sudo apt-get install build-essential), but now when I try in Linux Mint, it gives me the same error you are getting. During npm install -g phantom, it warns that weak 0.3.1 could not be installed.

Have you found any solution yet?

And, by the way the key difference between the system where I was successful and the current one is that, the first one was 32 bit, the latter where it is failing is 64 bit. Does the module "weak" have any problems that prevents it from installing on a 64 bit box?

sudo add-apt-repository ppa:fkrull/deadsnakes
sudo apt-get update
sudo apt-get install python2.6
sudo update-alternatives --install /usr/bin/python python /usr/bin/python2.6 20
  • python --version should give you 2.6.x. Weak module does not get installed when Python version is higher than 2.6.x.

After this make sure that your node and npm are latest versions. I have node v0.10.28 and npm v1.4.9.

If you have phantom module already installed, remove it by running npm uninstall phantom. Then run npm cache clean.

Now, install weak separately: npm install weak. If that goes through, install phantom by running npm install phantom.

Current version of node: v0.11.14-pre Current version of npm: 1.4.9 Did all of the above. Fails part way through 'npm install weak' > weak@0.3.1 install /home/joe/Documents/My Stuff/Programming/Angular.js Projects/NodeJS Messing/FreeAgentScraper/node_modules/weak > node-gyp rebuild gyp ERR! configure error gyp ERR! stack Error: "pre" versions of node cannot be installed, use the --nodedir flag instead gyp ERR! stack at install (/usr/local/lib/node_modules/npm/node_modules/node-gyp/lib/install.js:66:16)

Hmmm, I never faced that issue. I found couple of stackoverflow questions related to your issue here and here. Check them out. I guess using a stable version of node.js instead of v0.11.14-pre will resolve it. Or maybe use the --nodedir flag.

Node js and phantomjs - Cannot find module 'weak' - Stack Overflow

phantomjs
Rectangle 27 1

a work around is assigning :

window.console.log = function(msg) { alert(msg) };
page.onAlert = function(msg) {
  console.log(msg);
};

Not able to get console message from page.evaluate() to Page context (...

phantomjs
Rectangle 27 2

You don't specify what gets executed from the page with PhantomJS. You open the page with PhantomJS and all JavaScript that is executed in Chrome or Firefox is also executed in PhantomJS. It is a full browser without a "head".

There are some differences though. Clicking a download link will not trigger a download. The rendering engine which PhantomJS 1.x is based upon is nearly 4 years old, so some pages are simply rendered differently, because PhantomJS 1.x might not support that feature. (PhantomJS 2 is on the way and now in unofficial "alpha" status)

So you need to script every interaction that a user is doing on the page with JavaScript or CoffeeScript. You don't call page functions. You manipulate DOM elements to simulate a user interacting with the page in the browser. This needs to be done in such a crude way, because the PhantomJS API doesn't provide high-level user-like functions. If you want those, you have to look at CasperJS which is built on top of PhantomJS/SlimerJS.

There you actually have functions like click, wait, fetchText, etc.

I am not sure of the syntax here, i did ` var text = page.evaluate(function () { return document.title + '\n' + document.body.innerText; }); ` gives me the text but i need with the html tags as it would be seen in inspect. I m not sure of the syntax

If you want the page source of body with the tags, then use document.body.innerHTML inside of page.evaluate just like in any other browser. If you want the complete page source, you either access page.content outside of page context or get document.documentElement.outerHTML from inside page.evaluate. Again PhantomJS is just a normal browser, so everything you type in the Chrome Developer Tools, you can do inside page.evaluate. I guess you have to learn more about JavaScript in the browser to use it well. Please ask a proper question next time and do some research.

StackOverflow (SO) is not a forum. Forum threads tend to go on forever with many turns. The good thing on SO is that there is a rigid structure: Q&A. There is very little room for discussions. I particularly dislike long comment threads, because of course I can help you, but future readers may be overwhelmed with the amount of back and forth in the comments (comments may be deleted). 20 comments in a short amount of time automatically raises a moderator flag. I can give you pointers in the comments, but the real work has to be done by you. If you can't do it, I'm happy to answer your question.

javascript - Web scraping using PhantomJS - Stack Overflow

javascript web-scraping phantomjs
Rectangle 27 1

found solution and it works for me, the problem was that casperjs use older version of phantomjs, so for mac users just go to folder where casperjs installed. For me it was: /usr/local/Cellar/casperjs/. And find folder with phantomjs: /usr/local/Cellar/casperjs/1.1-beta4/libexec/phantomjs and change it to new dowloaded from phantomjs website.

I found that casperjs used 1.9 version, but current phantomjs is 2.1.1, just changed folder to new one and no problems with it.

node.js - In CasperJS scroll not working if we use it with PhantomJS -...

node.js phantomjs casperjs slimerjs
Rectangle 27 40

Found a way to do it and tried to adapt to your situation. I didn't test the best way of finding the bottom of the page because I had a different context, but check it out. The problem is that you have to wait a little for the page to load out and javascript works asynchronously so you have to use setInterval or setTimeout (see).

page.open('http://example.com/?q=houston', function () {

  // Checks for bottom div and scrolls down from time to time
  window.setInterval(function() {
      // Checks if there is a div with class=".has-more-items" 
      // (not sure if this is the best way of doing it)
      var count = page.content.match(/class=".has-more-items"/g);

      if(count === null) { // Didn't find
        page.evaluate(function() {
          // Scrolls to the bottom of page
          window.document.body.scrollTop = document.body.scrollHeight;
        });
      }
      else { // Found
        // Do what you want
        ...
        phantom.exit();
      }
  }, 500); // Number of milliseconds to wait between scrolls

});

It worked like a charm...thanks was stuck for several days... window.document.body.scrollTop = document.body.scrollHeight;

window.scrollTo(0, Math.max(Math.max(document.body.scrollHeight,document.documentElement.scrollHeight),Math.max(document.body.offsetHeight,document.documentElement.offsetHeight),Math.max(document.body.clientHeight, document.documentElement.clientHeight)));

javascript - How to scroll down with Phantomjs to load dynamic content...

javascript dom web-scraping screen-scraping phantomjs
Rectangle 27 14

For my master thesis, I developed the library phantomjs-pool which does exactly this. It allows to provide jobs which are then mapped to PhantomJS workers. The library handles the job distribution, communication, error handling, logging, restarting and some more stuff. The library was successfully used to crawl more than one million pages.

The following code executes a Google search for the numbers 0 to 9 and saves a screenshot of the page as googleX.png. Four websites are crawled in parallel (due to the creation of four workers). The script is started via node master.js.

master.js (runs in the Node.js environment)

var Pool = require('phantomjs-pool').Pool;

var pool = new Pool({ // create a pool
    numWorkers : 4,   // with 4 workers
    jobCallback : jobCallback,
    workerFile : __dirname + '/worker.js', // location of the worker file
    phantomjsBinary : __dirname + '/path/to/phantomjs_binary' // either provide the location of the binary or install phantomjs or phantomjs2 (via npm)
});
pool.start();

function jobCallback(job, worker, index) { // called to create a single job
    if (index < 10) { // index is count up for each job automatically
        job(index, function(err) { // create the job with index as data
            console.log('DONE: ' + index); // log that the job was done
        });
    } else {
        job(null); // no more jobs
    }
}

worker.js (runs in the PhantomJS environment)

var webpage = require('webpage');

module.exports = function(data, done, worker) { // data provided by the master
    var page = webpage.create();

    // search for the given data (which contains the index number) and save a screenshot
    page.open('https://www.google.com/search?q=' + data, function() {
        page.render('google' + data + '.png');
        done(); // signal that the job was executed
    });

};

This is a great library. I'm wondering, is there a way to detect when there are no more processes to be spawned? As in, waiting, via async or a promise, after pool.start() to do something once a series of processes has completed?

Thank you. Currently there is no way to do this as simple as with async. However, you can use the callback for each individual job (which fires when one job is done) and increase a counter that way. So you are still able to detect when all jobs are finished.

node.js - How to manage a 'pool' of PhantomJS instances - Stack Overfl...

node.js web-scraping phantomjs jsdom
Rectangle 27 1

This was an issue that I ran into yesterday. It turns out that the example script does not work for newer versions, so I built a new Phantom Script that works for Jasmine 2.X which fixes it. You can locate the working script here in my repository:

Very nice script, though I spotted a bug : for(var j = 0; j < specs.length; j++) { console.log(" it: " + specs[i].innerText); } You should have probably done specs[j].innerText

Thanks, if you want to continue off of it and make pr I'll accept it. But I'm not working on it since I use other test runners now like karma.

jasmine - JSCover with PhantomJS - TypeError: 'null' is not an object ...

jasmine phantomjs jscoverage
Rectangle 27 6

I was having the same issue just now, and I found a way that is, in my opinion, better than attempting to send keys to PhantomJS.

Remember, PhantomJS is a headless browser - no actual window is rendered for your OS to access via keyboard shortcuts.

That being said, every time a new tab/window is opened, it is added to the driver's window handles. Each window handle has a unique identifier.

You can easily just switch to the ID of that window (and back to your original window handler - if you wish to).

# Click a link that opens a new tab ...

# You'll see there's a new window handle!
print(driver.window_handles)

# Switch to the new window handle in the window_handles list
driver.switch_to.window(driver.window_handles[1])

# Switch back to the original window
driver.switch_to.window(driver.window_handles[0])

Then it's trivial to just check the driver's current_url to ensure you are on the right page, i.e.:

assert "www.stackoverflow.com" in driver.current_url

Cuz PhantomJS is javascript powered, you can always open a new windows with driver.execute_script('window.open("http://your-url.dot")'); in place of "Click a link that opens a new tab". It's also better to just name your handles than rely on the url, but is ok too.

That doesn't actually answer how to open the new window, and doing via javascript like i said does cause to lost the previous userAgent (and more things for sure) in capabilities. I think that attaching a dummy target='_blank' link and clicking it would be a good way to do it.

python selenium phantomJS new tab not working - Stack Overflow

python selenium phantomjs
Rectangle 27 3

From what I can tell your major issues is being notified when the postback is complete. I have mocked up a simple aspx page that simulates a long postback, it should work for your case. To wait for the callback to finish, then you can utilize standard casperjs functionality to do the scraping. I am a little worried about posting scraping instructions for a government site, hopefully my test page will be adequate to help you figure it out.

var casper = require('casper').create({
    // verbose: true,
    logLevel: "debug"
});
casper.start();

casper.on('remote.message', function (message) {
    this.echo(message);
});


grabResults = function () {
    this.echo(this.getCurrentUrl());
};

casper.start('http://localhost:13851/default.aspx', function () {

    casper.thenClick('#Button1', function () {
        // Setup a listener for the postback complete event
        this.evaluate(function () {
            Sys.WebForms.PageRequestManager.getInstance().add_endRequest(function () {
                console.log("client: doPostback complete");
                window.onPostBackComplete = true;
            });
        });

        // Use waitFor to wait for the postback to be finished
        this.waitFor(function () {
            return this.evaluate(function () {
                return window.onPostBackComplete;
            });
        }, function then() {
            this.echo("doPostback complete");
            this.echo("value of test label: " + this.fetchText('#Label1'));
        }, function timeout() {
            this.echo("-- > timeout");
        },
        5000);
    });


});

casper.run(function () {
    this.echo("finished");
});
<%@ Page Language="C#" AutoEventWireup="true" %>
<!DOCTYPE html>
<script runat="server">    
    protected void Button1_Click(object sender, EventArgs e)
    {
        Label1.Text = "Slow loaded text";
        System.Threading.Thread.Sleep(1000);  // simulate a slow server
    }
</script>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <title>Sample page</title>
</head>
<body>
    <form id="form1" runat="server">
        <asp:ScriptManager ID="ScriptManager1" runat="server"></asp:ScriptManager>
        <div>
            <asp:UpdatePanel ID="UpdatePanel1" runat="server" >
                <ContentTemplate>
                    <asp:Label ID="Label1" runat="server" Text="Default Label"></asp:Label>
                    <br />
                    <asp:Button ID="Button1" runat="server" Text="Button" OnClick="Button1_Click"  />
                </ContentTemplate>
            </asp:UpdatePanel>
        </div>
    </form>
</body>
</html>

javascript - dopostback in PhantomJS/CasperJS - Stack Overflow

javascript web-scraping phantomjs casperjs
Rectangle 27 3

Looking at the phantomjs API, page.open requires a URL as the first argument, not an HTML string. This is why the what you tried does not work.

However, one way that you might be able to achieve the effect of creating a page from a string is to host an empty "skeleton page," somewhere with a URL (could be localhost), and then include Javascript (using includeJs) into the empty page. The Javascript that you include into the blank page can use document.write("<p>blah blah blah</p>") to dynamically add content to the webpage.

I've ever done this, but AFAIK this should work.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head></head>
<body></body>
</html>

I figured as much... I saw something where someone was loading in local files and pushing them into an array then "opening" them with Phantom. I thought they were actually loading in the html content but after looking again I realized it was just the filename! I believe the above should work for me just fine.... thanks for the assistance!

javascript - PhantomJS create page from string - Stack Overflow

javascript node.js phantomjs
Rectangle 27 2

If Node.JS is an option, might I introduce you to cheerio? It's a great library for consuming questionably-formed HTML documents. It gives you a jQuery-like API for working with a DOM-like representation of the page you're scraping. Paired with request, it makes for a pretty easy environment for scraping HTML.

Your example would end up looking something like this (error handling left as an exercise for the reader):

var cheerio = require("cheerio"),
    request = require("request");

request("http://localhost/file.html", function(err, res, data) {
  var $ = cheerio.load(data);

  var people = $('table.person');
  var results = [];

  $.each(people, function() {
    var $this = $(this);

    results.push({ 
      firstName: $this.find('.firstName').text(),
      lastName: $this.find('.lastName').text(),
      age: $this.find('.age').text()
    });
  }

  do_something_with(results);
});

javascript - Is it pjscrape that is slow, or is it PhantomJS? Alternat...

javascript node.js screen-scraping phantomjs
Rectangle 27 3

Heroku Toolbelt now has first class support for multiple buildpacks, so you can get a working Node and PhantomJS setup with the following:

heroku buildpacks:set https://github.com/heroku/heroku-buildpack-nodejs.git
heroku buildpacks:add --index 1 https://github.com/stomita/heroku-buildpack-phantomjs.git

Is there a working nodejs/phantomjs Heroku buildpack? - Stack Overflow

heroku phantomjs buildpack
Rectangle 27 3

Heroku Toolbelt now has first class support for multiple buildpacks, so you can get a working Node and PhantomJS setup with the following:

heroku buildpacks:set https://github.com/heroku/heroku-buildpack-nodejs.git
heroku buildpacks:add --index 1 https://github.com/stomita/heroku-buildpack-phantomjs.git

Is there a working nodejs/phantomjs Heroku buildpack? - Stack Overflow

heroku phantomjs buildpack
Rectangle 27 2

I got the following to work in PhantomJS version 2.0.0. Whereas before, I was using page.open() to open a page from the filesystem and set a callback:

page.open("bench.html", pageLoadCallback);

Now, I accomplish the same thing from a string variable with the HTML page. The page.setContent() method requires a URL as the second argument, and this uses fs.absolute() to construct a file:// URL.

page.onLoadFinished = pageLoadCallback;
page.setContent(bench_str, "file://" + fs.absolute(".") + "/bench.html");

javascript - PhantomJS create page from string - Stack Overflow

javascript node.js phantomjs
Rectangle 27 2

Just wanted to mention I recently had a similar need and discovered that I could pass file:// style references as an URL param, so I dumped my HTML string into a local file then passed the full path to my capture script (django_screamshot) which basically uses casperjs and phantomjs + a capture.js script.

javascript - PhantomJS create page from string - Stack Overflow

javascript node.js phantomjs