Published
- 2 min read
Puppeteer.JS - Using Headless Chrome for Site Crawling
PuppeteerJS essentially allows you to automate Chrome. Headless Chrome allows you to run Chrome without actually rendering the webpage. Sounds silly, but has a lot of useful applications, you could for example simply write a test script that ensures that your website is still working correctly.
Installation
npm i puppeteer
# or
yarn add puppeteer
Usage
We are going to look at a quick example of how to Log In to a site and then do some operation.
Initialize Puppeteer
You need to run it in an async function, simply because you do not know how long it will take until chrome has started. so with
import pupeteer from 'puppeteer';
//Node Version < 9 //const pupeteer = require('puppeteer') const page; (async () => {
// Init Pupeteer
const browser = await pupeteer.launch({ headless: false });
const page = await browser.newPage(); // New Page to be manipulated
// Automation
// Close Browser
await browser.close();
})();
We start our browser. The flag headless is set to ‘true’ as default, however for debugging purposes, you should set it to ‘false’;
Login
To Login to the site we need three things:
- The URL for the Login Page
- CSS Selector for the Username Field
- CSS Selector for the Password Field
To obtain the selectors you can use the Chrome DevTools (F12). Simply select the HTML Field and with Rightclick select Copy Selector.
async function logIn() {
let LOGIN_URL = 'https://example.com/login'
await page.goto(LOGIN_URL)
await page.focus('#username')
await page.keyboard.type(USERNAME)
await page.focus('#password')
await page.keyboard.type(PASSWORD)
await page.click('#form-submit')
await page.waitForNavigation()
console.log('LOGIN COMPLETE')
}
Fetch all Links
Now since you are logged in to the site, you can navigate to any site and fetch all the links.
async function analysePage(){
let PAGE_URL = 'https://example.com/'
await page.goto(PAGE_URL);
let links = await page.evaluate(() => {
return Array.from(document.querySelectorAll('a')).map((val) => val.href);
});
console.log(links);
}
Final Code
import pupeteer from 'puppeteer';
//Node Version < 9
//const pupeteer = require('puppeteer')
const USERNAME = 'user';
const PASSWORD = 'user';
const page;
(async () => {
// Init Pupeteer
const browser = await pupeteer.launch({ headless: true });
const page = await browser.newPage();
// Automation
logIn(page);
analysePage(page);
// Close Browser
await browser.close();
})();
async function logIn() {
let LOGIN_URL = 'https://example.com/login';
await page.goto(LOGIN_URL);
await page.focus('#username');
await page.keyboard.type(USERNAME);
await page.focus('#password');
await page.keyboard.type(PASSWORD);
await page.click('#form-submit');
await page.waitForNavigation();
console.log('LOGIN COMPLETE');
}
async function analysePage() {
let PAGE_URL = 'https://example.com/';
await page.goto(PAGE_URL);
let links = await page.evaluate(() => {
return Array.from(document.querySelectorAll('a')).map(val => val.href);
});
console.log(links);
}
images: [”../../../assets/images/Designed by Freepik”]