Home

Published

- 2 min read

Puppeteer.JS - Using Headless Chrome for Site Crawling

img of Puppeteer.JS - Using Headless Chrome for Site Crawling

PuppeteerJS essentially allows you to automate Chrome. Headless Chrome allows you to run Chrome without actually rendering the webpage. Sounds silly, but has a lot of useful applications, you could for example simply write a test script that ensures that your website is still working correctly.

Installation

npm i puppeteer
# or
yarn add puppeteer

Usage

We are going to look at a quick example of how to Log In to a site and then do some operation.

Initialize Puppeteer

You need to run it in an async function, simply because you do not know how long it will take until chrome has started. so with

import pupeteer from 'puppeteer';
//Node Version < 9 //const pupeteer = require('puppeteer') const page; (async () => {
    // Init Pupeteer
    const browser = await pupeteer.launch({ headless: false });
    const page = await browser.newPage(); // New Page to be manipulated

    // Automation

    // Close Browser
    await browser.close();
})();

We start our browser. The flag headless is set to ‘true’ as default, however for debugging purposes, you should set it to ‘false’;

Login

To Login to the site we need three things:

  • The URL for the Login Page
  • CSS Selector for the Username Field
  • CSS Selector for the Password Field

To obtain the selectors you can use the Chrome DevTools (F12). Simply select the HTML Field and with Rightclick select Copy Selector.

async function logIn() {
	let LOGIN_URL = 'https://example.com/login'
	await page.goto(LOGIN_URL)
	await page.focus('#username')
	await page.keyboard.type(USERNAME)
	await page.focus('#password')
	await page.keyboard.type(PASSWORD)
	await page.click('#form-submit')
	await page.waitForNavigation()
	console.log('LOGIN COMPLETE')
}

Now since you are logged in to the site, you can navigate to any site and fetch all the links.

async function analysePage(){
    let PAGE_URL = 'https://example.com/'
    await page.goto(PAGE_URL);
    let links = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('a')).map((val) => val.href);
    });
    console.log(links);
}

Final Code

import pupeteer from 'puppeteer';
//Node Version < 9
//const pupeteer = require('puppeteer')

const USERNAME = 'user';
const PASSWORD = 'user';

const page;

(async () => {
  // Init Pupeteer
  const browser = await pupeteer.launch({ headless: true });
  const page = await browser.newPage();

  // Automation
  logIn(page);
  analysePage(page);

  // Close Browser
  await browser.close();
})();

async function logIn() {
  let LOGIN_URL = 'https://example.com/login';
  await page.goto(LOGIN_URL);
  await page.focus('#username');
  await page.keyboard.type(USERNAME);
  await page.focus('#password');
  await page.keyboard.type(PASSWORD);
  await page.click('#form-submit');
  await page.waitForNavigation();
  console.log('LOGIN COMPLETE');
}

async function analysePage() {
  let PAGE_URL = 'https://example.com/';
  await page.goto(PAGE_URL);
  let links = await page.evaluate(() => {
    return Array.from(document.querySelectorAll('a')).map(val => val.href);
  });
  console.log(links);
}

images: [”../../../assets/images/Designed by Freepik”]