实例运用puppeteer headless方式抓取JS网

发布时间：2023-11-03 11:00:48 所属栏目：教程来源：互联网

导读： 　　google chrome团队出品的puppeteer 是依赖nodejs和chromium的自动化测试库，它的最大优点就是可以处理网页中的动态内容，如JavaScript，能够更好的模拟用户。

　　有些网站的反爬

　　google chrome团队出品的puppeteer 是依赖nodejs和chromium的自动化测试库，它的最大优点就是可以处理网页中的动态内容，如JavaScript，能够更好的模拟用户。

　　有些网站的反爬虫手段是将部分内容隐藏于某些javascript/ajax请求中，致使直接获取a标签的方式不奏效。甚至有些网站会设置隐藏元素“陷阱”，对用户不可见，脚本触发则认为是机器。这种情况下，puppeteer的优势就凸显出来了。

　　它可实现如下功能：

　　生成页面的屏幕截图和PDF。

　　抓取SPA并生成预先呈现的内容（即“×××”）。

　　自动表单提交，UI测试，键盘输入等。

　　创建一个最新的自动化测试环境。使用最新的JavaScript和浏览器功能，直接在最新版本的Chrome中运行测试。

　　捕获跟踪您网站的时间线，以帮助诊断性能问题。

　　开源地址：[https://github.com/GoogleChrome/puppeteer/][1]

　　安装

　　npm i puppeteer

　　注意先安装nodejs, 并在nodejs文件根目录下执行（npm文件同级）。

　　安装过程中会下载chromium，大约120M。

　　用两天（大约10小时）摸索，绕过了相当多的异步的坑，笔者对puppeteer和nodejs有了一定的掌握。

　　抓取blog文章

　　以csdn blog为例，文章内容需要点击“阅读全文”来获取，这就导致只能读取dom的脚本失效。

　　/**

　　* load blog.csdn.net article to local files

　　**/

　　const puppeteer = require('puppeteer');

　　//emulate iphone

　　const userAgent = 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1';

　　const workPath = './contents';

　　const fs = require("fs");

　　if (!fs.existsSync(workPath)) {

　　        fs.mkdirSync(workPath)

　　}

　　//base url

　　const rootUrl = 'https://blog.csdn.net/';

　　//max wait milliseconds

　　const maxWait = 100;

　　//max loop scroll times

　　const makLoop = 10;

　　(async () => {

　　    let url;

　　    let countUrl=0;

　　    const browser = await puppeteer.launch({headless: false});//set headless: true will hide chromium UI

　　    const page = await browser.newPage();

　　    await page.setUserAgent(userAgent);

　　    await page.setViewport({width:414, height:736});

　　    await page.setRequestInterception(true);

　　    //filter to block images

　　    page.on('request', request => {

　　    if (request.resourceType() === 'image')

　　      request.abort();

　　    else

　　      request.continue();

　　    });

　　    await page.goto(rootUrl);

　　    for(let i= 0; i<makLoop;i++){

　　        try{

　　            await page.evaluate(()=>window.scrollTo(0, document.body.scrollHeight));

　　            await page.waitForNavigation({timeout:maxWait,waitUntil: ['networkidle0']});

　　        }catch(err){

　　            console.log('scroll to bottom and then wait '+maxWait+'ms.');

　　        }

　　    }

　　    await page.screenshot({path: workPath+'/screenshot.png',fullPage: true, quality :100, type :'jpeg'});

　　    //#feedlist_id li[data-type="blog"] a

　　    const sel = '#feedlist_id li[data-type="blog"] h3 a';

　　    const hrefs = await page.evaluate((sel) => {

　　        let elements = Array.from(document.querySelectorAll(sel));

　　        let links = elements.map(element => {

　　            return element.href

　　        })

　　        return links;

　　    }, sel);

　　    console.log('total links: '+hrefs.length);

　　    process();

　　 async function process(){

　　    if(countUrl<hrefs.length){

　　        url = hrefs[countUrl];

　　        countUrl++;

　　    }else{

　　        browser.close();

　　        return;

　　    }

　　    console.log('processing url: '+url);

　　    try{

　　        const tab = await browser.newPage();

　　        await tab.setUserAgent(userAgent);

　　        await tab.setViewport({width:414, height:736});

　　        await tab.setRequestInterception(true);

　　        //filter to block images

　　        tab.on('request', request => {

　　        if (request.resourceType() === 'image')

　　          request.abort();

　　        else

　　          request.continue();

　　        });

　　        await tab.goto(url);

　　        //execute tap request

　　        try{

　　            await tab.tap('.read_more_btn');

　　        }catch(err){

　　            console.log('there\'s none read more button. No need to TAP');

　　        }

　　        let title = await tab.evaluate(() => document.querySelector('#article .article_title').innerText);

　　        let contents = await tab.evaluate(() => document.querySelector('#article .article_content').innerText);

　　        contents = 'TITLE: '+title+'\nURL: '+url+'\nCONTENTS: \n'+contents;

　　        const fs = require("fs");

　　        fs.writeFileSync(workPath+'/'+tab.url().substring(tab.url().lastIndexOf('/'),tab.url().length)+'.txt',contents);

　　        console.log(title + " has been downloaded to local.");

　　        await tab.close();

　　    }catch(err){

　　        console.log('url: '+tab.url()+' \n'+err.toString());

　　    }finally{

　　        process();

　　    }

　　 }

　　})();

　　执行过程

　　结束语

　　以前就想过既然nodejs是使用JavaScript脚本语言，那么它肯定能处理网页的JavaScript内容，但并没有发现合适的/高效率的库。直到发现puppeteer，才下定决心试水。

　　话说回来，nodejs的异步真的是很头疼的一件事，这上百行代码我竟然折腾了10个小时。

　　大家可拓展下代码中process()方法，使用async.eachSeries，我使用的递归方式并不是最优解。

　　事实上，逐一处理并不高效，原本我写了一个异步的关闭browser方法：

　　let tryCloseBrowser = setInterval(function(){

　　        console.log("check if any process running...")

　　        if(countDown<=0){

　　          clearInterval(tryCloseBrowser);

　　          console.log("none process running, close.")

　　          browser.close();

　　        }

　　    },3000);

　　按照这个思路，代码的最初版本是同时打开多个tab页，效率很高，但容错率很低，大家可以试着自己写一下。

（编辑：航空爱好网）

【声明】本站内容均来自网络，其相关言论仅代表作者个人观点，不代表本站立场。若无意侵犯到您的权利，请及时与联系站长删除相关内容!