最新消息:20210917 已从crifan.com换到crifan.org

【已解决】Python的Playwright去解析提取百度搜索的结果

Python crifan 1144浏览 0评论
折腾:
【未解决】Mac中用playwright自动操作浏览器实现百度搜索
期间,已经能触发百度搜索,现在去提取结果。
15. Playwright API – class: Keyboard – 《Playwright v1.12 Document》 – 书栈网 · BookStack
对于获取到元素后,可以
ElementHandle | Playwright
各种操作
包括获取值:
  • * elementHandle.innerHTML()
  • * elementHandle.innerText()
  • * elementHandle.textContent()
  • * jsHandle.getProperties()
  • * jsHandle.jsonValue()
ElementHandle represents an in-page DOM element. ElementHandles can be created with the page.$(selector) method.
说明是:用:
page.$(selector) 
去选择获取到元素
去试试
不过还是去搞清楚:如何选择到元素
playwright select element
Element selectors | Playwright 中文文档 | Playwright 中文网 (bootcss.com)
Element selectors | Playwright
还是看看 核心概念吧
Core concepts | Playwright
  • * Browser
  • * Browser contexts
  • * Pages and frames
  • * Selectors
  • * Auto-waiting
  • * Execution contexts: Playwright and Browser
  • * Evaluation Argument
page有:
  • page.goto
  • page.fill
  • page.click
看到如何获取到 定位到 元素了:
// Get frame using any other selector
const frameElementHandle = await page.$('.frame-class');
就是:
page.$(someSelector)
去看看page
Page | Playwright
看到了:
  • * page.$(selector)
  • * page.$$(selector)
  • * page.$eval(selector, pageFunction[, arg])
  • * page.$$eval(selector, pageFunction[, arg])
“page.$(selector)#
* selector <string> A selector to query for. See working with selectors for more details.
* returns: <Promise<null|ElementHandle>>
The method finds an element matching the specified selector within the page. If no elements match the selector, the return value resolves to null.
Shortcut for main frame’s frame.$(selector).”
page.$,传入 selector,返回空会元素句柄 ElementHandle
javascript – How to get a collection of elements with playwright? – Stack Overflow
puppeteer – How to select an option from dropdown select – Stack Overflow
webautomation – Using Playwright for Python, how do I select (or find) an element? – Stack Overflow
Getting value of input element in Playwright – Stack Overflow
去试试:
    resultASelector = "h3[class^='t'] a"
    searchResultAList = page.$$(resultASelector)
结果:
语法错误:
   searchResultAList = page.$$(resultASelector)
                             ^
SyntaxError: invalid syntax
看来是:$$是js语法,不是此处python语法
-》要求找Python版Playwright的page.$$对应的写法
Getting Started | Playwright
是python版的文档
找到了
ElementHandle | Playwright
href_element = page.query_selector("a")
href_element.click()
很清晰,用:page.query_selector
->
“element_handle.query_selector(selector)#
* selector <str> A selector to query for. See working with selectors for more details.
* returns: <NoneType|ElementHandle>
The method finds an element matching the specified selector in the ElementHandle’s subtree. See Working with selectors for more details. If no elements match the selector, returns null.
element_handle.query_selector_all(selector)#
* selector <str> A selector to query for. See working with selectors for more details.
* returns: <List[ElementHandle]>
The method finds all elements matching the specified selector in the ElementHandles subtree. See Working with selectors for more details. If no elements match the selector, returns empty array.”
注意到此处是:
真是针对当前元素 的子元素中去找
而此处想要找的是页面中去找
所以再去page页面中找
Page | Playwright
果然也有:
“page.query_selector(selector)#
* selector <str> A selector to query for. See working with selectors for more details.
* returns: <NoneType|ElementHandle>
The method finds an element matching the specified selector within the page. If no elements match the selector, the return value resolves to null.
Shortcut for main frame’s frame.query_selector(selector).
page.query_selector_all(selector)#
* selector <str> A selector to query for. See working with selectors for more details.
* returns: <List[ElementHandle]>
The method finds all elements matching the specified selector within the page. If no elements match the selector, the return value resolves to [].
Shortcut for main frame’s frame.query_selector_all(selector).”
就是我们希望的:
找到我们要的元素了。
即:
去写代码:
    resultASelector = "h3[class^='t'] a"
    searchResultAList = page.query_selector_all(resultASelector)
结果:
每个都是:
<JSHandle preview=JSHandle@node>
的类型
searchResultAList=[<JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>]
然后再去获取值
批量运行时,也出现类似问题:
【已解决】Python的Playwright用page.query_selector_all找不到元素
继续。
【总结】
最后用代码:
    ################################################################################
    # Extract content
    ################################################################################
    resultASelector = "h3[class^='t'] a"
    searchResultAList = page.query_selector_all(resultASelector)
    print("searchResultAList=%s" % searchResultAList)
    # searchResultAList=[<JSHandle preview=JSHandle@<a target="_blank" href="http://www.baidu.com/link?…>在路上on the way - 走别人没走过的路,让别人有路可走</a>>, <JSHandle preview=JSHandle@node>, 。。。, <JSHandle preview=JSHandle@node>]
    searchResultANum = len(searchResultAList)
    print("Found %s search result:" % searchResultANum)
    for curIdx, aElem in enumerate(searchResultAList):
        curNum = curIdx + 1
        print("%s [%d] %s" % ("-"*20, curNum, "-"*20))
        title = aElem.text_content()
        print("title=%s" % title)
        baiduLinkUrl = aElem.get_attribute("href")
        print("baiduLinkUrl=%s" % baiduLinkUrl)
实现了百度搜索结果的内容的解析和提取:
searchResultAList=[<JSHandle preview=JSHandle@<a target="_blank" href="http://www.baidu.com/link?…>在路上on the way - 走别人没走过的路,让别人有路可走</a>>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>]
Found 10 search result:
-------------------- [1] --------------------
title=在路上on the way - 走别人没走过的路,让别人有路可走
baiduLinkUrl=http://www.baidu.com/link?url=fB3F0xZmwig9r2M_1pK4BJG00xFPLjJ85X39GheP_fzEA_zJIjX-IleEH_ZL8pfo
-------------------- [2] --------------------
title=crifan – 在路上
baiduLinkUrl=http://www.baidu.com/link?url=kmvgD1PraoULnnjUvNPQmwHFQ9uUKkXg_HWy0NI3xI11cV7evpdxyA_4FkVf3zLH
-------------------- [3] --------------------
title=crifan简介_crifan的专栏-CSDN博客_crifan
baiduLinkUrl=http://www.baidu.com/link?url=CHLWAQKOMgb23GmzVCZRIVze9kBNu6DIVoSWQqe21bWq_qZk2zDu_V3pDC1o1i5WC8qXAbUhaBIN8UO9Sjzxfa
-------------------- [4] --------------------
title=crifan的微博_微博
baiduLinkUrl=http://www.baidu.com/link?url=-QwlZ5SEmZD1R2QqdsK7ByUhxmIdX_hiFCX79hg9RTbQ11j5wXaBaYXegXU9WDk3
-------------------- [5] --------------------
title=Crifan的电子书大全 | crifan.github.io
baiduLinkUrl=http://www.baidu.com/link?url=Sgrbyd_pBsm-BTANKwSmyveSWvWj2_IqOOZzYw-SkG8tQ_C6Ccz88zZxHf3Eh1JA
-------------------- [6] --------------------
title=GitHub - crifan/crifanLib: crifan's library
baiduLinkUrl=http://www.baidu.com/link?url=NSZ5IzQ2Qag3CpGLMAbJer3QaAqI7qZOp2Ythiw6o8inoDX-5LqlzOKWTrMzQK5G
-------------------- [7] --------------------
title=在路上www.crifan.com - 网站排行榜
baiduLinkUrl=http://www.baidu.com/link?url=Tc4cbETNKpQXj-kX1pwSOcPG8l9ijRRPqokRSMHgB59rSn6GoWSBzCPu3ky3dN6Cu1pb-4HBZ2_YaVyS7qdDS_
-------------------- [8] --------------------
title=crifan的专栏_crifan_CSDN博客-crifan领域博主
baiduLinkUrl=http://www.baidu.com/link?url=OLkrWu8q9SRZuBN-KzEMO56f82IpIfvbOp-sU3jdjbVBPP3GXBw_8StJgYG-_QrK
-------------------- [9] --------------------
title=User crifan - Stack Overflow
baiduLinkUrl=http://www.baidu.com/link?url=t1rc0EGg33A-uJUiZHKkUWA8ETf6B5P8pBKo0yNCH-VTWluW3xqUlYRHjMz8bQdiN2mJROMhfkX6bY0db_bB_a
-------------------- [10] --------------------
title=crifan - Bing 词典
baiduLinkUrl=http://www.baidu.com/link?url=8z-3hYeLAQ8T4efOf4848LtAdpGdR1Ect9au4JIUB32bm2z412RDsMelFW1R2aIk
效果:
已回复帖子
webautomation – Using Playwright for Python, how do I select (or find) an element? – Stack Overflow

转载请注明:在路上 » 【已解决】Python的Playwright去解析提取百度搜索的结果

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
92 queries in 0.197 seconds, using 23.35MB memory