【已解决】Python的Playwright去解析提取百度搜索的结果

折腾：

【未解决】Mac中用playwright自动操作浏览器实现百度搜索

期间，已经能触发百度搜索，现在去提取结果。

15. Playwright API – class: Keyboard – 《Playwright v1.12 Document》 – 书栈网 · BookStack

对于获取到元素后，可以

ElementHandle | Playwright

各种操作

包括获取值：

* elementHandle.innerHTML()
* elementHandle.innerText()
* elementHandle.textContent()
* jsHandle.getProperties()
* jsHandle.jsonValue()

ElementHandle represents an in-page DOM element. ElementHandles can be created with the page.$(selector) method.

说明是：用：

page.$(selector)

去选择获取到元素

去试试

不过还是去搞清楚：如何选择到元素

playwright select element

Element selectors | Playwright 中文文档 | Playwright 中文网 (bootcss.com)

Element selectors | Playwright

还是看看核心概念吧

Core concepts | Playwright

* Browser
* Browser contexts
* Pages and frames
* Selectors
* Auto-waiting
* Execution contexts: Playwright and Browser
* Evaluation Argument

page有：

page.goto
page.fill
page.click

看到如何获取到定位到元素了：

// Get frame using any other selector
const frameElementHandle = await page.$('.frame-class');

就是：

page.$(someSelector)

去看看page

Page | Playwright

看到了：

* page.$(selector)

Page | Playwright

返回单个（第一个匹配到的）元素

* page.$$(selector)

Page | Playwright

返回所有元素

* page.$eval(selector, pageFunction[, arg])
* page.$$eval(selector, pageFunction[, arg])

“page.$(selector)#

* selector <string> A selector to query for. See working with selectors for more details.

* returns: <Promise<null|ElementHandle>>

The method finds an element matching the specified selector within the page. If no elements match the selector, the return value resolves to null.

Shortcut for main frame’s frame.$(selector).”

page.$，传入 selector，返回空会元素句柄 ElementHandle

javascript – How to get a collection of elements with playwright? – Stack Overflow

puppeteer – How to select an option from dropdown select – Stack Overflow

webautomation – Using Playwright for Python, how do I select (or find) an element? – Stack Overflow

Getting value of input element in Playwright – Stack Overflow

去试试：

    resultASelector = "h3[class^='t'] a"
    searchResultAList = page.$$(resultASelector)

结果：

语法错误：

   searchResultAList = page.$$(resultASelector)
                             ^
SyntaxError: invalid syntax

看来是：$$是js语法，不是此处python语法

-》要求找Python版Playwright的page.$$对应的写法

Getting Started | Playwright

是python版的文档

找到了

ElementHandle | Playwright

href_element = page.query_selector("a")
href_element.click()

很清晰，用：page.query_selector

* element_handle.query_selector(selector)

https://playwright.dev/python/docs/api/class-elementhandle#element_handlequery_selectorselector

* element_handle.query_selector_all(selector)

https://playwright.dev/python/docs/api/class-elementhandle#element_handlequery_selector_allselector

“element_handle.query_selector(selector)#

* selector <str> A selector to query for. See working with selectors for more details.

* returns: <NoneType|ElementHandle>

The method finds an element matching the specified selector in the ElementHandle’s subtree. See Working with selectors for more details. If no elements match the selector, returns null.

element_handle.query_selector_all(selector)#

* selector <str> A selector to query for. See working with selectors for more details.

* returns: <List[ElementHandle]>

The method finds all elements matching the specified selector in the ElementHandles subtree. See Working with selectors for more details. If no elements match the selector, returns empty array.”

注意到此处是：

真是针对当前元素的子元素中去找

而此处想要找的是页面中去找

所以再去page页面中找

Page | Playwright

果然也有：

* page.query_selector(selector)

https://playwright.dev/python/docs/api/class-page#pagequery_selectorselector

* page.query_selector_all(selector)

https://playwright.dev/python/docs/api/class-page#pagequery_selector_allselector

“page.query_selector(selector)#

* selector <str> A selector to query for. See working with selectors for more details.

* returns: <NoneType|ElementHandle>

The method finds an element matching the specified selector within the page. If no elements match the selector, the return value resolves to null.

Shortcut for main frame’s frame.query_selector(selector).

page.query_selector_all(selector)#

* selector <str> A selector to query for. See working with selectors for more details.

* returns: <List[ElementHandle]>

The method finds all elements matching the specified selector within the page. If no elements match the selector, the return value resolves to [].

Shortcut for main frame’s frame.query_selector_all(selector).”

就是我们希望的：

找到我们要的元素了。

即：

element_handle

element_handle.query_selector(selector)

https://playwright.dev/python/docs/api/class-elementhandle#element_handlequery_selectorselector

element_handle.query_selector_all(selector)

https://playwright.dev/python/docs/api/class-elementhandle#element_handlequery_selector_allselector

page

page.query_selector(selector)

https://playwright.dev/python/docs/api/class-page#pagequery_selectorselector

page.query_selector_all(selector)

https://playwright.dev/python/docs/api/class-page#pagequery_selector_allselector

去写代码：

    resultASelector = "h3[class^='t'] a"
    searchResultAList = page.query_selector_all(resultASelector)

结果：

每个都是：

<JSHandle preview=JSHandle@node>

的类型

searchResultAList=[<JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>]

然后再去获取值

* element_handle.get_attribute(name)

https://playwright.dev/python/docs/api/class-elementhandle#element_handleget_attributename

* element_handle.inner_html()

https://playwright.dev/python/docs/api/class-elementhandle#element_handleinner_html

* element_handle.inner_text()

https://playwright.dev/python/docs/api/class-elementhandle#element_handleinner_text

* element_handle.text_content()

https://playwright.dev/python/docs/api/class-elementhandle#element_handletext_content

批量运行时，也出现类似问题：

【已解决】Python的Playwright用page.query_selector_all找不到元素

继续。

【总结】

最后用代码：

    ################################################################################
    # Extract content
    ################################################################################
    resultASelector = "h3[class^='t'] a"
    searchResultAList = page.query_selector_all(resultASelector)
    print("searchResultAList=%s" % searchResultAList)
    # searchResultAList=[<JSHandle preview=JSHandle@<a target="_blank" href="http://www.baidu.com/link?…>在路上on the way - 走别人没走过的路,让别人有路可走</a>>, <JSHandle preview=JSHandle@node>, 。。。, <JSHandle preview=JSHandle@node>]
    searchResultANum = len(searchResultAList)
    print("Found %s search result:" % searchResultANum)
    for curIdx, aElem in enumerate(searchResultAList):
        curNum = curIdx + 1
        print("%s [%d] %s" % ("-"*20, curNum, "-"*20))
        title = aElem.text_content()
        print("title=%s" % title)
        baiduLinkUrl = aElem.get_attribute("href")
        print("baiduLinkUrl=%s" % baiduLinkUrl)

实现了百度搜索结果的内容的解析和提取：

searchResultAList=[<JSHandle preview=JSHandle@<a target="_blank" href="http://www.baidu.com/link?…>在路上on the way - 走别人没走过的路,让别人有路可走</a>>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>, <JSHandle preview=JSHandle@node>]
Found 10 search result:
-------------------- [1] --------------------
title=在路上on the way - 走别人没走过的路,让别人有路可走
baiduLinkUrl=http://www.baidu.com/link?url=fB3F0xZmwig9r2M_1pK4BJG00xFPLjJ85X39GheP_fzEA_zJIjX-IleEH_ZL8pfo
-------------------- [2] --------------------
title=crifan – 在路上
baiduLinkUrl=http://www.baidu.com/link?url=kmvgD1PraoULnnjUvNPQmwHFQ9uUKkXg_HWy0NI3xI11cV7evpdxyA_4FkVf3zLH
-------------------- [3] --------------------
title=crifan简介_crifan的专栏-CSDN博客_crifan
baiduLinkUrl=http://www.baidu.com/link?url=CHLWAQKOMgb23GmzVCZRIVze9kBNu6DIVoSWQqe21bWq_qZk2zDu_V3pDC1o1i5WC8qXAbUhaBIN8UO9Sjzxfa
-------------------- [4] --------------------
title=crifan的微博_微博
baiduLinkUrl=http://www.baidu.com/link?url=-QwlZ5SEmZD1R2QqdsK7ByUhxmIdX_hiFCX79hg9RTbQ11j5wXaBaYXegXU9WDk3
-------------------- [5] --------------------
title=Crifan的电子书大全 | crifan.github.io
baiduLinkUrl=http://www.baidu.com/link?url=Sgrbyd_pBsm-BTANKwSmyveSWvWj2_IqOOZzYw-SkG8tQ_C6Ccz88zZxHf3Eh1JA
-------------------- [6] --------------------
title=GitHub - crifan/crifanLib: crifan's library
baiduLinkUrl=http://www.baidu.com/link?url=NSZ5IzQ2Qag3CpGLMAbJer3QaAqI7qZOp2Ythiw6o8inoDX-5LqlzOKWTrMzQK5G
-------------------- [7] --------------------
title=在路上www.crifan.com - 网站排行榜
baiduLinkUrl=http://www.baidu.com/link?url=Tc4cbETNKpQXj-kX1pwSOcPG8l9ijRRPqokRSMHgB59rSn6GoWSBzCPu3ky3dN6Cu1pb-4HBZ2_YaVyS7qdDS_
-------------------- [8] --------------------
title=crifan的专栏_crifan_CSDN博客-crifan领域博主
baiduLinkUrl=http://www.baidu.com/link?url=OLkrWu8q9SRZuBN-KzEMO56f82IpIfvbOp-sU3jdjbVBPP3GXBw_8StJgYG-_QrK
-------------------- [9] --------------------
title=User crifan - Stack Overflow
baiduLinkUrl=http://www.baidu.com/link?url=t1rc0EGg33A-uJUiZHKkUWA8ETf6B5P8pBKo0yNCH-VTWluW3xqUlYRHjMz8bQdiN2mJROMhfkX6bY0db_bB_a
-------------------- [10] --------------------
title=crifan - Bing 词典
baiduLinkUrl=http://www.baidu.com/link?url=8z-3hYeLAQ8T4efOf4848LtAdpGdR1Ect9au4JIUB32bm2z412RDsMelFW1R2aIk

效果：

已回复帖子

webautomation – Using Playwright for Python, how do I select (or find) an element? – Stack Overflow

转载请注明：在路上 » 【已解决】Python的Playwright去解析提取百度搜索的结果

【已解决】Python的Playwright去解析提取百度搜索的结果

与本文相关的文章

Hi，您需要填写昵称和邮箱！