折腾:
【未解决】用Python爬虫框架PySpider实现爬虫爬取百度热榜内容列表
期间,先去返回热榜结果列表
状态改为 RUNNING

然后点击Run

很快就运行完毕了:

点击Results

结果没数据:

说明代码有问题。
继续去调试
改回TODO 或STOP:

好像是:
titleItemList = response.doc('span[class="title-content-title"]').items()
是个generator?
不能被使用,否则就空了?
改为:
# titleItemList = response.doc('span[class="title-content-title"]').items() # print("titleItemList=%s" % titleItemList) # for eachItem in titleItemList: for eachItem in response.doc('span[class="title-content-title"]').items():
结果:好像还是不行。
感觉是:
返回的dict中少了url?
去加上:
return { "url": response.url, "百度热榜标题": itemTitleStr, }
结果:
好像还是不对
点击Status:SUCCESS

跳转到:

SUCCESS crawlBaiduHotList_PySpider.baiduHome > https://www.baidu.com/ (7 minutes ago crawled ) taskid e81c1f5749545c5f7d247b3a100ffe62 lastcrawltime 1596174360.7978182 (7 minutes ago) updatetime 1596174360.797837 (7 minutes ago) track.fetch 231.26ms { "content": null, "encoding": "utf-8", "error": null, "headers": {}, "ok": true, "redirect_url": null, "status_code": 200, "time": 0.23125600814819336 } track.process 8.79ms titleItemList=<generator object PyQuery.items at 0x108591eb8> { "exception": null, "follows": 0, "logs": "titleItemList=<generator object PyQuery.items at 0x108591eb8>\n", "ok": true, "result": null, "time": 0.008793115615844727 } schedule {} fetch {} process { "callback": "baiduHome" }
好像没问题?
参考:
代码:
好像写法没问题:
# for eachItem in response.doc('span[class="title-content-title"]').items(): titleItemList = response.doc('span[class="title-content-title"]').items() print("titleItemList=%s" % titleItemList) for eachItem in titleItemList: print("eachItem=%s" % eachItem) itemTitleStr = eachItem.text() print("itemTitleStr=%s" % itemTitleStr) return { "url": response.url, "百度热榜标题": itemTitleStr, }
但是调试没输出titleItemList之后的值
感觉是PyQuery的用法不对?
pyquery
写法没错啊
去改为list试试:
titleItemGenerator = response.doc('span[class="title-content-title"]').items() titleItemList = list(titleItemGenerator) print("titleItemList=%s" % titleItemList)
结果:
titleItemList=[]
原来是代码有问题,返回是空啊。。。
所以继续找原因,估计是需要加header中的User-Agent?去试试
def on_start(self): UserAgent_Chrome_Mac = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36" curHeaderDict = { "User-Agent": UserAgent_Chrome_Mac, } self.crawl(' https://www.baidu.com/ ', callback=self.baiduHome, headers=curHeaderDict)
结果:
就对了,终于返回出内容了:

【总结】
此处之所以之前代码:
response.doc('span[class="title-content-title"]').items()
没有返回我们希望的百度热榜的标题列表
表面原因:
此处本身PySpider直接抓取
返回网页内容不完整
根本原因:缺少User-Agent
解决办法:去加上:
def on_start(self): UserAgent_Chrome_Mac = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36" curHeaderDict = { "User-Agent": UserAgent_Chrome_Mac, } self.crawl('https://www.baidu.com/', callback=self.baiduHome, headers=curHeaderDict)
后续即可返回结果。
转载请注明:在路上 » 【已解决】PySpider抓包百度热榜标题列表结果