需要去爬取:
xxxxxxxx大赛
http://xxx/index.php?m=Home&c=MatchNew&a=audition&act_id=3
《老鼠xx》xxx大赛开始了!
http:/www/index.php?m=Home&c=MatchNew&a=audition&act_id=4
xxx(全国)xx英语大赛
http://xxx/index.php?m=Home&c=MatchNew&a=audition&act_id=7
中的视频和相关信息。
先去本地用虚拟环境工具pipenv创建个虚拟环境,然后去安装搭建PySpider环境
【已解决】pipenv install PySpider卡死在:Locking [packages] dependencies
那就先去开始开发,之后再去操心pipenv的lock卡死的问题。
<code>pyspider </code>
然后去打开:



<code>#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-07-11 14:12:12
# Project: xxx
from pyspider.libs.base_handler import *
class Handler(BaseHandler):
crawl_config = {
}
@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://xxx/index.php?m=Home&c=MatchNew&a=audition&act_id=3', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.detail_page)
@config(priority=2)
def detail_page(self, response):
return {
"url": response.url,
"title": response.doc('title').text(),
}
</code>然后就是去研究了:
http://xxx/index.php?m=Home&c=MatchNew&a=audition&act_id=3
页面下拉加载更多后,请求是:
<code>POST http://xxx/index.php?m=home&c=match_new&a=get_shows form data: url-encoded counter=1&order=1&match_type=2&match_name=&act_id=3 counter=2&order=1&match_type=2&match_name=&act_id=3 ... counter=5&order=1&match_type=2&match_name=&act_id=3 </code>

即可返回要的数据:

<code>{"status":1,"data":[{"id":"795","uid":"4009201","show_id":"103241451","course_id":"41758","supports":"11","rewards":"0","shares":"0","scores":"6.60","status":"1","match_type":"2","create_time":"1512790165","act_id":"3","child_type":"1","show_score":"0","head_img":"https:\/\/x.x.x\/2017-11-20\/5a129fb13791d.jpg","cover_img":"https:\/\/x.x.x\/2017-03-15\/58c8abf7eafb6.jpg","name":"\u6881\u8f69\u94ed","href":"\/index.php?m=home&c=match_new&a=video&show_id=103241451"},{"id":"386","uid":"733919","show_id":"103099400","course_id":"46923","supports":"10","rewards":"0","shares":"1","scores":"6.40","status":"1","match_type":"2","create_time":"1512734745","act_id":"3","child_type":"1","show_score":"36","head_img":"https:\/\/x.x.x\/2017-07-30\/597d2ed157131.jpg","cover_img":"https:\/\/x.x.x\/2017-06-13\/14973432415241.jpg","name":"\u597d\u60ca\u559c","href":"\/index.php?m=home&c=match_new&a=video&show_id=103099400"},{"id":"632","uid":"818332","show_id":"103168349","course_id":"17734","supports":"9","rewards":"0","shares":"2","scores":"6.20","status":"1","match_type":"2","create_time":"1512741739","act_id":"3","child_type":"1","show_score":"92","head_img":"https:\/\/x.x.x\/2017-04-06\/58e5d0d774270.jpg","cover_img":"https:\/\/x.x.x\/2018-06-04\/5b14e22b8850a.jpg","name":"\u97e9\u6653\u5915","href":"\/index.php?m=home&c=match_new&a=video&show_id=103168349"},{"id":"94","uid":"5623116","show_id":"103021383","course_id":"22740","supports":"9","rewards":"0","shares":"2","scores":"6.20","status":"1","match_type":"2","create_time":"1512710369","act_id":"3","child_type":"1","show_score":"0","head_img":"http:\/\/q.qlogo.cn\/qqapp\/1104670989\/D3CE41F908B81149927A05914792468D\/100","cover_img":"https:\/\/x.x.x\/2017-12-12\/5a2f790ed12cf.jpg","name":"\u5434\u6850","href":"\/index.php?m=home&c=match_new&a=video&show_id=103021383"},{"id":"2284","uid":"1140302","show_id":"104223263","course_id":"22740","supports":"9","rewards":"0","shares":"1","scores":"5.80","status":"1","match_type":"2","create_time":"1513163554","act_id":"3","child_type":"1","show_score":"0","head_img":"https:\/\/x.x.x\/2016-10-16\/5802ffe1b3419.jpg","cover_img":"https:\/\/x.x.x\/2017-12-12\/5a2f790ed12cf.jpg","name":"\u8d75\u6668\u6c50","href":"\/index.php?m=home&c=match_new&a=video&show_id=104223263"},{"id":"1359","uid":"5697525","show_id":"103519915","course_id":"43716","supports":"9","rewards":"0","shares":"1","scores":"5.80","status":"1","match_type":"2","create_time":"1512879173","act_id":"3","child_type":"1","show_score":"0","head_img":"https:\/\/x.x.x\/2018-06-23\/5b2de55693ad9.jpg","cover_img":"https:\/\/x.x.x\/2017-02-23\/58ae9dec28283.jpg","name":"\u5510\u6615\u73a5","href":"\/index.php?m=home&c=match_new&a=video&show_id=103519915"},{"id":"281","uid":"3973436","show_id":"103070053","course_id":"41758","supports":"8","rewards":"0","shares":"2","scores":"5.60","status":"1","match_type":"2","create_time":"1512731030","act_id":"3","child_type":"1","show_score":"0","head_img":"https:\/\/x.x.x\/2018-07-05\/5b3d677fe90ce.jpg","cover_img":"https:\/\/x.x.x\/2017-03-15\/58c8abf7eafb6.jpg","name":"\u6881\u4e50","href":"\/index.php?m=home&c=match_new&a=video&show_id=103070053"},{"id":"172","uid":"4678134","show_id":"103038507","course_id":"41758","supports":"8","rewards":"0","shares":"2","scores":"5.60","status":"1","match_type":"2","create_time":"1512725033","act_id":"3","child_type":"1","show_score":"94","head_img":"https:\/\/x.x.x\/2018-01-24\/5a68647cd462b.jpg","cover_img":"https:\/\/x.x.x\/2017-03-15\/58c8abf7eafb6.jpg","name":"\u8427\u4fca\u9091","href":"\/index.php?m=home&c=match_new&a=video&show_id=103038507"},{"id":"1918","uid":"12695261","show_id":"103897863","course_id":"43713","supports":"9","rewards":"0","shares":"0","scores":"5.40","status":"1","match_type":"2","create_time":"1512997970","act_id":"3","child_type":"1","show_score":"88","head_img":"https:\/\/x.x.x\/Public\/static\/avatar_default.png","cover_img":"https:\/\/x.x.x\/2017-02-23\/58ae9e49a1353.jpg","name":"\u8c22\u80e4\u9e92","href":"\/index.php?m=home&c=match_new&a=video&show_id=103897863"},{"id":"1762","uid":"6098791","show_id":"103806041","course_id":"43718","supports":"9","rewards":"0","shares":"0","scores":"5.40","status":"1","match_type":"2","create_time":"1512990207","act_id":"3","child_type":"1","show_score":"95","head_img":"https:\/\/x.x.x\/1526815032729.jpg","cover_img":"https:\/\/x.x.x\/2017-02-23\/58ae9ef2e9b20.jpg","name":"\u8bfa\u8bfa\uff5e\u80d6\u80d6","href":"\/index.php?m=home&c=match_new&a=video&show_id=103806041"}]}
</code>然后就是去看看PySpider中,如何实现POST,且传递url-encoded的form data了。
【已解决】PySpider中如何发送POST请求且传递格式为application/x-www-form-urlencoded的form data参数
然后就是去生成多个url了。
然后接着去:
然后经过后续调试,可以通过:
<code>#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-07-11 14:12:12
# Project: xxx
# Author: Crifan Li
# Updated: 20180712
from pyspider.libs.base_handler import *
import re
import os
import codecs
import json
from datetime import datetime,timedelta
xxxUrlRoot = "http://xxx"
OutputFullPath = "/Users/crifan/dev/xxx/output"
MatchInfoDict = {
# act_id -> title,
"3" : {
"title": "xxx大赛",
# para for http://xxx/index.php?m=home&c=match_new&a=get_shows POST
"match_type": "2",
"order": [
"1", # 亲子组
"2" # 好友组
]
},
"4" : {
"title": "xxx2大赛",
# para for http://xxx/index.php?m=home&c=match_new&a=get_shows POST
"match_type": "1",
"order": [
"create_time", # 最新配音
"scores", # 热度总榜
]
},
"7" : {
"title": "yyy赛",
# para for http://xxx?m=home&c=match_new&a=get_shows POST
"match_type": "2",
"order": [
"1", # 学前组
"2" #小学组
]
},
}
class Handler(BaseHandler):
crawl_config = {
}
# @every(minutes=24 * 60)
def on_start(self):
# actIdList = ["3", "4", "7"]
# for debug
actIdList = ["4", "7", "3"]
for curActId in actIdList:
curUrl = "http://xxx/index.php?m=Home&c=MatchNew&a=audition&act_id=%s" % curActId
self.crawl(curUrl, callback=self.indexPageCallback, save=curActId)
# @config(age=10 * 24 * 60 * 60)
def indexPageCallback(self, response):
curActId = response.save
print("curActId=%s" % curActId)
# <ul class="list-user list-user-1" id="list-user-1">
for each in response.doc('ul[id^="list-user"] li a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.showVideoCallback, save=curActId)
# <ul class="list-user list-user-1" id="list-user-1">
# <ul class="list-user list-user-2" id="list-user-2">
curPageNum = 1
curMatchOrderList = MatchInfoDict[curActId]["order"]
match_type = MatchInfoDict[curActId]["match_type"]
print("curMatchOrderList=%s,match_type=%s" % (curMatchOrderList, match_type))
for curOrder in curMatchOrderList:
print("curOrder=%s" % curOrder)
getShowsParaDict = {
"counter": curPageNum,
"order": curOrder,
"match_type": match_type,
"match_name": "",
"act_id": curActId
}
self.getNextPageShow(response, getShowsParaDict)
def getNextPageShow(self, response, getShowsParaDict):
"""
recursively get next page shows until fail
"""
print("getNextPageShow: getShowsParaDict=%s" % getShowsParaDict)
getShowsUrl = "http://xxx/index.php?m=home&c=match_new&a=get_shows"
headerDict = {
"Content-Type": "application/x-www-form-urlencoded"
}
timestampStr = datetime.now().strftime("%Y%m%d_%H%M%S_%f")
getShowsUrlWithHash = getShowsUrl + "#" + timestampStr
#20180712_154134_660436_1_2_4
fakeItagForceRecrawl = "%s_%s_%s_%s" % (
timestampStr,
getShowsParaDict["counter"],
getShowsParaDict["order"],
getShowsParaDict["act_id"]
)
self.crawl(
getShowsUrlWithHash,
itag=fakeItagForceRecrawl, # To force re-crawl for next page
method="POST",
headers=headerDict,
data=getShowsParaDict,
cookies=response.cookies,
callback=self.parseGetShowsCallback,
save=getShowsParaDict
)
def parseGetShowsCallback(self, response):
print("parseGetShowsCallback: self=%s, response=%s"%(self, response))
respJson = response.json
prevPageParaDict = response.save
print("prevPageParaDict=%s, respJson=%s" % (prevPageParaDict, respJson))
if respJson["status"] == 1:
respData = respJson["data"]
# recursively try get next page shows
prevPageParaDict["counter"] = prevPageParaDict["counter"] + 1
self.getNextPageShow(response, prevPageParaDict)
for eachData in respData:
# print("type(eachData)=" % type(eachData))
showId = eachData["show_id"]
href = eachData["href"]
fullUrl = xxxUrlRoot + href
print("[%s] fullUrl=%s" % (showId, fullUrl))
curShowInfoDict = eachData
self.crawl(
fullUrl,
callback=self.showVideoCallback,
save=curShowInfoDict)
else:
print("!!! Fail to get shows json from %s" % response.url)
# @config(priority=2)
def showVideoCallback(self, response):
print("showVideoCallback: response.url=%s" % (response.url))
curShowInfoDictOrActId = response.save
print("curShowInfoDictOrActId=%s" % curShowInfoDictOrActId)
act_id = ""
curShowInfoDict = None
if isinstance(curShowInfoDictOrActId, str):
act_id = curShowInfoDictOrActId
print("para is curActId")
elif isinstance(curShowInfoDictOrActId, dict):
curShowInfoDict = curShowInfoDictOrActId
print("para is curShowInfoDict")
else:
print("!!! can not recognize parameter for showVideoCallback")
title = response.doc('span[class="video-title"]').text()
show_id = ""
name = ""
scores = "" # 热度
supports = "" # 点赞数
shares = "" # 被分享数
# <video controls="" class="video-box" poster="https://xxx/2017-02-23/58ae9dec28283.jpg" id="myVideo">
# <source src="https://xxx/2017-12-15/id1513344895u878964.mp4" type="video/mp4"> 您的浏览器不支持Video标签。
# </video>
# videoUrl = response.doc('video source[src$=".mp4"]')
videoUrl = response.doc('video source[src^="http"]').attr("src")
print("title=%s" % title)
if curShowInfoDict:
act_id = curShowInfoDict["act_id"]
print("inside curShowInfoDict: set act_id to %s" % act_id)
show_id = curShowInfoDict["show_id"]
name = curShowInfoDict["name"]
scores = curShowInfoDict["scores"]
supports = curShowInfoDict["supports"]
shares = curShowInfoDict["shares"]
else:
#<a href="javascript:;" class="sign-btn" id="redirect_show" sid="104728193" onclick="pauseVid()">投票传送门</a>
show_id = response.doc('a[id="redirect_show"]').attr("sid")
# <div class="v-user">
# <span class="v-user-name">徐欣蕊</span>
# <span>热度:65.00</span>
name = response.doc('span[class="v-user-name"]').text()
scoresText = response.doc('div[class="v-user"] span:nth-child(2)').text()
print("scoresText=%s" % scoresText)
scoresMatch = re.search("热度:(?P<scoresFloatText>[\d\.]+)", scoresText)
print("scoresMatch=%s" % scoresMatch)
if scoresMatch:
scores = scoresMatch.group("scoresFloatText")
print("scores=%s" % scores)
# <ul>
# <li class="li-1">
# <img src="https://x.x.x/Home/images/dubbing/icon6.png?201806116141">
# <span>107次</span>
# </li>
# <li class="li-2">
# <img src="https://x.x.x/Home/images/dubbing/icon8.png?201806116141">
# <span>2次</span>
# </li>
# </ul>
supportsText = response.doc('ul li[class="li-1"] span').text()
supportsMatch = re.search("(?P<supportIntText>\d+)次", supportsText)
print("supportsMatch=%s" % supportsMatch)
if supportsMatch:
supports = supportsMatch.group("supportIntText")
print("supports=%s" % supports)
sharesText = response.doc('ul li[class="li-2"] span').text()
sharesMatch = re.search("(?P<sharesIntText>\d+)次", sharesText)
print("sharesMatch=%s" % sharesMatch)
if sharesMatch:
shares = sharesMatch.group("sharesIntText")
print("shares=%s" % shares)
respDict = {
"url": response.url,
"act_id": act_id,
"title": title,
"show_id": show_id,
"name": name,
"scores": scores,
"supports": supports,
"shares": shares,
"videoUrl": videoUrl
}
self.crawl(
videoUrl,
callback=self.saveVideoAndJsonCallback,
save=respDict)
return respDict
def saveVideoAndJsonCallback(self, response):
itemUrl = response.url
print("saveVideoAndJsonCallback: itemUrl=%s,response=%s" % (itemUrl, response))
itemInfoDict = response.save
curActId = itemInfoDict["act_id"]
print("curActId=%s" % curActId)
matchName = MatchInfoDict[curActId]["title"]
print("matchName=%s" % matchName)
matchFolderPath = os.path.join(OutputFullPath, matchName)
print("matchFolderPath=%s" % matchFolderPath)
if not os.path.exists(matchFolderPath):
os.makedirs(matchFolderPath)
print("Ok to create folder %s" % matchFolderPath)
filename = "%s-%s-%s" % (
itemInfoDict["show_id"],
itemInfoDict["name"],
itemInfoDict["title"])
print("filename=%s" % filename)
jsonFilename = filename + ".json"
videoSuffix = itemUrl.split(".")[-1]
videoFileName = filename + "." + videoSuffix
print("jsonFilename=%s,videoSuffix=%s,videoFileName=%s" % (jsonFilename, videoSuffix, videoFileName))
# {
# 'act_id': '7',
# 'name': '李冉月',
# 'scores': '22.50',
# 'shares': '1',
# 'show_id': '138169051',
# 'supports': '44',
# 'title': '【激情】坚持到底不放弃',
# 'url': 'http://x.x.x/index.php?m=home&c=match_new&a=video&show_id=138169051',
# 'videoUrl': 'https://cdnx.x.x/2018-06-03/152798389836832449205.mp4'
# }
jsonFilePath = os.path.join(matchFolderPath, jsonFilename)
print("jsonFilePath=%s" % jsonFilePath)
self.saveJsonToFile(jsonFilePath, itemInfoDict)
videoBinData = response.content
videoFilePath = os.path.join(matchFolderPath, videoFileName)
self.saveDataToFile(videoFilePath, videoBinData)
def saveDataToFile(self, fullFilename, binaryData):
with open(fullFilename, 'wb') as fp:
fp.write(binaryData)
fp.close()
print("Complete save file %s" % fullFilename)
def saveJsonToFile(self, fullFilename, jsonValue):
with codecs.open(fullFilename, 'w', encoding="utf-8") as jsonFp:
json.dump(jsonValue, jsonFp, indent=2, ensure_ascii=False)
print("Complete save json %s" % fullFilename)
</code>去下载mp4视频和json信息到本地了:



【后记】
【无法解决】PySpider的部署运行而非调试界面上RUN运行
不过经过好几个小时的运行,最后终于爬取完毕了:

共3万多个,其中一半感觉是(mp4的)url是重复的,所以实际视频只有1万5千个左右。
转载请注明:在路上 » 【已解决】使用PySpider去爬取某网站中的视频