折腾:
期间,此处已经抓取到mp4视频地址了:
showVideoCallback: response.url=http://xx.xx?m=home&c=match_new&a=video&show_id=1xxx1
curShowInfoDict={‘act_id’: ‘3’, ‘child_type’: ‘1’, ‘course_id’: ‘xx’, ‘cover_img’: ‘https://xx.xx/2018-06-04/5b14e22b8850a.jpg’, ‘create_time’: ‘1512737780’, ‘head_img’: ‘https://xx.xx/2018-06-23/5b2de55693ad9.jpg’, ‘href’: ‘/index.php?m=home&c=match_new&a=video&show_id=1xxx1’, ‘id’: ‘489’, ‘match_type’: ‘2’, ‘name’: ‘唐xx’, ‘rewards’: ‘0’, ‘scores’: ‘63.20’, ‘shares’: ‘2’, ‘show_id’: ‘1xx1’, ‘show_score’: ‘0’, ‘status’: ‘1’, ‘supports’: ‘104’, ‘uid’: ‘5xx’}
title=学学xx
{‘name’: ‘唐xx’,
‘scores’: ‘63.20’,
‘shares’: ‘2’,
‘show_id’: ‘1xxx1’,
‘supports’: ‘104’,
‘title’: ‘学学xx’,
‘url’: ‘http://xx.xx?m=home&c=match_new&a=video&show_id=1xxx1’,
‘videoUrl’: ‘https://xx.xx/2017-12-08/id15xxxuxx.mp4’}

然后就需要去搞清楚:
PySpider中如何下载文件
目前能想到的是:
难道要用requests第三方库去直接下载和保存文件?
pyspider 下载文件
pyspider示例代码七:自动登陆并获得PDF文件下载地址 – microman – 博客园
pyspider – pysipder下载文件超时 – SegmentFault 思否
用的是:urllib2.urlopen去下载文件的
[python]使用pyspider下载meizitu的图片 – 简书
结果 – pyspider中文文档 – pyspider中文网
[python]使用pyspider下载meizitu的图片 – 简书
倒是可以考虑借用PySpider的自带response,把content保存到本地文件中去的。
pyspider—爬取下载图片 – silianpan – 博客园
【总结】
PySpider中用:
import re
import os
import codecs
import json
OutputFullPath = "/Users/crifan/dev/dev_root/xxx/output"
def downloadVideoCallback(self, response):
itemUrl = response.url
print("downloadVideoCallback: itemUrl=%s,response=%s" % (itemUrl, response))
if not os.path.exists(OutputFullPath):
os.makedirs(OutputFullPath)
print("Ok to create folder %s" % OutputFullPath)
itemInfoDict = response.save
filename = "%s-%s-%s" % (
itemInfoDict["show_id"],
itemInfoDict["name"],
itemInfoDict["title"])
print("filename=%s" % filename)
jsonFilename = filename + ".json"
videoSuffix = itemUrl.split(".")[-1]
videoFileName = filename + "." + videoSuffix
print("jsonFilename=%s,videoSuffix=%s,videoFileName=%s" % (jsonFilename, videoSuffix, videoFileName))
# http://xx.xx/index.php?m=home&c=match_new&a=video&show_id=xxx4
# "梁x"
# "57.20"
# "2"
# "1xx34"
# "94"
# "跟xx心"
# "http://xx.x.x/index.php?m=home&c=match_new&a=video&show_id=xx4"
# "https://xx.x.x./2017-12-08/15126857208283977349.mp4"
jsonFilePath = os.path.join(OutputFullPath, jsonFilename)
print("jsonFilePath=%s" % jsonFilePath)
self.saveJsonToFile(jsonFilePath, itemInfoDict)
videoBinData = response.content
videoFilePath = os.path.join(OutputFullPath, videoFileName)
self.saveDataToFile(videoFilePath, videoBinData)
def saveDataToFile(self, fullFilename, binaryData):
with open(fullFilename, ‘wb’) as fp:
fp.write(binaryData)
fp.close()
print("Complete save file %s" % fullFilename)
def saveJsonToFile(self, fullFilename, jsonValue):
with codecs.open(fullFilename, ‘w’, encoding="utf-8") as jsonFp:
json.dump(jsonValue, jsonFp, indent=2, ensure_ascii=False)
print("Complete save json %s" % fullFilename)实现了保存文件到对应目录:



【后记20190411】
后来整理出下载二进制文件的函数:
def saveDataToFile(fullFilename, binaryData):
"""save binary data info file"""
with open(fullFilename, 'wb') as fp:
fp.write(binaryData)
fp.close()
print("Complete save file %s" % fullFilename)
class Handler(BaseHandler):
def downloadFileCallback(self, response):
fileInfo = response.save
print("fileInfo=%s" % fileInfo)
binData = response.content
fileFullPath = os.path.join(fileInfo["saveFolder"], fileInfo["filename"])
print("fileFullPath=%s" % fileFullPath)
saveDataToFile(fileFullPath, binData)
def downloadFile(self, fileInfo):
urlToDownload = fileInfo["fileUrl"]
print("urlToDownload=%s" % urlToDownload)
self.crawl(urlToDownload,
callback=self.downloadFileCallback,
save=fileInfo)调用举例:
# download audio file
# "path": "Audio/1808/20180911222516379.mp3",
audioFileUrlTail = singleAudioDict["path"]
print("audioFileUrlTail=%s" % audioFileUrlTail)
if audioFileUrlTail:
audioFileInfo = {
"fileUrl": gResourcesRoot + "/" + audioFileUrlTail,
"filename": ("Aduios_%s_" % audioId) + audioFileUrlTail.replace("/", "_"),
"saveFolder": curSingleAudioFolder,
}
self.downloadFile(audioFileInfo)转载请注明:在路上 » 【已解决】PySpider中如何下载mp4视频文件到本地