折腾:
【记录】爬虫 爬数据 义务教育教科书 义教教科书
期间,先去爬取:
义教教科书英语八年级下册
先去用Chrome去分析看看:
是否简单能找到api和请求
简单分析了下,貌似是:
第三张图片的thumb:
https://bp.pep.com.cn/ebook/yybanjxc/files/thumb/3.jpg?200209175611

第三张图片的移动端的大图:
https://bp.pep.com.cn/ebook/yybanjxc/files/mobile/3.jpg?200209175611

然后之后其他几张图片地址都是类似的:
https://bp.pep.com.cn/ebook/yybanjxc/files/thumb/1.jpg?200209175611 https://bp.pep.com.cn/ebook/yybanjxc/files/mobile/1.jpg?200209175611 https://bp.pep.com.cn/ebook/yybanjxc/files/thumb/2.jpg?200209175611 https://bp.pep.com.cn/ebook/yybanjxc/files/mobile/2.jpg?200209175611 https://bp.pep.com.cn/ebook/yybanjxc/files/mobile/4.jpg?200209175611
然后去找找200209175611,是怎么得来的

拷贝相关内容出来,放到VSCode中,搜索看看:

相关部分是:
bookConfig.totalPageCount=147; bookConfig.largePageWidth=1024; bookConfig.largePageHeight=1432;; bookConfig.securityType="1"; bookConfig.CreatedTime ="200209175611";bookConfig.bookTitle="义教教科书英语八年级下册"; bookConfig.bookmarkCR="fedde07e3aa4fb28b08228ec8a994da9f421c6dd"; bookConfig.productName="名编辑企业版"; bookConfig.homePage="http://www.mingbianji.com";
去看看:

-》很明显,这个电子书就是这家公司,或相关技术制作的。
所以可以去爬取去试试了。
然后先去:
【已解决】Python的requests中如何下载二进制数据保存为图片文件
再去批量运行,也是OK的:

【总结】
最后完整代码是:
# 下载在线电子书:
# 义教教科书英语八年级下册
# https://bp.pep.com.cn/ebook/yybanjxc/mobile/index.html
# 的图片
# Author: Crifan Li
# Update: 20200302
import os
import requests
# bookConfig.bookTitle="义教教科书英语八年级下册";
gBookTitle = "义教教科书英语八年级下册"
# bookConfig.CreatedTime ="200209175611";
gCreateTimeStr = "200209175611"
# bookConfig.totalPageCount=147;
gTotalPageCount=147
UserAgent_Mac_Chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36"
gHeaders = {
"User-Agent": UserAgent_Mac_Chrome,
}
gSaveFolder = os.path.join("output", gBookTitle)
def createFolder(folderFullPath):
"""
create folder, even if already existed
Note: for Python 3.2+
"""
os.makedirs(folderFullPath, exist_ok=True)
createFolder(gSaveFolder)
for curPageIdx in range(gTotalPageCount):
curPageNum = curPageIdx + 1
# https://bp.pep.com.cn/ebook/yybanjxc/files/thumb/1.jpg?200209175611
# https://bp.pep.com.cn/ebook/yybanjxc/files/mobile/1.jpg?200209175611
# curImageType = "thumb"
curImageType = "mobile"
curPictureUrl = "https://bp.pep.com.cn/ebook/yybanjxc/files/%s/%d.jpg?%s" % (curImageType, curPageNum, gCreateTimeStr)
print("[%d] url=%s" % (curPageNum, curPictureUrl))
saveFilename = "%s_%d.jpg" % (curImageType, curPageNum)
saveFullPath = os.path.join(gSaveFolder, saveFilename)
resp = requests.get(curPictureUrl, headers=gHeaders)
if resp.ok:
with open(saveFullPath, 'wb') as saveFp:
saveFp.write(resp.content)
print("Saved to %s" % saveFullPath)
else:
print("!!! fail to open url: %s, reason: %s, status_code" % (curPictureUrl, resp.reason, resp.status_code))继续运行后,即可下载全部147张图片:

效果不错。
转载请注明:在路上 » 【已解决】爬取bp.pep.com.cn中的义务教育教科书资源