之前已经完成了:
【已解决】汽车之家车型车系数据:支持新版车系页面
后来发现有些旧版车系页面,html网页结构不一样
所以要再去添加支持
也意味着之前抓包结果会漏掉这些:

旧版车系页面
举例
Q开头
-》
且这类旧版车系,在字母页面中,很明显,指导价都是 暂无 没有 没链接的:

-》
只是页面内部才有 单个 car的model车型 的(某个范围的)指导价

对于这类页面,需要先去加代码判断 是 新 还是 旧 版本的页面布局:
carSeriesHtml = response.text
print("type(carSeriesHtml)=%s" % type(carSeriesHtml)) # <class 'str'>
# print("carSeriesHtml=%s" % carSeriesHtml)
foundLevelId = re.search("var\s+levelid\s+=", carSeriesHtml)
print("foundLevelId=%s" % foundLevelId)
isNewLayoutHtml = bool(foundLevelId)
print("isNewLayoutHtml=%s" % isNewLayoutHtml)
foundShowCityId = re.search("var\s+showCityId\s+=", carSeriesHtml)
print("foundShowCityId=%s" % foundShowCityId)
isOldLayoutHtml = bool(foundShowCityId)
print("isOldLayoutHtml=%s" % isOldLayoutHtml)然后才能基于不同布局,用不同代码:
if isOldLayoutHtml: # 旧版布局 elif isNewLayoutHtml # 新版布局 # 详见前贴:【已解决】汽车之家车型车系数据:支持新版车系页面
然后再说,旧版布局中的要抓取的内容和处理逻辑
此处以
为例,说明要抓取的内容:

其中切换到其他年份款,也是同样的结构:

再去解释处理逻辑:
对于html
<div class="car_detail " id="tab1-2"> <div class="models"> <!--年代--> <div class="header"> <div class="car_price"> <span class="years">2005款</span> <span class="price">指导价(停售):<strong class="red">6.28万-9.18万</strong></span> <span class="price">二手车价格:<strong class="red"><a class='cd60000' href='//www.che168.com/china/qiya/qianlima/a0_0msdgscncgpiltocsp1exs276/?pvareaid=103693'>0.39万-1.30万</a></strong></span> 。。。 <div class="car_detail current" id="tab1-1"> <div class="models"> <!--年代--> <div class="header"> <div class="car_price"> <span class="years">2006款</span> <span class="price">指导价(停售):<strong class="red">7.28万-8.58万</strong></span>
解析出:
- carModelYear:2006款
为了和 和之前 新版车系页面 一致,此处:
对于整个车系的数据:采用 第一个model的数据:
对于html
<div class="models_info"> <dl class='models_pics'> <dt><a href='//car.autohome.com.cn/photolist/series/2305/23796.html?pvareaid=101468'><img src='https://car0.autoimg.cn/upload/spec/1344/t_1344388912334.jpg' width='240' height='180' /></a></dt>
解析出:
- carSeriesMainImgUrl:https://car0.autoimg.cn/upload/spec/1344/t_1344388912334.jpg
对于:
<div class="car_price"> <span class="price">指导价(停售):<strong class="red">7.28万-8.58万</strong></span>
解析出:
- carSeriesMsrp:7.93万
- =(carSeriesMinPrice + carSeriesMaxPrice) / 2
- carSeriesMinPrice:7.28万
- carSeriesMaxPrice:8.58万
然后继续解析每个model数据:
对于html
<div class="modelswrap"> <!-- 信息 start --> <div class="models_info"> <dl class='models_prop'> <dt>发动机:</dt> <dd><span>1.3L</span><span>1.6L</span></dd> </dl> <dl class='models_prop'> <dt>变速箱:</dt> <dd><span>手动</span><span>自动</span></dd> <dt>车身结构:</dt> <dd><span>三厢</span></dd> </dl>
解析出:
- carModelGearBox:手动自动
- carModelDriveType:空值
- carModelEmissionStandards:空值
- carModelPower:1.3L1.6L
- carModelGroupName:1.3L1.6L 手动自动 三厢
从html
<table class='models_tab tableline' cellspacing='0' cellpadding='0' border='0'> <tr> <td class='name_d'> <div class='name'><a title='2006款 1.6L MT特别版GL' href='spec/2304/'>2006款 1.6L MT特别版GL</a></div> </td> <td class='price_d'> <div class='price01'>8.18万</div> </td>
解析出:
- carModelName:2006款 1.6L MT特别版GL
- carModelSpecUrl:https://www.autohome.com.cn/spec/2304/
- carModelMsrp:8.18万
此处支持 新版车系页面 + 旧版车系页面 的完整代码如下:
贴出完整的代码:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2020-08-20 22:19:20
# Project: autohome_20200819
import string
import re
import copy
from lxml import etree
from pyspider.libs.base_handler import *
AutohomeHost = "https://www.autohome.com.cn"
CarSpecPrefix = "%s/spec" % AutohomeHost # "https://www.autohome.com.cn/spec/%s/"
class Handler(BaseHandler):
UserAgent_Mac_Chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
crawl_config = {
"headers": {
"User-Agent": UserAgent_Mac_Chrome,
}
}
def genSpecUrl(self, specId):
# return "%s/%s" % (CarSpecPrefix, specId)
return "%s/%s/" % (CarSpecPrefix, specId)
# @every(minutes=24 * 60)
def on_start(self):
# autohomeEntryUrl = "https://www.autohome.com.cn/car/"
# self.crawl(autohomeEntryUrl, callback=self.carBrandListCallback)
for eachLetter in list(string.ascii_lowercase):
letterUpper = eachLetter.upper()
# # for debug
# letterUpper = "H"
print("letterUpper=%s" % letterUpper)
self.crawl("https://www.autohome.com.cn/grade/carhtml/%s.html" % eachLetter,
save={"initials": letterUpper},
callback=self.gradCarHtmlPage)
# # @config(age=10 * 24 * 60 * 60)
# def carBrandListCallback(self, response):
# print("response.url=%s" % response.url)
# # <div vos="gs" class="uibox" id="boxA" style="">
# for eachVosGs in response.doc('div[vos="gs"]').items():
# print("eachVosGs=%s" % eachVosGs)
# # self.crawl(each.attr.href, callback=self.detail_page)
# # @config(priority=2)
# def detail_page(self, response):
# return {
# "url": response.url,
# "title": response.doc('title').text(),
# }
@catch_status_code_error
def gradCarHtmlPage(self, response):
print("gradCarHtmlPage: response=", response)
# picSeriesItemList = response.doc('.rank-list-ul li div a[href*="/pic/series"]').items()
# print("picSeriesItemList=", picSeriesItemList)
# print("len(picSeriesItemList)=%s"%(len(picSeriesItemList)))
# for each in picSeriesItemList:
# self.crawl(each.attr.href, callback=self.picSeriesPage)
saveDict = response.save
print("saveDict=", saveDict)
initials = saveDict["initials"]
print("initials=", initials)
respText = response.text
# print("respText=", respText)
"""
<dl id="33" olr="6">
<dt><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362"><img width="50" height="50"
src="//car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"></a>
<div><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362">奥迪</a></div>
</dt>
"""
# brandDoc = response.doc('dl dt')
# print("brandDoc=%s" % brandDoc)
# brandListDoc = response.doc('dl[id and orl] dt')
# dlListDoc = response.doc('dl[id and orl]').items()
# dlListDoc = response.doc("dl[id*=''][orl*='']").items()
# dlListDoc = response.doc("dl[orl*='']").items()
# dlListDoc = response.doc("dl").items()
# dlListDoc = response.doc("dl:regex(id, \d+)").items()
# dlListDoc = response.doc("dl:regex(id,[0-9])").items()
# dlListDoc = response.doc("dl[id]").items()
dlListDoc = response.doc("dl[olr]").items()
print("type(dlListDoc)=%s" % type(dlListDoc))
dlList = list(dlListDoc)
print("len(dlList)=%s" % len(dlList))
print("dlList=%s" % dlList)
for curBrandIdx, eachDlDoc in enumerate(dlList):
print("%s [%d] %s" % ('#'*30, curBrandIdx, '#'*30))
dtDoc = eachDlDoc.find("dt")
# print("dtDoc=%s" % dtDoc)
# <a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362"><img width="50" height="50" src="//car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"></a>
brandLogoDoc = dtDoc.find('a img')
# print("brandLogoDoc=%s" % brandLogoDoc)
carBrandLogoUrl = brandLogoDoc.attr["src"]
print("carBrandLogoUrl=%s" % carBrandLogoUrl)
# <div><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362">奥迪</a></div>
brandNameDoc = dtDoc.find('div a')
# print("brandNameDoc=%s" % brandNameDoc)
carBrandName = brandNameDoc.text()
print("carBrandName=%s" % carBrandName)
# <div class="h3-tit"><a href="//car.autohome.com.cn/price/brand-33-9.html#pvareaid=2042363">一汽-大众奥迪</a></div>
# merchantDocGenerator = response.doc("dd div[class='h3-tit'] a").items()
# ddDoc = eachDlDoc.find("dd")
ddDoc = eachDlDoc.find("dd")
# print("ddDoc=%s" % ddDoc)
merchantDocGenerator = ddDoc.items("div[class='h3-tit'] a")
merchantDocList = list(merchantDocGenerator)
# print("merchantDocList=%s" % merchantDocList)
merchantDocLen = len(merchantDocList)
print("merchantDocLen=%s" % merchantDocLen)
# <ul class="rank-list-ul" 0>
# merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']")
# merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']").items()
merchantRankDocGenerator = ddDoc.items("ul[class='rank-list-ul']")
merchantRankDocList = list(merchantRankDocGenerator)
# print("merchantRankDocList=%s" % merchantRankDocList)
merchantRankDocListLen = len(merchantRankDocList)
print("merchantRankDocListLen=%s" % merchantRankDocListLen)
for curIdx, merchantItem in enumerate(merchantDocList):
# for curIdx, merchantItem in enumerate(merchantDocGenerator):
# print("%s" % "="*80)
print("%s [%d] %s" % ('='*30, curIdx, '='*30))
# print("type(merchantItem)=%s" % type(merchantItem))
# print("[%d] merchantItem=%s" % (curIdx, merchantItem))
# print("[%d] merchantItem=%s" % (curIdx, merchantItem))
carMerchantName = merchantItem.text()
print("carMerchantName=%s" % carMerchantName)
merchantItemAttr = merchantItem.attr
# print("merchantItemAttr=%s" % merchantItemAttr)
carMerchantUrl = merchantItemAttr["href"]
print("carMerchantUrl=%s" % carMerchantUrl)
# curSubBrandDict = {
# "brandName": brandName,
# "carBrandLogoUrl": carBrandLogoUrl,
# "carMerchantName": carMerchantName,
# "carMerchantUrl": carMerchantUrl,
# }
# self.send_message(self.project_name, curSubBrandDict, url=carMerchantUrl)
merchantRankDoc = merchantRankDocList[curIdx]
# print("merchantRankDoc=%s" % merchantRankDoc)
# print("type(merchantRankDoc)=%s" % type(merchantRankDoc))
# type(merchantRankDoc)=<class 'lxml.html.HtmlElement'>
# merchantRankHtml = etree.tostring(merchantRankDoc)
# type(merchantRankDoc)=<class 'pyquery.pyquery.PyQuery'>
# merchantRankHtml = merchantRankDoc.html()
# print("merchantRankHtml=%s" % merchantRankHtml)
# <li id="s3170">
# carSeriesDocGenerator = merchantRankDoc.find("li")
# carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")
carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")
# print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))
carSeriesDocList = list(carSeriesDocGenerator)
# print("type(carSeriesDocList)=%s" % type(carSeriesDocList))
# print("carSeriesDocList=%s" % carSeriesDocList)
carSeriesDocListLen = len(carSeriesDocList)
# print("carSeriesDocListLen=%s" % carSeriesDocListLen)
for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):
print("%s [%d] %s" % ('-'*30, curSeriesIdx, '-'*30))
# print("[%d] eachCarSeriesDoc=%s" % (curSeriesIdx, eachCarSeriesDoc))
# print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc)) # type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
# <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4>
carSeriesInfoDoc = eachCarSeriesDoc.find("h4 a")
# print("type(carSeriesInfoDoc)=%s" % type(carSeriesInfoDoc))
# print("carSeriesInfoDoc=%s" % carSeriesInfoDoc)
carSeriesName = carSeriesInfoDoc.text()
print("carSeriesName=%s" % carSeriesName)
carSeriesUrl = carSeriesInfoDoc.attr.href
print("carSeriesUrl=%s" % carSeriesUrl)
# <div>指导价:<a class="red" href="//www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div>
# 厂商指导价=厂商建议零售价格=MSRP=Manufacturer's Suggested Retail Price
# carSeriesMsrpDoc = eachCarSeriesDoc.find("div a")
carSeriesMsrpDoc = eachCarSeriesDoc.find("div a[class='red']")
# print("carSeriesMsrpDoc=%s" % carSeriesMsrpDoc)
carSeriesMsrp = carSeriesMsrpDoc.text()
print("carSeriesMsrp=%s" % carSeriesMsrp)
carSeriesMsrpUrl = carSeriesMsrpDoc.attr.href
print("carSeriesMsrpUrl=%s" % carSeriesMsrpUrl)
carSeriesDict = {
"carBrandName": carBrandName,
"carBrandLogoUrl": carBrandLogoUrl,
"carMerchantName": carMerchantName,
"carMerchantUrl": carMerchantUrl,
"carSeriesName": carSeriesName,
"carSeriesUrl": carSeriesUrl,
"carSeriesMsrp": carSeriesMsrp,
"carSeriesMsrpUrl": carSeriesMsrpUrl,
}
# self.send_message(self.project_name, carSeriesDict, url=carSeriesUrl)
self.crawl(carSeriesUrl,
callback=self.carSeriesDetailPage,
save=carSeriesDict,
)
def on_message(self, project, msg):
print("on_message: msg=%s" % msg)
return msg
@catch_status_code_error
def carSeriesDetailPage(self, response):
carSeriesDict = response.save
print("carSeriesDict=%s" % carSeriesDict)
carSeriesUrl = response.url
print("carSeriesUrl=%s" % carSeriesUrl)
carSeriesMainImgUrl = ""
carSeriesId = ""
carSeriesLevelId = ""
carSeriesMsrp = ""
carSeriesMinPrice = ""
carSeriesMaxPrice = ""
carSeriesHtml = response.text
print("type(carSeriesHtml)=%s" % type(carSeriesHtml)) # <class 'str'>
# print("carSeriesHtml=%s" % carSeriesHtml)
foundLevelId = re.search("var\s+levelid\s+=", carSeriesHtml)
print("foundLevelId=%s" % foundLevelId)
isNewLayoutHtml = bool(foundLevelId)
print("isNewLayoutHtml=%s" % isNewLayoutHtml)
foundShowCityId = re.search("var\s+showCityId\s+=", carSeriesHtml)
print("foundShowCityId=%s" % foundShowCityId)
isOldLayoutHtml = bool(foundShowCityId)
print("isOldLayoutHtml=%s" % isOldLayoutHtml)
if isOldLayoutHtml:
# Q开头
# https://www.autohome.com.cn/grade/carhtml/q.html
# ->
# 东风悦达起亚-千里马
# https://www.autohome.com.cn/142/#levelsource=000000000_0&pvareaid=101594
# 其他:
#
# 一汽丰田-花冠
# https://www.autohome.com.cn/109/#levelsource=000000000_0&pvareaid=101594
#
# 昶洧-昶洧 SUV
# https://www.autohome.com.cn/4550/#levelsource=000000000_0&pvareaid=101594
"""
<div class="car_detail " id="tab1-2">
<div class="models">
<!--年代-->
<div class="header">
<div class="car_price">
<span class="years">2005款</span>
<span class="price">指导价(停售):<strong class="red">6.28万-9.18万</strong></span>
<span class="price">二手车价格:<strong class="red"><a class='cd60000' href='//www.che168.com/china/qiya/qianlima/a0_0msdgscncgpiltocsp1exs276/?pvareaid=103693'>0.39万-1.30万</a></strong></span>
。。。
<div class="car_detail current" id="tab1-1">
<div class="models">
<!--年代-->
<div class="header">
<div class="car_price">
<span class="years">2006款</span>
<span class="price">指导价(停售):<strong class="red">7.28万-8.58万</strong></span>
。。。
"""
carDetailDivGenerator = response.doc("div[class^='car_detail']").items()
print("carDetailDivGenerator=%s" % carDetailDivGenerator)
carDetailDivList = list(carDetailDivGenerator)
print("carDetailDivList=%s" % carDetailDivList)
for curDivIdx, curCarDetailDoc in enumerate(carDetailDivList):
print("%s [%d] %s" % ('#'*30, curDivIdx, '#'*30))
curCarModelGroupDict = copy.deepcopy(carSeriesDict)
# <span class="years">2006款</span>
modelYearDoc = curCarDetailDoc.find("span[class='years']")
print("modelYearDoc=%s" % modelYearDoc)
carModelYear = modelYearDoc.text()
print("carModelYear=%s" % carModelYear)
curCarModelGroupDict["carModelYear"] = carModelYear
if curDivIdx == 0:
# use first car model as series: main img, msrp, ...
"""
<div class="models_info">
<dl class='models_pics'>
<dt><a href='//car.autohome.com.cn/photolist/series/2305/23796.html?pvareaid=101468'><img
src='https://car0.autoimg.cn/upload/spec/1344/t_1344388912334.jpg' width='240'
height='180' /></a></dt>
"""
# modelMainImgDocListGenerator = response.doc("div[class='models_info'] dl[class='models_pics'] dt a img").items()
# modelMainImgDocList = list(modelMainImgDocListGenerator)
# firstModelMainImgDoc = modelMainImgDocList[0]
firstModelMainImgDoc = curCarDetailDoc.find("div[class='models_info'] dl[class='models_pics'] dt a img")
firstModelMainImgUrl = firstModelMainImgDoc.attr["src"]
print("firstModelMainImgUrl=%s" % firstModelMainImgUrl)
carSeriesMainImgUrl = firstModelMainImgUrl
print("carSeriesMainImgUrl=%s" % carSeriesMainImgUrl)
curCarModelGroupDict["carSeriesMainImgUrl"] = carSeriesMainImgUrl
# <div class="car_price">
# <span class="price">指导价(停售):<strong class="red">7.28万-8.58万</strong></span>
carPriceStrongDocGenerator = curCarDetailDoc.items("div[class='car_price'] span[class='price'] strong[class='red']")
print("carPriceStrongDocGenerator=%s" % carPriceStrongDocGenerator)
if carPriceStrongDocGenerator:
carPriceStrongDocList = list(carPriceStrongDocGenerator)
print("carPriceStrongDocList=%s" % carPriceStrongDocList)
carPriceStrongDoc = carPriceStrongDocList[0]
print("carPriceStrongDoc=%s" % carPriceStrongDoc)
carPriceMinMax = carPriceStrongDoc.text()
print("carPriceMinMax=%s" % carPriceMinMax)
if carPriceMinMax:
foundMinMax = re.search("(?P<minPrice>[\d\.]+)万-(?P<maxPrice>[\d\.]+)万", carPriceMinMax)
print("foundMinMax=%s" % foundMinMax)
if foundMinMax:
minPrice = foundMinMax.group("minPrice")
print("minPrice=%s" % minPrice)
minPriceFloat = float(minPrice)
print("minPriceFloat=%s" % minPriceFloat)
maxPrice = foundMinMax.group("maxPrice")
print("maxPrice=%s" % maxPrice)
maxPriceFloat = float(maxPrice)
print("maxPriceFloat=%s" % maxPriceFloat)
averageMsrcPrice = (minPriceFloat + maxPriceFloat) / 2.0
print("averageMsrcPrice=%s" % averageMsrcPrice)
carSeriesMsrp = "%.2f万" % averageMsrcPrice
print("carSeriesMsrp=%s" % carSeriesMsrp)
carSeriesMinPrice = "%.2f万" % minPriceFloat
print("carSeriesMinPrice=%s" % carSeriesMinPrice)
carSeriesMaxPrice = "%.2f万" % maxPriceFloat
print("carSeriesMaxPrice=%s" % carSeriesMaxPrice)
curCarModelGroupDict["carSeriesMsrp"] = carSeriesMsrp
curCarModelGroupDict["carSeriesMinPrice"] = carSeriesMinPrice
curCarModelGroupDict["carSeriesMaxPrice"] = carSeriesMaxPrice
print("")
"""
<div class="modelswrap">
<!-- 信息 start -->
<div class="models_info">
<dl class='models_prop'>
<dt>发动机:</dt>
<dd><span>1.3L</span><span>1.6L</span></dd>
</dl>
<dl class='models_prop'>
<dt>变速箱:</dt>
<dd><span>手动</span><span>自动</span></dd>
<dt>车身结构:</dt>
<dd><span>三厢</span></dd>
</dl>
"""
# modelsPropDdList = curCarDetailDoc.find("div[class='modelswrap'] div[class='models_info'] dl[class='models_prop'] dd")
modelsPropDdGenerator = curCarDetailDoc.items("div[class='modelswrap'] div[class='models_info'] dl[class='models_prop'] dd")
print("modelsPropDdGenerator=%s" % modelsPropDdGenerator)
modelsPropDdList = list(modelsPropDdGenerator)
print("modelsPropDdList=%s" % modelsPropDdList)
engineValueDoc = modelsPropDdList[0]
print("engineValueDoc=%s" % engineValueDoc)
engineValue = engineValueDoc.text()
print("engineValue=%s" % engineValue)
gearBoxValueDoc = modelsPropDdList[1]
print("gearBoxValueDoc=%s" % gearBoxValueDoc)
gearBoxValue = gearBoxValueDoc.text()
print("gearBoxValue=%s" % gearBoxValue)
bodyStructureValueDoc = modelsPropDdList[2]
print("bodyStructureValueDoc=%s" % bodyStructureValueDoc)
bodyStructureValue = bodyStructureValueDoc.text()
print("bodyStructureValue=%s" % bodyStructureValue)
carModelGearBox = gearBoxValue
print("carModelGearBox=%s" % carModelGearBox)
curCarModelGroupDict["carModelGearBox"] = carModelGearBox # 手动自动
curCarModelGroupDict["carModelDriveType"] = ""
curCarModelGroupDict["carModelEmissionStandards"] = ""
carModelPower = engineValue
print("carModelPower=%s" % carModelPower)
curCarModelGroupDict["carModelPower"] = carModelPower
carModelGroupName = "%s %s %s" % (engineValue, gearBoxValue, bodyStructureValue)
print("carModelGroupName=%s" % carModelGroupName)
curCarModelGroupDict["carModelGroupName"] = carModelGroupName
"""
<table class='models_tab tableline' cellspacing='0' cellpadding='0' border='0'>
<tr>
<td class='name_d'>
<div class='name'><a title='2006款 1.6L MT特别版GL' href='spec/2304/'>2006款 1.6L MT特别版GL</a></div>
</td>
<td class='price_d'>
<div class='price01'>8.18万</div>
</td>
"""
modelsTrDocGenerator = curCarDetailDoc.items("table[class^='models_tab'] tr")
print("modelsTrDocGenerator=%s" % modelsTrDocGenerator)
modelsTrDocList = list(modelsTrDocGenerator)
print("modelsTrDocList=%s" % modelsTrDocList)
for curTabIdx, curModelTrDoc in enumerate(modelsTrDocList):
print("%s [%d] %s" % ('='*30, curTabIdx, '='*30))
curCarModeDict = copy.deepcopy(curCarModelGroupDict)
print("curModelTrDoc=%s" % curModelTrDoc)
nameADoc = curModelTrDoc.find("td[class='name_d'] div[class='name'] a")
print("nameADoc=%s" % nameADoc)
carModelName = nameADoc.text()
print("carModelName=%s" % carModelName)
carModelSpecUrl = nameADoc.attr["href"]
# bug -> wrong url:
# https://www.autohome.com.cn/142/spec/2304/
# need repace
# https://www.autohome.com.cn/142/spec/2304/
# to
# https://www.autohome.com.cn/spec/2304/
foundSpecId = re.search("spec/(?P<specId>\d+)", carModelSpecUrl)
specId = foundSpecId.group("specId")
print("specId=%s" % specId) # 2304
carModelSpecUrl = self.genSpecUrl(specId)
print("carModelSpecUrl=%s" % carModelSpecUrl)
priceDivDoc = curModelTrDoc.find("td[class='price_d'] div[class='price01']")
print("priceDivDoc=%s" % priceDivDoc)
carModelMsrp = priceDivDoc.text()
print("carModelMsrp=%s" % carModelMsrp)
if "暂无" in carModelMsrp:
carModelMsrp = ""
print("carModelMsrp=%s" % carModelMsrp)
curCarModeDict["carModelName"] = carModelName
curCarModeDict["carModelSpecUrl"] = carModelSpecUrl
curCarModeDict["carModelMsrp"] = carModelMsrp
self.send_message(self.project_name, curCarModeDict, url=carModelSpecUrl)
elif isNewLayoutHtml:
carModelDict = copy.deepcopy(carSeriesDict)
# carSeriesUrl=https://www.autohome.com.cn/2123/#levelsource=000000000_0&pvareaid=101594
foundSeriesId = re.search("www\.autohome\.com\.cn/(?P<seriesId>\d+)/", carSeriesUrl)
carSeriesId = foundSeriesId.group("seriesId")
# carSeriesId = int(carSeriesId)
print("carSeriesId=%s" % carSeriesId) # 2123
carModelDict["carSeriesId"] = carSeriesId
"""
<div class="information-pic">
<div class="pic-main">
。。。
<picture>
。。。
<img sizes="380px" width="380" height="285"
src="//car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/380x285_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg"
srcset="//car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/380x285_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg 380w, //car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/760x570_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg 760w">
</picture>
"""
mainImgDoc = response.doc("div[class='information-pic'] div[class='pic-main'] picture img")
print("mainImgDoc=%s" % mainImgDoc)
carSeriesMainImgUrl = mainImgDoc.attr["src"]
print("carSeriesMainImgUrl=%s" % carSeriesMainImgUrl)
carModelDict["carSeriesMainImgUrl"] = carSeriesMainImgUrl
"""
<script type="text/javascript">
。。。
var seriesid = '2123';
var seriesname='哈弗H6';
var yearid = '0';
var brandid = '181';
var levelid = '17';
var levelname='紧凑型SUV';
var fctid = '4';
var SeriesMinPrice='9.80';
var SeriesMaxPrice='14.10';
"""
infoKeyList = [
"seriesid",
# "seriesname", # has got
# "yearid", # no need
"brandid",
"levelid",
"levelname",
# "fctid", # unknown meaning
"SeriesMinPrice",
"SeriesMaxPrice",
]
InfoDict = {}
for eachInfoKey in infoKeyList:
curPattern = "var\s+%s\s*=\s*'(?P<infoValue>[^']+)'\s*;" % eachInfoKey
print("curPattern=%s" % curPattern)
foundInfo = re.search(curPattern, carSeriesHtml)
print("foundInfo=%s" % foundInfo)
# if foundInfo:
infoValue = foundInfo.group("infoValue")
print("infoValue=%s" % infoValue)
InfoDict[eachInfoKey] = infoValue
print("InfoDict=%s" % InfoDict)
# if "seriesid" in InfoDict:
carSeriesId = InfoDict["seriesid"] # 2123
carModelDict["carSeriesId"] = carSeriesId
# carModelDict["carSeriesName"] = InfoDict["seriesname"] # 哈弗H6
# if "brandid" in InfoDict:
carModelDict["carBrandId"] = InfoDict["brandid"] # 181
# if "levelid" in InfoDict:
carSeriesLevelId = InfoDict["levelid"] # 17
carModelDict["carSeriesLevelId"] = carSeriesLevelId
# if "levelname" in InfoDict:
carModelDict["carSeriesLevelName"] = InfoDict["levelname"] # 紧凑型SUV
# if "SeriesMinPrice" in InfoDict:
carSeriesMinPrice = InfoDict["SeriesMinPrice"] # 9.80
carModelDict["carSeriesMinPrice"] = carSeriesMinPrice
# if "SeriesMaxPrice" in InfoDict:
carSeriesMaxPrice = InfoDict["SeriesMaxPrice"] # 14.10
carModelDict["carSeriesMaxPrice"] = carSeriesMaxPrice
"""
<div class="series-list">
。。。
<li class="more-dropdown">
<a href="javascript:void(0);" target="_self" data-toggle="tab" class="tab-disabled" data-target="#specWrap-3">停售款 <i class="athm-iconfont athm-iconfont-arrowdown"></i></a>
<ul class="dropdown-con" id="haltList">
<li><a href="javascript:void(0);" target="_self" data-toggle="tab" data-yearid="11691">2019款</a></li>
...
<li><a href="javascript:void(0);" target="_self" data-toggle="tab" data-yearid="3100">2011款</a></li>
</ul>
</li>
"""
haltAListDoc = response.doc("li[class='more-dropdown'] ul[id='haltList'] li a").items()
print("type(haltAListDoc)=%s" % type(haltAListDoc))
print("haltAListDoc=%s" % haltAListDoc)
for curLiIdx, eachHatADoc in enumerate(haltAListDoc):
print("%s [%d] %s" % ('%'*30, curLiIdx, '%'*30))
print("eachHatADoc=%s" % eachHatADoc)
yearName = eachHatADoc.text()
print("yearName=%s" % yearName)
yearId = eachHatADoc.attr["data-yearid"]
print("yearId=%s" % yearId)
# getHaltSpecUrl = "https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=%s&syearid=%s&levelid=%s" % (carModelDict["carSeriesId"], yearId, carModelDict["carSeriesLevelId"])
if carSeriesId and carSeriesLevelId:
getHaltSpecUrl = "https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=%s&syearid=%s&levelid=%s" % (carSeriesId, yearId, carSeriesLevelId)
# https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=2123&syearid=10379&levelid=17
print("getHaltSpecUrl=%s" % getHaltSpecUrl)
self.crawl(getHaltSpecUrl,
callback=self.haltCarSpecCallback,
save=carModelDict,
)
# """
# <div class="information-summary">
# <dl class="information-price">
# ...
# <dd class="type">
# <span class="type__item">紧凑型车</span>
# """
# carLevelDoc = response.doc("div[class='information-summary'] dl[class='information-price'] dd[class='type'] span[class='type__item']").eq(0)
# print("carLevelDoc=%s" % carLevelDoc)
# carSeriesLevelName = carLevelDoc.text()
# print("carSeriesLevelName=%s" % carSeriesLevelName)
# carModelDict["carSeriesLevelName"] = carSeriesLevelName
carSeriesContentDoc = response.doc("div[class='series-content']")
print("carSeriesContentDoc=%s" % carSeriesContentDoc)
# carSpecWrapDoc = carSeriesContentDoc.find("div[class^='spec-wrap']")
# carSpecWrapDoc = carSeriesContentDoc.find("div[class^='spec-wrap active']")
carSpecWrapListDoc = carSeriesContentDoc.items("div[class^='spec-wrap']")
print("carSpecWrapListDoc=%s" % carSpecWrapListDoc)
for curSpecWrapIdx, eachSpecWrapDoc in enumerate(carSpecWrapListDoc):
print("%s [%d] %s" % ('#'*30, curSpecWrapIdx, '#'*30))
print("eachSpecWrapDoc=%s" % eachSpecWrapDoc)
"""
<!--即将上市 start-->
<div class="spec-wrap active" id="specWrap-1">
<dl class="halt-spec">
<dt>
<div class="spec-name">
<span>参数配置未公布</span>
</div>
<dl class="halt-spec">
<dt>
<div class="spec-name">
<span>1.5升 涡轮增压 169马力 国VI</span>
</div>
"""
# dlDoc = eachSpecWrapDoc.find("dl[class='']")
# dlDoc = eachSpecWrapDoc.find("dl")
dlListDoc = eachSpecWrapDoc.items("dl")
print("dlListDoc=%s" % dlListDoc)
for curDlIdx, eachDlDoc in enumerate(dlListDoc):
print("%s [%d] %s" % ('='*30, curDlIdx, '='*30))
print("eachDlDoc=%s" % eachDlDoc)
"""
<dt>
<div class="spec-name">
<span>1.5升 涡轮增压 169马力 国VI</span>
"""
dtDoc = eachDlDoc.find("dt")
print("dtDoc=%s" % dtDoc)
groupSpecNameSpanDoc = dtDoc.find("div[class='spec-name'] span")
print("groupSpecNameSpanDoc=%s" % groupSpecNameSpanDoc)
carModelGroupName = ""
if groupSpecNameSpanDoc:
carModelGroupName = groupSpecNameSpanDoc.text()
print("carModelGroupName=%s" % carModelGroupName)
carModelDict["carModelGroupName"] = carModelGroupName
# <dd data-sift1="2020款" data-sift2="国VI" data-sift3="1.5T" data-sift4="7挡双离合" class="">
ddListDoc = eachDlDoc.items("dd")
print("ddListDoc=%s" % ddListDoc)
for curDdIdx, eachDdDoc in enumerate(ddListDoc):
print("%s [%d] %s" % ('-'*30, curDdIdx, '-'*30))
curDdAttr = eachDdDoc.attr
# print("curDdAttr=%s" % curDdAttr)
carModelYear = curDdAttr["data-sift1"]
print("carModelYear=%s" % carModelYear)
carModelEmissionStandards = curDdAttr["data-sift2"]
print("carModelEmissionStandards=%s" % carModelEmissionStandards)
carModelPower = curDdAttr["data-sift3"]
print("carModelPower=%s" % carModelPower)
carModelGearBox = curDdAttr["data-sift4"]
print("carModelGearBox=%s" % carModelGearBox)
carModelDict["carModelYear"] = carModelYear
carModelDict["carModelEmissionStandards"] = carModelEmissionStandards
carModelDict["carModelPower"] = carModelPower
carModelDict["carModelGearBox"] = carModelGearBox
"""
<div class="spec-name">
<div class="name-param">
<p data-gcjid="41511" id="spec_41511">
<a href="/spec/41511/#pvareaid=3454492" class="name">2020款 1.5GDIT 自动铂金舒适版</a>
<span class="athm-badge athm-badge--grey is-plain">停产在售</span>
<span class="athm-badge athm-badge--orange">特惠</span></p>
<p><span class="type-default">前置前驱</span><span class="type-default">7挡双离合</span></p>
</div>
</div>
"""
specNameDoc = eachDdDoc.find("div[class='spec-name']")
# print("specNameDoc=%s" % specNameDoc)
specADoc = specNameDoc.find("p a[class='name']")
# print("specADoc=%s" % specADoc)
carModelName = specADoc.text()
print("carModelName=%s" % carModelName) # 2020款 1.5GDIT 自动铂金舒适版
carModelSpecUrl = specADoc.attr["href"]
print("carModelSpecUrl=%s" % carModelSpecUrl) # https://www.autohome.com.cn/spec/41511/#pvareaid=3454492
typeDefaultListDoc = specNameDoc.items("p span[class='type-default']")
print("typeDefaultListDoc=%s" % typeDefaultListDoc)
typeDefaultList = list(typeDefaultListDoc)
print("typeDefaultList=%s" % typeDefaultList)
carModelDriveType = ""
carModelGearBox = ""
if typeDefaultList:
spanTypeDefault0 = typeDefaultList[0]
print("spanTypeDefault0=%s" % spanTypeDefault0)
carModelDriveType = spanTypeDefault0.text()
print("carModelDriveType=%s" % carModelDriveType)
spanTypeDefault1 = typeDefaultList[1]
print("spanTypeDefault1=%s" % spanTypeDefault1)
carModelGearBox = spanTypeDefault1.text()
print("carModelGearBox=%s" % carModelGearBox)
carModelDict["carModelName"] = carModelName
carModelDict["carModelSpecUrl"] = carModelSpecUrl
carModelDict["carModelDriveType"] = carModelDriveType # 前置前驱
carModelDict["carModelGearBox"] = carModelGearBox # 7挡双离合
"""
<div class="spec-guidance">
<p class="guidance-price">
<span>10.40万</span>
<a href="//j.autohome.com.cn/pc/carcounter?type=1&specId=41511&pvareaid=3454617"><i class="athm-iconpng athm-iconpng-calculator"></i></a>
</p>
</div>
<div class="spec-guidance">
<p class="guidance-price">
<span><span>暂无</span></span>
"""
specGuidanceDoc = eachDdDoc.find("div[class='spec-guidance']")
# print("specGuidanceDoc=%s" % specGuidanceDoc)
guidancePriceSpanDoc = specGuidanceDoc.find("p[class='guidance-price'] span")
# print("guidancePriceSpanDoc=%s" % guidancePriceSpanDoc)
carModelMsrp = guidancePriceSpanDoc.text()
print("carModelMsrp=%s" % carModelMsrp)
if "暂无" in carModelMsrp:
carModelMsrp = ""
print("carModelMsrp=%s" % carModelMsrp)
carModelDict["carModelMsrp"] = carModelMsrp
self.send_message(self.project_name, carModelDict, url=carModelSpecUrl)
@catch_status_code_error
def haltCarSpecCallback(self, response):
carModelDict = response.save
carModelDict = copy.deepcopy(carModelDict)
print("carModelDict=%s" % carModelDict)
respJson = response.json
print("respJson=%s" % respJson)
"""
[
{
"name": "1.5升 涡轮增压 169马力",
"speclist": [
{
"specid": 36955,
"specname": "2019款 红标 1.5GDIT 自动舒适版",
"specstate": 40,
"minprice": 102000,
"maxprice": 102000,
"fueltype": 1,
"fueltypedetail": 1,
"driveform": "前置前驱",
"drivetype": "前驱",
"gearbox": "7挡双离合",
"evflag": "",
"newcarflag": "",
"subsidy": "",
"paramisshow": 1,
"videoid": 0,
"link2sc": "http://www.che168.com/china/hafu/hafuh6/7_8/",
"price2sc": "7.58万",
"price": "10.20万",
"syear": 2019
}, {
"specid": 36956,
"specname": "2019款 红标 1.5GDIT 自动都市版",
"specstate": 40,
"minprice": 109000,
"maxprice": 109000,
"fueltype": 1,
"fueltypedetail": 1,
"driveform": "前置前驱",
"drivetype": "前驱",
"gearbox": "7挡双离合",
"evflag": "",
"newcarflag": "",
"subsidy": "",
"paramisshow": 1,
"videoid": 0,
"link2sc": "",
"price2sc": "",
"price": "10.90万",
"syear": 2019
},
...
"""
if respJson:
for eachModelGroupDict in respJson:
modelGroupName = eachModelGroupDict["name"]
modelSpecList = eachModelGroupDict["speclist"]
for eachModelDict in modelSpecList:
curCarModelDict = copy.deepcopy(carModelDict)
carModelYear = "%s款" % eachModelDict["syear"]
# carModelSpecUrl = "%s/%s" % (CarSpecPrefix, eachModelDict["specid"])
carModelSpecUrl = self.genSpecUrl(eachModelDict["specid"])
curCarModelDict["carModelGroupName"] = modelGroupName
curCarModelDict["carModelYear"] = carModelYear
curCarModelDict["carModelEmissionStandards"] = ""
curCarModelDict["carModelPower"] = ""
curCarModelDict["carModelDriveType"] = eachModelDict["drivetype"]
curCarModelDict["carModelGearBox"] = eachModelDict["gearbox"]
curCarModelDict["carModelName"] = eachModelDict["specname"]
curCarModelDict["carModelSpecUrl"] = carModelSpecUrl
curCarModelDict["carModelMsrp"] = eachModelDict["price"]
self.send_message(self.project_name, curCarModelDict, url=carModelSpecUrl)最终批量爬取运行抓包到:
- PySpider总运行Url:9368个
- 车型数据总数:44278个
- excel文件:
- 效果截图


转载请注明:在路上 » 【已解决】汽车之家车型车系数据:支持旧版车系页面