最新消息:20210917 已从crifan.com换到crifan.org

【已解决】汽车之家车型车系数据:支持新版车系页面

数据 crifan 663浏览 0评论
折腾:
【未解决】用Python爬取汽车之家的车型车系详细数据
期间,经过调试,已经支持了新版的 车系详情页面
具体细节是:
从入口
https://www.autohome.com.cn/car/
研究后发现是:
每个车系,都可以从
英文字母页面获取到
从:
https://www.autohome.com.cn/grade/carhtml/a.html
到:
https://www.autohome.com.cn/grade/carhtml/z.html
总结成:
https://www.autohome.com.cn/grade/carhtml/%s.html
针对每个页面去:
找到车系的一些数据:
分别对应着:
然后进去 车系主页=车系详情页
【奥迪A3】奥迪_奥迪A3报价_奥迪A3图片_汽车之家
上图中多数信息前面已抓取
只需要抓剩余的信息即可
另外加上相关的内容,此处是:

和 从源码中js:

        <script type="text/javascript">
            。。。
            var seriesid = '2123';
            var seriesname='哈弗H6';
            var yearid = '0';
            var brandid = '181';
            var levelid = '17';
            var levelname='紧凑型SUV';
            var fctid = '4';
            var SeriesMinPrice='9.80';
            var SeriesMaxPrice='14.10';

提取出的:
  • carSeriesId:2123
  • carBrandId:181
  • carSeriesLevelId:17
    • 用于后续获取停产车型数据的请求的参数
  • carSeriesLevelName:紧凑型SUV
  • carSeriesMinPrice:9.80
    • 单位:万元
  • carSeriesMaxPrice:14.10
    • 单位:万元
另外还要抓取:
相关html:
            <!--即将上市 start-->
            <div class="spec-wrap  active" id="specWrap-1">
                
                <dl class="halt-spec">
                    <dt>
                        <div class="spec-name">
                            <span>参数配置未公布</span>
                        </div>


            <dl class="halt-spec">
                <dt>
                    <div class="spec-name">
                        <span>1.5升 涡轮增压 169马力 国VI</span>
                    </div>
解析出内容:
  • carModelGroupName:1.5升 涡轮增压 169马力 国VI
相关html
<dd data-sift1="2020款" data-sift2="国VI" data-sift3="1.5T" data-sift4="7挡双离合" class="">
解析内容:
  • carModelYear:2020款
  • carModelEmissionStandards:国VI
  • carModelPower:1.5T
  • carModelGearBox:7挡双离合
相关html:
                    <div class="spec-name">
                        <div class="name-param">
                            <p data-gcjid="41511" id="spec_41511">
                                <a href="/spec/41511/#pvareaid=3454492" class="name">2020款 1.5GDIT 自动铂金舒适版</a>
                                <span class="athm-badge athm-badge--grey is-plain">停产在售</span>
                            <span class="athm-badge athm-badge--orange">特惠</span></p>
                            <p><span class="type-default">前置前驱</span><span class="type-default">7挡双离合</span></p>
                        </div>
                    </div>
解析出:
相关html:
                    <div class="spec-guidance">
                        <p class="guidance-price">
                            <span>10.40万</span>
                            <a href="//j.autohome.com.cn/pc/carcounter?type=1&specId=41511&pvareaid=3454617"><i class="athm-iconpng athm-iconpng-calculator"></i></a>
                        </p>
                    </div>


                    <div class="spec-guidance">
                        <p class="guidance-price">
                            <span><span>暂无</span></span>
解析出:
  • carModelMsrp:10.40万
    • 或 暂无 -》 再替换成 空字符串
以及对于 停售款 也要抓取:
此处经研究,停售款的数据,是单独发送请求,返回json数据的
请求 举例
https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=2123&syearid=10379&levelid=17
返回的json数据,很全面。
举例:
[
    {
        "name": "1.5升 涡轮增压 169马力",
        "speclist": [
            {
                "specid": 36955,
                "specname": "2019款 红标 1.5GDIT 自动舒适版",
                "specstate": 40,
                "minprice": 102000,
                "maxprice": 102000,
                "fueltype": 1,
                "fueltypedetail": 1,
                "driveform": "前置前驱",
                "drivetype": "前驱",
                "gearbox": "7挡双离合",
                "evflag": "",
                "newcarflag": "",
                "subsidy": "",
                "paramisshow": 1,
                "videoid": 0,
                "link2sc": "http://www.che168.com/china/hafu/hafuh6/7_8/",
                "price2sc": "7.58万",
                "price": "10.20万",
                "syear": 2019
            }, {
                "specid": 36956,
                "specname": "2019款 红标 1.5GDIT 自动都市版",
                "specstate": 40,
                "minprice": 109000,
                "maxprice": 109000,
                "fueltype": 1,
                "fueltypedetail": 1,
                "driveform": "前置前驱",
                "drivetype": "前驱",
                "gearbox": "7挡双离合",
                "evflag": "",
                "newcarflag": "",
                "subsidy": "",
                "paramisshow": 1,
                "videoid": 0,
                "link2sc": "",
                "price2sc": "",
                "price": "10.90万",
                "syear": 2019
            },
。。。
只取用部分字段即可。
核心逻辑是:
                    carModelYear = "%s款" % eachModelDict["syear"]
                    # carModelSpecUrl = "%s/%s" % (CarSpecPrefix, eachModelDict["specid"])
                    carModelSpecUrl = self.genSpecUrl(eachModelDict["specid"])


                    curCarModelDict["carModelGroupName"] = modelGroupName
                    curCarModelDict["carModelYear"] = carModelYear
                    curCarModelDict["carModelEmissionStandards"] = ""
                    curCarModelDict["carModelPower"] = ""
                    curCarModelDict["carModelDriveType"] = eachModelDict["drivetype"]
                    curCarModelDict["carModelGearBox"] = eachModelDict["gearbox"]
                    curCarModelDict["carModelName"] = eachModelDict["specname"]
                    curCarModelDict["carModelSpecUrl"] = carModelSpecUrl
                    curCarModelDict["carModelMsrp"] = eachModelDict["price"]
即可。
完整代码如下:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2020-08-19 22:19:20
# Project: autohome_20200819


import string
import re
import copy


from lxml import etree


from pyspider.libs.base_handler import *


AutohomeHost = "https://www.autohome.com.cn"
CarSpecPrefix = "%s/spec" % AutohomeHost # "https://www.autohome.com.cn/spec/%s/"


class Handler(BaseHandler):
    UserAgent_Mac_Chrome = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"
    crawl_config = {
        "headers": {
            "User-Agent": UserAgent_Mac_Chrome,
        }
    }


    def genSpecUrl(self, specId):
        # return "%s/%s" % (CarSpecPrefix, specId)
        return "%s/%s/" % (CarSpecPrefix, specId)


    # @every(minutes=24 * 60)
    def on_start(self):
        # autohomeEntryUrl = "https://www.autohome.com.cn/car/"
        # self.crawl(autohomeEntryUrl, callback=self.carBrandListCallback)
        for eachLetter in list(string.ascii_lowercase):
            letterUpper = eachLetter.upper()
            # # for debug
            # letterUpper = "H"
            print("letterUpper=%s" % letterUpper)
            self.crawl("https://www.autohome.com.cn/grade/carhtml/%s.html" % eachLetter,
                save={"initials": letterUpper},
                callback=self.gradCarHtmlPage)


    @catch_status_code_error
    def gradCarHtmlPage(self, response):
        print("gradCarHtmlPage: response=", response)


        # picSeriesItemList = response.doc('.rank-list-ul li div a[href*="/pic/series"]').items()
        # print("picSeriesItemList=", picSeriesItemList)
        # print("len(picSeriesItemList)=%s"%(len(picSeriesItemList)))
        # for each in picSeriesItemList:
        #     self.crawl(each.attr.href, callback=self.picSeriesPage)


        saveDict = response.save
        print("saveDict=", saveDict)
        initials = saveDict["initials"]
        print("initials=", initials)
        respText = response.text
        # print("respText=", respText)


        """
        <dl id="33" olr="6">
            <dt><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362"><img width="50" height="50"
                    src="//car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"></a>
                <div><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362">奥迪</a></div>
            </dt>
        """
        # brandDoc = response.doc('dl dt')
        # print("brandDoc=%s" % brandDoc)
        # brandListDoc = response.doc('dl[id and orl] dt')
        # dlListDoc = response.doc('dl[id and orl]').items()
        # dlListDoc = response.doc("dl[id*=''][orl*='']").items()
        # dlListDoc = response.doc("dl[orl*='']").items()
        # dlListDoc = response.doc("dl").items()
        # dlListDoc = response.doc("dl:regex(id, \d+)").items()
        # dlListDoc = response.doc("dl:regex(id,[0-9])").items()
        # dlListDoc = response.doc("dl[id]").items()
        dlListDoc = response.doc("dl[olr]").items()
        print("type(dlListDoc)=%s" % type(dlListDoc))
        dlList = list(dlListDoc)
        print("len(dlList)=%s" % len(dlList))
        print("dlList=%s" % dlList)
        for curBrandIdx, eachDlDoc in enumerate(dlList):
            print("%s [%d] %s" % ('#'*30, curBrandIdx, '#'*30))


            dtDoc = eachDlDoc.find("dt")
            # print("dtDoc=%s" % dtDoc)
            # <a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362"><img width="50" height="50" src="//car2.autoimg.cn/cardfs/series/g26/M0B/AE/B3/100x100_f40_autohomecar__wKgHEVs9u5WAV441AAAKdxZGE4U148.png"></a>
            brandLogoDoc = dtDoc.find('a img')
            # print("brandLogoDoc=%s" % brandLogoDoc)
            carBrandLogoUrl = brandLogoDoc.attr["src"]
            print("carBrandLogoUrl=%s" % carBrandLogoUrl)
            # <div><a href="//car.autohome.com.cn/price/brand-33.html#pvareaid=2042362">奥迪</a></div>
            brandNameDoc = dtDoc.find('div a')
            # print("brandNameDoc=%s" % brandNameDoc)
            carBrandName = brandNameDoc.text()
            print("carBrandName=%s" % carBrandName)


            # <div class="h3-tit"><a href="//car.autohome.com.cn/price/brand-33-9.html#pvareaid=2042363">一汽-大众奥迪</a></div>
            # merchantDocGenerator = response.doc("dd div[class='h3-tit'] a").items()
            # ddDoc = eachDlDoc.find("dd")
            ddDoc = eachDlDoc.find("dd")
            # print("ddDoc=%s" % ddDoc)


            merchantDocGenerator = ddDoc.items("div[class='h3-tit'] a")
            merchantDocList = list(merchantDocGenerator)
            # print("merchantDocList=%s" % merchantDocList)
            merchantDocLen = len(merchantDocList)
            print("merchantDocLen=%s" % merchantDocLen)


            # <ul class="rank-list-ul" 0>
            # merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']")
            # merchantRankDocGenerator = response.doc("dd ul[class='rank-list-ul']").items()
            merchantRankDocGenerator = ddDoc.items("ul[class='rank-list-ul']")
            merchantRankDocList = list(merchantRankDocGenerator)
            # print("merchantRankDocList=%s" % merchantRankDocList)
            merchantRankDocListLen = len(merchantRankDocList)
            print("merchantRankDocListLen=%s" % merchantRankDocListLen)


            for curIdx, merchantItem  in enumerate(merchantDocList):
            # for curIdx, merchantItem  in enumerate(merchantDocGenerator):
                # print("%s" % "="*80)
                print("%s [%d] %s" % ('='*30, curIdx, '='*30))
                # print("type(merchantItem)=%s" % type(merchantItem))
                # print("[%d] merchantItem=%s" % (curIdx, merchantItem))
                # print("[%d] merchantItem=%s" % (curIdx, merchantItem))
                carMerchantName = merchantItem.text()
                print("carMerchantName=%s" % carMerchantName)
                merchantItemAttr = merchantItem.attr
                # print("merchantItemAttr=%s" % merchantItemAttr)
                carMerchantUrl = merchantItemAttr["href"]
                print("carMerchantUrl=%s" % carMerchantUrl)


                # curSubBrandDict = {
                #     "brandName": brandName,
                #     "carBrandLogoUrl": carBrandLogoUrl,
                #     "carMerchantName": carMerchantName,
                #     "carMerchantUrl": carMerchantUrl,
                # }
                # self.send_message(self.project_name, curSubBrandDict, url=carMerchantUrl)


                merchantRankDoc = merchantRankDocList[curIdx]
                # print("merchantRankDoc=%s" % merchantRankDoc)
                # print("type(merchantRankDoc)=%s" % type(merchantRankDoc))


                # type(merchantRankDoc)=<class 'lxml.html.HtmlElement'>
                # merchantRankHtml = etree.tostring(merchantRankDoc)


                # type(merchantRankDoc)=<class 'pyquery.pyquery.PyQuery'>
                # merchantRankHtml = merchantRankDoc.html()


                # print("merchantRankHtml=%s" % merchantRankHtml)


                # <li id="s3170">
                # carSeriesDocGenerator = merchantRankDoc.find("li")
                # carSeriesDocGenerator = merchantRankDoc.find("li[id*='s']")
                carSeriesDocGenerator = merchantRankDoc.items("li[id*='s']")
                # print("type(carSeriesDocGenerator)=%s" % type(carSeriesDocGenerator))
                carSeriesDocList = list(carSeriesDocGenerator)
                # print("type(carSeriesDocList)=%s" % type(carSeriesDocList))
                # print("carSeriesDocList=%s" % carSeriesDocList)
                carSeriesDocListLen = len(carSeriesDocList)
                # print("carSeriesDocListLen=%s" % carSeriesDocListLen)
                
                for curSeriesIdx, eachCarSeriesDoc in enumerate(carSeriesDocList):
                    print("%s [%d] %s" % ('-'*30, curSeriesIdx, '-'*30))
                    # print("[%d] eachCarSeriesDoc=%s" % (curSeriesIdx, eachCarSeriesDoc))
                    # print("type(eachCarSeriesDoc)=%s" % type(eachCarSeriesDoc)) # type(eachCarSeriesDoc)=<class 'lxml.html.HtmlElement'>
                    # <h4><a href="//www.autohome.com.cn/3170/#levelsource=000000000_0&pvareaid=101594">奥迪A3</a></h4>
                    carSeriesInfoDoc = eachCarSeriesDoc.find("h4 a")
                    # print("type(carSeriesInfoDoc)=%s" % type(carSeriesInfoDoc))
                    # print("carSeriesInfoDoc=%s" % carSeriesInfoDoc)
                    carSeriesName = carSeriesInfoDoc.text()
                    print("carSeriesName=%s" % carSeriesName)
                    carSeriesUrl = carSeriesInfoDoc.attr.href
                    print("carSeriesUrl=%s" % carSeriesUrl)


                    # <div>指导价:<a class="red" href="//www.autohome.com.cn/3170/price.html#pvareaid=101446">19.32-23.46万</a></div>
                    # 厂商指导价=厂商建议零售价格=MSRP=Manufacturer's Suggested Retail Price
                    # carSeriesMsrpDoc = eachCarSeriesDoc.find("div a")
                    carSeriesMsrpDoc = eachCarSeriesDoc.find("div a[class='red']")
                    # print("carSeriesMsrpDoc=%s" % carSeriesMsrpDoc)
                    carSeriesMsrp = carSeriesMsrpDoc.text()
                    print("carSeriesMsrp=%s" % carSeriesMsrp)
                    carSeriesMsrpUrl = carSeriesMsrpDoc.attr.href
                    print("carSeriesMsrpUrl=%s" % carSeriesMsrpUrl)


                    carSeriesDict = {
                        "carBrandName": carBrandName,
                        "carBrandLogoUrl": carBrandLogoUrl,
                        "carMerchantName": carMerchantName,
                        "carMerchantUrl": carMerchantUrl,
                        "carSeriesName": carSeriesName,
                        "carSeriesUrl": carSeriesUrl,
                        "carSeriesMsrp": carSeriesMsrp,
                        "carSeriesMsrpUrl": carSeriesMsrpUrl,
                    }
                    # self.send_message(self.project_name, carSeriesDict, url=carSeriesUrl)
                    self.crawl(carSeriesUrl,
                        callback=self.carSeriesDetailPage,
                        save=carSeriesDict,
                    )


    def on_message(self, project, msg):
        print("on_message: msg=%s" % msg)
        return msg


    @catch_status_code_error
    def carSeriesDetailPage(self, response):
        carSeriesDict = response.save
        print("carSeriesDict=%s" % carSeriesDict)


        carSeriesUrl = response.url
        print("carSeriesUrl=%s" % carSeriesUrl)


        carSeriesMainImgUrl = ""
        carSeriesId = ""
        carSeriesLevelId = ""
        carSeriesMsrp = ""
        carSeriesMinPrice = ""
        carSeriesMaxPrice = ""


        carSeriesHtml = response.text
        print("type(carSeriesHtml)=%s" % type(carSeriesHtml)) # <class 'str'>
        # print("carSeriesHtml=%s" % carSeriesHtml)


        carModelDict = copy.deepcopy(carSeriesDict)


        # carSeriesUrl=https://www.autohome.com.cn/2123/#levelsource=000000000_0&pvareaid=101594
        foundSeriesId = re.search("www\.autohome\.com\.cn/(?P<seriesId>\d+)/", carSeriesUrl)
        carSeriesId = foundSeriesId.group("seriesId")
        # carSeriesId = int(carSeriesId)
        print("carSeriesId=%s" % carSeriesId) # 2123
        carModelDict["carSeriesId"] = carSeriesId


        """
        <div class="information-pic">
            <div class="pic-main">
            。。。
                    <picture>
                        。。。
                        <img sizes="380px" width="380" height="285"
                            src="//car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/380x285_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg"
                            srcset="//car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/380x285_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg 380w, //car2.autoimg.cn/cardfs/product/g1/M04/0B/F0/760x570_0_q87_autohomecar__ChwFqV8YG-aACch8AAkAdoJoSYM874.jpg 760w">
                    </picture>
        """
        mainImgDoc = response.doc("div[class='information-pic'] div[class='pic-main'] picture img")
        print("mainImgDoc=%s" % mainImgDoc)
        carSeriesMainImgUrl = mainImgDoc.attr["src"]
        print("carSeriesMainImgUrl=%s" % carSeriesMainImgUrl)
        carModelDict["carSeriesMainImgUrl"] = carSeriesMainImgUrl


        """
        <script type="text/javascript">
            。。。
            var seriesid = '2123';
            var seriesname='哈弗H6';
            var yearid = '0';
            var brandid = '181';
            var levelid = '17';
            var levelname='紧凑型SUV';
            var fctid = '4';
            var SeriesMinPrice='9.80';
            var SeriesMaxPrice='14.10';
        """


        infoKeyList = [
            "seriesid",
            # "seriesname", # has got
            # "yearid", # no need
            "brandid",
            "levelid",
            "levelname",
            # "fctid", # unknown meaning
            "SeriesMinPrice",
            "SeriesMaxPrice",
        ]
        InfoDict = {}
        for eachInfoKey in infoKeyList:
            curPattern = "var\s+%s\s*=\s*'(?P<infoValue>[^']+)'\s*;" % eachInfoKey
            print("curPattern=%s" % curPattern)
            foundInfo = re.search(curPattern, carSeriesHtml)
            print("foundInfo=%s" % foundInfo)
            # if foundInfo:
            infoValue = foundInfo.group("infoValue")
            print("infoValue=%s" % infoValue)
            InfoDict[eachInfoKey] = infoValue
        print("InfoDict=%s" % InfoDict)


        # if "seriesid" in InfoDict:
        carSeriesId = InfoDict["seriesid"] # 2123
        carModelDict["carSeriesId"] = carSeriesId
        # carModelDict["carSeriesName"] = InfoDict["seriesname"] # 哈弗H6
        # if "brandid" in InfoDict:
        carModelDict["carBrandId"] = InfoDict["brandid"] # 181
        # if "levelid" in InfoDict:
        carSeriesLevelId = InfoDict["levelid"] # 17
        carModelDict["carSeriesLevelId"] = carSeriesLevelId
        # if "levelname" in InfoDict:
        carModelDict["carSeriesLevelName"] = InfoDict["levelname"] # 紧凑型SUV
        # if "SeriesMinPrice" in InfoDict:
        carSeriesMinPrice = InfoDict["SeriesMinPrice"] # 9.80
        carModelDict["carSeriesMinPrice"] = carSeriesMinPrice
        # if "SeriesMaxPrice" in InfoDict:
        carSeriesMaxPrice = InfoDict["SeriesMaxPrice"] # 14.10
        carModelDict["carSeriesMaxPrice"] = carSeriesMaxPrice


        """
        <div class="series-list">
        。。。
            <li class="more-dropdown">
                <a href="javascript:void(0);" target="_self" data-toggle="tab" class="tab-disabled" data-target="#specWrap-3">停售款 <i class="athm-iconfont athm-iconfont-arrowdown"></i></a>
                <ul class="dropdown-con" id="haltList">
                    <li><a href="javascript:void(0);" target="_self" data-toggle="tab" data-yearid="11691">2019款</a></li>
                    ...
                    <li><a href="javascript:void(0);" target="_self" data-toggle="tab" data-yearid="3100">2011款</a></li>
                </ul>
            </li>
        """
        haltAListDoc = response.doc("li[class='more-dropdown'] ul[id='haltList'] li a").items()
        print("type(haltAListDoc)=%s" % type(haltAListDoc))
        print("haltAListDoc=%s" % haltAListDoc)
        for curLiIdx, eachHatADoc in enumerate(haltAListDoc):
            print("%s [%d] %s" % ('%'*30, curLiIdx, '%'*30))
            print("eachHatADoc=%s" % eachHatADoc)
            yearName = eachHatADoc.text()
            print("yearName=%s" % yearName)
            yearId = eachHatADoc.attr["data-yearid"]
            print("yearId=%s" % yearId)


            # getHaltSpecUrl = "https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=%s&syearid=%s&levelid=%s" % (carModelDict["carSeriesId"], yearId, carModelDict["carSeriesLevelId"])
            if carSeriesId and carSeriesLevelId:
                getHaltSpecUrl = "https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=%s&syearid=%s&levelid=%s" % (carSeriesId, yearId, carSeriesLevelId)
                # https://www.autohome.com.cn/ashx/car/Spec_ListByYearId.ashx?seriesid=2123&syearid=10379&levelid=17
                print("getHaltSpecUrl=%s" % getHaltSpecUrl)
                self.crawl(getHaltSpecUrl,
                    callback=self.haltCarSpecCallback,
                    save=carModelDict,
                )


        # """
        # <div class="information-summary">
        #     <dl class="information-price">
        #         ...
        #         <dd class="type">
        #             <span class="type__item">紧凑型车</span>
        # """
        # carLevelDoc = response.doc("div[class='information-summary'] dl[class='information-price'] dd[class='type'] span[class='type__item']").eq(0)
        # print("carLevelDoc=%s" % carLevelDoc)
        # carSeriesLevelName = carLevelDoc.text()
        # print("carSeriesLevelName=%s" % carSeriesLevelName)
        # carModelDict["carSeriesLevelName"] = carSeriesLevelName


        carSeriesContentDoc = response.doc("div[class='series-content']")
        print("carSeriesContentDoc=%s" % carSeriesContentDoc)
        # carSpecWrapDoc = carSeriesContentDoc.find("div[class^='spec-wrap']")
        # carSpecWrapDoc = carSeriesContentDoc.find("div[class^='spec-wrap active']")
        carSpecWrapListDoc = carSeriesContentDoc.items("div[class^='spec-wrap']")
        print("carSpecWrapListDoc=%s" % carSpecWrapListDoc)
        for curSpecWrapIdx, eachSpecWrapDoc in enumerate(carSpecWrapListDoc):
            print("%s [%d] %s" % ('#'*30, curSpecWrapIdx, '#'*30))
            print("eachSpecWrapDoc=%s" % eachSpecWrapDoc)
            """
            <!--即将上市 start-->
            <div class="spec-wrap  active" id="specWrap-1">
                
                <dl class="halt-spec">
                    <dt>
                        <div class="spec-name">
                            <span>参数配置未公布</span>
                        </div>


            <dl class="halt-spec">
                <dt>
                    <div class="spec-name">
                        <span>1.5升 涡轮增压 169马力 国VI</span>
                    </div>
            """
            # dlDoc = eachSpecWrapDoc.find("dl[class='']")
            # dlDoc = eachSpecWrapDoc.find("dl")
            dlListDoc = eachSpecWrapDoc.items("dl")
            print("dlListDoc=%s" % dlListDoc)
            for curDlIdx, eachDlDoc in enumerate(dlListDoc):
                print("%s [%d] %s" % ('='*30, curDlIdx, '='*30))
                print("eachDlDoc=%s" % eachDlDoc)
                """
                    <dt>
                        <div class="spec-name">
                            <span>1.5升 涡轮增压 169马力 国VI</span>
                """
                dtDoc = eachDlDoc.find("dt")
                print("dtDoc=%s" % dtDoc)
                groupSpecNameSpanDoc = dtDoc.find("div[class='spec-name'] span")
                print("groupSpecNameSpanDoc=%s" % groupSpecNameSpanDoc)
                carModelGroupName = ""
                if groupSpecNameSpanDoc:
                    carModelGroupName = groupSpecNameSpanDoc.text()
                    print("carModelGroupName=%s" % carModelGroupName)
                
                carModelDict["carModelGroupName"] = carModelGroupName


                # <dd data-sift1="2020款" data-sift2="国VI" data-sift3="1.5T" data-sift4="7挡双离合" class="">
                ddListDoc = eachDlDoc.items("dd")
                print("ddListDoc=%s" % ddListDoc)
                for curDdIdx, eachDdDoc in enumerate(ddListDoc):
                    print("%s [%d] %s" % ('-'*30, curDdIdx, '-'*30))
                    curDdAttr = eachDdDoc.attr
                    # print("curDdAttr=%s" % curDdAttr)
                    carModelYear = curDdAttr["data-sift1"]
                    print("carModelYear=%s" % carModelYear)
                    carModelEmissionStandards = curDdAttr["data-sift2"]
                    print("carModelEmissionStandards=%s" % carModelEmissionStandards)
                    carModelPower = curDdAttr["data-sift3"]
                    print("carModelPower=%s" % carModelPower)
                    carModelGearBox = curDdAttr["data-sift4"]
                    print("carModelGearBox=%s" % carModelGearBox)


                    carModelDict["carModelYear"] = carModelYear
                    carModelDict["carModelEmissionStandards"] = carModelEmissionStandards
                    carModelDict["carModelPower"] = carModelPower
                    carModelDict["carModelGearBox"] = carModelGearBox


                    """
                    <div class="spec-name">
                        <div class="name-param">
                            <p data-gcjid="41511" id="spec_41511">
                                <a href="/spec/41511/#pvareaid=3454492" class="name">2020款 1.5GDIT 自动铂金舒适版</a>
                                <span class="athm-badge athm-badge--grey is-plain">停产在售</span>
                            <span class="athm-badge athm-badge--orange">特惠</span></p>
                            <p><span class="type-default">前置前驱</span><span class="type-default">7挡双离合</span></p>
                        </div>
                    </div>
                    """
                    specNameDoc = eachDdDoc.find("div[class='spec-name']")
                    # print("specNameDoc=%s" % specNameDoc)
                    specADoc = specNameDoc.find("p a[class='name']")
                    # print("specADoc=%s" % specADoc)
                    carModelName = specADoc.text()
                    print("carModelName=%s" % carModelName) # 2020款 1.5GDIT 自动铂金舒适版
                    carModelSpecUrl = specADoc.attr["href"]
                    print("carModelSpecUrl=%s" % carModelSpecUrl) # https://www.autohome.com.cn/spec/41511/#pvareaid=3454492
                    typeDefaultListDoc = specNameDoc.items("p span[class='type-default']")
                    print("typeDefaultListDoc=%s" % typeDefaultListDoc)
                    typeDefaultList = list(typeDefaultListDoc)
                    print("typeDefaultList=%s" % typeDefaultList)
                    carModelDriveType = ""
                    carModelGearBox = ""
                    if typeDefaultList:
                        spanTypeDefault0 = typeDefaultList[0]
                        print("spanTypeDefault0=%s" % spanTypeDefault0)
                        carModelDriveType = spanTypeDefault0.text()
                        print("carModelDriveType=%s" % carModelDriveType)
                        spanTypeDefault1 = typeDefaultList[1]
                        print("spanTypeDefault1=%s" % spanTypeDefault1)
                        carModelGearBox = spanTypeDefault1.text()
                        print("carModelGearBox=%s" % carModelGearBox)


                    carModelDict["carModelName"] = carModelName
                    carModelDict["carModelSpecUrl"] = carModelSpecUrl
                    carModelDict["carModelDriveType"] = carModelDriveType # 前置前驱
                    carModelDict["carModelGearBox"] = carModelGearBox # 7挡双离合


                    """
                    <div class="spec-guidance">
                        <p class="guidance-price">
                            <span>10.40万</span>
                            <a href="//j.autohome.com.cn/pc/carcounter?type=1&specId=41511&pvareaid=3454617"><i class="athm-iconpng athm-iconpng-calculator"></i></a>
                        </p>
                    </div>


                    <div class="spec-guidance">
                        <p class="guidance-price">
                            <span><span>暂无</span></span>
                    """
                    specGuidanceDoc = eachDdDoc.find("div[class='spec-guidance']")
                    # print("specGuidanceDoc=%s" % specGuidanceDoc)
                    guidancePriceSpanDoc = specGuidanceDoc.find("p[class='guidance-price'] span")
                    # print("guidancePriceSpanDoc=%s" % guidancePriceSpanDoc)
                    carModelMsrp = guidancePriceSpanDoc.text()
                    print("carModelMsrp=%s" % carModelMsrp)
                    if "暂无" in carModelMsrp:
                        carModelMsrp = ""
                        print("carModelMsrp=%s" % carModelMsrp)
                    carModelDict["carModelMsrp"] = carModelMsrp


                    self.send_message(self.project_name, carModelDict, url=carModelSpecUrl)


    @catch_status_code_error
    def haltCarSpecCallback(self, response):
        carModelDict = response.save
        carModelDict = copy.deepcopy(carModelDict)
        print("carModelDict=%s" % carModelDict)


        respJson = response.json
        print("respJson=%s" % respJson)


        """
        [
            {
                "name": "1.5升 涡轮增压 169马力",
                "speclist": [
                    {
                        "specid": 36955,
                        "specname": "2019款 红标 1.5GDIT 自动舒适版",
                        "specstate": 40,
                        "minprice": 102000,
                        "maxprice": 102000,
                        "fueltype": 1,
                        "fueltypedetail": 1,
                        "driveform": "前置前驱",
                        "drivetype": "前驱",
                        "gearbox": "7挡双离合",
                        "evflag": "",
                        "newcarflag": "",
                        "subsidy": "",
                        "paramisshow": 1,
                        "videoid": 0,
                        "link2sc": "http://www.che168.com/china/hafu/hafuh6/7_8/",
                        "price2sc": "7.58万",
                        "price": "10.20万",
                        "syear": 2019
                    }, {
                        "specid": 36956,
                        "specname": "2019款 红标 1.5GDIT 自动都市版",
                        "specstate": 40,
                        "minprice": 109000,
                        "maxprice": 109000,
                        "fueltype": 1,
                        "fueltypedetail": 1,
                        "driveform": "前置前驱",
                        "drivetype": "前驱",
                        "gearbox": "7挡双离合",
                        "evflag": "",
                        "newcarflag": "",
                        "subsidy": "",
                        "paramisshow": 1,
                        "videoid": 0,
                        "link2sc": "",
                        "price2sc": "",
                        "price": "10.90万",
                        "syear": 2019
                    },
                    ...
        """
        if respJson:
            for eachModelGroupDict in respJson:
                modelGroupName = eachModelGroupDict["name"]
                modelSpecList = eachModelGroupDict["speclist"]
                for eachModelDict in modelSpecList:
                    curCarModelDict = copy.deepcopy(carModelDict)
                    
                    carModelYear = "%s款" % eachModelDict["syear"]
                    # carModelSpecUrl = "%s/%s" % (CarSpecPrefix, eachModelDict["specid"])
                    carModelSpecUrl = self.genSpecUrl(eachModelDict["specid"])


                    curCarModelDict["carModelGroupName"] = modelGroupName
                    curCarModelDict["carModelYear"] = carModelYear
                    curCarModelDict["carModelEmissionStandards"] = ""
                    curCarModelDict["carModelPower"] = ""
                    curCarModelDict["carModelDriveType"] = eachModelDict["drivetype"]
                    curCarModelDict["carModelGearBox"] = eachModelDict["gearbox"]
                    curCarModelDict["carModelName"] = eachModelDict["specname"]
                    curCarModelDict["carModelSpecUrl"] = carModelSpecUrl
                    curCarModelDict["carModelMsrp"] = eachModelDict["price"]


                    self.send_message(self.project_name, curCarModelDict, url=carModelSpecUrl)
如此,去批量运行
(注意旧版车系页面不支持,可以自己加代码忽略,而不至于报错
比如:
        foundLevelId = re.search("var\s+levelid\s+=", carSeriesHtml)
        print("foundLevelId=%s" % foundLevelId)
        isNewLayoutHtml = bool(foundLevelId)
        
        if isNewLayoutHtml:
            # add above code
最后即可抓包出
  • PySpider总运行Url:9365个
  • 车型数据总数:31287个
    • excel文件:
      • 效果截图

转载请注明:在路上 » 【已解决】汽车之家车型车系数据:支持新版车系页面

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
90 queries in 0.178 seconds, using 23.66MB memory