最新消息:20210917 已从crifan.com换到crifan.org

【已解决】PySpider中使用多贝云IP代理池实现每次请求IP都不同

代理 crifan 770浏览 0评论
折腾:
【已解决】找个好用的IP代理池实现防止大众点评网站的反扒
期间,已经用代码测试了代理IP,的确可以实现我们要的
每次请求,都会变IP
然后接着就是去PySpider中,试试实际使用的效果了
看看能否能够实现绕过大众点评的反扒,从而正常下载数据
先去看看PySpider的self.crawl如何传递代理参数
self.crawl – pyspider
“proxy
proxy server of username:password@hostname:port to use, only http proxy is supported currently.
class Handler(BaseHandler):
    crawl_config = {
        ‘proxy’: ‘localhost:8080’
    }
Handler.crawl_config can be used with proxy to set a proxy for whole project.”
好像不支持
proxies=proxies
的写法啊
不过貌似直接把拼出来的代理地址加上去,估计就可以了。
去试试
结果用:

### dobel 多贝云 IP代理
#http代理接入服务器地址端口
ProxyHost = "http-proxy-t3.dobel.cn"
ProxyPort = "9180"


#账号密码
ProxyUser = "xxx"
ProxyPass = "xxx"


ProxyUri = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host" : ProxyHost,
        "port" : ProxyPort,
        "user" : ProxyUser,
        "pass" : ProxyPass,
}


class Handler(BaseHandler):
    crawl_config = {
        "proxy": ProxyUri,
加上之前的
其中的保留之前的headers
class Handler(BaseHandler):
    crawl_config = {
        "proxy": ProxyUri,
        "connect_timeout": 100,
        "timeout": 600,
        "retries": 15,
        "headers": {
            "User-Agent": constUserAgentMacChrome,
            # "Accept": "application/json, text/javascript, */*; q=0.01",
            # "Accept": "application/json, text/javascript, text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*; q=0.8,application/signed-exchange;v=b3",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
            # "Content-Type": "application/json",
            "Origin": "http://www.dianping.com",
            "Host": "www.dianping.com",
            "Cache-Control": "no-cache",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
            "Upgrade-Insecure-Requests": "1",
            "Pragma": "no-cache",
            "Connection": "keep-alive",
            "Cookie": "_lxsdk_cuid=16a200a8282c8-055d7d49cf2d47-12316d51-fa000-16a200a8282c8; _lxsdk=16a200a8282c8-055d7d49cf2d47-12316d51-fa000-16a200a8282c8; _hc.v=f301d625-2322-83f6-b7f0-4686373a858b.1555315721; aburl=1; Hm_lvt_dbeeb675516927da776beeb1d9802bd4=1555386214; wed_user_path=33780|0; cy=509; cye=wangdu; Hm_lpvt_dbeeb675516927da776beeb1d9802bd4=1555484470; _lxsdk_s=16a29fd395b-0ac-21b-f2%7C%7C34",
            # "X-Requested-With": "XMLHttpRequest",
        }
    }
但是去调试时,竟然很快就出现:之前的反扒页面了:
那去掉之前的cookie等参数试试
去掉:
“Connection”: “keep-alive”,
“Cookie”: xxx
结果:
还是
respUrl=https://verify.meituan.com/xxx
打开验证页面,输入了验证码后,页面可以正常访问了。
再去调试,结果又出错:
去用多贝云代理IP去调试,结果出错:
[E 190422 17:56:01 base_handler:203] HTTP 599: Proxy CONNECT aborted
    Traceback (most recent call last):
      File "/Users/crifan/.local/share/virtualenvs/crawler_dianping_com-sGcMRJTS/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task
        result = self._run_task(task, response)
      File "/Users/crifan/.local/share/virtualenvs/crawler_dianping_com-sGcMRJTS/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 175, in _run_task
        response.raise_for_status()
      File "/Users/crifan/.local/share/virtualenvs/crawler_dianping_com-sGcMRJTS/lib/python3.6/site-packages/pyspider/libs/response.py", line 172, in raise_for_status
        six.reraise(Exception, Exception(self.error), Traceback.from_string(self.traceback).as_traceback())
      File "/Users/crifan/.local/share/virtualenvs/crawler_dianping_com-sGcMRJTS/lib/python3.6/site-packages/six.py", line 692, in reraise
        raise value.with_traceback(tb)
      File "/Users/crifan/.local/share/virtualenvs/crawler_dianping_com-sGcMRJTS/lib/python3.6/site-packages/pyspider/fetcher/tornado_fetcher.py", line 378, in http_fetch
        response = yield gen.maybe_future(self.http_client.fetch(request))
      File "/Users/crifan/.local/share/virtualenvs/crawler_dianping_com-sGcMRJTS/lib/python3.6/site-packages/tornado/httpclient.py", line 102, in fetch
        self._async_client.fetch, request, **kwargs))
      File "/Users/crifan/.local/share/virtualenvs/crawler_dianping_com-sGcMRJTS/lib/python3.6/site-packages/tornado/ioloop.py", line 458, in run_sync
        return future_cell[0].result()
      File "/Users/crifan/.local/share/virtualenvs/crawler_dianping_com-sGcMRJTS/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
        raise_exc_info(self._exc_info)
      File "<string>", line 4, in raise_exc_info
    Exception: HTTP 599: Proxy CONNECT aborted
多贝云 IP代理 Exception HTTP 599 Proxy CONNECT aborted
PySpider proxy Exception HTTP 599 Proxy CONNECT aborted
后记:发现无需解决
加上额外测试代码,看看此处代理IP是否生效了,结果出错:
【已解决】PySpider中使用IP代理proxy出错:requests.exceptions.HTTPError HTTP 404 Not Found
再去看看上面的599问题。
又可以了。。没599的问题了。。。
这样至少通过:
class Handler(BaseHandler):
    crawl_config = {
        "proxy": ProxyUri,
        "connect_timeout": 100,
        "timeout": 600,
        "retries": 15,
        ...
    }

    def testProxy(self):
        # test proxy ip
        # targetUrl = "https://api.ipify.org?format=json"
        targetUrl = "https://www.taobao.com/help/getip.php"
        self.crawl(
            targetUrl,
            # proxy=ProxyUri,
            cookies={},
            callback=self.testProxyCallback
        )
的方式,实现了PySpider中使用多贝云代理IP了,效果是:
respUrl=https://www.taobao.com/help/getip.php
respText=ipCallback({ip:"120.11.185.123"})
但是后来多次调试发现个问题:
调试时刷新页面,结果请求IP没变。
然后一直以为是此处PySpider中试用代理的问题呢。
后来无意间发现从:
pyspider cookies问题,用requests设置cookies成功登录微博但在pyspider中失败 – 足兆叉虫的回答 – SegmentFault 思否
看到了自己之前就想要试用的去测试IP的:
http://httpbin.org/get
Mac中Safari效果是:
{
  "args": {}, 
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate", 
    "Accept-Language": "zh-cn", 
    "Cookie": "_gauges_unique=1", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.2 Safari/605.1.15"
  }, 
  "origin": "222.92.130.218, 222.92.130.218", 
  "url": "https://httpbin.org/get"
}
而换到PySpider中:
    def testProxy(self):
        # test proxy ip
        # targetUrl = "https://api.ipify.org?format=json"
        # targetUrl = "https://www.taobao.com/help/getip.php"
        targetUrl = "http://httpbin.org/get"
        self.crawl(
            targetUrl,
            # proxy=ProxyUri,
            headers=self.genCurHeaders(),
            cookies={},
            callback=self.testProxyCallback
        )
效果是:
真的实现了我们希望的:多次请求,IP不同 的效果了:
注意到:此处header中已经加上了,禁止缓存相关的参数:
"Cache-Control": "no-cache",
"Pragma": "no-cache",
且去掉了:
"Connection": "keep-alive",
->可以确保不应该有缓存页面了
->而对于之前的:
https://www.taobao.com/help/getip.php
却对于后续请求返回同样的IP
反推出:淘宝的这个api是有缓存的,有问题的。
还是:
http://httpbin.org/get
做的专业,准确。
后来也在:
【未解决】如何破解大众点评网页爬取时的反扒验证verify.meituan.com
期间,参考:
python – pyspider cookies问题,用requests设置cookies成功登录微博但在pyspider中失败 – SegmentFault 思否
去确认此处PySpider的版本是:
➜  crawler_dianping_com git:(master) ✗ pipenv graph
pyspider==0.3.10
  - chardet [required: >=2.2, installed: 3.0.4]
  - click [required: >=3.3, installed: 7.0]
  - cssselect [required: >=0.9, installed: 1.0.3]
  - Flask [required: >=0.10, installed: 1.0.2]
    - click [required: >=5.1, installed: 7.0]
    - itsdangerous [required: >=0.24, installed: 1.1.0]
    - Jinja2 [required: >=2.10, installed: 2.10.1]
      - MarkupSafe [required: >=0.23, installed: 1.1.1]
    - Werkzeug [required: >=0.14, installed: 0.15.2]
  - Flask-Login [required: >=0.2.11, installed: 0.4.1]
    - Flask [required: Any, installed: 1.0.2]
      - click [required: >=5.1, installed: 7.0]
      - itsdangerous [required: >=0.24, installed: 1.1.0]
      - Jinja2 [required: >=2.10, installed: 2.10.1]
        - MarkupSafe [required: >=0.23, installed: 1.1.1]
      - Werkzeug [required: >=0.14, installed: 0.15.2]
  - Jinja2 [required: >=2.7, installed: 2.10.1]
    - MarkupSafe [required: >=0.23, installed: 1.1.1]
  - lxml [required: Any, installed: 4.3.3]
  - pycurl [required: Any, installed: 7.43.0.2]
  - pyquery [required: Any, installed: 1.4.0]
    - cssselect [required: >0.7.9, installed: 1.0.3]
    - lxml [required: >=2.1, installed: 4.3.3]
  - requests [required: >=2.2, installed: 2.21.0]
    - certifi [required: >=2017.4.17, installed: 2019.3.9]
    - chardet [required: >=3.0.2,<3.1.0, installed: 3.0.4]
    - idna [required: >=2.5,<2.9, installed: 2.8]
    - urllib3 [required: >=1.21.1,<1.25, installed: 1.24.1]
  - six [required: >=1.5.0, installed: 1.12.0]
  - tblib [required: >=1.3.0, installed: 1.3.2]
  - tornado [required: >=3.2,<=4.5.3, installed: 4.5.3]
  - u-msgpack-python [required: >=1.6, installed: 2.5.1]
  - wsgidav [required: >=2.0.0, installed: 2.4.1]
    - defusedxml [required: Any, installed: 0.5.0]
    - jsmin [required: Any, installed: 2.2.2]
    - PyYAML [required: Any, installed: 5.1]
-》不是有问题的0.3.8的版本。
-》是最新的版本:0.3.10
【总结】
PySpider中设置了全局的代理:
### dobel 多贝云 IP代理
#http代理接入服务器地址端口
ProxyHost = "http-proxy-t3.dobel.cn"
ProxyPort = "9180"


#账号密码
ProxyUser = "UUUU"
ProxyPass = "PPPPP"


ProxyUri = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
        "host" : ProxyHost,
        "port" : ProxyPort,
        "user" : ProxyUser,
        "pass" : ProxyPass,
}


class Handler(BaseHandler):
    crawl_config = {
        "proxy": ProxyUri,
        ...
    }
后,之后的self.crawl正常调用url,即可用上此处的代理了。
其中ProxyUri是这种:
http://UserName:Password@http-proxy-t3.dobel.cn:9180
【后记】
如果账号过期了:
则会出现:
requests.exceptions.HTTPError: HTTP 407: Proxy Authentication Required
[E 190424 17:53:40 base_handler:203] HTTP 407: Proxy Authentication Required
    Traceback (most recent call last):
      File "/Users/crifan/.local/share/virtualenvs/crawler_dianping_com-sGcMRJTS/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 196, in run_task
        result = self._run_task(task, response)
      File "/Users/crifan/.local/share/virtualenvs/crawler_dianping_com-sGcMRJTS/lib/python3.6/site-packages/pyspider/libs/base_handler.py", line 175, in _run_task
        response.raise_for_status()
      File "/Users/crifan/.local/share/virtualenvs/crawler_dianping_com-sGcMRJTS/lib/python3.6/site-packages/pyspider/libs/response.py", line 184, in raise_for_status
        raise http_error
    requests.exceptions.HTTPError: HTTP 407: Proxy Authentication Required

转载请注明:在路上 » 【已解决】PySpider中使用多贝云IP代理池实现每次请求IP都不同

发表我的评论
取消评论

表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
90 queries in 0.303 seconds, using 23.38MB memory