您的位置：首页 > 新闻资讯 > 正文

如何使用代理IP来解决爬虫被封的问题？

发布时间：2019-06-01 16:51:34 来源：互联网

　　如何使用代理IP来解决爬虫被封的问题？在大量爬取某个网站时，突然被该网站封了IP，再也爬不动了。研究其反爬虫策略时发现，当单个IP访问次数达到某个阈值时，将会限制当天访问。爬虫不能停，工作任务必须按时完成，怎么办呢？同事告知：使用代理IP来解决。

　　在同事的介绍下，买了黑洞代理IP的一手私密代理IP，接下来就是使用代理IP来继续爬虫工作了。通过python官方文档得知，可用urllib库的request方法中的ProxyHandler方法，build_opener方法，install_opener方法来使用代理IP。

　　官方文档很官方，有点难以理解，下面是部分关键文档，一起来看下：

　　class urllib.request.ProxyHandler(proxies=None)

　　Cause requests to go through a proxy. If proxies is given, it must be a dictionary mapping protocol names to URLs of proxies.（通过代理方法请求，如果给定一个代理，它必须是一个字典映射，key为协议，value为URLs或者代理ip。）

　　urllib.request.build_opener([handler, ...])

　　Return an OpenerDirector instance, which chains the handlers in the order given.（build_opener方法返回一个链接着给定顺序的handler的OpenerDirector实例。）

　　urllib.request.install_opener(opener)

　　Install an OpenerDirector instance as the default global opener.（install_opener方法安装OpenerDirector实例作为默认的全局opener。）

　　是不是云里雾里的，如果这样理顺下，就会发现其实很简单：

　　1、将代理IP及其协议载入ProxyHandler赋给一个opener_support变量；

　　2、将opener_support载入build_opener方法，创建opener；

　　3、安装opener。

　　具体代码如下：

　　from urllib import request

　　def ProxySpider(url, proxy_ip, header):

　　opener_support = request.ProxyHandler({'http': proxy_ip})