使用scrapy splash 重写煎蛋网图片爬虫
这算是煎蛋网爬虫 V3 吧。
PhantomJS 容易卡住,网上说是因为内存泄漏,但是我不太清楚。 可能是我使用的姿势不对,改用splash + scrapy,于是重新写一遍, 系统为 Mac,这是感觉速度还是挺快的。
安装 splash Mac 下:
1 2 brew install docker docker pull scrapinghub/splash
Ubuntu 下:
1 2 sudo apt-get install docker sudo docker pull scrapinghub/splash
python安装splash扩展 1 pip3 install scrapy-splash
启动splash 服务 在本机的8050和8051开启splash 服务
1 sudo docker run -p 8050:8050 -p 8051:8051 scrapinghub/splash
两个示例代码:
code1 :
1 2 3 4 5 6 7 8 9 10 11 import requestsfrom scrapy.selector import Selectorsplash_url = 'http://localhost:8050/render.html' args = {'url' : 'http://quotes.toscrape.com/js/' , 'timeout' : 5 , 'images' : 0 } response = requests.get(splash_url, params=args) sel = Selector(response) print (sel.css('div.quote span.text::text' )).extract()
code2 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 import requestsimport jsonlua_script = ''' function main(splash) splash:go("http://example.com") splash:wait(0.5) local title=splash:evaljs("document.title") return {title=title} end ''' splash_url = 'http://localhost:8050/excute' headers = {'content-type' : 'application/json' } data = json.dumps({'lua_source' : lua_script}) response = requests.get(splash_url, headers=headers, data=data) print (response.content)print (response.json())
开始煎蛋爬虫 1 新建scrapy 项目
1 scrapy startproject jiandan
2 修改项目配置 settings,增加如下内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 USER_AGENT = "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6" SPLASH_URL = 'http://localhost:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware' : 723 , 'scrapy_splash.SplashMiddleware' : 725 , 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware' : 810 , } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware' : 100 , } ITEM_PIPELINES = { 'scrapy.pipelines.images.ImagesPipeline' : 1 , } IMAGES_STORE = 'jd_img'
3 生成爬虫
4 编写spider代码
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import scrapyfrom scrapy_splash import SplashRequestclass JdimgSpider (scrapy.Spider ): name = 'jdimg' allowed_domains = ['jandan.net' ] start_urls = ["http://jandan.net/ooxx/page-{}#comments" .format (str (i)) for i in range (1 , 391 )] def start_requests (self ): for url in self.start_urls: print ('正在解析: ' , url) yield SplashRequest(url, args={'images' : 1 , 'timeout' : 90 }) def parse (self, response ): item = {} item['image_urls' ] = [] for url in response.css('a.view_img_link::attr(href)' ).extract(): url = 'http:' + url item['image_urls' ].append(url) yield item
5 检查是否有语法错误
6 运行爬虫
or
1 2 python3 -m scrapy crawl jdmm
总结:
SplashRequest(url, args={‘images’: 1, ‘timeout’: 90})
timeout 尽量设置为最大值 90,太小的话容易获取不到数据,
我忘了选择器语法
response.css(‘a.view_img_link::attr(href)’).extract(): response.css(‘a.view_img_link::text’).extract(): response.xpath(‘//a[@class=”view_img_link”]/@href’).extract(): response.xpath(‘//a[@class=”view_img_link”]/text())’).extract(): response.xpath(‘string(//a[@class=”view_img_link”]))’).extract():
暂时到这里,