使用scrapy splash 重写煎蛋网图片爬虫

这算是煎蛋网爬虫 V3 吧。

PhantomJS 容易卡住,网上说是因为内存泄漏,但是我不太清楚。
可能是我使用的姿势不对,改用splash + scrapy,于是重新写一遍,
系统为 Mac,这是感觉速度还是挺快的。

安装 splash

Mac 下:

1
2
brew install docker
docker pull scrapinghub/splash

Ubuntu 下:

1
2
sudo apt-get install docker # 安装时注意提示,可以免sudo运行docker
sudo docker pull scrapinghub/splash

python安装splash扩展

1
pip3 install scrapy-splash

启动splash 服务

在本机的8050和8051开启splash 服务

1
sudo docker run -p 8050:8050 -p 8051:8051 scrapinghub/splash

两个示例代码:

code1 :

1
2
3
4
5
6
7
8
9
10
11
import requests
from scrapy.selector import Selector

splash_url = 'http://localhost:8050/render.html'
# timeout 最大是90,
args = {'url': 'http://quotes.toscrape.com/js/', 'timeout': 5, 'images': 0}

response = requests.get(splash_url, params=args)
sel = Selector(response)
print(sel.css('div.quote span.text::text')).extract()

code2 :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests
import json

lua_script = '''
function main(splash)
splash:go("http://example.com")
splash:wait(0.5)
local title=splash:evaljs("document.title")
return {title=title}
end
'''

splash_url = 'http://localhost:8050/excute'
headers = {'content-type': 'application/json'}
data = json.dumps({'lua_source': lua_script})
response = requests.get(splash_url, headers=headers, data=data)
print(response.content)
print(response.json())

开始煎蛋爬虫

1 新建scrapy 项目

1
scrapy startproject jiandan

2 修改项目配置 settings,增加如下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# 不加 useragent 会被服务器 ban 掉。
USER_AGENT = "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6"

SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline': 1,
}

IMAGES_STORE = 'jd_img'

3 生成爬虫

1
scrapy genspider jdmm

4 编写spider代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest


class JdimgSpider(scrapy.Spider):
name = 'jdimg'
allowed_domains = ['jandan.net']
start_urls = ["http://jandan.net/ooxx/page-{}#comments".format(str(i)) for i in range(1, 391)]

def start_requests(self):
for url in self.start_urls:
print('正在解析: ', url)
yield SplashRequest(url, args={'images': 1, 'timeout': 90})

def parse(self, response):
item = {}
item['image_urls'] = []

for url in response.css('a.view_img_link::attr(href)').extract():
url = 'http:' + url
item['image_urls'].append(url)

yield item

5 检查是否有语法错误

1
scrapy list

6 运行爬虫

1
scrapy crawl jdmm

or

1
2
# 我是Python2 和 3多版本共存,而且都安装了scrapy
python3 -m scrapy crawl jdmm

总结:

SplashRequest(url, args={‘images’: 1, ‘timeout’: 90})

timeout 尽量设置为最大值 90,太小的话容易获取不到数据,

我忘了选择器语法

response.css(‘a.view_img_link::attr(href)’).extract():
response.css(‘a.view_img_link::text’).extract():
response.xpath(‘//a[@class=”view_img_link”]/@href’).extract():
response.xpath(‘//a[@class=”view_img_link”]/text())’).extract():
response.xpath(‘string(//a[@class=”view_img_link”]))’).extract():

暂时到这里,