煎蛋网图片爬虫一枚

我觉得学Python到现在,遇到了瓶颈:
1 Python的异步编程理解的还不够
2 Python数据结构和算法待提高

随便找了网上的一些爬虫来练手,以查找自己弱点。
在简书看到一篇教程:http://www.jianshu.com/p/e30b714eca67 (里面的代码已过时,主要是因为图片地址被JS加密),这为练手提供了目标。

另外煎蛋网的网页内含比特币相关的关键字,不知道浏览网页会不会成为别人的挖矿机。

图片地址html源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
<li id="comment-3535962">
<div>
<div class="row">

<div class="author"><strong
title="防伪码:5225042d3a74838ebc0ee0e742feba6989343de1" class="">积极</strong> <br>
<small><a href="#footer" title="@回复"
onclick="document.getElementById('comment').value += &#39;@&lt;a href=&quot;//jandan.net/ooxx/page-1#comment-3535962&quot;&gt;积极&lt;/a&gt;: &#39;">@4 months ago</a></span></small>
</div>
<div class="text"><span class="righttext"><a href="//jandan.net/ooxx/page-1#comment-3535962">3535962</a></span><p><img src="//img.jandan.net/img/blank.gif" onload="jandan_load_img(this)" /><span class="img-hash">3c28QnMeyAwt3SrZqZiN8EqKYkRtaHOg3GopKMpHe3t+rjK6Vv0DQkgdTn88a1kbeSW0YJp9cUoYx8cgqxfdGB+tIBzAFl6DFAplt949Va4I14HwHxGS0A</span></p>
</div>
<div class="jandan-vote">
<span class="tucao-like-container">
<a title="圈圈/支持" href="javascript:;" class="comment-like like" data-id="3535962" data-type="pos">OO</a> [<span>12</span>]
</span>
<span class="tucao-unlike-container">
<a title="叉叉/反对" href="javascript:;" class="comment-unlike unlike" data-id="3535962" data-type="neg">XX</a> [<span>126</span>]

<a href="javascript:;" class="tucao-btn" data-id="3535962"> 吐槽 [1] </a>
</span>
</div>
</div>
</div>
</li>

看到这里了吗?<img src="//img.jandan.net/img/blank.gif" onload="jandan_load_img(this)" /><span class="img-hash">3c28QnMeyAwt3SrZqZiN8EqKYkRtaHOg3GopKMpHe3t+rjK6Vv0DQkgdTn88a1kbeSW0YJp9cUoYx8cgqxfdGB+tIBzAFl6DFAplt949Va4I14HwHxGS0A</span> 在网页的JS文件里面通过onload事件触发解密JS
我感觉使用Python模仿出来里面的解密函数对我来说有点困难,所以使用了PhantomJS处理,源码如下:

爬虫V1


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import re
import requests
import os
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

from PIL import Image
from io import BytesIO

cap = dict(DesiredCapabilities.PHANTOMJS)
cap["phantomjs.page.settings.resourceTimeout"] = 1000
cap["phantomjs.page.settings.loadImages"] = True
cap["phantomjs.page.settings.disk-cache"] = True

cap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0",


driver = webdriver.PhantomJS(desired_capabilities=cap)

img_count = 0
def get_img(url, referer, host, filename):
global img_count
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
'Host': host,
'Referer': referer
}
print('正在下载图片:', url)

img = requests.get(url, headers = headers)
if 'gif' in filename:
img = Image.open(BytesIO(img.content))
img.save(filename)
else:
with open(filename, 'wb') as f:
f.write(img.content)
img_count += 1
print('下载了', img_count, '张图片')

img_pattern = re.compile(r'(?s);"><a href="(.*?)".*?view_img_link">')

folder = "ooxx"
if not os.path.exists(folder):
os.mkdir(folder)
os.chdir(folder)

urls = ["http://jandan.net/ooxx/page-{}#comments".format(str(i)) for i in range(1, 391)]


for url in urls:
print('正在下载页面:', url)
driver.get(url)
referer = url.split('#')[-2]
response = driver.page_source
img_urls = img_pattern.findall(response)
if len(img_urls) > 0:
for img_url in img_urls:
filename = img_url.split('/')[-1]
img_host = img_url.split('/')[2]
print('正在下载:', filename, '.....')
url = 'http:' + img_url
get_img(url, referer, img_host, filename)


driver.close()

本次爬虫有一下遗憾:
1 使用了PhantomJS 加载网页,运行了很多不必要的JS代码,导致爬虫速度慢(如果可以写出解密函数,速度会变的更快)
2 PhantomJS 中途会卡住,暂时原因不明,第一次在p162卡主,第二次是在p193,暂时不知道怎么优化,
3 单进程单线程下载图片,没有使用异步多线程(这里可以优化,但是我暂时不知道怎么处理)
4 没有图片地址写出解密函数

所以我会以此文为机会加强一下JS 的知识。。。回头给出第二种实现。

附上JS 解密函数源码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
var f_2zKCnAnHzB3YYiFuffz6OoAkXddvtD8v = function(m, r, d) {
var e = "DECODE";
var r = r ? r: "";
var d = d ? d: 0;
var q = 4;
r = md5(r);
var o = md5(r.substr(0, 16));
var n = md5(r.substr(16, 16));
if (q) {
if (e == "DECODE") {
var l = m.substr(0, q)
}
} else {
var l = ""
}
var c = o + md5(o + l);
var k;
if (e == "DECODE") {
m = m.substr(q);
k = base64_decode(m)
}
var h = new Array(256);
for (var g = 0; g < 256; g++) {
h[g] = g
}
var b = new Array();
for (var g = 0; g < 256; g++) {
b[g] = c.charCodeAt(g % c.length)
}
for (var f = g = 0; g < 256; g++) {
f = (f + h[g] + b[g]) % 256;
tmp = h[g];
h[g] = h[f];
h[f] = tmp
}
var t = "";
k = k.split("");
for (var p = f = g = 0; g < k.length; g++) {
p = (p + 1) % 256;
f = (f + h[p]) % 256;
tmp = h[p];
h[p] = h[f];
h[f] = tmp;
t += chr(ord(k[g]) ^ (h[(h[p] + h[f]) % 256]))
}
if (e == "DECODE") {
if ((t.substr(0, 10) == 0 || t.substr(0, 10) - time() > 0) && t.substr(10, 16) == md5(t.substr(26) + n).substr(0, 16)) {
t = t.substr(26)
} else {
t = ""
}
}
return t
};

爬虫v2


  • 将下载部分的动作改为异步模式,感觉速度提升明显。
  • 该版本的问题就只出在PhantomJS 部分了,卡在p134页的解析哪里,暂时不知道原因,我会继续努力优化。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

import re
import requests
import os
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

from PIL import Image
from io import BytesIO

import asyncio

cap = dict(DesiredCapabilities.PHANTOMJS)
cap["phantomjs.page.settings.resourceTimeout"] = 1000
cap["phantomjs.page.settings.loadImages"] = True
cap["phantomjs.page.settings.disk-cache"] = True

cap["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0",


driver = webdriver.PhantomJS(desired_capabilities=cap)

img_count = 0
async def get_img(url, referer, host, filename):
global img_count
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36",
'Host': host,
'Referer': referer
}
print('正在下载图片:', url)

img = requests.get(url, headers = headers)
if 'gif' in filename:
img = Image.open(BytesIO(img.content))
img.save(filename)
else:
with open(filename, 'wb') as f:
f.write(img.content)
img_count += 1
print('下载了', img_count, '张图片')

img_pattern = re.compile(r'(?s);"><a href="(.*?)".*?view_img_link">')

folder = "ooxx"
if not os.path.exists(folder):
os.mkdir(folder)
os.chdir(folder)

urls = ["http://jandan.net/ooxx/page-{}#comments".format(str(i)) for i in range(1, 391)]

loop = asyncio.get_event_loop()

for url in urls:
print('正在下载页面:', url)
driver.get(url)
referer = url.split('#')[-2]
response = driver.page_source
img_urls = img_pattern.findall(response)
if len(img_urls) > 0:
tasks = []
for img_url in img_urls:
filename = img_url.split('/')[-1]
img_host = img_url.split('/')[2]
print('正在下载:', filename, '.....')
url = 'http:' + img_url
task = asyncio.ensure_future(get_img(url, referer, img_host, filename))
tasks.append(task)
loop.run_until_complete(asyncio.wait(tasks))


driver.close()