我觉得学Python到现在,遇到了瓶颈: 1 Python的异步编程理解的还不够 2 Python数据结构和算法待提高
随便找了网上的一些爬虫来练手,以查找自己弱点。 在简书看到一篇教程:http://www.jianshu.com/p/e30b714eca67 (里面的代码已过时,主要是因为图片地址被JS加密),这为练手提供了目标。
另外煎蛋网的网页内含比特币相关的关键字,不知道浏览网页会不会成为别人的挖矿机。
图片地址html源码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 <li id ="comment-3535962" > <div > <div class ="row" > <div class ="author" > <strong title ="防伪码:5225042d3a74838ebc0ee0e742feba6989343de1" class ="" > 积极</strong > <br > <small > <a href ="#footer" title ="@回复" onclick ="document.getElementById('comment').value += ' @< a href=" //jandan.net/ooxx/page-1#comment-3535962" > 积极< /a> : ' " > @4 months ago</a > </span > </small > </div > <div class ="text" > <span class ="righttext" > <a href ="//jandan.net/ooxx/page-1#comment-3535962" > 3535962</a > </span > <p > <img src ="//img.jandan.net/img/blank.gif" onload ="jandan_load_img(this)" /> <span class ="img-hash" > 3c28QnMeyAwt3SrZqZiN8EqKYkRtaHOg3GopKMpHe3t+rjK6Vv0DQkgdTn88a1kbeSW0YJp9cUoYx8cgqxfdGB+tIBzAFl6DFAplt949Va4I14HwHxGS0A</span > </p > </div > <div class ="jandan-vote" > <span class ="tucao-like-container" > <a title ="圈圈/支持" href ="javascript:;" class ="comment-like like" data-id ="3535962" data-type ="pos" > OO</a > [<span > 12</span > ] </span > <span class ="tucao-unlike-container" > <a title ="叉叉/反对" href ="javascript:;" class ="comment-unlike unlike" data-id ="3535962" data-type ="neg" > XX</a > [<span > 126</span > ] <a href ="javascript:;" class ="tucao-btn" data-id ="3535962" > 吐槽 [1] </a > </span > </div > </div > </div > </li >
看到这里了吗?<img src="//img.jandan.net/img/blank.gif" onload="jandan_load_img(this)" /><span class="img-hash">3c28QnMeyAwt3SrZqZiN8EqKYkRtaHOg3GopKMpHe3t+rjK6Vv0DQkgdTn88a1kbeSW0YJp9cUoYx8cgqxfdGB+tIBzAFl6DFAplt949Va4I14HwHxGS0A</span>
在网页的JS文件里面通过onload
事件触发解密JS 我感觉使用Python模仿出来里面的解密函数对我来说有点困难,所以使用了PhantomJS处理,源码如下:
爬虫V1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 import reimport requestsimport osfrom selenium import webdriverfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesfrom PIL import Imagefrom io import BytesIOcap = dict (DesiredCapabilities.PHANTOMJS) cap["phantomjs.page.settings.resourceTimeout" ] = 1000 cap["phantomjs.page.settings.loadImages" ] = True cap["phantomjs.page.settings.disk-cache" ] = True cap["phantomjs.page.settings.userAgent" ] = "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0" , driver = webdriver.PhantomJS(desired_capabilities=cap) img_count = 0 def get_img (url, referer, host, filename ): global img_count headers = { "User-Agent" : "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36" , 'Host' : host, 'Referer' : referer } print ('正在下载图片:' , url) img = requests.get(url, headers = headers) if 'gif' in filename: img = Image.open (BytesIO(img.content)) img.save(filename) else : with open (filename, 'wb' ) as f: f.write(img.content) img_count += 1 print ('下载了' , img_count, '张图片' ) img_pattern = re.compile (r'(?s);"><a href="(.*?)".*?view_img_link">' ) folder = "ooxx" if not os.path.exists(folder): os.mkdir(folder) os.chdir(folder) urls = ["http://jandan.net/ooxx/page-{}#comments" .format (str (i)) for i in range (1 , 391 )] for url in urls: print ('正在下载页面:' , url) driver.get(url) referer = url.split('#' )[-2 ] response = driver.page_source img_urls = img_pattern.findall(response) if len (img_urls) > 0 : for img_url in img_urls: filename = img_url.split('/' )[-1 ] img_host = img_url.split('/' )[2 ] print ('正在下载:' , filename, '.....' ) url = 'http:' + img_url get_img(url, referer, img_host, filename) driver.close()
本次爬虫有一下遗憾: 1 使用了PhantomJS 加载网页,运行了很多不必要的JS代码,导致爬虫速度慢(如果可以写出解密函数,速度会变的更快) 2 PhantomJS 中途会卡住,暂时原因不明,第一次在p162卡主,第二次是在p193,暂时不知道怎么优化, 3 单进程单线程下载图片,没有使用异步多线程(这里可以优化,但是我暂时不知道怎么处理) 4 没有图片地址写出解密函数
所以我会以此文为机会加强一下JS 的知识。。。回头给出第二种实现。
附上JS 解密函数源码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 var f_2zKCnAnHzB3YYiFuffz6OoAkXddvtD8v = function (m, r, d ) { var e = "DECODE" ; var r = r ? r: "" ; var d = d ? d: 0 ; var q = 4 ; r = md5(r); var o = md5(r.substr(0 , 16 )); var n = md5(r.substr(16 , 16 )); if (q) { if (e == "DECODE" ) { var l = m.substr(0 , q) } } else { var l = "" } var c = o + md5(o + l); var k; if (e == "DECODE" ) { m = m.substr(q); k = base64_decode(m) } var h = new Array (256 ); for (var g = 0 ; g < 256 ; g++) { h[g] = g } var b = new Array (); for (var g = 0 ; g < 256 ; g++) { b[g] = c.charCodeAt(g % c.length) } for (var f = g = 0 ; g < 256 ; g++) { f = (f + h[g] + b[g]) % 256 ; tmp = h[g]; h[g] = h[f]; h[f] = tmp } var t = "" ; k = k.split("" ); for (var p = f = g = 0 ; g < k.length; g++) { p = (p + 1 ) % 256 ; f = (f + h[p]) % 256 ; tmp = h[p]; h[p] = h[f]; h[f] = tmp; t += chr(ord(k[g]) ^ (h[(h[p] + h[f]) % 256 ])) } if (e == "DECODE" ) { if ((t.substr(0 , 10 ) == 0 || t.substr(0 , 10 ) - time() > 0 ) && t.substr(10 , 16 ) == md5(t.substr(26 ) + n).substr(0 , 16 )) { t = t.substr(26 ) } else { t = "" } } return t };
爬虫v2
将下载部分的动作改为异步模式,感觉速度提升明显。
该版本的问题就只出在PhantomJS 部分了,卡在p134页的解析哪里,暂时不知道原因,我会继续努力优化。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 import reimport requestsimport osfrom selenium import webdriverfrom selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesfrom PIL import Imagefrom io import BytesIOimport asynciocap = dict (DesiredCapabilities.PHANTOMJS) cap["phantomjs.page.settings.resourceTimeout" ] = 1000 cap["phantomjs.page.settings.loadImages" ] = True cap["phantomjs.page.settings.disk-cache" ] = True cap["phantomjs.page.settings.userAgent" ] = "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0" , driver = webdriver.PhantomJS(desired_capabilities=cap) img_count = 0 async def get_img (url, referer, host, filename ): global img_count headers = { "User-Agent" : "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36" , 'Host' : host, 'Referer' : referer } print ('正在下载图片:' , url) img = requests.get(url, headers = headers) if 'gif' in filename: img = Image.open (BytesIO(img.content)) img.save(filename) else : with open (filename, 'wb' ) as f: f.write(img.content) img_count += 1 print ('下载了' , img_count, '张图片' ) img_pattern = re.compile (r'(?s);"><a href="(.*?)".*?view_img_link">' ) folder = "ooxx" if not os.path.exists(folder): os.mkdir(folder) os.chdir(folder) urls = ["http://jandan.net/ooxx/page-{}#comments" .format (str (i)) for i in range (1 , 391 )] loop = asyncio.get_event_loop() for url in urls: print ('正在下载页面:' , url) driver.get(url) referer = url.split('#' )[-2 ] response = driver.page_source img_urls = img_pattern.findall(response) if len (img_urls) > 0 : tasks = [] for img_url in img_urls: filename = img_url.split('/' )[-1 ] img_host = img_url.split('/' )[2 ] print ('正在下载:' , filename, '.....' ) url = 'http:' + img_url task = asyncio.ensure_future(get_img(url, referer, img_host, filename)) tasks.append(task) loop.run_until_complete(asyncio.wait(tasks)) driver.close()