高级篇
学会用框架,能站在巨人肩膀上的人,能力往往都不会太差。
这里我们学习的是PySpider
PySpider环境搭建 (Windows)
-
pip install pyspider
安装pyspider (前面python 已经安装了2.7)
-
下载phantomjs-2.1.1-windows
加入环境变量,动态加载js会用到
-
我们使用mysql存储
如果不需要存储到mysql, 这步可以直接跳过
安装mysql,Navicat Premium(db管理工具)
-
运行 cmd -> pyspider all
到这里环境就搭建完成了
Pyspider的基础
先了解下pyspider的功能:
Pyspider入门实践 - 定时爬取下载量
下面我们从豌豆荚,百度手机助手,应用宝爬出某个应用的下载量。
code说明:
- __init__方法是初始化方法,里面放了一些url数据,时间等。
- on_start是入口,相当于main方法
- @every(minutes=24 * 60) # 每天执行一次
- @every(minutes=1) # 每min执行一次, 这个一般调试时用,方便实时查看情况
- 页面解析的三个方法: index_wdjpage, index_baidupage,index_yingyongbaopage
- @config(age=60) # 有效期1min 一分钟后才会过期。真正抓的频率是 on_start的@ every + 这里的age。即2分钟才会run一次index_wdjpage
- @config(priority=3) # priority设置优先级,数字越高代表最先run到
Code实例
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2017-11-29 16:32:33
# Project: ZenTalk_Download
from pyspider.libs.base_handler import *
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import uniout
import re
import time
class Handler(BaseHandler):
crawl_config = {
}
def __init__(self):
self.wdjurl = 'http://www.wandoujia.com/apps/com.asus.cnzentalk'
self.baiduurl = 'http://shouji.baidu.com/software/11306355.html'
self.yingyongbaourl = 'http://sj.qq.com/myapp/detail.htm?apkName=com.asus.cnzentalk'
self.tool = Tool()
self.time = self.tool.getCurrentIntTime()
FILE_NAME = 'zentalk_download.txt'
#@every(minutes=24 * 60) # 每天执行一次
@every(minutes=1) # 每1min执行一次
def on_start(self):
self.crawl(self.wdjurl, callback=self.index_wdjpage, fetch_type='js')
self.crawl(self.baiduurl, callback=self.index_baidupage, fetch_type='js')
self.crawl(self.yingyongbaourl, callback=self.index_yingyongbaopage, fetch_type='js')
@config(age=60) # 有效期1min
@config(priority=3)
def index_wdjpage(self, response):
raw_download = response.doc('[itemprop="interactionCount"]').text()
download = self.tool.getCount(raw_download)
# write to file
self.writeToFile(download, '豌豆荚')
@config(age=60) # 有效期1min
@config(priority=2)
def index_baidupage(self, response):
raw_download = response.doc('.yui3-g .download-num').text()
download = self.tool.getCount(re.split(':', raw_download)[1])
# write to file
self.writeToFile(download, '百度手机助手')
@config(age=60) # 有效期1min
@config(priority=1)
def index_yingyongbaopage(self, response):
raw_download = response.doc('.det-ins-num').text()[:-2]
download = self.tool.getCount(raw_download)
# write to file
self.writeToFile(download, '应用宝' )
def writeToFile(self, download, platform, append=True):
if append :
f = open(self.FILE_NAME, 'a')
f.write(platform + ": " + str(download) + '\n')
else:
f = open(self.FILE_NAME, 'w')
f.write("\n------ time: " + self.tool.getCurrentTime() + "," + str(self.tool.getCurrentIntTime()) + " ------------------\n")
f.write(platform + ": " + str(download) + '\n')
f.close()
# 工具类
class Tool:
# 将超链接广告剔除
removeADLink = re.compile('<div class="link_layer.*?</div>')
# 去除img标签,1-7位空格,
removeImg = re.compile('<img.*?>| {1,7}| ')
# 删除超链接标签
removeAddr = re.compile('<a.*?>|</a>')
# 把换行的标签换为\n
replaceLine = re.compile('<tr>|<div>|</div>|</p>')
# 将表格制表<td>替换为\t
replaceTD = re.compile('<td>')
# 将换行符或双换行符替换为\n
replaceBR = re.compile('<br><br>|<br>')
# 将其余标签剔除
removeExtraTag = re.compile('<.*?>')
# 将多行空行删除
removeNoneLine = re.compile('\n+')
def replace(self, x):
x = re.sub(self.removeADLink, "", x)
x = re.sub(self.removeImg, "", x)
x = re.sub(self.removeAddr, "", x)
x = re.sub(self.replaceLine, "\n", x)
x = re.sub(self.replaceTD, "\t", x)
x = re.sub(self.replaceBR, "\n", x)
x = re.sub(self.removeExtraTag, "", x)
x = re.sub(self.removeNoneLine, "\n", x)
# strip()将前后多余内容删除
return x.strip()
# it support convert string include unit to count
# such as 6000,60.2万,1亿
def getCount(self, x):
x = self.replace(x)
if (self.isFloat(x) or str.isdigit(str(x))):
return x
else:
num = re.findall("\d+\.?\d*", x)[0]
unit = re.sub(num, "", x)
unit_number = 1
if (unit == "万"):
unit_number = 10000
elif (unit == "亿"):
unit_number =100000000
return float(num) * unit_number
def isFloat(self, number):
try:
num = float(number)
return True
except ValueError:
return False
# 获取当前时间
def getCurrentTime(self):
return time.strftime('[%Y-%m-%d %H:%M:%S]', time.localtime(time.time()))
# 获取当前时间 unix 时间戳
def getCurrentIntTime(self):
return (int)(time.mktime(time.localtime(time.time())))
Pyspider的Tip
- Pyspider的创建的项目都放在下面的位置,以db的形式存储