JD产品数据爬取
比如我要爬取“衬衫”的搜索结果分析。
-
首先确认url
JD输入“衬衫”后的url就是我们需要的。
https://search.jd.com/Search?keyword=%E8%A1%AC%E8%A1%AB&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E8%A1%AC%E8%A1%AB&page=5&s=105&click=0
很长,我们不用那么长,下面的就足够了(Unicode被我替换成中文了,效果一样)
https://search.jd.com/Search?keyword=衬衫&enc=utf-8&wq=衬衫
-
翻页
因为结果很多,肯定要翻页,不能只抓取当前页面的内容。
首先翻页的url规则: F12,勾选下面的两项,点击几页我们来分析下url翻页的规则
查看Header里面这一项:Request URL
https://search.jd.com/s_new.php?keyword=%E8%A1%AC%E8%A1%AB&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E8%A1%AC%E8%A1%AB&page=5&s=107&click=0
基本上page=5这个不断变化的值就是我们需要的了,看起规律是当前page num * 2 -1
okay, 我们看Code实现
Code说明:
- index_page:我们来确认一共有多少页面,来看需要爬取多少次
- split_page:每页的数据,在这里解析列表的每项,写入文件
Code实现:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2017-12-14 14:21:44
# Project: TShirt_JD
from pyspider.libs.base_handler import *
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import uniout
import re
import time
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
class Handler(BaseHandler):
crawl_config = {
"headers":{
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
}
}
def __init__(self):
# 搜索:华硕zenbo豪华版
self.url = 'https://search.jd.com/Search?keyword=衬衫&enc=utf-8&wq=衬衫'
self.splitpage_url = 'https://search.jd.com/Search?keyword=衬衫&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=衬衫&psort=4&page='
self.tool = Tool()
@every(minutes=1) # 每分
def on_start(self):
self.crawl(self.url, callback=self.index_page, validate_cert=False, fetch_type='js')
@config(age=60) # 有效期1min
@config(priority=2)
def index_page(self, response):
page = response.doc('em > b').text()
print "pages count ", page
if self.tool.isFloat(page) and page >0:
for i in range(1, int(page)):
num = i*2 -1
split_url = self.splitpage_url + str(i)
print split_url
self.crawl(split_url, callback=self.split_page, validate_cert=False, fetch_type='js')
@config(age=60) # 有效期1min
@config(priority=1)
def split_page(self, response):
list = response.doc('.J-goods-list > .clearfix >li')
data_list = []
for each in list.items():
price = each.find('.p-price > strong > i').text()
name = each.find('.p-name-type-2 em').text()
commit = each.find('.p-commit > strong > a').text()
shop = each.find('.p-shop > span > a').text()
onsale = each.find('.p-icons').text()
data_list.append([price, commit, shop, onsale, name])
np_data = np.array(data_list)
frame = DataFrame(data_list,columns=['price', 'commit','shop','onsale','name'])
frame.to_csv('tshit.csv', encoding='gb2312',mode='a')
return {
"url": response.url,
}
# 工具类
class Tool:
def isFloat(self, number):
try:
num = float(number)
return True
except ValueError:
return False
tips
在pandas中读取带有中文的csv文件时,读写中汉字为乱码,可加上encoding参数来避免,如:
pd.read_csv(“xxx.csv”, encoding=”gb2312”)
pd.read_csv(“xxx.csv”, encoding=”gbk”)
在导出时也要加上encoding参数,否则导出后用excel打开也是乱码,editplus打开正常,如:
df.to_csv(“sel.csv”, index=False, encoding=”gb2312”)
df.to_csv(“sel.csv”, index=False, encoding=”gbk”)