爬虫学习之路 - Pyspider实践 JD产品数据

Posted on By Vivian Sun

JD产品数据爬取

比如我要爬取“衬衫”的搜索结果分析。

  1. 首先确认url

    JD输入“衬衫”后的url就是我们需要的。

    https://search.jd.com/Search?keyword=%E8%A1%AC%E8%A1%AB&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E8%A1%AC%E8%A1%AB&page=5&s=105&click=0

    很长,我们不用那么长,下面的就足够了(Unicode被我替换成中文了,效果一样)

    https://search.jd.com/Search?keyword=衬衫&enc=utf-8&wq=衬衫

  2. 翻页

    因为结果很多,肯定要翻页,不能只抓取当前页面的内容。

    首先翻页的url规则: F12,勾选下面的两项,点击几页我们来分析下url翻页的规则

    查看Header里面这一项:Request URL

    https://search.jd.com/s_new.php?keyword=%E8%A1%AC%E8%A1%AB&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=%E8%A1%AC%E8%A1%AB&page=5&s=107&click=0

    基本上page=5这个不断变化的值就是我们需要的了,看起规律是当前page num * 2 -1

    okay, 我们看Code实现

Code说明:

  • index_page:我们来确认一共有多少页面,来看需要爬取多少次
  • split_page:每页的数据,在这里解析列表的每项,写入文件

Code实现:

    #!/usr/bin/env python
    # -*- encoding: utf-8 -*-
    # Created on 2017-12-14 14:21:44
    # Project: TShirt_JD
    
    from pyspider.libs.base_handler import *
    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    import uniout
    import re
    import time
    import pandas as pd
    from pandas import Series, DataFrame 
    import numpy as np
    
    
    class Handler(BaseHandler):
        crawl_config = {
            "headers":{
                "user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
            }
        }
        
        def __init__(self):
            # 搜索:华硕zenbo豪华版
            self.url = 'https://search.jd.com/Search?keyword=衬衫&enc=utf-8&wq=衬衫'
            self.splitpage_url = 'https://search.jd.com/Search?keyword=衬衫&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=衬衫&psort=4&page='
            self.tool = Tool()
    
        @every(minutes=1)  # 每分
        def on_start(self):
            self.crawl(self.url, callback=self.index_page, validate_cert=False, fetch_type='js')
    
        @config(age=60)  # 有效期1min
        @config(priority=2)
        def index_page(self, response):
            page = response.doc('em > b').text()
            print "pages count ", page
            if self.tool.isFloat(page) and page >0:
                for i in range(1, int(page)):
                    num = i*2 -1
                    split_url = self.splitpage_url + str(i)
                    print split_url
                    self.crawl(split_url, callback=self.split_page, validate_cert=False, fetch_type='js')
              
        @config(age=60)  # 有效期1min
        @config(priority=1)
        def split_page(self, response):
            list = response.doc('.J-goods-list > .clearfix >li')
            data_list = []
            for each in list.items():
                price = each.find('.p-price > strong > i').text()
                name = each.find('.p-name-type-2 em').text()
                commit = each.find('.p-commit > strong > a').text()
                shop = each.find('.p-shop > span > a').text()
                onsale = each.find('.p-icons').text()
                data_list.append([price, commit, shop, onsale, name])
            
            np_data = np.array(data_list)
            frame = DataFrame(data_list,columns=['price', 'commit','shop','onsale','name'])
            frame.to_csv('tshit.csv', encoding='gb2312',mode='a')
            
            return {
                "url": response.url,
            }
                
    # 工具类
    class Tool:
       
        def isFloat(self, number):
            try:
                num = float(number)
                return True
            except ValueError:
                return False

tips

在pandas中读取带有中文的csv文件时,读写中汉字为乱码,可加上encoding参数来避免,如:

pd.read_csv(“xxx.csv”, encoding=”gb2312”) 

pd.read_csv(“xxx.csv”, encoding=”gbk”)

在导出时也要加上encoding参数,否则导出后用excel打开也是乱码,editplus打开正常,如:

df.to_csv(“sel.csv”, index=False, encoding=”gb2312”) 
df.to_csv(“sel.csv”, index=False, encoding=”gbk”)

Reference

《Python数据分析常用手册》一、NumPy和Pandas篇