Python 简单爬取MM照片

License: Attribution-NonCommercial-ShareAlike 4.0 International

本文出自 Suzf Blog。如未注明，均为 SUZF.NET 原创。

转载请注明：http://suzf.net/post/1146

制作爬虫的基本步骤

需求分析
分析网站源码<F12>
编写正则表达式过滤内容
生成代码

需求分析

有好多想要的图片，自己又懒得下载；有没有简单而有效地方法呢？

分析网站源码

打开百度图片搜索输入关键词 `mm`；然后点击右上方按钮切换到 传统翻页版本 之后按 F12 分析网站源码

编写正则表达式

    img = re.compile(r'(http:[^\s]*?(jpg|png|gif))')
    img_list = img.findall(html)

生成代码

#!/usr/bin/env python3
# -*- encoding: utf-8 -*-


import os
import re
import urllib.request


def get_html(url):
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
    headers = {'User-Agent': user_agent}
    req = urllib.request.Request(url, headers=headers)
    page = urllib.request.urlopen(req)
    html = page.read().decode("utf8")
    return html


def get_img(html):
    img = re.compile(r'(http:[^\s]*?(jpg|png|gif))')
    img_list = img.findall(html)
    return img_list


def download(dst_path, img_name):
    if not os.path.isdir(dst_path):
        os.mkdir(dst_path)
    dst_name = os.path.join(dst_path, img_name)
    return dst_name


if __name__ == '__main__':
    dst_path = '/home/zfsu/download'
    pn = 0
    for i in range(0, 9, 1):
        pn += 20
        url = "http://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=mm&pn={0}&gsm=f000000000f0".format(pn)
        html = get_html(url)
        img_list = get_img(html)
        for img, _ in img_list:
            print("Download {0}".format(img))
            img_name = img.split('/')[-1]
            urllib.request.urlretrieve(img, download(dst_path, img_name))

运行程序查看MM