Suzf Blog

[译] Python Logging Howto

Jeffrey Nov 21, 2016 Python

基本日志记录教程

日志是跟踪一些软件运行时发生的事件的手段。软件的开发人员添加日志调用到他们的代码中，以指示已发生的某些事件。一个事件是通过一个描述性消息可任选地含有可变数据（即是该事件的每次发生潜在不同的数据）中。事件是很重要的，开发者通常通过事件追踪问题，重要性也可称为水平或严重程度。

什么时候使用 logging

日志提供了简单的日志使用一组方便的功能。这里有 debug()，info()，warning()，error()和 critical()。要确定何时使用日志记录，请参阅下表，其中规定，对于一组常见任务，使用最好的工具。

制作爬虫的基本步骤

需求分析
分析网站源码<F12>
编写正则表达式过滤内容
生成代码

需求分析

有好多想要的图片，自己又懒得下载；有没有简单而有效地方法呢？

How-to Make mod_wsgi use python2.7.3 instead of python2.6.6

Jeffrey Jun 18, 2016 Python

场景再现：

CentOs 6.7 中 Python的默认版本为 2.6.x, 而日常工作中仍是以 2.7.x 居多。

So问题来了，如果使用 Apache + mod_wsgi 构建基于Python的 Web 服务器，如何修改其中的Python 默认版本呢？

Bytes-to-human and human-to-bytes converter

Jeffrey Jun 06, 2016 Python

#!/usr/bin/env python

"""
Bytes-to-human / human-to-bytes converter.
Based on: http://goo.gl/kTQMs
Working with Python 2.x and 3.x.

Author: Giampaolo Rodola' <g.rodola [AT] gmail [DOT] com>
License: MIT
"""

# see: http://goo.gl/kTQMs
SYMBOLS = {
    'customary'     : ('B', 'K', 'M', 'G', 'T', 'P', 'E', 'Z', 'Y'),
    'customary_ext' : ('byte', 'kilo', 'mega', 'giga', 'tera', 'peta', 'exa',
                       'zetta', 'iotta'),
    'iec'           : ('Bi', 'Ki', 'Mi', 'Gi', 'Ti', 'Pi', 'Ei', 'Zi', 'Yi'),
    'iec_ext'       : ('byte', 'kibi', 'mebi', 'gibi', 'tebi', 'pebi', 'exbi',
                       'zebi', 'yobi'),
}

def bytes2human(n, format='%(value).1f %(symbol)s', symbols='customary'):
    """
    Convert n bytes into a human readable string based on format.
    symbols can be either "customary", "customary_ext", "iec" or "iec_ext",
    see: http://goo.gl/kTQMs

      >>> bytes2human(0)
      '0.0 B'
      >>> bytes2human(0.9)
      '0.0 B'
      >>> bytes2human(1)
      '1.0 B'
      >>> bytes2human(1.9)
      '1.0 B'
      >>> bytes2human(1024)
      '1.0 K'
      >>> bytes2human(1048576)
      '1.0 M'
      >>> bytes2human(1099511627776127398123789121)
      '909.5 Y'

      >>> bytes2human(9856, symbols="customary")
      '9.6 K'
      >>> bytes2human(9856, symbols="customary_ext")
      '9.6 kilo'
      >>> bytes2human(9856, symbols="iec")
      '9.6 Ki'
      >>> bytes2human(9856, symbols="iec_ext")
      '9.6 kibi'

      >>> bytes2human(10000, "%(value).1f %(symbol)s/sec")
      '9.8 K/sec'

      >>> # precision can be adjusted by playing with %f operator
      >>> bytes2human(10000, format="%(value).5f %(symbol)s")
      '9.76562 K'
    """
    n = int(n)
    if n < 0:
        raise ValueError("n < 0")
    symbols = SYMBOLS[symbols]
    prefix = {}
    for i, s in enumerate(symbols[1:]):
        prefix[s] = 1 << (i+1)*10
    for symbol in reversed(symbols[1:]):
        if n >= prefix[symbol]:
            value = float(n) / prefix[symbol]
            return format % locals()
    return format % dict(symbol=symbols[0], value=n)

def human2bytes(s):
    """
    Attempts to guess the string format based on default symbols
    set and return the corresponding bytes as an integer.
    When unable to recognize the format ValueError is raised.

      >>> human2bytes('0 B')
      0
      >>> human2bytes('1 K')
      1024
      >>> human2bytes('1 M')
      1048576
      >>> human2bytes('1 Gi')
      1073741824
      >>> human2bytes('1 tera')
      1099511627776

      >>> human2bytes('0.5kilo')
      512
      >>> human2bytes('0.1  byte')
      0
      >>> human2bytes('1 k')  # k is an alias for K
      1024
      >>> human2bytes('12 foo')
      Traceback (most recent call last):
          ...
      ValueError: can't interpret '12 foo'
    """
    init = s
    num = ""
    while s and s[0:1].isdigit() or s[0:1] == '.':
        num += s[0]
        s = s[1:]
    num = float(num)
    letter = s.strip()
    for name, sset in SYMBOLS.items():
        if letter in sset:
            break
    else:
        if letter == 'k':
            # treat 'k' as an alias for 'K' as per: http://goo.gl/kTQMs
            sset = SYMBOLS['customary']
            letter = letter.upper()
        else:
            raise ValueError("can't interpret %r" % init)
    prefix = {sset[0]:1}
    for i, s in enumerate(sset[1:]):
        prefix[s] = 1 << (i+1)*10
    return int(num * prefix[letter])

Extra: for an alternative version as simple as possible (no global vars, no extra args, easier to customize) see here: http://code.google.com/p/pyftpdlib/source/browse/trunk/test/bench.py?spec=svn984&r=984#137

来自 http://code.activestate.com/recipes/578019-bytes-to-human-human-to-bytes-converter/

Python Re module learn

Jeffrey Apr 24, 2016 Python

上篇文章 Python正则表达式操作指南已经对正则表达式做出了详细的介绍。下面只对 re 模块做出简要的说明。元字符说明

.    匹配除换行符以外的任意字符
^    匹配字符串的开始
$    匹配字符串的结束
[]   用来匹配一个指定的字符类别
？   对于前一个字符字符重复0次到1次
*    对于前一个字符重复0次到无穷次
{}   对于前一个字符重复m次
{m，n} 对前一个字符重复为m到n次
\d   匹配数字，相当于[0-9]
\D   匹配任何非数字字符，相当于[^0-9]
\s   匹配任意的空白符，相当于[ fv]
\S   匹配任何非空白字符，相当于[^ fv]
\w   匹配任何字母数字字符，相当于[a-zA-Z0-9_]
\W   匹配任何非字母数字字符，相当于[^a-zA-Z0-9_]
\b   匹配单词的开始或结束

Python正则表达式操作指南

Jeffrey Apr 22, 2016 Python

摘要
本文是通过Python的 re 模块来使用正则表达式的一个入门教程，和库参考手册的对应章节相比，更为浅显易懂、循序渐进。

简介

Python 自1.5版本起增加了re 模块，它提供 Perl 风格的正则表达式模式。Python 1.5之前版本则是通过 regex 模块提供 Emacs 风格的模式。Emacs 风格模式可读性稍差些，而且功能也不强，因此编写新代码时尽量不要再使用 regex 模块，当然偶尔你还是可能在老代码里发现其踪影。
就其本质而言，正则表达式（或 RE）是一种小型的、高度专业化的编程语言，（在Python中）它内嵌在Python中，并通过 re 模块实现。使用这个小型语言，你可以为想要匹配的相应字符串集指定规则；该字符串集可能包含英文语句、e-mail地址、TeX命令或任何你想搞定的东西。然后你可以问诸如“这个字符串匹配该模式吗？”或“在这个字符串中是否有部分匹配该模式呢？”。你也可以使用 RE 以各种方式来修改或分割字符串。
正则表达式模式被编译成一系列的字节码，然后由用 C 编写的匹配引擎执行。在高级用法中，也许还要仔细留意引擎是如何执行给定 RE ，如何以特定方式编写 RE 以令生产的字节码运行速度更快。本文并不涉及优化，因为那要求你已充分掌握了匹配引擎的内部机制。
正则表达式语言相对小型和受限（功能有限），因此并非所有字符串处理都能用正则表达式完成。当然也有些任务可以用正则表达式完成，不过最终表达式会变得异常复杂。碰到这些情形时，编写 Python 代码进行处理可能反而更好；尽管 Python 代码比一个精巧的正则表达式要慢些，但它更易理解。

Python time 模块简述

Jeffrey Mar 10, 2016 Python

一、简介

time模块提供各种操作时间的函数
说明：一般有两种表示时间的方式:
第一种是时间戳的方式(相对于1970.1.1 00:00:00以秒计算的偏移量),时间戳是惟一的
第二种以数组的形式表示即(struct_time),共有九个元素，分别表示，同一个时间戳的struct_time会因为时区不同而不同
year (four digits, e.g. 1998)
month (1-12)
day (1-31)
hours (0-23)
minutes (0-59)
seconds (0-59)
weekday (0-6, Monday is 0)
Julian day (day in the year, 1-366)
DST (Daylight Savings Time) flag (-1, 0 or 1) 是否是夏令时
If the DST flag is 0, the time is given in the regular time zone;
if it is 1, the time is given in the DST time zone;
if it is -1, mktime() should guess based on the date and time.
夏令时介绍：http://baike.baidu.com/view/100246.htm
UTC介绍：http://wenda.tianya.cn/wenda/thread?tid=283921a9da7c5aef&clk=wttpcts

二、函数介绍
1.asctime()
asctime([tuple]) -> string
将一个struct_time(默认为当时时间)，转换成字符串
Convert a time tuple to a string, e.g. 'Sat Jun 06 16:26:11 1998'.
When the time tuple is not present, current time as returned by localtime()
is used.

2.clock()
clock() -> floating point number
该函数有两个功能，
在第一次调用的时候，返回的是程序运行的实际时间；
以第二次之后的调用，返回的是自第一次调用后,到这次调用的时间间隔

示例：

[python] view plain copy

import time
if __name__ == '__main__':
time.sleep(1)
print "clock1:%s" % time.clock()
time.sleep(1)
print "clock2:%s" % time.clock()
time.sleep(1)
print "clock3:%s" % time.clock()

输出：
clock1:3.35238137808e-006
clock2:1.00004944763
clock3:2.00012040636
其中第一个clock输出的是程序运行时间
第二、三个clock输出的都是与第一个clock的时间间隔

3.sleep(...)
sleep(seconds)
线程推迟指定的时间运行，经过测试，单位为秒，但是在帮助文档中有以下这样一句话，这关是看不懂
“The argument may be a floating point number for subsecond precision.”

4.ctime(...)
ctime(seconds) -> string
将一个时间戳(默认为当前时间)转换成一个时间字符串
例如：
time.ctime()
输出为：'Sat Mar 28 22:24:24 2009'

5.gmtime(...)
gmtime([seconds]) -> (tm_year, tm_mon, tm_day, tm_hour, tm_min,tm_sec, tm_wday, tm_yday, tm_isdst)
将一个时间戳转换成一个UTC时区(0时区)的struct_time，如果seconds参数未输入，则以当前时间为转换标准

6.localtime(...)
localtime([seconds]) -> (tm_year,tm_mon,tm_day,tm_hour,tm_min,tm_sec,tm_wday,tm_yday,tm_isdst)
将一个时间戳转换成一个当前时区的struct_time，如果seconds参数未输入，则以当前时间为转换标准

7.mktime(...)
mktime(tuple) -> floating point number
将一个以struct_time转换为时间戳

8.strftime(...)
strftime(format[, tuple]) -> string
将指定的struct_time(默认为当前时间)，根据指定的格式化字符串输出
python中时间日期格式化符号：
%y 两位数的年份表示（00-99）
%Y 四位数的年份表示（000-9999）
%m 月份（01-12）
%d 月内中的一天（0-31）
%H 24小时制小时数（0-23）
%I 12小时制小时数（01-12）
%M 分钟数（00=59）
%S 秒（00-59）

%a 本地简化星期名称
%A 本地完整星期名称
%b 本地简化的月份名称
%B 本地完整的月份名称
%c 本地相应的日期表示和时间表示
%j 年内的一天（001-366）
%p 本地A.M.或P.M.的等价符
%U 一年中的星期数（00-53）星期天为星期的开始
%w 星期（0-6），星期天为星期的开始
%W 一年中的星期数（00-53）星期一为星期的开始
%x 本地相应的日期表示
%X 本地相应的时间表示
%Z 当前时区的名称
%% %号本身

9.strptime(...)
strptime(string, format) -> struct_time
将时间字符串根据指定的格式化符转换成数组形式的时间
例如：
2009-03-20 11:45:39 对应的格式化字符串为：%Y-%m-%d %H:%M:%S
Sat Mar 28 22:24:24 2009 对应的格式化字符串为：%a %b %d %H:%M:%S %Y

10.time(...)
time() -> floating point number
返回当前时间的时间戳

三、疑点
1.夏令时
在struct_time中，夏令时好像没有用，例如
a = (2009, 6, 28, 23, 8, 34, 5, 87, 1)
b = (2009, 6, 28, 23, 8, 34, 5, 87, 0)
a和b分别表示的是夏令时和标准时间，它们之间转换为时间戳应该相关3600，但是转换后输出都为646585714.0

四、小应用
1.python获取当前时间
time.time() 获取当前时间戳
time.localtime() 当前时间的struct_time形式
time.ctime() 当前时间的字符串形式

2.python格式化字符串
格式化成2009-03-20 11:45:39形式
time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())

不加参数时，默认就是输出当前的时间
time.strftime('%Y-%m-%d %H:%M:%S')

格式化成Sat Mar 28 22:24:24 2009形式
time.strftime("%a %b %d %H:%M:%S %Y", time.localtime())

3.将格式字符串转换为时间戳
a = "Sat Mar 28 22:24:24 2009"
b = time.mktime(time.strptime(a,"%a %b %d %H:%M:%S %Y"))

4.将时间字符串转换为时间元组struct_time
time.strptime('2015-09-16 10:27:36','%Y-%m-%d %H:%M:%S')

MySQLdb 参数处理的坑

Jeffrey Feb 29, 2016 Python

前几天又有同事掉进了给 SQL 的 IN 条件传参的坑，就像 SELECT col1, col2 FROM table1 WHERE id IN (1, 2, 3) 这类 SQL，如果是一个可变的列表作为 IN 的参数，那这个参数应该怎么传呢？

我见过至少这么几种：

id_list = [1, 2, 3]
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)', id_list)

这种方式是语法错误的，原因是 MySQLdb 做字符串格式化时占位符和参数个数不匹配。

id_list = [1, 2, 3]
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)', (id_list,))

这种方式语法是正确的，但语义是错误的，因为生成的 SQL 是 SELECT col1, col2 FROM table1 WHERE id IN ((‘1’, ‘2’, ‘3’))

id_list = [1, 2, 3]
id_list = ','.join([str(i) for i in id_list])
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)', id_list)

这种方式语义也是错误的，因为生成的 SQL 是 SELECT col1, col2 FROM table1 WHERE id IN (‘1,2,3’)

这三种是第一次使用 MySQLdb 给 IN 传参时犯的最多的错误，大多数人遇到第一种错和掉进后两个坑之后，转而采用了下面的方式：

id_list = [1, 2, 3]
id_list = ','.join([str(i) for i in id_list])
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)' % id_list)

这个方式对于可信的参数(比如自己生成的列表：range(1, 10, 2))来说可以用，但由于参数未经 escape，对于从用户端接受的不可信参数来说，存在 SQL 注入的风险。

严防 SQL 注入的问题时刻都不能松懈，于是就有了这样的改进版本：

id_list = [1, 2, 3]
id_list = ','.join([str(cursor.connection.literal(i)) for i in id_list])
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)' % id_list)

这个方式控制了 SQL 注入问题的滋生，但由于 cursor.connection.literal 是内部接口，并不推荐从外部使用。

然后就有了这样的方式：

id_list = [1, 2, 3]
arg_list = ','.join(['%s'] * len(id_list))
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)' % arg_list, id_list)

这个方式是先生成与参数个数相同的 %s 占位，拼出 ‘SELECT col1, col2 FROM table1 WHERE id IN (%s,%s,%s)’ 这样的 SQL，然后使用安全的方式来传参。

就是想传一个参数而已，怎么会这么麻烦呢？触令丧惨！

更正：以下划线内容为未经充分测试的错误结论，仅做记录：

一直以为 MySQLdb 是不支持给 IN 传参的，直到这次又有同事掉坑我才读了 MySQLdb escape 部分的代码，然后发现，MySQLdb 是在很多类型的 Python object 和 SQL 支持的类型之间做自动转换的，比如 MySQLdb 会对 list 和 tuple 内的元素逐个进行 escape，生成一个 tuple，因此这才是正确的给 IN 传参的方式：

id_list = [1, 2, 3]
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN %s', (id_list,))

可以把 MySQLdb 处理参数的过程简化描述为：

对参数 (id_list,) 做 escape 得到 ((‘1’, ‘2’, ‘3’),)
用 escape 过的参数对 SQL 进行格式化：’SELECT col1, col2 FROM table1 WHERE id IN %s’ % ((‘1’, ‘2’, ‘3’),)，得到完整 SQL：’SELECT col1, col2 FROM table1 WHERE id IN (‘1’, ‘2’, ‘3’)

整理一下口诀：IN 的参数和其他参数一样，是一个整体，就要不要对属于参数一部分的 () 念念不忘了……

总结一下评论中对这个方法提出的问题：

如果参数列表只有一个元素，比如 cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN %s', ([1],))，生成的 SQL 是 SELECT col1, col2 FROM table1 WHERE id IN ('1',)，是语法错误的
对列表内元素做 esacpe 时增加的引号会被留下，如果列表元素是字符串，结果会是错误的，比如 cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN %s', (["1", "2"],)) 生成的 SQL 是 SELECT col1, col2 FROM table1 WHERE id IN ("'1'", "'2'")，而对于数字参数恰好能正确工作的原因是，在执行 SQL 时如果列定义是 int 而传参为字符串，MySQL 会做隐式类型转换（Type Conversion in Expression Evaluation）。

MySQLdb 支持对各种类型的 Python object 进行转换和 escape，感兴趣的同学可以看看 MySQLdb.converters 和 _mysql.c 中 *_escape* 系列的函数，另外 MySQLdb 也支持自定义转换规则，参见 MySQLdb.connect 的 conv 参数。

来源: 互联网

How to limit Flask dev server to only one visiting ip address

Jeffrey Feb 23, 2016 Python

I'm developing a website using the Python Flask framework and I now do some devving, pushing my changes to a remote dev server. I set this remote dev server up to serve the website publically using app.run(host='0.0.0.0').

This works fine, but I just don't want other people to view my website yet. For this reason I somehow want to whitelist my ip so that the dev server only serves the website to my own ip address, giving no response, 404's or some other non-useful response to other ip addresses. I can of course set up the server to use apache or nginx to actually serve the website, but I like the automatic reloading of the website on code changes for devving my website