Fork me on GitHub
Suzf  Blog

Archive Dev

Deprecated: mysql_connect(): The mysql extension is deprecated and will be removed in the future

昨天刚把 php 的错误日志打开,今天看自己以前写的 监控网站访问量的 demo 就一片空白了 一片空白了 一片空白了。
手动执行了下 数据采集的 data_access.php 发现有一个警告信息输出:
Deprecated: mysql_connect(): The mysql extension is deprecated and will be removed in the future: use mysqli or PDO instead
这个警告把原本的数据结构给打乱了,所以会出现 demo 的空白。
从提示很明显可以看出  mysql 扩展模块将在将来弃用,可以使用 mysqli 和 PDO 来代替。

Case A:
将 mysql 替换为 mysqli

-$conn = mysql_connect("db-hostname","dnuser","password","dbname");
+$conn = mysqli_connect("db-hostname","dbuer","password","dbname");
 
-if (!$conn) {
-  die('Could not connect: ' . mysql_error());
+/* check connection */
+if (mysqli_connect_errno()) {
+    printf("Connect failed: %s\n", mysqli_connect_error());
+    exit();
 }

官方文档  http://www.php.net/manual/zh/mysqli.query.php

Case B:
将错误日志关闭
display_errors = On
改为
display_errors = Off

Case C:
在php程序代码里面设置报警级别
error_reporting(E_ALL ^ E_DEPRECATED);

这样 Deprecated 这个问题就解决了。 不过还是推荐 使用 mysqli 或是 PDO 替代老旧的 mysql. 毕竟是趋势嘛。

 

pip_faq - Cannot fetch index base URL

新装了一个 ubuntu, 由于公司内诸多网络限制,未经认证的用户是禁止下载任何东西的。 自己就用 cntlm 搭了一个透明代理供自己科学上网。 hah cntlm 配置我这里就不做过多介绍了,请自行脑补。 配置全局代理
^_^[17:20:51][[email protected] ~]$tail -2  .profile
export http_proxy=http://X.X.X.X:PORT
export https_proxy=http://X.X.X.X:PORT
安装 paramiko
^_^[17:09:33][[email protected] ~]$sudo pip install paramiko
Downloading/unpacking paramiko
  Cannot fetch index base URL https://pypi.python.org/simple/
  Could not find any downloads that satisfy the requirement Flask
Cleaning up...
No distributions at all found for Flask
Storing debug log for failure in /home/lucy/.pip/pip.log
Woops, 不能用 Google 了一把 做了这么一些操作,能用了 喜出望外。 什么鬼? 和缓存有关? 和代理有关?看起来是和sudo 权限有关。
@_@[17:18:20][[email protected] ~]$rm -r ~/.pip/
^_^[17:18:26][[email protected] ~]$sudo -E pip install paramiko
Downloading/unpacking paramiko
  Downloading paramiko-1.16.0-py2.py3-none-any.whl (169kB): 169kB downloaded
Downloading/unpacking pycrypto>=2.1,!=2.4 (from paramiko)
  Downloading pycrypto-2.6.1.tar.gz (446kB): 446kB downloaded
  Running setup.py (path:/tmp/pip_build_root/pycrypto/setup.py) egg_info for package pycrypto
    
Downloading/unpacking ecdsa>=0.11 (from paramiko)
  Downloading ecdsa-0.13-py2.py3-none-any.whl (86kB): 86kB downloaded
  ... ...
顺便来说一下 sudo
用户也可以通过su切换到root用户运行命令。然而与su的启动一个root shell允许用户运行之后的所有的命令不同,sudo可以针对单个命令授予临时权限。sudo仅在需要时授予用户权限,减少了用户因为错误执行命令损坏系统的可能性。sudo也可以用来以其他用户身份执行命令。此外,sudo可以记录用户执行的命令,以及失败的特权获取。

选项:
  -a type       使用指定的 BSD 认证类型
  -b            在后台运行命令
  -C fd         关闭所有 >= fd 的文件描述符
  -E            在执行命令时保留用户环境
  -e            编辑文件而非执行命令
  -g group      以指定的用户组执行命令
  -H            将 HOME 变量设为目标用户的主目录。
  -h            显示帮助消息并退出
  -i [command]  以目标用户身份运行一个登录 shell
  -K            完全移除时间戳文件
  -k            无效的时间戳文件
  -l[l] command 列出用户能执行的命令
  -n            非交互模式,将不提示用户
  -P            保留组向量,而非设置为目标的组向量
  -p prompt     使用指定的密码提示
  -S            从标准输入读取密码
  -s [command]  以目标用户身份运行 shell
  -U user       在列表时,列出指定用户的权限
  -u user       以指定用户身份运行命令(或编辑文件)
  -V            显示版本信息并退出
  -v            更新用户的时间戳而不执行命令
  --            停止处理命令行参数
 

Python time 模块简述

一、简介

time模块提供各种操作时间的函数
说明:一般有两种表示时间的方式:
第一种是时间戳的方式(相对于1970.1.1 00:00:00以秒计算的偏移量),时间戳是惟一的
第二种以数组的形式表示即(struct_time),共有九个元素,分别表示,同一个时间戳的struct_time会因为时区不同而不同
year (four digits, e.g. 1998)
month (1-12)
day (1-31)
hours (0-23)
minutes (0-59)
seconds (0-59)
weekday (0-6, Monday is 0)
Julian day (day in the year, 1-366)
DST (Daylight Savings Time) flag (-1, 0 or 1) 是否是夏令时
If the DST flag is 0, the time is given in the regular time zone;
if it is 1, the time is given in the DST time zone;
if it is -1, mktime() should guess based on the date and time.
夏令时介绍:http://baike.baidu.com/view/100246.htm
UTC介绍:http://wenda.tianya.cn/wenda/thread?tid=283921a9da7c5aef&clk=wttpcts

二、函数介绍
1.asctime()
asctime([tuple]) -> string
将一个struct_time(默认为当时时间),转换成字符串
Convert a time tuple to a string, e.g. 'Sat Jun 06 16:26:11 1998'.
When the time tuple is not present, current time as returned by localtime()
is used.

2.clock()
clock() -> floating point number
该函数有两个功能,
在第一次调用的时候,返回的是程序运行的实际时间;
以第二次之后的调用,返回的是自第一次调用后,到这次调用的时间间隔

示例:

[python] view plain copy

import time
if __name__ == '__main__':
time.sleep(1)
print "clock1:%s" % time.clock()
time.sleep(1)
print "clock2:%s" % time.clock()
time.sleep(1)
print "clock3:%s" % time.clock()

输出:
clock1:3.35238137808e-006
clock2:1.00004944763
clock3:2.00012040636
其中第一个clock输出的是程序运行时间
第二、三个clock输出的都是与第一个clock的时间间隔

3.sleep(...)
sleep(seconds)
线程推迟指定的时间运行,经过测试,单位为秒,但是在帮助文档中有以下这样一句话,这关是看不懂
“The argument may be a floating point number for subsecond precision.”

4.ctime(...)
ctime(seconds) -> string
将一个时间戳(默认为当前时间)转换成一个时间字符串
例如:
time.ctime()
输出为:'Sat Mar 28 22:24:24 2009'

5.gmtime(...)
gmtime([seconds]) -> (tm_year, tm_mon, tm_day, tm_hour, tm_min,tm_sec, tm_wday, tm_yday, tm_isdst)
将一个时间戳转换成一个UTC时区(0时区)的struct_time,如果seconds参数未输入,则以当前时间为转换标准

6.localtime(...)
localtime([seconds]) -> (tm_year,tm_mon,tm_day,tm_hour,tm_min,tm_sec,tm_wday,tm_yday,tm_isdst)
将一个时间戳转换成一个当前时区的struct_time,如果seconds参数未输入,则以当前时间为转换标准

7.mktime(...)
mktime(tuple) -> floating point number
将一个以struct_time转换为时间戳

8.strftime(...)
strftime(format[, tuple]) -> string
将指定的struct_time(默认为当前时间),根据指定的格式化字符串输出
python中时间日期格式化符号:
%y 两位数的年份表示(00-99)
%Y 四位数的年份表示(000-9999)
%m 月份(01-12)
%d 月内中的一天(0-31)
%H 24小时制小时数(0-23)
%I 12小时制小时数(01-12)
%M 分钟数(00=59)
%S 秒(00-59)

%a 本地简化星期名称
%A 本地完整星期名称
%b 本地简化的月份名称
%B 本地完整的月份名称
%c 本地相应的日期表示和时间表示
%j 年内的一天(001-366)
%p 本地A.M.或P.M.的等价符
%U 一年中的星期数(00-53)星期天为星期的开始
%w 星期(0-6),星期天为星期的开始
%W 一年中的星期数(00-53)星期一为星期的开始
%x 本地相应的日期表示
%X 本地相应的时间表示
%Z 当前时区的名称
%% %号本身

9.strptime(...)
strptime(string, format) -> struct_time
将时间字符串根据指定的格式化符转换成数组形式的时间
例如:
2009-03-20 11:45:39  对应的格式化字符串为:%Y-%m-%d %H:%M:%S
Sat Mar 28 22:24:24 2009 对应的格式化字符串为:%a %b %d %H:%M:%S %Y

10.time(...)
time() -> floating point number
返回当前时间的时间戳

三、疑点
1.夏令时
在struct_time中,夏令时好像没有用,例如
a = (2009, 6, 28, 23, 8, 34, 5, 87, 1)
b = (2009, 6, 28, 23, 8, 34, 5, 87, 0)
a和b分别表示的是夏令时和标准时间,它们之间转换为时间戳应该相关3600,但是转换后输出都为646585714.0

四、小应用
1.python获取当前时间
time.time() 获取当前时间戳
time.localtime() 当前时间的struct_time形式
time.ctime() 当前时间的字符串形式

2.python格式化字符串
格式化成2009-03-20 11:45:39形式
time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())

不加参数时,默认就是输出当前的时间
time.strftime('%Y-%m-%d %H:%M:%S')

格式化成Sat Mar 28 22:24:24 2009形式
time.strftime("%a %b %d %H:%M:%S %Y", time.localtime())

3.将格式字符串转换为时间戳
a = "Sat Mar 28 22:24:24 2009"
b = time.mktime(time.strptime(a,"%a %b %d %H:%M:%S %Y"))

4.将时间字符串转换为时间元组struct_time
time.strptime('2015-09-16 10:27:36','%Y-%m-%d %H:%M:%S')

 

MySQLdb 参数处理的坑

前几天又有同事掉进了给 SQL 的 IN 条件传参的坑,就像 SELECT col1, col2 FROM table1 WHERE id IN (1, 2, 3) 这类 SQL,如果是一个可变的列表作为 IN 的参数,那这个参数应该怎么传呢?

我见过至少这么几种:

id_list = [1, 2, 3]
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)', id_list)

这种方式是语法错误的,原因是 MySQLdb 做字符串格式化时占位符和参数个数不匹配。

id_list = [1, 2, 3]
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)', (id_list,))

这种方式语法是正确的,但语义是错误的,因为生成的 SQL 是 SELECT col1, col2 FROM table1 WHERE id IN ((‘1’, ‘2’, ‘3’))

id_list = [1, 2, 3]
id_list = ','.join([str(i) for i in id_list])
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)', id_list)

这种方式语义也是错误的,因为生成的 SQL 是 SELECT col1, col2 FROM table1 WHERE id IN (‘1,2,3’)

这三种是第一次使用 MySQLdb 给 IN 传参时犯的最多的错误,大多数人遇到第一种错和掉进后两个坑之后,转而采用了下面的方式:

id_list = [1, 2, 3]
id_list = ','.join([str(i) for i in id_list])
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)' % id_list)

这个方式对于可信的参数(比如自己生成的列表:range(1, 10, 2))来说可以用,但由于参数未经 escape,对于从用户端接受的不可信参数来说,存在 SQL 注入的风险。

严防 SQL 注入的问题时刻都不能松懈,于是就有了这样的改进版本:

id_list = [1, 2, 3]
id_list = ','.join([str(cursor.connection.literal(i)) for i in id_list])
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)' % id_list)

这个方式控制了 SQL 注入问题的滋生,但由于 cursor.connection.literal 是内部接口,并不推荐从外部使用。

然后就有了这样的方式:

id_list = [1, 2, 3]
arg_list = ','.join(['%s'] * len(id_list))
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN (%s)' % arg_list, id_list)

这个方式是先生成与参数个数相同的 %s 占位,拼出 ‘SELECT col1, col2 FROM table1 WHERE id IN (%s,%s,%s)’ 这样的 SQL,然后使用安全的方式来传参。

就是想传一个参数而已,怎么会这么麻烦呢?触令丧惨!

更正:以下划线内容为未经充分测试的错误结论,仅做记录:

一直以为 MySQLdb 是不支持给 IN 传参的,直到这次又有同事掉坑我才读了 MySQLdb escape 部分的代码,然后发现,MySQLdb 是在很多类型的 Python object 和 SQL 支持的类型之间做自动转换的,比如 MySQLdb 会对 list 和 tuple 内的元素逐个进行 escape,生成一个 tuple,因此这才是正确的给 IN 传参的方式:

id_list = [1, 2, 3]
cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN %s', (id_list,))

可以把 MySQLdb 处理参数的过程简化描述为:

  1. 对参数 (id_list,) 做 escape 得到 ((‘1’, ‘2’, ‘3’),)
  2. 用 escape 过的参数对 SQL 进行格式化:’SELECT col1, col2 FROM table1 WHERE id IN %s’ % ((‘1’, ‘2’, ‘3’),),得到完整 SQL:’SELECT col1, col2 FROM table1 WHERE id IN (‘1’, ‘2’, ‘3’)

整理一下口诀:IN 的参数和其他参数一样,是一个整体,就要不要对属于参数一部分的 () 念念不忘了……

总结一下评论中对这个方法提出的问题:

  1. 如果参数列表只有一个元素,比如 cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN %s', ([1],)),生成的 SQL 是 SELECT col1, col2 FROM table1 WHERE id IN ('1',),是语法错误的
  2. 对列表内元素做 esacpe 时增加的引号会被留下,如果列表元素是字符串,结果会是错误的,比如 cursor.execute('SELECT col1, col2 FROM table1 WHERE id IN %s', (["1", "2"],)) 生成的 SQL 是 SELECT col1, col2 FROM table1 WHERE id IN ("'1'", "'2'"),而对于数字参数恰好能正确工作的原因是,在执行 SQL 时如果列定义是 int 而传参为字符串,MySQL 会做隐式类型转换(Type Conversion in Expression Evaluation)。

MySQLdb 支持对各种类型的 Python object 进行转换和 escape,感兴趣的同学可以看看 MySQLdb.converters_mysql.c*_escape* 系列的函数,另外 MySQLdb 也支持自定义转换规则,参见 MySQLdb.connectconv 参数。

来源:  互联网

 

How to limit Flask dev server to only one visiting ip address

I'm developing a website using the Python Flask framework and I now do some devving, pushing my changes to a remote dev server. I set this remote dev server up to serve the website publically using app.run(host='0.0.0.0').

This works fine, but I just don't want other people to view my website yet. For this reason I somehow want to whitelist my ip so that the dev server only serves the website to my own ip address, giving no response, 404's or some other non-useful response to other ip addresses. I can of course set up the server to use apache or nginx to actually serve the website, but I like the automatic reloading of the website on code changes for devving my website

Python 爬取中文网页乱码的通用解决办法

由于网页编码不统一,我们会遇到各式各样的网页编码格式。

那么问题来了,由于编码格式不统一,爬取的中文信息往往是不尽任意的, 那么如何处理它呢?

ENV: Python 2.7.x

如何取得网页的编码,用chardet库最方便(https://pypi.python.org/pypi/chardet/)。

#!/usr/bin/env python
# encoding:utf-8

import chardet
import urllib2

line = "http://suzf.net"
html_old = urllib2.urlopen(line, timeout=30).read()

mychar = chardet.detect(html_old)

coding = mychar['encoding']
print coding

if coding == 'utf-8' or coding == 'UTF-8':
    html = html_old
elif coding == 'gbk' or coding == 'GBK':
    html = html_old.decode('gbk', 'ignore').encode('utf-8')
elif coding == 'gb2312':
    html = html_old.decode('gb2312', 'ignore').encode('utf-8')

print html

 

 

pip-faq: Error -5 while decompressing data: incomplete or truncated stream

在我执行 `pip install flask-bootstrap` 出现了一个这样的错误
-- error: Error -5 while decompressing data: incomplete or truncated stream

安装/卸载其他包是正常的。唯独管理flask-bootstrap 出现了这样的错误。

版本信息:
#pip --version
pip 7.1.2 from /usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg (python 2.7)
#python --version
Python 2.7.3

完整的报错信息是:

^_^[15:36:31][root@master01 ~]#pip install flask-bootstrap
Collecting flask-bootstrap
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/basecommand.py", line 211, in main
    status = self.run(options, args)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/commands/install.py", line 294, in run
    requirement_set.prepare_files(finder)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/req/req_set.py", line 334, in prepare_files
    functools.partial(self._prepare_file, finder))
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/req/req_set.py", line 321, in _walk_req_to_install
    more_reqs = handler(req_to_install)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/req/req_set.py", line 461, in _prepare_file
    req_to_install.populate_link(finder, self.upgrade)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/req/req_install.py", line 250, in populate_link
    self.link = finder.find_requirement(self, upgrade)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/index.py", line 486, in find_requirement
    all_versions = self._find_all_versions(req.name)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/index.py", line 404, in _find_all_versions
    index_locations = self._get_index_urls_locations(project_name)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/index.py", line 378, in _get_index_urls_locations
    page = self._get_page(main_index_url)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/index.py", line 818, in _get_page
    return HTMLPage.get_page(link, session=self.session)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/index.py", line 928, in get_page
    "Cache-Control": "max-age=600",
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/requests/sessions.py", line 477, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/download.py", line 373, in request
    return super(PipSession, self).request(method, url, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/requests/sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/cachecontrol/adapter.py", line 36, in send
    cached_response = self.controller.cached_request(request)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/cachecontrol/controller.py", line 102, in cached_request
    resp = self.serializer.loads(request, self.cache.get(cache_url))
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/cachecontrol/serialize.py", line 108, in loads
    return getattr(self, "_loads_v{0}".format(ver))(request, data)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/cachecontrol/serialize.py", line 164, in _loads_v2
    cached = json.loads(zlib.decompress(data).decode("utf8"))
error: Error -5 while decompressing data: incomplete or truncated stream

原来在PIP的本地缓存损坏了(在我这里的环境中,默认情况下在 ~/.cache/pip)。
我测试了一下,试图执行 `pip install --no-cache-dir flask-bootstrap`,它工作了。
为了确认这是高速缓存,我执行:

pip uninstall flask-bootstrap
rm -rf ~/.cache/pip/*`
pip install flask-bootstrap

或者  pip install --no-cache-dir install flask-bootstrap

这次它成功了,而它之前总是失败。
我不知道该这个问题是否跟缓存的问题有关。但我的猜测是,PIP被中断下载导致缓存数据被破坏。

 

Python debugging tools

This is an overview of the tools and practices I've used for debugging or profiling purposes. This is not necessarily complete, there are so many tools so I'm listing only what I think is best or relevant. If you know better tools or have other preferences, please comment below.

Logging

Yes, really. Can't stress enough how important it is to have adequate logging in your application. You should log important stuff. If your logging is good enough, you can figure out the problem just from the logs. Lots of time saved right there.

If you do ever litter your code with print statements stop now. Use logging.debug instead. You'll be able to reuse that later, disable it altogether and so on ...

Tracing

Sometimes it's better to see what gets executed. You could run step-by-step using some IDE's debugger but you would need to know what you're looking for, otherwise the process will be very slow.

In the stdlib there's a trace module which can print all the executed lines amongst other this (like making coverage reports)

python -mtrace --trace script.py

This will make lots of output (every line executed will be printed so you might want to pipe it through grep to only see the interesting modules). Eg:

python -mtrace --trace script.py | egrep '^(mod1.py|mod2.py)'

Alternatives

Grepping for relevant output is not fun. Plus, the trace module doesn't show you any variables.

Hunter is a flexible alternative that allows filtering and even shows variables of your choosing. Just pip install hunter and run:

PYTHON_HUNTER="F(module='mod1'),F(module='mod2')" python script.py

Take a look at the project page for more examples.

If you're feeling adventurous then you could try smiley - it shows you the variables and you can use it to trace programs remotely.

Alternativelly, if you want very selective tracing you can use aspectlib.debug.log to make existing or 3rd party code emit traces.

PDB

Very basic intro, everyone should know this by now:

import pdb
pdb.set_trace() # opens up pdb prompt

Or:

try:
    code
    that
    fails
except:
    import pdb
    pdb.pm() # or pdb.post_mortem()

Or (press c to start the script):

python -mpdb script.py

Once in the REPL do:

  • c or continue
  • q or quit
  • l or list, shows source at the current frame
  • w or where, shows the traceback
  • d or down, goes down 1 frame on the traceback
  • u or up, goes up 1 frame on the traceback
  • <enter>, repeats last command
  • ! <stuff>, evaluates <stuff> as python code on the current frame
  • everything else, evaluates as python code if it's not a PDB command

Better PDB

Drop in replacements for pdb:

  • ipdb (pip install ipdb) - like ipython (autocomplete, colors etc).
  • pudb (pip install pudb) - curses based (gui-like), good at browsing sourcecode.
  • pdb++ (pip install pdbpp) - autocomplete, colors, extra commands etc.

Remote PDB

sudo apt-get install winpdb

Instead of pdb.set_trace() do:

import rpdb2
rpdb2.start_embedded_debugger("secretpassword")

Now run winpdb and go to File > Attach with the password.

Don't like Winpdb? Use PDB over TCP

Get remote-pdb and then, to open a remote PDB on first available port, use:

from remote_pdb import set_trace
set_trace() # you'll see the port number in the logs

To use some specific host/port:

from remote_pdb import RemotePdb
RemotePdb(host='0.0.0.0', port=4444).set_trace()

To connect just run something like telnet 192.168.12.34 4444. Alternatively, run socat socat readline tcp:192.168.12.34:4444 to get line editing and history.

Just a REPL

If you don't need a full blown debugger then just start a IPython with:

import IPython
IPython.embed()

If you don't have an attached terminal you can use manhole.

Standard Linux tools

I'm always surprised of how underused they are. You can figure out a wide range of problems with these: from performance problems (too many syscalls, memory allocations etc) to deadlocks, network issues, disk issues etc

The most useful is downright strace, just run sudo strace -p 12345 or strace -f command (-f means strace forked processes too) and you're set. Output is generally very large so you might want to redirect it to a file (just add &> somefile) for more analysis.

Then there's ltrace, it's just like strace but with library calls. Arguments are mostly the same.

And lsof for figuring out what the handler numbers you see in ltrace / strace are for. Eg: lsof -p 12345

Better tracing

It's so easy to use and can do so many things - everyone should have htop installed!

sudo apt-get install htop
sudo htop

Now find the process you want, and press:

  • s for system call trace (strace)
  • L for library call trace (ltrace)
  • l for lsof

Monitoring

There's no replacement for good, continuous server monitoring but if you ever find yourself in that weird spot scrambling to find out why everything is slow and where are the resources going ... don't bother with iotop, iftop, htop, iostat, vmstat etc just yet, start with dstat instead! It can do most of the aforementioned tools do and maybe better!

It will show you data continuously, in a compact, color-coded fashion (unlike iostat, vmstat) and you can always see past data (unlike iftop, iotop, htop).

Just run this:

dstat --cpu --io --mem --net --load --fs --vm --disk-util --disk-tps --freespace --swap --top-io --top-bio-adv

There's probably a shorter way to write it but then again, shell history or aliases.

GDB

This one is a rather complicated and powerful tool, but I'm only covering the basic stuff (setup and basic commands).

sudo apt-get install gdb python-dbg
zcat /usr/share/doc/python2.7/gdbinit.gz > ~/.gdbinit
run app with python2.7-dbg
sudo gdb -p 12345

Now use:

  • bt - stacktrace (C level)
  • pystack - python stacktrace, you need to have ~/.gdbinit and use python-dbg unfortunately
  • c (continue)

Worthy mentions

  • sysdig - like strace and lsof but with superpowers.

Having segfaults? faulthandler

Rather awesome addition from Python 3.3, backported to Python 2.x

Just do this and you'll get at least an idea of what's causing the segmentation fault. Just add this in some module that's always imported:

import faulthandler
faulthandler.enable()

This won't work in PyPy unfortunately. If you can't get interactive (e.g.: use gdb) you can just set this environment variable (GNU libc only, details):

export LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so

Make sure the path is correct - otherwise it won't have any effect (e.g.: run locate libSegFault.so).

Quick stacktrace on a signal? faulthandler

Add this in some module that's always imported:

import faulthandler
import signal
faulthandler.register(signal.SIGUSR2, all_threads=True)

Then run kill -USR2 <pid> to get a stacktrace for all threads on the process's stderr.

Memory leaks

Well, there's are plenty of tools here, some specialized on WSGI applications like Dozer but my favorite is definitely objgraph. It's so convenient and easy to use it's amazing. It's doesn't have any integration with WSGI or anything so you need to find yourself a way to run code like:

>>> import objgraph
>>> objgraph.show_most_common_types() # try to find objects to investigate
Request                  119105
function                   7413
dict                       2492
tuple                      2396
wrapper_descriptor         1324
weakref                    1291
list                       1234
cell                       1011
>>> objs = objgraph.by_type("Request")[:15] # select few Request objects
>>> objgraph.show_backrefs(objs, max_depth=15, highlight=lambda v: v in objs, filename="/tmp/graph.png") # and plot them
Graph written to /tmp/objgraph-zbdM4z.dot (107 nodes)
Image generated as /tmp/graph.png

And you get a nice diagram like this (warning: it's very large). You can also get dot output.

Memory usage

Sometimes you want to use less memory. Less allocations usually make applications faster and well, users like them lean and mean :)

There are lots of tools [1] but the best one in my opinion is pytracemalloc - it has very little overhead (doesn't need to rely on the speed crippling sys.settrace) compared to other tools and it's output is very detailed. It's a pain to setup because you need to recompile python but apt makes it very easy to do so. In fact, it is so good that it got included in Python 3.4. See PEP-454 for details.

Just run these commands and go grab lunch or something:

apt-get source python2.7
cd python2.7-*
wget https://github.com/wyplay/pytracemalloc/raw/master/python2.7_track_free_list.patch
patch -p1 < python2.7_track_free_list.patch
debuild -us -uc
cd ..
sudo dpkg -i python2.7-minimal_2.7*.deb python2.7-dev_*.deb

Alternativelly, you can use this ppa but I think it might be outdated by now. You can make your own ppa, it's easy enough.

And install pytracemalloc (note that if you're doing this in a virtualenv, you need to recreate it after the python re-install - just run virtualenv myenv):

pip install pytracemalloc

Now wrap your application in code like this:

import tracemalloc, time
tracemalloc.enable()
top = tracemalloc.DisplayTop(
    5000, # log the top 5000 locations
    file=open('/tmp/memory-profile-%s' % time.time(), "w")
)
top.show_lineno = True
try:
    # code that needs to be traced
finally:
    top.display()

And output is like this:

2013-05-31 18:05:07: Top 5000 allocations per file and line
#1: .../site-packages/billiard/_connection.py:198: size=1288 KiB, count=70 (+0), average=18 KiB
#2: .../site-packages/billiard/_connection.py:199: size=1288 KiB, count=70 (+0), average=18 KiB
#3: .../python2.7/importlib/__init__.py:37: size=459 KiB, count=5958 (+0), average=78 B
#4: .../site-packages/amqp/transport.py:232: size=217 KiB, count=6960 (+0), average=32 B
#5: .../site-packages/amqp/transport.py:231: size=206 KiB, count=8798 (+0), average=24 B
#6: .../site-packages/amqp/serialization.py:210: size=199 KiB, count=822 (+0), average=248 B
#7: .../lib/python2.7/socket.py:224: size=179 KiB, count=5947 (+0), average=30 B
#8: .../celery/utils/term.py:89: size=172 KiB, count=1953 (+0), average=90 B
#9: .../site-packages/kombu/connection.py:281: size=153 KiB, count=2400 (+0), average=65 B
#10: .../site-packages/amqp/serialization.py:462: size=147 KiB, count=4704 (+0), average=32 B

...

Beautiful, no?

[1] pytracemalloc alternatives.

EDIT: More about profiling here.

Author: Ionel Cristian Mărieș
Link: python-debugging-tools

中文译文: 我常用的 Python 调试工具
相关连接: Python 代码调试技巧

 

Python CSV 操作实例

Reference: The Python Standard Library CSV

使用 Python 生成csv 文件

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import csv

# wb中的w表示写入模式,b是文件模式
csv_file = file('test.csv', 'wb')
writer = csv.writer(csv_file)

# 写入一行
writer.writerow(['Name', 'Age', 'Sex'])

data = [
    ('Lisa', 18, 'female'),
    ('jack', 20, 'male'),
    ('Danny', 19, 'female'),
]
# 写入多行
writer.writerows(data)

csv_file.close()


"""
spamwriter = csv.writer(csvfile, dialect='excel')
如果想使生成的CSV 文件可以使excel打开,而不出现乱码 请使用参数:dialect='excel'
这里我生成的 csv 文件没有使用 dialect 参数。 excel用的是 WPS,PY Version 是 2.7
"""

运行结果

^_^[15:43:21][root@master01 ~]#cat test.csv 
Name,Age,Sex
Lisa,18,female
jack,20,male
Danny,19,female

20151109160321

读取 Python 生成的 CSV 文件

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import csv

# rb中的r表示读取模式,b是文件模式
csv_file = file('test.csv', 'rb')

reader = csv.reader(csv_file)

for line in reader:
    print line

csv_file.close()

运行结果

['Name', 'Age', 'Sex']
['Lisa', '18', 'female']
['jack', '20', 'male']
['Danny', '19', 'female']

Python读取从excel导出的csv文件
将 excel 文件导出成CSV 格式,使用python读取数据

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import csv
with open('test_dos.csv', 'rb') as csv_file:
    rows = csv.reader(csv_file, dialect='excel')
    for row in rows:
        print ', '.join(row)

csv_file.close()

运行结果

Name, Age, Sex
Lisa, 18, female
jack, 20, male
Danny, 19, female