Suzf Blog

Python 爬取中文网页乱码的通用解决办法

Jeffrey Feb 02, 2016 Python

由于网页编码不统一，我们会遇到各式各样的网页编码格式。

那么问题来了，由于编码格式不统一，爬取的中文信息往往是不尽任意的，那么如何处理它呢？

ENV: Python 2.7.x

如何取得网页的编码，用chardet库最方便(https://pypi.python.org/pypi/chardet/)。

#!/usr/bin/env python
# encoding:utf-8

import chardet
import urllib2

line = "http://suzf.net"
html_old = urllib2.urlopen(line, timeout=30).read()

mychar = chardet.detect(html_old)

coding = mychar['encoding']
print coding

if coding == 'utf-8' or coding == 'UTF-8':
    html = html_old
elif coding == 'gbk' or coding == 'GBK':
    html = html_old.decode('gbk', 'ignore').encode('utf-8')
elif coding == 'gb2312':
    html = html_old.decode('gb2312', 'ignore').encode('utf-8')

print html

pip-faq: Error -5 while decompressing data: incomplete or truncated stream

Jeffrey Dec 30, 2015 Python

在我执行 `pip install flask-bootstrap` 出现了一个这样的错误
-- error: Error -5 while decompressing data: incomplete or truncated stream

安装/卸载其他包是正常的。唯独管理flask-bootstrap 出现了这样的错误。

版本信息：
#pip --version
pip 7.1.2 from /usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg (python 2.7)
#python --version
Python 2.7.3

完整的报错信息是：

^_^[15:36:31][root@master01 ~]#pip install flask-bootstrap
Collecting flask-bootstrap
Exception:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/basecommand.py", line 211, in main
    status = self.run(options, args)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/commands/install.py", line 294, in run
    requirement_set.prepare_files(finder)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/req/req_set.py", line 334, in prepare_files
    functools.partial(self._prepare_file, finder))
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/req/req_set.py", line 321, in _walk_req_to_install
    more_reqs = handler(req_to_install)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/req/req_set.py", line 461, in _prepare_file
    req_to_install.populate_link(finder, self.upgrade)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/req/req_install.py", line 250, in populate_link
    self.link = finder.find_requirement(self, upgrade)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/index.py", line 486, in find_requirement
    all_versions = self._find_all_versions(req.name)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/index.py", line 404, in _find_all_versions
    index_locations = self._get_index_urls_locations(project_name)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/index.py", line 378, in _get_index_urls_locations
    page = self._get_page(main_index_url)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/index.py", line 818, in _get_page
    return HTMLPage.get_page(link, session=self.session)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/index.py", line 928, in get_page
    "Cache-Control": "max-age=600",
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/requests/sessions.py", line 477, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/download.py", line 373, in request
    return super(PipSession, self).request(method, url, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/requests/sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/cachecontrol/adapter.py", line 36, in send
    cached_response = self.controller.cached_request(request)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/cachecontrol/controller.py", line 102, in cached_request
    resp = self.serializer.loads(request, self.cache.get(cache_url))
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/cachecontrol/serialize.py", line 108, in loads
    return getattr(self, "_loads_v{0}".format(ver))(request, data)
  File "/usr/local/lib/python2.7/site-packages/pip-7.1.2-py2.7.egg/pip/_vendor/cachecontrol/serialize.py", line 164, in _loads_v2
    cached = json.loads(zlib.decompress(data).decode("utf8"))
error: Error -5 while decompressing data: incomplete or truncated stream

原来在PIP的本地缓存损坏了（在我这里的环境中，默认情况下在 ~/.cache/pip）。
我测试了一下，试图执行 `pip install --no-cache-dir flask-bootstrap`,它工作了。
为了确认这是高速缓存，我执行：

pip uninstall flask-bootstrap
rm -rf ~/.cache/pip/*`
pip install flask-bootstrap

或者 pip install --no-cache-dir install flask-bootstrap

这次它成功了，而它之前总是失败。
我不知道该这个问题是否跟缓存的问题有关。但我的猜测是，PIP被中断下载导致缓存数据被破坏。

使用 VIM 打造 Python 开发IDE

Jeffrey Dec 25, 2015 Python

编程常用的文本编辑器就那么几种常见的, 有人喜欢Vim, 有人喜欢emacs, 也有人喜欢IDE, 例如Pycharm, eclipse等. 今天我们不谈孰优孰劣, 只要适合自己就可以了.

如果你喜欢VIM, 又希望有IDE常见的功能. 你完全可以将这些功能集成到Vim中. 但是, 对于一个初学者, 或像我一样的懒人, 一个一个的查找并试验配置这些插件未免有些太麻烦. 因此, 本文介绍 spf13-vim, 可以简单的满足我们的需要.

This is an overview of the tools and practices I've used for debugging or profiling purposes. This is not necessarily complete, there are so many tools so I'm listing only what I think is best or relevant. If you know better tools or have other preferences, please comment below.

Logging

Yes, really. Can't stress enough how important it is to have adequate logging in your application. You should log important stuff. If your logging is good enough, you can figure out the problem just from the logs. Lots of time saved right there.

If you do ever litter your code with print statements stop now. Use logging.debug instead. You'll be able to reuse that later, disable it altogether and so on ...

Tracing

Sometimes it's better to see what gets executed. You could run step-by-step using some IDE's debugger but you would need to know what you're looking for, otherwise the process will be very slow.

In the stdlib there's a trace module which can print all the executed lines amongst other this (like making coverage reports)

python -mtrace --trace script.py

This will make lots of output (every line executed will be printed so you might want to pipe it through grep to only see the interesting modules). Eg:

python -mtrace --trace script.py | egrep '^(mod1.py|mod2.py)'

Alternatives

Grepping for relevant output is not fun. Plus, the trace module doesn't show you any variables.

Hunter is a flexible alternative that allows filtering and even shows variables of your choosing. Just pip install hunter and run:

PYTHON_HUNTER="F(module='mod1'),F(module='mod2')" python script.py

Take a look at the project page for more examples.

If you're feeling adventurous then you could try smiley - it shows you the variables and you can use it to trace programs remotely.

Alternativelly, if you want very selective tracing you can use aspectlib.debug.log to make existing or 3rd party code emit traces.

PDB

Very basic intro, everyone should know this by now:

import pdb
pdb.set_trace() # opens up pdb prompt

Or:

try:
    code
    that
    fails
except:
    import pdb
    pdb.pm() # or pdb.post_mortem()

Or (press c to start the script):

python -mpdb script.py

Once in the REPL do:

c or continue
q or quit
l or list, shows source at the current frame
w or where, shows the traceback
d or down, goes down 1 frame on the traceback
u or up, goes up 1 frame on the traceback
<enter>, repeats last command
! <stuff>, evaluates <stuff> as python code on the current frame
everything else, evaluates as python code if it's not a PDB command

Better PDB

Drop in replacements for pdb:

ipdb (pip install ipdb) - like ipython (autocomplete, colors etc).
pudb (pip install pudb) - curses based (gui-like), good at browsing sourcecode.
pdb++ (pip install pdbpp) - autocomplete, colors, extra commands etc.

Remote PDB

sudo apt-get install winpdb

Instead of pdb.set_trace() do:

import rpdb2
rpdb2.start_embedded_debugger("secretpassword")

Now run winpdb and go to File > Attach with the password.

Don't like Winpdb? Use PDB over TCP

Get remote-pdb and then, to open a remote PDB on first available port, use:

from remote_pdb import set_trace
set_trace() # you'll see the port number in the logs

To use some specific host/port:

from remote_pdb import RemotePdb
RemotePdb(host='0.0.0.0', port=4444).set_trace()

To connect just run something like telnet 192.168.12.34 4444. Alternatively, run socat socat readline tcp:192.168.12.34:4444 to get line editing and history.

Just a REPL

If you don't need a full blown debugger then just start a IPython with:

import IPython
IPython.embed()

If you don't have an attached terminal you can use manhole.

Standard Linux tools

I'm always surprised of how underused they are. You can figure out a wide range of problems with these: from performance problems (too many syscalls, memory allocations etc) to deadlocks, network issues, disk issues etc

The most useful is downright strace, just run sudo strace -p 12345 or strace -f command (-f means strace forked processes too) and you're set. Output is generally very large so you might want to redirect it to a file (just add &> somefile) for more analysis.

Then there's ltrace, it's just like strace but with library calls. Arguments are mostly the same.

And lsof for figuring out what the handler numbers you see in ltrace / strace are for. Eg: lsof -p 12345

Better tracing

It's so easy to use and can do so many things - everyone should have htop installed!

sudo apt-get install htop
sudo htop

Now find the process you want, and press:

s for system call trace (strace)
L for library call trace (ltrace)
l for lsof

Monitoring

There's no replacement for good, continuous server monitoring but if you ever find yourself in that weird spot scrambling to find out why everything is slow and where are the resources going ... don't bother with iotop, iftop, htop, iostat, vmstat etc just yet, start with dstat instead! It can do most of the aforementioned tools do and maybe better!

It will show you data continuously, in a compact, color-coded fashion (unlike iostat, vmstat) and you can always see past data (unlike iftop, iotop, htop).

Just run this:

dstat --cpu --io --mem --net --load --fs --vm --disk-util --disk-tps --freespace --swap --top-io --top-bio-adv

There's probably a shorter way to write it but then again, shell history or aliases.

GDB

This one is a rather complicated and powerful tool, but I'm only covering the basic stuff (setup and basic commands).

sudo apt-get install gdb python-dbg
zcat /usr/share/doc/python2.7/gdbinit.gz > ~/.gdbinit
run app with python2.7-dbg
sudo gdb -p 12345

Now use:

bt - stacktrace (C level)
pystack - python stacktrace, you need to have ~/.gdbinit and use python-dbg unfortunately
c (continue)

Worthy mentions

sysdig - like strace and lsof but with superpowers.

Having segfaults? faulthandler

Rather awesome addition from Python 3.3, backported to Python 2.x

Just do this and you'll get at least an idea of what's causing the segmentation fault. Just add this in some module that's always imported:

import faulthandler
faulthandler.enable()

This won't work in PyPy unfortunately. If you can't get interactive (e.g.: use gdb) you can just set this environment variable (GNU libc only, details):

export LD_PRELOAD=/lib/x86_64-linux-gnu/libSegFault.so

Make sure the path is correct - otherwise it won't have any effect (e.g.: run locate libSegFault.so).

Quick stacktrace on a signal? faulthandler

Add this in some module that's always imported:

import faulthandler
import signal
faulthandler.register(signal.SIGUSR2, all_threads=True)

Then run kill -USR2 <pid> to get a stacktrace for all threads on the process's stderr.

Memory leaks

Well, there's are plenty of tools here, some specialized on WSGI applications like Dozer but my favorite is definitely objgraph. It's so convenient and easy to use it's amazing. It's doesn't have any integration with WSGI or anything so you need to find yourself a way to run code like:

>>> import objgraph
>>> objgraph.show_most_common_types() # try to find objects to investigate
Request                  119105
function                   7413
dict                       2492
tuple                      2396
wrapper_descriptor         1324
weakref                    1291
list                       1234
cell                       1011
>>> objs = objgraph.by_type("Request")[:15] # select few Request objects
>>> objgraph.show_backrefs(objs, max_depth=15, highlight=lambda v: v in objs, filename="/tmp/graph.png") # and plot them
Graph written to /tmp/objgraph-zbdM4z.dot (107 nodes)
Image generated as /tmp/graph.png

And you get a nice diagram like this (warning: it's very large). You can also get dot output.

Memory usage

Sometimes you want to use less memory. Less allocations usually make applications faster and well, users like them lean and mean :)

There are lots of tools [1] but the best one in my opinion is pytracemalloc - it has very little overhead (doesn't need to rely on the speed crippling sys.settrace) compared to other tools and it's output is very detailed. It's a pain to setup because you need to recompile python but apt makes it very easy to do so. In fact, it is so good that it got included in Python 3.4. See PEP-454 for details.

Just run these commands and go grab lunch or something:

apt-get source python2.7
cd python2.7-*
wget https://github.com/wyplay/pytracemalloc/raw/master/python2.7_track_free_list.patch
patch -p1 < python2.7_track_free_list.patch
debuild -us -uc
cd ..
sudo dpkg -i python2.7-minimal_2.7*.deb python2.7-dev_*.deb

Alternativelly, you can use this ppa but I think it might be outdated by now. You can make your own ppa, it's easy enough.

And install pytracemalloc (note that if you're doing this in a virtualenv, you need to recreate it after the python re-install - just run virtualenv myenv):

pip install pytracemalloc

Now wrap your application in code like this:

import tracemalloc, time
tracemalloc.enable()
top = tracemalloc.DisplayTop(
    5000, # log the top 5000 locations
    file=open('/tmp/memory-profile-%s' % time.time(), "w")
)
top.show_lineno = True
try:
    # code that needs to be traced
finally:
    top.display()

And output is like this:

2013-05-31 18:05:07: Top 5000 allocations per file and line
#1: .../site-packages/billiard/_connection.py:198: size=1288 KiB, count=70 (+0), average=18 KiB
#2: .../site-packages/billiard/_connection.py:199: size=1288 KiB, count=70 (+0), average=18 KiB
#3: .../python2.7/importlib/__init__.py:37: size=459 KiB, count=5958 (+0), average=78 B
#4: .../site-packages/amqp/transport.py:232: size=217 KiB, count=6960 (+0), average=32 B
#5: .../site-packages/amqp/transport.py:231: size=206 KiB, count=8798 (+0), average=24 B
#6: .../site-packages/amqp/serialization.py:210: size=199 KiB, count=822 (+0), average=248 B
#7: .../lib/python2.7/socket.py:224: size=179 KiB, count=5947 (+0), average=30 B
#8: .../celery/utils/term.py:89: size=172 KiB, count=1953 (+0), average=90 B
#9: .../site-packages/kombu/connection.py:281: size=153 KiB, count=2400 (+0), average=65 B
#10: .../site-packages/amqp/serialization.py:462: size=147 KiB, count=4704 (+0), average=32 B

...

Beautiful, no?

[1] pytracemalloc alternatives.

EDIT: More about profiling here.

Author: Ionel Cristian Mărieș
Link: python-debugging-tools

中文译文：我常用的 Python 调试工具
相关连接： Python 代码调试技巧

Python CSV 操作实例

Jeffrey Nov 09, 2015 Python

Reference: The Python Standard Library CSV

使用 Python 生成csv 文件

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import csv

# wb中的w表示写入模式，b是文件模式
csv_file = file('test.csv', 'wb')
writer = csv.writer(csv_file)

# 写入一行
writer.writerow(['Name', 'Age', 'Sex'])

data = [
    ('Lisa', 18, 'female'),
    ('jack', 20, 'male'),
    ('Danny', 19, 'female'),
]
# 写入多行
writer.writerows(data)

csv_file.close()


"""
spamwriter = csv.writer(csvfile, dialect='excel')
如果想使生成的CSV 文件可以使excel打开，而不出现乱码 请使用参数：dialect='excel'
这里我生成的 csv 文件没有使用 dialect 参数。 excel用的是 WPS，PY Version 是 2.7
"""

运行结果

^_^[15:43:21][root@master01 ~]#cat test.csv 
Name,Age,Sex
Lisa,18,female
jack,20,male
Danny,19,female

读取 Python 生成的 CSV 文件

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import csv

# rb中的r表示读取模式，b是文件模式
csv_file = file('test.csv', 'rb')

reader = csv.reader(csv_file)

for line in reader:
    print line

csv_file.close()

运行结果

['Name', 'Age', 'Sex']
['Lisa', '18', 'female']
['jack', '20', 'male']
['Danny', '19', 'female']

Python读取从excel导出的csv文件
将 excel 文件导出成CSV 格式,使用python读取数据

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import csv
with open('test_dos.csv', 'rb') as csv_file:
    rows = csv.reader(csv_file, dialect='excel')
    for row in rows:
        print ', '.join(row)

csv_file.close()

运行结果

Name, Age, Sex
Lisa, 18, female
jack, 20, male
Danny, 19, female

How-to Load CSV data into mysql use Python

Jeffrey Nov 09, 2015 Python

逗号分隔值（Comma-Separated Values，CSV，有时也称为字符分隔值，因为分隔字符也可以不是逗号），其文件以纯文本形式存储表格数据（数字和文本）。纯文本意味着该文件是一个字符序列，不含必须像二进制数字那样被解读的数据。CSV文件由任意数目的记录组成，记录间以某种换行符分隔；每条记录由字段组成，字段间的分隔符是其它字符或字符串，最常见的是逗号或制表符。通常，所有记录都有完全相同的字段序列。

CSV文件格式的通用标准并不存在，但是在RFC 4180中有基础性的描述。使用的字符编码同样没有被指定，但是7-bitASCII是最基本的通用编码。

日常工作中总是需要将采集的数据保存到数据库中对以往数据的对比。下面以MySQL数据为例说明。

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import csv
import MySQLdb

conn = MySQLdb.connect(host='localhost',
                       user='root',
                       passwd='',
                       db='test')
cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS `test_csv`;')
create_table = '''
  CREATE TABLE `test_csv` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `col1` varchar(50) NOT NULL,
  `col2` varchar(30) DEFAULT NULL,
  `col3` varchar(30) DEFAULT NULL,
  `col4` varchar(30) DEFAULT NULL,
  PRIMARY KEY (`id`)
) 
'''

cur.execute(create_table)

csv_data = csv.reader(file('test.csv'))
for row in csv_data:
    #print row
    cur.execute('INSERT INTO test_csv(col1, col2, col3, col4)'
                'VALUES("%s", "%s", "%s", "%s")',
                row)

# close the connection to the database.
conn.commit()
cur.close()
conn.close()
print "Done"

执行结果

mysql>  select * from test_csv;
+----+------+-------+--------+--------+
| id | col1 | col2  | col3   | col4   |
+----+------+-------+--------+--------+
|  1 | '1'  | 'UE1' | '6295' | '1648' |
|  2 | '2'  | 'UE9' | '9805' | '4542' |
|  3 | '3'  | 'MQ2' | 'NONE' | 'NONE' |
|  4 | '4'  | 'BD8' | 'NONE' | 'NONE' |
|  5 | '5'  | '908' | '1548' | '1099' |
|  6 | '6'  | 'dle' | '1548' | '1098' |
|  7 | '7'  | '808' | '1548' | '1099' |
|  8 | '8'  | '108' | '1548' | '1098' |
|  9 | '9'  | 'B08' | '1548' | '1098' |
+----+------+-------+--------+--------+
9 rows in set (0.00 sec)

源CSV文件

^_^[14:36:16][root@master01 ~]#cat test.csv 
1,UE1,6295,1648
2,UE9,9805,4542
3,MQ2,NONE,NONE
4,BD8,NONE,NONE
5,908,1548,1099
6,dle,1548,1098
7,808,1548,1099
8,108,1548,1098
9,B08,1548,1098

How-to use MySQL-python module

Jeffrey Nov 04, 2015 Python

对于数据库操作，和 TCP/IP 的三次握手异曲同工之妙，建立连接，执行操作，断开连接。当然这就需要建立连接的工具

Python连接mysql的方案有oursql、PyMySQL、 myconnpy、MySQL Connector 等，不过本篇说的确是另外一个类库MySQLdb，MySQLdb 是用于Python链接Mysql数据库的接口，它实现了 Python 数据库 API 规范 V2.0，基于 MySQL C API 上建立的。

Reference

Package MySQL-python

MySQLdb User's Guide

安装

yum install Mysql-python -y

pip install Mysql-python

源码解压缩进入主目录执行 python setup.py install

使用

1. 数据库的连接

MySQLdb提供了connect方法用来和数据库建立连接,接收数个参数,返回连接对象：
conn=MySQLdb.connect(host="hostname",user="username",passwd="password",db="dbname",charset="utf8")

比较常用的参数包括:
host:数据库主机名.默认是用本地主机
user:数据库登陆名.默认是当前用户
passwd:数据库登陆的秘密.默认为空
db:要使用的数据库名.没有默认值
port:MySQL服务使用的TCP端口.默认是3306
charset:数据库编码
然后,这个连接对象也提供了对事务操作的支持,标准的方法:
commit() 提交
rollback() 回滚

#!/usr/bin/env python
# -*- coding=utf-8 -*-

import MySQLdb
 
try:
    conn=MySQLdb.connect(host='localhost',user='root',passwd='',port=3306)
    
    # 使用cursor()方法获取操作游标
    cur=conn.cursor()

    # 选择数据库
    conn.select_db('test')

    # 使用execute方法执行SQL语句
    cur.execute("SELECT VERSION()")

    # 使用 fetchone() 方法获取一条数据库。
    data = cur.fetchone()
    print "Database version : %s " % data
    
    # 关闭连接
    conn.commit()
    cur.close()
    conn.close()
 
except MySQLdb.Error,e:
     print "Mysql Error %d: %s" % (e.args[0], e.args[1])

执行结果

Database version : 5.1.73

2. cursor方法执行与返回值

cursor方法提供两类操作：
1.执行命令
2.接收返回值
cursor用来执行命令的方法

#用来执行存储过程,接收的参数为存储过程名和参数列表,返回值为受影响的行数
callproc(self, procname, args)
#执行单条sql语句,接收的参数为sql语句本身和使用的参数列表,返回值为受影响的行数
execute(self, query, args)
#执行单挑sql语句,但是重复执行参数列表里的参数,返回值为受影响的行数
executemany(self, query, args)
#移动到下一个结果集
nextset(self)
cursor用来接收返回值的方法
#接收全部的返回结果行.
fetchall(self)
#接收size条返回结果行.如果size的值大于返回的结果行的数量,则会返回cursor.arraysize条数据
fetchmany(self, size=None)
#返回一条结果行
fetchone(self)
#移动指针到某一行.如果mode='relative',则表示从当前所在行移动value条,如果mode='absolute',则表示从结果集的第一行移动value条
scroll(self, value, mode='relative')
#这是一个只读属性，并返回执行execute()方法后影响的行数
rowcount

3. 数据库操作

a.创建表

#!/usr/bin/env python
# -*- coding=utf-8 -*-

import MySQLdb

 
try:
    conn=MySQLdb.connect(host='localhost',user='root',passwd='',port=3306)
    
    # 使用cursor()方法获取操作游标

    cur=conn.cursor()
    # 选择数据库
    conn.select_db('test')


    # 如果数据表已经存在使用 execute() 方法删除表。
    cur.execute("DROP TABLE IF EXISTS stu_info")

    # 创建数据表SQL语句
    sql = """CREATE TABLE stu_info (
             `id` int(10) NOT NULL AUTO_INCREMENT PRIMARY KEY,
             `name` CHAR(20) NOT NULL,
             `age` INT,
             `sex` CHAR(6))"""
    cur.execute(sql)

    conn.commit()
    cur.close()
    conn.close()
 
except MySQLdb.Error,e:
     print "Mysql Error %d: %s" % (e.args[0], e.args[1])

b. 插入数据

添加单行记录

#!/usr/bin/env python
# -*- coding=utf-8 -*-

import MySQLdb
 
# 创建连接
conn=MySQLdb.connect(host='localhost',user='root',passwd='',port=3306)
# 使用cursor()方法获取操作游标

cur=conn.cursor()
# 选择数据库
conn.select_db('test')

# 插入一条记录
sql = "insert into stu_info(name,age,sex) values(%s,%s,%s)"
cur.execute(sql,('Lisa',18,'female'))

conn.commit()
cur.close()
conn.close()

添加多行记录

#!/usr/bin/env python
# -*- coding=utf-8 -*-

import MySQLdb

# 创建连接
conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306)
# 使用cursor()方法获取操作游标

cur = conn.cursor()
# 选择数据库
conn.select_db('test')

# 插入多条记录
sql = "insert into stu_info(name, age, sex) values(%s, %s, %s)"
mutl_line = [
    ('jack', 20, 'male'),
    ('Danny', 19, 'female'),
    ]
cur.executemany(sql, mutl_line)

conn.commit()
cur.close()
conn.close()

executemany()方法可以一次插入多条值，执行单挑sql语句,但是重复执行参数列表里的参数,返回值为受影响的行数。

c. 查询数据

#!/usr/bin/env python
# -*- coding=utf-8 -*-

import MySQLdb

# 创建连接
conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306)
# 使用cursor()方法获取操作游标

cur = conn.cursor()
# 选择数据库
conn.select_db('test')

# 获取记录条数
rec_count = cur.execute("select * from stu_info")
print "There have %s records" % rec_count

#打印表中的数据
#rows = cur.fetchmany(rec_count)
rows = cur.fetchall()
for row in rows:
    print row

conn.commit()
cur.close()
conn.close()

执行结果

There have 3 records
(1L, 'Lisa', 18L, 'female')
(2L, 'jack', 20L, 'male')
(3L, 'Danny', 19L, 'female')

上面的代码，用来将所有的结果取出，不过打印的时候是每行一个元祖打印，现在我们使用方法，取出其中的单个数据：

import MySQLdb
 
# 创建连接
conn=MySQLdb.connect(host='localhost',user='root',passwd='',port=3306)
# 使用cursor()方法获取操作游标

cur=conn.cursor()
# 选择数据库
conn.select_db('test')

# 执行那个查询，这里用的是select语句
cur.execute("select * from stu_info")

# 使用cur.rowcount获取结果集的条数
numrows = int(cur.rowcount)
for i in range(numrows):
  row = cur.fetchone()
  print str(row[0]) + "," + row[1] + "," + str(row[2]) + "," + row[3]

conn.commit()
cur.close()
conn.close()

执行结果

1,Lisa,18,female
2,jack,20,male
3,Danny,19,female

使用字典cursor取得结果集（可以使用表字段名字访问值）

import MySQLdb
 
# 创建连接
conn=MySQLdb.connect(host='localhost',user='root',passwd='',port=3306)
# 使用cursor()方法获取操作游标

cur=conn.cursor()
# 选择数据库
conn.select_db('test')

# 获取连接上的字典cursor，注意获取的方法，
# 每一个cursor其实都是cursor的子类
cur = conn.cursor(MySQLdb.cursors.DictCursor)

# 执行语句不变
cur.execute("SELECT * FROM stu_info")

# 获取数据方法不变
rows = cur.fetchall()

# 遍历数据也不变（比上一个更直接一点）
for row in rows:
    # 这里，可以使用键值对的方法，由键名字来获取数据
    print "%s %s %s" % (str(row["id"]), row["name"], str(row["age"]))


conn.commit()
cur.close()
conn.close()

执行结果：

1 Lisa 18
2 jack 20
3 Danny 19

使用Prepared statements执行查询

import MySQLdb
 
# 创建连接
conn=MySQLdb.connect(host='localhost',user='root',passwd='',port=3306)
# 使用cursor()方法获取操作游标

cur=conn.cursor()
# 选择数据库
conn.select_db('test')


# 我们看到，这里可以通过写一个可以组装的sql语句来进行
cur.execute("UPDATE stu_info SET Name = %s WHERE Id = %s",
    ("cherry", "3"))
# 使用cur.rowcount获取影响了多少行
print "Number of rows updated: %d" % cur.rowcount


conn.commit()
cur.close()
conn.close()

执行结果

:<EOF>
Number of rows updated: 1

In [16]:

Transaction - 事务

事务机制可以确保数据一致性。
事务应该具有4个属性：原子性、一致性、隔离性、持久性。这四个属性通常称为ACID特性。
① 原子性（atomicity）。一个事务是一个不可分割的工作单位，事务中包括的诸操作要么都做，要么都不做。
② 一致性（consistency）。事务必须是使数据库从一个一致性状态变到另一个一致性状态。一致性与原子性是密切相关的。
③ 隔离性（isolation）。一个事务的执行不能被其他事务干扰。即一个事务内部的操作及使用的数据对并发的其他事务是隔离的，并发执行的各个事务之间不能互相干扰。
④ 持久性（durability）。持续性也称永久性（permanence），指一个事务一旦提交，它对数据库中数据的改变就应该是永久性的。接下来的其他操作或故障不应该对其有任何影响。

Python DB API 2.0 的事务提供了两个方法 commit 或 rollback。
对于支持事务的数据库，在Python数据库编程中，当游标建立之时，就自动开始了一个隐形的数据库事务。commit()方法游标的所有更新操作，rollback（）方法回滚当前游标的所有操作。每一个方法都开始了一个新的事务。

import MySQLdb

try:
    # 创建连接
    conn = MySQLdb.connect(
        host='localhost', user='root', passwd='', port=3306)
    # 使用cursor()方法获取操作游标

    cur = conn.cursor()
    # 选择数据库
    conn.select_db('test')

    # 如果某个数据库支持事务，会自动开启
    # 这里用的是MYSQL，所以会自动开启事务（若是MYISM引擎则不会）
    cur.execute("UPDATE stu_info SET name = %s WHERE id = %s",
                   ("tonny", "1"))
    cur.execute("UPDATE stu_infos SET name = %s WHERE id = %s",
                   ("jim", "2"))

    # 事务的特性1、原子性的手动提交
    conn.commit()

    cur.close()
    conn.close()

except MySQLdb.Error, e:
    # 如果出现了错误，那么可以回滚，就是上面的三条语句要么执行，要么都不执行 [ 存储引擎支持事物 ]
    # 如存储引擎不只是事务[MyISM], 则只提交成功的结果
    conn.rollback()
    print "Error %d: %s" % (e.args[0], e.args[1])

执行结果

Error 1146: Table 'test.stu_infos' doesn't exist

未完待续 ... ....

Python MySQLdb test for select count(*) = zero

Jeffrey Nov 03, 2015 Python

I use SELECT COUNT(*) FROM db WHERE <expression> to see if a set of records is null. So:

>>> cnt = c.fetchone()
>>> print cnt
(0L,)

My question is: how do you test for this condition?
I have a number of other ways to accomplish this. Is something like the following possible?

if cnt==(0L,):
    # do something

fetchone returns a row, which is a sequence of columns.
If you want to get the first value in a sequence, you use [0].

You could instead compare the row to (0,), as you're suggesting. But as far as I know neither the general DB-API nor the specific MySQLdb library guarantee what kind of sequence a row is; it could be a list, or a custom sequence class. So, relying on the fact that it's a tuple is probably not a good idea. And, since it's just as easy to not do so, why not be safe and portable?

So:

count_row = c.fetchone()
count = count_row[0]
if count == 0:
    do_something()

Or, putting it together in one line:

if c.fetchone()[0] == 0:
    do_something()

source： stackoverflow

ImportError: libmysqlclient.so.18: cannot open shared object file

Jeffrey Nov 03, 2015 Python

MySQL-python (1.2.5)

>>> import MySQLdb
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Python2.7.3/lib/python2.7/site-packages/MySQLdb/__init__.py", line 19, in <module>
    import _mysql
ImportError: libmysqlclient.so.18: cannot open shared object file: No such file or directory

I.创建软连接再次导入模块正常

[21:21:01][[email protected] ~]#ln -s /usr/local/mysql-5.6.26/lib/libmysqlclient.so.18 /usr/lib64/libmysqlclient.so.18


>>> import MySQLdb
>>>

Suzf Blog

keep it simple, stupid.

Tag Python

Python 爬取中文网页乱码的通用解决办法

pip-faq: Error -5 while decompressing data: incomplete or truncated stream

使用 VIM 打造 Python 开发IDE

Python debugging tools

Logging

Tracing

Alternatives

PDB

Better PDB

Remote PDB

Don't like Winpdb? Use PDB over TCP

Just a REPL

Standard Linux tools

Better tracing

Monitoring

GDB

Worthy mentions

Having segfaults? faulthandler

Quick stacktrace on a signal? faulthandler

Memory leaks

Memory usage

Python CSV 操作实例

How-to Load CSV data into mysql use Python

How-to use MySQL-python module

Python MySQLdb test for select count(*) = zero

ImportError: libmysqlclient.so.18: cannot open shared object file