Python Re module learn

License: Attribution-NonCommercial-ShareAlike 4.0 International

本文出自 Suzf Blog。如未注明，均为 SUZF.NET 原创。

转载请注明：http://suzf.net/post/812

上篇文章 Python正则表达式操作指南已经对正则表达式做出了详细的介绍。下面只对 re 模块做出简要的说明。元字符说明

.    匹配除换行符以外的任意字符
^    匹配字符串的开始
$    匹配字符串的结束
[]   用来匹配一个指定的字符类别
？   对于前一个字符字符重复0次到1次
*    对于前一个字符重复0次到无穷次
{}   对于前一个字符重复m次
{m，n} 对前一个字符重复为m到n次
\d   匹配数字，相当于[0-9]
\D   匹配任何非数字字符，相当于[^0-9]
\s   匹配任意的空白符，相当于[ fv]
\S   匹配任何非空白字符，相当于[^ fv]
\w   匹配任何字母数字字符，相当于[a-zA-Z0-9_]
\W   匹配任何非字母数字字符，相当于[^a-zA-Z0-9_]
\b   匹配单词的开始或结束

模块导入查看方法

import re
dir(re)
['DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__doc__', '__file__', '__name__', '__package__', '__version__', '_alphanum', '_cache', '_cache_repl', '_compile', '_compile_repl', '_expand', '_pattern_type', '_pickle', '_subx', 'compile', 'copy_reg', 'error', 'escape', 'findall', 'finditer', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'sys', 'template']

常用的方法函数

# 以下为匹配所用函数
re.match(pattern, string[, flags])
re.search(pattern, string[, flags])
re.split(pattern, string[, maxsplit])
re.findall(pattern, string[, flags])
re.finditer(pattern, string[, flags])
re.sub(pattern, repl, string[, count])
re.subn(pattern, repl, string[, count])

# 返回 pattern 对象
re.compile(string[,flag]) 

参数 flag 是匹配模式，取值可以使用按位或运算符’|’表示同时生效。
      比如 re.I | re.M。

可选值有：
    re.I(I --> IGNORECASE): 忽略大小写（括号内是完整写法，下同）
    re.M(M --> MULTILINE): 多行模式，改变'^'和'$'的行为（参见上图）
    re.S(S --> DOTALL): 点任意匹配模式，改变'.'的行为
    re.L(L --> LOCALE): 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定
    re.U(U --> UNICODE): 使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性
    re.X(X --> VERBOSE): 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。

Match 尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none。

语法格式
re.match(pattern, string[, flags])

匹配对象方法	描述
group(num=0)	匹配的整个表达式的字符串，group() 可以一次输入多个组号，
                在这种情况下它将返回一个包含那些组所对应值的元组。
groups()	返回一个包含所有小组字符串的元组，从 1 到 所含的小组号。

举个栗子

#!/usr/bin/env python
# -*- encoding: utf-8 -*-

import re

# 在开始位置匹配
print (re.match('www', 'www.suzf.net').span())

# 不在开始位置匹配
print (re.match('net', 'www.suzf.net'))


# 执行结果
(0, 3)
None


line = 'Welcome to access my blog ---> http://suzf.net'
matchObj = re.match(r'(.*) -{2,}. (.*)', line, re.M|re.I)

if matchObj:
    print "matchObj.group():", matchObj.group()
    print "matchObj.group(1):", matchObj.group(1)
    print "matchObj.group(2):", matchObj.group(2)

else:
    print "Sorry! No match!"

# 执行结果
matchObj.group(): Welcome to access my blog --> http://suzf.net
matchObj.group(1): Welcome to access my blog
matchObj.group(2): http://suzf.net

Search re.search 扫描整个字符串并返回第一个成功的匹配，匹配失败返回None。。

语法格式
re.search(pattern, string, flags=0)

举个栗子

import re

print (re.search('www', 'www.suzf.net').span())
print (re.search('net', 'www.suzf.net').span())

line = 'Welcome to access my blog ---> http://suzf.net'
matchObj = re.search(r'(.*) -{2,}. (.*)', line, re.M|re.I)

if matchObj:
    print "matchObj.group():", matchObj.group()
    print "matchObj.group(1):", matchObj.group(1)
    print "matchObj.group(2):", matchObj.group(2)

else:
    print "Sorry! No match!"

# 运行结果
(0, 3)
(9, 12)
matchObj.group(): Welcome to access my blog ---> http://suzf.net
matchObj.group(1): Welcome to access my blog
matchObj.group(2): http://suzf.net

re.match与re.search的区别 re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；而re.search匹配整个字符串，直到找到一个匹配。 Split 通过正则表达式将字符串分离。如果用括号将正则表达式括起来，那么匹配的字符串也会被列入到list中返回。maxsplit是分离的次数，maxsplit=1分离一次，默认为0，不限制次数。

语法格式
re.split(pattern, string[, maxsplit]) | split(string[, maxsplit])

举个栗子

import re

test="Trouble is a friend, yeah trouble is a friend of mine."
print re.split(r"\s+", test)

# 分割前三个
print re.split(r"\s+", test, 3)

# 不匹配默认 打印所有
print re.split("AAA", "158158") 

# 运行结果
['Trouble', 'is', 'a', 'friend,', 'yeah', 'trouble', 'is', 'a', 'friend', 'of', 'mine.']
['Trouble', 'is', 'a', 'friend, yeah trouble is a friend of mine.']
['158158']

Findall 搜索string，以列表形式返回全部能匹配的子串。

语法格式
re.findall(pattern, string[, flags]) | findall(string[, pos[, endpos]])

举个栗子

import re

num = re.compile(r'\d+')
print num.findall('one1two2three3')

# 运行结果
['1', '2', '3']

Finditer 搜索string，返回一个顺序访问每一个匹配结果（Match对象）的迭代器。

语法格式
re.finditer(pattern, string[, flags]) | finditer(string[, pos[, endpos]])

举个栗子

import re

num = re.compile(r'\d+')
 
for item in num.finditer('one1two2three3'):
    print item.group(),

# 运行结果
1 2 3

Sub 返回的字符串是在字符串中用 RE 最左边不重复的匹配来替换。如果模式没有发现，字符将被没有改变地返回。可选参数 count 是模式匹配后替换的最大次数；count 必须是非负整数。缺省值是 0 表示替换所有的匹配。

语法格式
re.sub(pattern, repl, string, count=0, flags=0)

举个栗子

import re

phone = "6666-666-666 # This is Phone Number"

# Delete comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num:", num

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print "Phone Num:", num

# 执行结果
Phone Num: 6666-666-666 
Phone Num: 6666666666

Subn 与re.sub方法作用一样，但返回的是包含新字符串和替换执行次数的两元组。

语法格式
re.subn(pattern, repl, string, count=0, flags=0)

举个栗子

import re

phone = "6666-666-666 # This is Phone Number"

# Delete comments
num = re.subn(r'#.*$', "", phone)
print "Phone Num:", num

# Remove anything other than digits
num = re.subn(r'\D', "", phone)    
print "Phone Num:", num

# 执行结果
Phone Num: ('6666-666-666 ', 1)
Phone Num: ('6666666666', 25)

Compile compile 函数根据一个模式字符串和可选的标志参数生成一个正则表达式对象。该对象拥有一系列方法用于正则表达式匹配和替换。

help(re.compile)
compile(pattern, flags=0)
    Compile a regular expression pattern, returning a pattern object.

举个栗子

import os,re

data_file = '/tmp/disk_info.csv.tmp'

with open(data_file, 'w') as f:
  for row in disk_info:
    spacetab = re.compile(" +")
    item = re.sub(spacetab, ',', row)
    f.write(item)
f.close()

Notice prog = re.compile(pattern) result = prog.match(string) 与 result = re.match(pattern, string) 是等价的。 第一种方式能实现正则表达式的重用。

好了就啰嗦这么多吧。更多详细内容请移步 Python Standard Library