【教程】详解Python正则表达式之： (…) group 分组

先贴上Python 2.7 手册中的解释:

(...)
Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

其中所提到的 \number的含义是：

\number
Matches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'the end' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.

下面就来解释一下，Python中的(…)的含义和用法：

1.(…) 用来匹配符合条件的字符串。并且将此部分，打包放在一起，看做成一个组，group。

2. 而此group，可以被后续的（正则表达式中）匹配时，所引用。

此处我称其为前向引用，即前面已经通过group定义好的字符串，你在后面需要引用。

引用的方式，是通过\N，其中N是对应的group的编号。

3.group的编号

编号为0的group，始终代表匹配的整个字符串；

你在正则表达式内所看到的，通过括号括起来的group，编号分别对应着1,2,3，…

4.如果你想要在正则表达式中，匹配左括号'(‘，右括号’)’，其字符本身，则通过添加反斜杠，即’\(‘，’\)’的方式来匹配。

1.group的基本用法的代码演示

对于group的解释,通过如下代码,基本上就可以很好的解释具体用法和含义了：

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
【教程】详解Python正则表达式之： (...) group 分组
【教程】详解Python正则表达式之： (…) group 分组


Version:    2012-11-14
Author:     Crifan
"""

import re;

#提示：
#此处所演示的是没有命名的组，unnamed group，关于named group，可以参考：
#【教程】详解Python正则表达式之： (?P<name>…) named group 带命名的组
#https://www.crifan.org/detailed_explanation_about_python_regular_express_named_group/

# 下列举例所用的字符串 http://www.songtaste.com/user/351979/ 中的部分html代码
reGroupTestStr = '<h1 class="h1user">crifan</h1>';
foundH1user = re.search('<h1 class="h1user">(.+?)</h1>', reGroupTestStr);

# 1. Match Objects
# 如果查找到了，则对应返回的值是一个 match对象，打印出来都是 _sre.SRE_Match object之类的
# 其中，Match Objects的详细说明，可参考官网的手册：
#http://docs.python.org/2/library/re.html#match-objects
print "foundH1user=",foundH1user; #foundH1user= <_sre.SRE_Match object at 0x023A7D60>

if(foundH1user):
    # 2. matched.group(0)
    # 如果有匹配的字符串，则之前通过括号所括起来的group，其中的group(0)，都表示整个所匹配的字符串的值
    wholeMatchString = foundH1user.group(0);
    print "wholeMatchString=",wholeMatchString; #wholeMatchString= <h1 class="h1user">crifan</h1>
    
    # 3. matched.group(N)
    # 余下的，如果之前有多个括号，即多个group，那么分别对应着group(1),group(2),group(3),...
    # 此处，就一个group，所以对应的group(1)就是我们所要提取出来的值
    h1User = foundH1user.group(1);
    print "Group(1): h1User=",h1User; #Group(1): h1User= crifan
    
    # 4. matched.groups()
    #Match对象的groups，表示从group(1)开始往后的所有的值，组合成一个Tuple类型的值
    allMatchedGroups = foundH1user.groups();
    print "allMatchedGroups=",allMatchedGroups; #allMatchedGroups= ('crifan',)
    
    # 5. matched.strat(N) and matched.end(N)
    #也可以获得所匹配的group的起始位置和结束位置
    start1 = foundH1user.start(1);
    end1 = foundH1user.end(1);
    print "Group(1): start position=%d, end postion=%d"%(start1,end1); #Group(1): start position=19, end postion=25
    
    # 6 matched.string
    #通过MatchObject.sring的方式，获得的值，和之前MatchObject.group(0)，是一样的
    foundString = foundH1user.string;
    print "foundString=",foundString; #foundString= <h1 class="h1user">crifan</h1>
    
    # 7. get string by [startN:endN]
    #对应的，还可以通过 start和end之间，获得所匹配的字符串
    #和之前通过MatchObject.group(1)获得的值，也是一样的
    strByStartAndEnd = foundString[start1:end1];
    print "Group(1): strByStartAndEnd=",strByStartAndEnd; #Group(1): strByStartAndEnd= crifan

# 8. 演示如何 前向引用，即匹配前面已经出现的某个group的值
#！！！注意：下面这个写法，是无法工作的，因为\1只是代表了特殊的单个字符'\1'而不是去匹配编号为1的group
#foundPrevMatch = re.search('<(\S+) class="h1user">.+?</\1>', reGroupTestStr);
#下面两种写法才是正确的：
#foundPrevMatch = re.search('<(\S+) class="h1user">.+?</\\1>', reGroupTestStr);
foundPrevMatch = re.search(r'<(\S+) class="h1user">(.+?)</\1>', reGroupTestStr);
print "foundPrevMatch=",foundPrevMatch; #foundPrevMatch= <_sre.SRE_Match object at 0x01F67BA8>
if(foundPrevMatch):
    # 9. 编号为1,2的group，分别就是上面的：
    #(\S+)和(.+?)，所以分别是标签h1和h1user的值
    # 10. 对应的\1就是匹配前面的，第一个group，即(\S+)
    h1 = foundPrevMatch.group(1);
    print "h1=",h1; #h1= h1

    h1User = foundPrevMatch.group(2);
    print "h1User=",h1User; #h1User= crifan

2.匹配多个group

其中想要匹配多个group中的单个，即不同group的或者关系，可以参考这里的例子：

【教程】详解Python正则表达式之： ‘|’ vertical bar 竖杠

转载请注明：在路上 » 【教程】详解Python正则表达式之： (…) group 分组

【教程】详解Python正则表达式之： (…) group 分组

与本文相关的文章

Hi，您需要填写昵称和邮箱！

网友最新评论 (2)