【解答】关于BeautifulSoup抓取目标数据的问题

【问题】

本人在用BeautifulSoup抓到这些数据后（当然还有其他部分）不知道如何能够用BeautifulSoup的解析方法（用re好像很复杂）去提取我想要的24,804,000,000.00与1,511,750,000.00这两个数据，望各位大神出手相助！
<tr><td width=’150px’><strong>报表日期</strong></td><td style=’text-align:right;’>2013-03-31</td></tr>
<tr></tr>
<tr><td colspan=’5′><strong>流动资产</strong></td></tr>& lt;tr><td style=’padding-left:30px’ width=’150px’><a target=’_blank’ href=’/corp/view/vFD_FinanceSummaryHistory.php?stockid=002024& type=cbsheet1′>货币资金</a></td><td style=’text-align:right;’>24,804,000,000.00</td></tr>
<tr><td style=’padding-left:30px’ width=’150px’><a target=’_blank’ href=’/corp/view/vFD_FinanceSummaryHistory.php?stockid=002024& type=cbsheet110′>交易性金融资产</a></td><td style=’text-align:right;’>1,511,750,000.00</td></tr>
</tbody>

【解答】

1.想要提取数据，就要看清楚对应的html的结构，所以，手动格式化为：

<tr>
    <td width='150px'><strong>报表日期</strong></td>
    <td style='text-align:right;'>2013-03-31</td>
</tr>
<tr>
</tr>
<tr>
    <td colspan='5'><strong>流动资产</strong></td>
</tr>
<tr>
    <td style='padding-left:30px' width='150px'>
        <a target='_blank' href='/corp/view/vFD_FinanceSummaryHistory.php?stockid=002024&type=cbsheet1'>货币资金</a>
    </td>
    <td style='text-align:right;'>24,804,000,000.00</td>
</tr>
<tr>
    <td style='padding-left:30px' width='150px'><a target='_blank' href='/corp/view/vFD_FinanceSummaryHistory.php?stockid=002024&type=cbsheet110'>交易性金融资产</a></td>
    <td style='text-align:right;'>1,511,750,000.00</td>
</tr>
</tbody>

就容易看清楚结构了。

2.可以看到，如果你此处确定上述的html的代码结构不会变的前提下：

那么是可以去通过：

findAll(name="td", attrs={"style":"text-align:right;"})

搜索到那三个td的：

3.（我自己也是刚知道的）

再去通过text参数去匹配对应的soup.string

其中BeautifulSoup的findAll中，支持正则re，所以可以用：

findAll(name="td", attrs={"style":"text-align:right;"}, text=re.compile("\d+(,\d+)*\.\d+"))

去只匹配，你所需要的，那两个货币值：

24,804,000,000.00

1,511,750,000.00

（注意：不是那两个，完整的td：

）

4.完整代码如下：

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【解答】关于BeautifulSoup抓取目标数据的问题
【解答】关于BeautifulSoup抓取目标数据的问题


Author:     Crifan Li
Version:    2013-06-06
Contact:    https://www.crifan.org/contact_me
"""

import re;
from BeautifulSoup import BeautifulSoup;

def beautifulsoup_capture_money():
    """
        1. answer other's question
        2. demo BeautifulSoup usage: findAll(text=xxx)
    """
    html = """<tr>
    <td width='150px'><strong>报表日期</strong></td>
    <td style='text-align:right;'>2013-03-31</td>
</tr>
<tr>
</tr>
<tr>
    <td colspan='5'><strong>流动资产</strong></td>
</tr>
<tr>
    <td style='padding-left:30px' width='150px'>
        <a target='_blank' href='/corp/view/vFD_FinanceSummaryHistory.php?stockid=002024&type=cbsheet1'>货币资金</a>
    </td>
    <td style='text-align:right;'>24,804,000,000.00</td>
</tr>
<tr>
    <td style='padding-left:30px' width='150px'><a target='_blank' href='/corp/view/vFD_FinanceSummaryHistory.php?stockid=002024&type=cbsheet110'>交易性金融资产</a></td>
    <td style='text-align:right;'>1,511,750,000.00</td>
</tr>
</tbody>""";
    soup = BeautifulSoup(html);
    
    #\d+(,\d+)*\.\d+
    #can match:
    #24,804,000,000.00
    #1,511,750,000.00
    #123,750,000.00
    #123,000.456
    #23400.456
    #...
    
    foundTds = soup.findAll(name="td", attrs={"style":"text-align:right;"}, text=re.compile("\d+(,\d+)*\.\d+"));
    
    # !!! here match only the match re.compile text, not whole td tag
    print "foundTds=",foundTds; #foundTds= [u'24,804,000,000.00', u'1,511,750,000.00']
    if(foundTds):
        for eachMoney in foundTds:
            print "eachMoney=",eachMoney;
            # eachMoney= 24,804,000,000.00
            # eachMoney= 1,511,750,000.00
    
if __name__ == "__main__":
    beautifulsoup_capture_money();

【总结】

BeautifulSoup的findAll中，还支持传递text，去匹配对应的soup节点的string的值；

需要注意的是，匹配出来的值，不是整个html的tag（此处不是完整的td的节点）

而是对应的，符合你的text的字符串值（此处是，符合re.compile("\d+(,\d+)*\.\d+")的字符串的那部分的值）

注：

关于BeautifulSoup的findAll的函数说明，不了解的可以参考我的教程：

【教程】Python中第三方的用于解析HTML的库：BeautifulSoup

中所提到的，BeautifulSoup官网的教程：

findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

转载请注明：在路上 » 【解答】关于BeautifulSoup抓取目标数据的问题

soup.prettify()不用手动格式化

Hugo13年前 (2014-02-17)回复

您好，请问在本例中，我想抓属性的值，该怎么办？比如我想抓width 为150px时的style值？这时应如何选择输出？多谢！！

Methanol13年前 (2013-09-09)回复

方案1：先搜width的值为150px，可以找到对应的那三个td的soup 然后再去判断该soup中是否有属性style 就可以找到那两个带style的td了。代码：
```
width150pxSoupList = soup.findAll(attrs={"width":"150px"});
for width150pxSoup in width150pxSoupList:
    if("style" in width150pxSoup):
        print "width150pxSoup["style"]",width150pxSoup["style"];
```
方案2：或者是：直接搜width为150px，以及style为任意值也可以找到对应的那两个带style的td的值代码：
```
width150pxStyleSoupList = soup.findAll(attrs={"width":"150px", "style":True});
for width150pxStyleSoup in width150pxStyleSoupList:
    print "width150pxStyleSoup["style"]=",width150pxStyleSoup["style"];
```
crifan13年前 (2013-09-09)回复
- 太迅速了！追问: width150pxStyleSoup 这个名字是随便起的吗？
  Methanol13年前 (2013-09-09)回复
  - 我看懂了，谢谢！！
    Methanol13年前 (2013-09-09)回复

【解答】关于BeautifulSoup抓取目标数据的问题

与本文相关的文章

Hi，您需要填写昵称和邮箱！

网友最新评论 (5)