关于正则表达式

菜头叔 2013-07-16 05:59:10
<div aria-label="5星, 747 份评分" class="rating" role="img" tabindex="-1">
<div>
<span class="rating-star">
</span>
<span class="rating-star">
</span>
<span class="rating-star">
</span>
<span class="rating-star">
</span>
<span class="rating-star">
</span>
</div>
<span class="rating-count">
747 份评分
</span>
</div>
这是一个字符串,我想把那个5星,747份评论拿出来。。求帮助!
...全文
254 5 打赏 收藏 举报
写回复
5 条回复
切换为时间正序
当前发帖距今超过3年,不再开放新的回复
发表回复
jongsuny 2013-09-03
  • 打赏
  • 举报
回复
aria-label=\"([^"]+)\" 然后取第一个就可以了
  • 打赏
  • 举报
回复
如果你想用正则来解析的话:

arrs = re.findall("(?is)<div aria-label=\"(.*?)\".*?>",content)
crifan 2013-07-17
  • 打赏
  • 举报
回复
借鉴上面仁兄的,直接用find会更简洁:
#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
Function:
【整理】用BeautifulSoup查找属性值未知的标签
http://www.crifan.com/python_use_beautifulsoup_find_tag_with_unknown_attribute_value/

Author:     Crifan Li
Version:    2013-07-17
Contact:    http://www.crifan.com/about/me/
"""

from BeautifulSoup import BeautifulSoup;

def beautifulsoup_tag_attr_unknown():
    """
        demo BeautifulSoup find the tag which attribute value unknown
    """
    html = """<div aria-label="5星, 747 份评分" class="rating" role="img" tabindex="-1">
 <div>
  <span class="rating-star">
  </span>
  <span class="rating-star">
  </span>
  <span class="rating-star">
  </span>
  <span class="rating-star">
  </span>
  <span class="rating-star">
  </span>
 </div>
 <span class="rating-count">
  747 份评分
 </span>
</div>""";
    soup = BeautifulSoup(html);
        
    foundDiv = soup.find(name="div", attrs={"aria-label":True});
    #print "foundDiv=",foundDiv;
    attrVal = foundDiv['aria-label'];
    print "attrVal=",attrVal; #attrVal= 5星, 747 份评分
    
if __name__ == "__main__":
    beautifulsoup_tag_attr_unknown();
panghuhu250 2013-07-16
  • 打赏
  • 举报
回复
1. BeautifulSoup的find,findAll支持有attrs参数,用{'aria-label':True}可以得到所有有aira-lebel的div。 2. BeautifulSoup的每个节点都像一个dict,x['aria-label']就能得到aria-label的值。

In [8]: from BeautifulSoup import BeautifulSoup

In [9]: root = BeautifulSoup(u"""<div aria-label="5星, 747 份评分" class="rating" role="img" tabindex="-1">
   ...:  <div>
   ...:   <span class="rating-star">
   ...:   </span>
   ...:   <span class="rating-star">
   ...:   </span>
   ...:   <span class="rating-star">
   ...:   </span>
   ...:   <span class="rating-star">
   ...:   </span>
   ...:   <span class="rating-star">
   ...:   </span>
   ...:  </div>
   ...:  <span class="rating-count">
   ...:   747 份评分
   ...:  </span>
   ...: </div>""")

In [10]: rating = root.findAll(attrs={"aria-lable": True})

In [11]: rating
Out[11]: []

In [12]: rating = root.findAll(attrs={"aria-label": True})

In [13]: rating
Out[13]: 
[<div aria-label="5星, 747 份评分" class="rating" role="img" tabindex="-1">
<div>
<span class="rating-star">
</span>
<span class="rating-star">
</span>
<span class="rating-star">
</span>
<span class="rating-star">
</span>
<span class="rating-star">
</span>
</div>
<span class="rating-count">
  747 份评分
 </span>
</div>]

In [14]: rating[0]['aria-label']
Out[14]: u'5\u661f, 747 \u4efd\u8bc4\u5206'

In [15]: print rating[0]['aria-label']
5星, 747 份评分
菜头叔 2013-07-16
  • 打赏
  • 举报
回复
我是用BeautifulSoup获得这个数据。用BeautifulSoup解析出来5星,747份评论 也可以。。
相关推荐
发帖
脚本语言

3.7w+

社区成员

JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • ITBOB • 鲍勃
加入社区
帖子事件
创建了帖子
2013-07-16 05:59