re提取不到数据，求大神解答

weixin_46609022 2020-07-08 12:27:28

提取到的关键网页代码：
<tr onclick="location.href='/city/sz.html';" style="cursor: pointer;">
<th>1</th>
<th>
<a href="/city/sz.html" title="深圳房价行情，房价概况走势，数据分析"> 深圳</a>
</th>
<th>74,929</th>
<th class="red">+18.96%</th>
<th class="red">+2.86%</th>
</tr>, <tr onclick="location.href='/city/bj.html';" style="cursor: pointer;">
<th>2</th>
<th>
<a href="/city/bj.html" title="北京房价行情，房价概况走势，数据分析"> 北京</a>
</th>
<th>62,567</th>
<th class="green">-2.09%</th>
<th class="green">-4.76%</th>
</tr>, <tr onclick="location.href='/city/sh.html';" style="cursor: pointer;">
......后边同类型

我的代码：
import requests
from tool import useragenttool
import bs4
import re
import openpyxl

def open_url(url):
"""解析网址，获取源码信息"""
res = requests.get(url, headers=useragenttool.get_headers())
return res

def find_data(res):
datas = []
soup = bs4.BeautifulSoup(res.text, "html.parser")
content = soup.find(class_="gb-dataListBox")
# print(content)
target = content.find_all("tr", style="cursor: pointer;")
# print(target)
target = iter(target)

for each in target:
# print(each.text)
if each.text.isnumeric():
datas.append([
re.search(r'(.+)', next(target).text).group(1),
re.search(r'\d.*', next(target).text).group(),
re.search(r'\d.*', next(target).text).group(),
re.search(r'\d.*', next(target).text).group()])
print(datas)

return datas

def main():
url = "https://www.creprice.cn/rank/cityforsale.html"
res = open_url(url)
datas = find_data(res)

if __name__ == '__main__':
main()

为什么 print(datas)出来的datas列表空的啊，我要爬城市，房价还有后边两个百分数，新手百思不得其解，求大神解答

...全文

6061 3 打赏收藏转发到动态举报

写回复

用AI写文章

3 条回复

切换为时间正序

请发表友善的回复…

发表回复

AutumnSea03 2020-07-28

打赏
举报

帖子是不是可以结一下？

weixin_46609022 2020-07-10

打赏
举报

谢谢大佬！！

AutumnSea03 2020-07-09

打赏
举报

import requests
import bs4
import re


def find_data():
    head = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36',
        'Connection': 'keep-alive'}
    res = requests.get('https://www.creprice.cn/rank/cityforsale.html',headers=head)
    content = bs4.BeautifulSoup(res.text, "html.parser").find(class_="gb-dataListBox")
    target = content.find_all("tr", style="cursor: pointer;")
    info_list = []
    for each in target:
        tmp_dic = dict()
        city = re.search('[^\x00-\xff]+',each.text).group()
        price = re.search('\d+,\d+', each.text).group()
        rate = re.findall('[+-]\d+.*%', each.text)
        tmp_dic[city] = [price,rate[1],rate[0]]
        info_list.append(tmp_dic)
    print(info_list)

if __name__ == '__main__':
    find_data()

[{'深圳': ['74,929', '+18.96%', '+2.86%']}, {'北京': ['62,567', '-2.09%', '-4.76%']}, {'上海': ['54,911', '+5.85%', '-0.25%']}, {'厦门': ['47,817', '+5.66%', '+0.27%']}, {'三亚': ['38,291', '+12.01%', '+3.72%']}, {'广州': ['35,934', '+6.13%', '+5.43%']}, {'杭州': ['31,487', '+4.1%', '+3.1%']}, {'南京': ['31,416', '+2.87%', '-0.24%']}, {'福州': ['26,288', '+0.55%', '+1.78%']}, {'天津': ['25,751', '+0.14%', '+1.4%']}, {'宁波': ['23,544', '+15.65%', '+0.5%']}, {'珠海': ['23,473', '+1.43%', '-0.37%']}, {'苏州': ['23,294', '+6.32%', '-1.96%']}, {'青岛': ['21,890', '+1.65%', '+0.76%']}, {'温州': ['21,777', '+7.11%', '-1.31%']}, {'丽水': ['19,428', '+7.9%', '-2.74%']}, {'武汉': ['18,942', '+4.89%', '+0.3%']}, {'东莞': ['17,921', '+11.79%', '+0.86%']}, {'金华': ['17,279', '+5.54%', '-0.69%']}, {'成都': ['16,726', '+7.34%', '+3.11%']}, {'无锡': ['16,675', '+12.46%', '+0.13%']}, {'合肥': ['16,500', '+4.93%', '-0.73%']},...., {'鹤岗': ['2,307', '-2.19%', '-2.92%']}]

阿里巴巴

引用来自“幻视Vision”的评论你学正则表达式。用正则处理逻辑会简单很多。开启re.DOTALL参数，让 . 可以匹配换行。建议结果输出为CSV文件。基本上不需要额外学习。python第三方模块很多，有学习门槛。谢谢各位热心朋友的指点，有了大概方向了，在朋友们的帮助下，初步有了一个模子，发上来，让各位大神指点指正一下。#!/usr/bin/env/python3# _*_ coding: utf...

导读：本文的目标是介绍一些Python库，帮助你从类似于PDF和Word DOCX 这样的二进制文件中提取数据。我们也将了解和学习如何从网络信息源（web feeds）（如RSS）中获取数据，以及利用一个库帮助解析HTML文本并从文档中提取原始文本。我们还将学习如何从不同来源提取原始文本，对其进行规范化，并基于它创建一个用户定义的语料库。在本文中，你将学习7个不同的实例。我...