11,612
社区成员
发帖
与我相关
我的任务
分享
代码本意是解析页面:https://so.eastmoney.com/web/s?keyword=600759
从该页面提取如下内容:
综合评分 54
今日表现 +0.09
打败了 15.52% 的gp
目前遇到的问题是未提取到任何文字内容,且无法验证xpath写的路径是否正确(尽管该路径来源于google浏览器的复制xpath),有没有会用xpath的帮我修改一下
import urllib
from urllib.request import urlopen
from urllib.request import Request
import requests
from bs4 import BeautifulSoup as bf4
import random
import json
import time
from lxml import etree
def get_pages():
_url = ' '
_url = 'https://so.eastmoney.com/web/s?keyword=600759/'
print('正在爬取:', _url)
# 伪装头部
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = requests.get(_url, headers=headers)
print(str(req.text))
html2 = etree.HTML(req.content)
time.sleep(random.randint(0, 3))
ttt = html2.xpath('/html/body/div[1]/div[3]/div[1]/div[2]/div[3]/div[2]/div/div[1]/div[4]/span/text()')
#print(str(ttt[0].text))
print(type(html2))
print(html2)
print(len(ttt))
#for t in ttt
# print(str(t))
#result = etree.tostring(html2)
#print(result.decode('utf-8'))
return html2
if __name__ == "__main__":
html = get_pages('')
如下为控制台的打印内容:
<class 'lxml.etree._Element'>
<Element html at 0x2267db310c0>
0
Process finished with exit code 0
直接用request进行静态网页爬取是不行的,鼠标右键查看网页源代码,然后搜索评分显示没有。可以试试selenium。