108
社区成员




github Repositories作业链接:https://github.com/wkkkjjjj/032004126.git
PSP2.1 | Personal Software Process Stages | 预估耗时(分钟) | 实际耗时(分钟) |
---|---|---|---|
Planning | 计划 | 5 | 6 |
Estimate | 估计这个任务需要多少时间 | 600 | 1000 |
Development | 开发 | 600 | 1000 |
Analysis | 需求分析 (包括学习新技术) | 10 | 80 |
Design Spec | 生成设计文档 | 10 | 20 |
Design Review | 设计复审 | 150 | 450 |
Coding Standard | 代码规范 (为目前的开发制定合适的规范) | 5 | 5 |
Design | 具体设计 | 20 | 25 |
Coding | 具体编码 | 130 | 150 |
Code Review | 代码复审 | 190 | 100 |
Test | 测试(自我测试,修改代码,提交修改) | 10 | 20 |
Reporting | 报告 | 190 | 150 |
Test Repor | 测试报告 | 20 | 15 |
Size Measurement | 计算工作量 | 90 | 50 |
Postmortem & Process Improvement Plan | 事后总结, 并提出过程改进计划 | ||
合计 | 2220 | 3071 |
1.html获取bvid
2.bvid获取cid
3.cid获取弹幕xml文件并且提取出弹幕
4.dataframe统计top20,并且制作云词
前三环节主要用到了request库和一些前端知识还有正则表达式,第四环节matplotlib,dataframe库和wordcloud库的使用,另外还有一些文件处理的知识,pandas模块的使用
1.先复制下搜索结果页面的源代码(太长省略),然后正则表达式获取前300个bv号
bvlist1=re.findall('BV[0-z]+',doc1)
bvlist2=re.findall('BV[0-z]+',doc2)
bvlist3=re.findall('BV[0-z]+',doc3)
bvlist4=re.findall('BV[0-z]+',doc4)
bvlist5=re.findall('BV[0-z]+',doc5)
bvlist6=re.findall('BV[0-z]+',doc6)
#num代表还需要多少个视频
num=300
bvlist=[]
if len(bvlist1)<=num:
bvlist.extend(bvlist1)
num-=len(bvlist1)
else:
bvlist.extend(bvlist1[:num])
num=0
if len(bvlist2)<=num:
bvlist.extend(bvlist2)
num-=len(bvlist2)
else:
bvlist.extend(bvlist2[:num])
num=0
if len(bvlist3)<=num:
bvlist.extend(bvlist3)
num-=len(bvlist3)
else:
bvlist.extend(bvlist3[:num])
num=0
if len(bvlist4)<=num:
bvlist.extend(bvlist4)
num-=len(bvlist4)
else:
bvlist.extend(bvlist4[:num])
num=0
if len(bvlist5)<=num:
bvlist.extend(bvlist5)
num-=len(bvlist5)
else:
bvlist.extend(bvlist5[:num])
num=0
if len(bvlist6)<=num:
bvlist.extend(bvlist6)
num-=len(bvlist6)
else:
bvlist.extend(bvlist6[:num])
num=0
2.定义cid函数通过bv号获取cid号
#给定一个bv号,返回对应cid
def cid(bvid):
url = 'https://api.bilibili.com/x/player/pagelist?bvid=' + bvid + '&jsonp=jsonp'
headers = {
'Cookie':
'buvid3=1B24090C-625F-5D6A-33F9-C2C8676AD09055616infoc; b_nut=1693815855; i-wanna-go-back=-1; b_ut=7; _uuid=BC1022EE1-58F7-AE1E-F713-A75E7BFCDD4765044infoc; buvid4=355B3E64-7114-321E-7E7A-2D8CB29C6C9367607-023090416-X83v1qigvaXaWzfk4QM9rA%3D%3D; buvid_fp=02d9944d11f8b9f618b05c1e66cf2bac; DedeUserID=700937803; DedeUserID__ckMd5=16e09be0058f1010; CURRENT_FNVAL=4048; rpdid=|(u)luk)YYJR0J\'uYmJ)ulmlk; bp_video_offset_700937803=837129502994726963; PVID=1; b_lsid=177D174F_18A786C2953; bsource=search_baidu; header_theme_version=CLOSE; home_feed_column=4; browser_resolution=767-706; SESSDATA=cfa685d0%2C1709789477%2C18272%2A91CjDYV9L73A3ajSkiT3IQpxqkuHYI1Zj1ir5VVKaAljm-eAp0K6-0_pUdbdDpGDrbSgkSVlE3S1ZIcXBVcS02aUJRTnZVVnhnX28wdUlkcTdpbmlNcFRTbU0wOVlIQnRFRF9NVkhJbUJRSXBhUTc0VExhUXRuNG90WHRaUHRJdmpDQWZPelFWdzN3IIEC; bili_jct=d96a265d33654e041d2db3089c105148; sid=81gkwgjc; bili_ticket=eyJhbGciOiJIUzI1NiIsImtpZCI6InMwMyIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2OTQ0OTcyNjMsImlhdCI6MTY5NDIzODA2MywicGx0IjotMX0.9c2Wth23b9e9JInaVi0SRzGclI4RHkTddEbgXK2wFfI; bili_ticket_expires=1694497263',
'origin': 'https://www.bilibili.com',
'referer': 'https://www.bilibili.com/video/BV1t94y147Fk/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
}
response = requests.get(url, headers=headers)
jsondata=response.content.decode('unicode-escape')
dictdata=json.loads(jsondata)
return dictdata['data'][0]['cid']
3.定义函数wordlist调用cid函数输入bvid号能把对应弹幕存入’B站弹幕.csv‘文件中
def wordlist(bvid):
newcid=cid(bvid)
url = f'https://api.bilibili.com/x/v1/dm/list.so?oid={newcid}'
response = requests.get(url)
response.encoding = 'utf-8'
result=re.findall('.*?([\u4E00-\u9FA5]+).*?',response.text)
with open('B站弹幕.csv', mode='a', encoding='utf-8',newline='') as file:
csv_writter=csv.writer(file)
for i in result:
mylist=[i,1]
csv_writter.writerow(mylist)
4.定义函数wordcount循环把300个视频所有弹幕存入’B站弹幕.csv‘文件中
#统计弹幕数量
def countword():
#先清空文件
with open('B站弹幕.csv', mode='w', encoding='utf-8') as f:
pass
for bv in bvlist:
wordlist(bv)
df = pd.read_csv('./B站弹幕.csv', header=None, names=['弹幕', '数量'])
print(df)
gr = df.groupby(by='弹幕')
print(gr.agg({'数量': 'sum'}))
countword()
5.最后把top20存入’B站弹幕top20.csv‘并且wordcloud制作云词就行了
def show():
matplotlib.rcParams['font.family']='SimHei'
matplotlib.rcParams['font.sans-serif']=['SeiHei']
df=pd.read_csv('B站弹幕top20.csv',index_col='弹幕')
df.plot(kind='bar')
plt.xlabel('弹幕')
plt.ylabel('数量')
plt.title('统计结果')
plt.show()
w=wordcloud.WordCloud(width=500,height=500,background_color='black',font_path="msyh.ttc")
w.generate(" ".join(mylist))
w.to_file("云词.png")
ncalls
调用次数
tottime
「在给定函数中花费的总时间(不包括调用子函数的时间」)
percall
tottime除以ncalls的商
cumtime
「是在这个函数和所有子函数中花费的累积时间(从调用到退出)」。
percall
是cumtime除以原始调用次数的商
filename:lineno(function)
提供每个函数的各自信息
可以看出主要花费时间在于request和send上这两个函数获取网页数据耗时最大
def cid(bvid):
url = 'https://api.bilibili.com/x/player/pagelist?bvid=' + bvid + '&jsonp=jsonp'
headers = {
'Cookie':
'buvid3=1B24090C-625F-5D6A-33F9-C2C8676AD09055616infoc; b_nut=1693815855; i-wanna-go-back=-1; b_ut=7; _uuid=BC1022EE1-58F7-AE1E-F713-A75E7BFCDD4765044infoc; buvid4=355B3E64-7114-321E-7E7A-2D8CB29C6C9367607-023090416-X83v1qigvaXaWzfk4QM9rA%3D%3D; buvid_fp=02d9944d11f8b9f618b05c1e66cf2bac; DedeUserID=700937803; DedeUserID__ckMd5=16e09be0058f1010; CURRENT_FNVAL=4048; rpdid=|(u)luk)YYJR0J\'uYmJ)ulmlk; bp_video_offset_700937803=837129502994726963; PVID=1; b_lsid=177D174F_18A786C2953; bsource=search_baidu; header_theme_version=CLOSE; home_feed_column=4; browser_resolution=767-706; SESSDATA=cfa685d0%2C1709789477%2C18272%2A91CjDYV9L73A3ajSkiT3IQpxqkuHYI1Zj1ir5VVKaAljm-eAp0K6-0_pUdbdDpGDrbSgkSVlE3S1ZIcXBVcS02aUJRTnZVVnhnX28wdUlkcTdpbmlNcFRTbU0wOVlIQnRFRF9NVkhJbUJRSXBhUTc0VExhUXRuNG90WHRaUHRJdmpDQWZPelFWdzN3IIEC; bili_jct=d96a265d33654e041d2db3089c105148; sid=81gkwgjc; bili_ticket=eyJhbGciOiJIUzI1NiIsImtpZCI6InMwMyIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2OTQ0OTcyNjMsImlhdCI6MTY5NDIzODA2MywicGx0IjotMX0.9c2Wth23b9e9JInaVi0SRzGclI4RHkTddEbgXK2wFfI; bili_ticket_expires=1694497263',
'origin': 'https://www.bilibili.com',
'referer': 'https://www.bilibili.com/video/BV1t94y147Fk/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36',
}
response = requests.get(url, headers=headers)
jsondata=response.content.decode('unicode-escape')
dictdata=json.loads(jsondata)
return dictdata['data'][0]['cid']
#把给定bvid的视频弹幕加入'B站弹幕.csv'
def wordlist(bvid):
newcid=cid(bvid)
url = f'https://api.bilibili.com/x/v1/dm/list.so?oid={newcid}'
response = requests.get(url)
response.encoding = 'utf-8'
result=re.findall('.*?([\u4E00-\u9FA5]+).*?',response.text)
with open('B站弹幕.csv', mode='a', encoding='utf-8',newline='') as file:
csv_writter=csv.writer(file)
for i in result:
mylist=[i,1]
csv_writter.writerow(mylist)
保守估计占了90%的时间,个人认为可以通过多线程提升性能
可视化界面
1.由wordcloud库制作云词
2.dataframe模块导入csv并使用matplotlib绘制柱状图
这次作业实践了爬虫的一些基本知识,同时也巩固了自己的python编程,另外还学会和科学上网的方法以及github的一些使用方法,总而言之这次任务虽然对于编程能力偏差的我而言很有难度,但是也提升了自己的能力,是一次非常有意义的作业