python多线程问题,非setDaemon,标志位控制,线程无法结束!

z752964360 2011-12-17 04:42:19
先看一个严重标志位是否好使的简单例子:

from threading import Thread,Lock
import time
a = True
class A:
def __init__(self,):
t = Thread(target=self.run)
t.start()
def run(self):
print 'im starting'
while a:
print a
if __name__ == "__main__":
aa=A()
time.sleep(1)
a= False

这个例子运行正常,说明可以通过标志位控制子线程。

下面这个是一个深度抓取页面的爬虫,提取出首页的所有链接继续抓这些链接的内容,,控制机制和上面例子一样。但是不能正常结束。

#coding=utf-8
from BeautifulSoup import BeautifulSoup
import urllib2
from threading import Thread,Lock
from Queue import Queue
import time
import socket
socket.setdefaulttimeout(5)

flag = True#标志位,控制子线程结束
class Fetcher:#封装的抓取网页的线程,处理结果的线程
def __init__(self,th_num):
self.opener = urllib2.build_opener(urllib2.HTTPHandler)
self.q_req = Queue() #任务队列
self.q_ans = Queue() #抓取结果处理队列
self.urls = []#返回下一深度要抓取的url
for i in range(th_num):#启动抓取线程
t = Thread(target=self.thread_get)
t.start()
for i in range(0,1):
t = Thread(target=self.thread_put)
t.start()


def join(self): #需等待两个队列完成
self.q_req.join()
self.q_ans.join()
print '=====================im done'#这个运行的时候也能打印出来。

def push(self,req):#往任务队列装任务
self.q_req.put(req)

def thread_put(self):#负责结果处理,将页面中的url存到sel.url里面
while flag:
print flag#调试用,如果这个线程没有结束,应该不断的打印flag才对
try:
url = self.q_ans.get()
self.urls.extend(url)
except Exception ,e:
print e,'other,excp========in=put'
finally:
self.q_ans.task_done()

def thread_get(self):
while flag:
print flag
try:
req = self.q_req.get()
urls = []
ans = self.opener.open(req).read()#抓取页面
soup = BeautifulSoup(ans)#解析页面
for a in soup.findAll('a'):#提取页面的所有链接
try:
if a['href'].startswith('http'):
urls.append(a['href'])
except Exception,ex:
print ex,'========================Exception=in=soup=findAll'
self.q_ans.put(urls)#将连接放入结果处理队列,让put线程去处理
except Exception, e:
print e,'other--exception----------in- threadget----'
finally:
print '--------------------'
self.q_req.task_done()
print "----------get---quiting"#这个看不到打印

def run(f,links):#作用是在下一轮抓取前清空urls,填装下一轮的任务,并等待任务完成。
f.urls = []
for url in links:
f.push(url)
f.join()
return f.urls

if __name__ == "__main__":
links = ['http://www.kingdowin.com/',]
deep = 2#要抓取的深度
f = Fetcher(10)
while deep > 0:#在这里结合run函数控制抓取深度
urls = run(f,links)
deep -= 1
links = urls
print len(links)

time.sleep(1)
flag = False #关闭线程
time.sleep(1)
print "Exiting Main Thread"#这一行会被打印出来
#每次运行,到这里,光标不动,通过任务管理器查看python的线程数13个
#线程虽然在,但不打印信息了

下面加上打印信息
Setting environment for using Microsoft Visual Studio 2008 x86 tools.

C:\Program Files\Microsoft Visual Studio 9.0\VC>G:

G:\>cd proxy

G:\proxy>python Quere.py
True
True
True
True
True
True
True
True
True
True
True
--------------------
True
True
=====================im done
17
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
<urlopen error timed out> other--exception----------in- threadget----
--------------------
True
<urlopen error (11001, 'getaddrinfo failed')> other--exception----------in- thre
adget----
--------------------
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
=====================im done
195
Exiting Main Thread

#不能正常结束,如果标志位不起作用,应该不断打印true才对,但是没有。任务管理器里线程数还是10+
困惑,,,难道是Queue join的问题?但是=====================im done都打印出来了!

...全文
1262 18 打赏 收藏 转发到动态 举报
写回复
用AI写文章
18 条回复
切换为时间正序
请发表友善的回复…
发表回复
z752964360 2011-12-20
  • 打赏
  • 举报
回复
多谢su天使,,多谢 askandstudy,,上面这段代码能运行了!
这样的话验证了确实是task_done的问题.
先不结贴.等大家在看下.
z752964360 2011-12-20
  • 打赏
  • 举报
回复

def get(self, block=True, timeout=None):
151 """Remove and return an item from the queue.
152
153 If optional args 'block' is true and 'timeout' is None (the default),
154 block if necessary until an item is available. If 'timeout' is
155 a positive number, it blocks at most 'timeout' seconds and raises
156 the Empty exception if no item was available within that time.
157 Otherwise ('block' is false), return an item if one is immediately
158 available, else raise the Empty exception ('timeout' is ignored
159 in that case).
160 """
161 self.not_empty.acquire()
162 try:
163 if not block:
164 if not self._qsize():
165 raise Empty
166 elif timeout is None:
167 while not self._qsize():
168 self.not_empty.wait()
169 elif timeout < 0:
170 raise ValueError("'timeout' must be a positive number")
171 else:
172 endtime = _time() + timeout
173 while not self._qsize():
174 remaining = endtime - _time()
175 if remaining <= 0.0:
176 raise Empty
177 self.not_empty.wait(remaining)
178 item = self._get()
179 self.not_full.notify()
180 return item
181 finally:
182 self.not_empty.release()

这个是Queue里get的源码,原因找到了,如果get的时候没参数,会一直wait下去所以,,标志位不管用!!
在次感谢诸位!
askandstudy 2011-12-20
  • 打赏
  • 举报
回复
是get的问题啊,如果你看一下库文档或者自己用个调试工具调试一下,你自己也能很快就找到问题在哪里的了。

python的不同版本的library还是需要有一份的,便于查阅。

get([block, [timeout]])
Remove and return an item from the queue. If optional args block is true and timeout is None (the default),
block if necessary until an item is available. If timeout is a positive number, it blocks at most timeout
seconds and raises the Empty exception if no item was available within that time. Otherwise (block is
false), return an item if one is immediately available, else raise the Empty exception (timeout is ignored in
that case). New in version 2.3: The timeout parameter.


另外我在13楼最后那次贴的代码的锁也用得很菜,当时没仔细想。要加锁也应该是创建两个锁,分别在thread_put和thread_get里用不同的锁来加锁,如果thread_put是单线程运行就不需要锁了。如果我对锁的理解没错的话。

askandstudy 2011-12-19
  • 打赏
  • 举报
回复

if __name__=='__main__':部分有句代码打了个错别字:

outputinfo='run return links length:%d\n' % len(links)
askandstudy 2011-12-19
  • 打赏
  • 举报
回复
因为字数限制,中间省略了一些

E:\codes\komodoprj\python2>c:\python27\python getwebpage.py
True
True
TrueTrue

True
True
True
TrueTrue

TrueTrue

TrueTrueTrueTrueTrueTrue





TrueTrueTrue
True


TrueTrueTrueTrueTrueTrue





TrueTrueTrueTrue



--------------------
True
TrueTrueTrueTrueTrueTrue





TrueTrueTrue


TrueTrue
=====================im done
deep [2] ok

run turn links length:17

[
u'http://my.4399.com/userapp.php?id=100111', u'http://my.kingdowin.com', u'http:
//apps.renren.com/tdsheep/', u'http://www.pengyou.com/index.php?mod=appmanager&a
ct=openapp&type=qzone&appid=16488', u'http://www.playersaid.com/runescape-gold/'
, u'http://www.playersaid.com/wow-gold/', u'http://www.playersaid.com/wow-gold/'
, u'http://www.playersaid.com/runescape-gold/', u'http://www.playersaid.com/rift
-platinum/', u'http://www.renren.com', u'http://uchome.developer.manyou.com/', u
'http://www.myspace.cn/', u'http://www.facebook.com', u'http://www.pengyou.com',
u'http://www.kaixin001.com', u'http://www.linezing.com', u'http://www.linezing.
com']
True
--------------------
True
TrueTrueTrue


TrueTrueTrue


TrueTrue

True
True
True
--------------------
True
--------------------
True--------------------

--------------------
TrueTrue

--------------------
True
--------------------
True
--------------------
True
--------------------
True
--------------------
True
--------------------
True
--------------------
TrueTrue

True
True
True--------------------

TrueTrue

True
True
True
True
True
True
True
True
--------------------
True
--------------------
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
--------------------
True
True
True
True
True
True
True
True
True
True
True
<urlopen error timed out> other--exception----------in- threadget----
--------------------
True=====================im done

deep [1] ok

run turn links length:197

[u'http://my.4399.com/', u'http://my.4399.com/', u'http://my.4399.com/index.php?
ct=myapp', u'http://my.4399.com/network.html', u'http://my.4399.com/help1.php',
u'http://t.sina.com.cn/my4399', u'http://t.sina.com.cn/caiwensheng', u'http://my
.4399.com/sitemap/', u'http://my.4399.com/help1.php', u'http://my.4399.com/joinu
s/zhaopin.html', u'http://my.4399.com/joinus/', u'http://imga.4399.com/upload_pi
c/2011/icp.jpg', u'http://net.china.cn/chinese/index.htm', u'http://imga.4399.co
m/upload_pic/2011/wenwangwen.jpg', u'http://imga.4399.com/upload_pic/2011/chuban
.jpg', u'http://reg.pengyou.qq.com/emailreg.html', u'http://reg.pengyou.qq.com/e
mailreg.html', u'http://www.pengyou.com/frame.html?ADTAG=py_tuijian_wap&frame=ht
tp://imgcache.qq.com/campus/html/mobile_py.html', u'http://mobile.qq.com/pengyou
/', u'http://itunes.apple.com/cn/app/id413616230?mt=8', u'http://www.qq.com/cult
ure.shtml', u'http://www.qq.com/icp.shtml', u'http://www.qq.com/icp1.shtml', u'h
ttp://www.qq.com/?pref=pengyou', u'http://pengyou.qq.com/index.php?mod=frame&ADT
AG=py_tuijian_wap&width=1&frame=http://imgcache.qq.com/campus/html/mobile_py.htm
l', u'http://support.qq.com/cgi-bin/beta2/titlelist_simple?pn=0&order=3&fid=371'
, u'http://www.tencent.com/zh-cn/le/copyrightstatement.shtml', u'http://www.tenc
ent.com/zh-cn/le/copyrightstatement.shtml', u'http://www.miibeian.gov.cn', u'htt
p://u.discuz.net', u'http://www.comsenz.com', u'http://www.linezing.com/index.ph
p', u'http://bbs.linezing.com/', u'http://www.linezing.com/help/stat_help.html',
u'http://light.lz.taobao.com/index.php?fid=67&flag=1&r=http%3A%2F%2Flz.taobao.c
om%2F%3Ffid%3D67%26flag%3D1', u'http://tongji.linezing.com/report.html?unit_id=1
0399', u'http://www.linezing.com/help/stat_guide.html', u'http://lz.taobao.com',
u'http://light.lz.taobao.com/index.php?fid=68&flag=1&r=http%3A%2F%2Flz.taobao.c
om%2F%3Ffid%3D68%26flag%3D1', u'http://bbs.linezing.com/', u'http://bbs.linezing
.com/read.php?tid=3974&page=1&toread=1', u'http://club.alimama.com/read-htm-tid-
458726.html', u'http://bbs.linezing.com/htm_data/4/0905/63.html', u'http://bbs.l
inezing.com/htm_data/4/0905/69.html', u'http://bbs.linezing.com/htm_data/4/0905/
62.html', Trueu'http://www.linezing.com/help/stat_help.html',
u'http://www.linezing.com/help/qianyi_faq.html', u'http://www.linezing.com/help/
qianyi_faq.html#2', u'http://www.linezing.com/help/qianyi_faq.html#3', u'http://
www.linezing.com/help/qianyi_faq.html', u'http://act.life.alipay.com/shopping/pr
omotion/subject/dbmmpro/index.html?src=shangf_weis_g01', u'http://www.paidai.com
/jobs/index.php', u'http://lz.taobao.com', u'http://www.linezing.com', u'http://
www.linezing.com/index.php', u'http://bbs.linezing.com/', u'http://www.linezing.
com/help/stat_help.html', u'http://light.lz.taobao.com/index.php?fid=67&flag=1&r
=http%3A%2F%2Flz.taobao.com%2F%3Ffid%3D67%26flag%3D1', u'http://tongji.linezing.
com/report.html?unit_id=10399', u'http://www.linezing.com/help/stat_guide.html',
u'http://lz.taobao.com', u'http://light.lz.taobao.com/index.php?fid=68&flag=1&r
=http%3A%2F%2Flz.taobao.com%2F%3Ffid%3D68%26flag%3D1', u'http://bbs.linezing.com
/', u'http://bbs.linezing.com/read.php?tid=3974&page=1&toread=1', u'http://club.
alimama.com/read-htm-tid-458726.html', u'http://bbs.linezing.com/htm_data/4/0905
/63.html', u'http://bbs.linezing.com/htm_data/4/0905/69.html', u'http://bbs.line
zing.com/htm_data/4/0905/62.html', u'http://www.linezing.com/help/stat_help.html
', u'http://www.linezing.com/help/qianyi_faq.html', u'http://www.linezing.com/he
lp/qianyi_faq.html#2', u'http://www.linezing.com/help/qianyi_faq.html#3', u'http
://www.linezing.com/help/qianyi_faq.html', u'http://act.life.alipay.com/shopping
/promotion/subject/dbmmpro/index.html?src=shangf_weis_g01', u'http://www.paidai.
com/jobs/index.php', u'http://lz.taobao.com', u'http://www.linezing.com', u'http
://reg.kaixin001.com', u'http://login.kaixin001.com', u'http://tuan.kaixin001.co
m', u'http://game.kaixin001.com', u'http://xyx.kaixin001.com', u'http://game.kai
xin001.com/#sns', u'http://reg.kaixin001.com', u'http://itunes.apple.com/cn/app/
id348883057?mt=8', u'http://zhaopin.kaixin001.com', u'http://www.miibeian.gov.cn
', u'http://reg.pengyou.qq.com/emailreg.html', u'http://reg.pengyou.qq.com/email
reg.html', u'http://www.pengyou.com/frame.html?ADTAG=py_tuijian_wap&frame=http:/
/imgcache.qq.com/campus/html/mobile_py.html', u'http://mobile.qq.com/pengyou/',
u'http://itunes.apple.com/cn/app/id413616230?mt=8', u'http://www.qq.com/culture.
shtml', u'http://www.qq.com/icp.shtml', u'http://www.qq.com/icp1.shtml', u'http:
//www.qq.com/?pref=pengyou', u'http://pengyou.qq.com/index.php?mod=frame&ADTAG=p
y_tuijian_wap&width=1&frame=http://imgcache.qq.com/campus/html/mobile_py.html',
u'http://support.qq.com/cgi-bin/beta2/titlelist_simple?pn=0&order=3&fid=371', u'
http://www.tencent.com/zh-cn/le/copyrightstatement.shtml', u'http://www.tencent.
com/zh-cn/le/copyrightstatement.shtml', u'http://share.renren.com', u'http://app
.renren.com', u'http://page.renren.com', u'http://life.renren.com', u'http://xia
ozu.renren.com/', u'http://name.renren.com', u'http://school.renren.com/allpages
.html', u'http://school.renren.com/daxue/', u'http://m.renren.com', u'http://min
i.renren.com', u'http://club.renren.com', u'http://movie.renren.com', u'http://w
wv.renren.com/xn.do?ss=10113&rt=27', u'http://www.renren.com/GLogin.do', u'http:
//support.renren.com/visitor/helpcenter', u'http://support.renren.com/GetGuestbo
okList.do?action=suggest&stage=-1', u'http://safe.renren.com/findPass.do', u'htt
p://dev.renren.com/blog/127', u'http://safe.renren.com/findPass.do', u'http://bl
og.renren.com/share/331500000/10400168898', u'http://blog.renren.com/share/22221
0538/10426113607', u'http://blog.renren.com/blog/401943014/781515062', u'http://
blog.renren.com/blog/278712439/786873567', u'http://blog.renren.com/blog/2872841
11/786922626', u'http://blog.renren.com/share/287286115/10454167790', u'http://b
log.renren.com/share/287284388/10500384368', u'http://blog.renren.com/share/4023
85140/10448971688',
......此处省略一些......
u'http://safe.renren.com/relive.do', u'http://safe.renren.com/findPass.do', u'ht
tp://safe.renren.com/findPass.do', u'https://openapi.360.cn/oauth2/authorize?cli
ent_id=5ddda4458747126a583c5d58716bab4c&response_type=code&redirect_uri=http://w
ww.renren.com/bind/tsz/tszLoginCallBack&scope=basic&display=default', u'http://w
wv.renren.com/xn.do?ss=17076&rt=1&g=2011xinsheng3', u'http://invite.renren.com/u
nrgsfidfrd.do', u'http://mobile.renren.com/mobilelink.do?psf=8000201', u'http://
mobile.renren.com/showClient?psf=41004', u'http://im.renren.com/?subver=8&word02
', u'http://im.renren.com/desktop/rrsetup-8.exe?word02', u'http://g.renren.com/?
subver=5&word02', u'http://g.renren.com/lobby/rrgamesetup-5.exe?word02', u'http:
//www.renren.com/siteinfo/about', u'http://dev.renren.com', u'http://wan.renren.
com', u'http://page.renren.com/register/regGuide/', u'http://mobile.renren.com/m
obilelink.do?psf=40002', u'http://www.nuomi.com', u'http://ads.renren.com', u'ht
tp://job.renren-inc.com/', u'http://support.renren.com/helpcenter', u'http://www
.renren.com/siteinfo/privacy', u'http://www.miibeian.gov.cn/', u'http://u.discuz
.net', u'http://www.comsenz.com']
True
True
True
True
True
True
True
True
True
True
True
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
Exiting Main Thread

E:\codes\komodoprj\python2>
askandstudy 2011-12-19
  • 打赏
  • 举报
回复
我贴一下我文件里的代码吧,刚才又试了一次能直接运行了的,但是这样写我觉得还是有点问题的,只是出现问题的概率变得较小而已。我觉得需要用锁来确保:判断队列的qsize和队列的get这两句代码必须被同一个线程在一次完成,否则还是会有阻塞的问题的。怎么用锁来控制我还没有试验,因为以前都没用过锁,还不太熟,需要再看点资料试试才行。
或者还有别的什么更好一点的方法。
不好意思啊,哥们,我不用qq和msn的。

贴代码(刚运行过,输出结果贴在下一楼):


#!/usr/bin/env python
#coding=utf-8
from BeautifulSoup import BeautifulSoup
import urllib2
from threading import Thread,Lock
from Queue import Queue
import time
import socket
socket.setdefaulttimeout(5)

flag = True#标志位,控制子线程结束
class Fetcher:#封装的抓取网页的线程,处理结果的线程
def __init__(self,th_num):
self.opener = urllib2.build_opener(urllib2.HTTPHandler)
self.q_req = Queue() #任务队列
self.q_ans = Queue() #抓取结果处理队列
self.urls = []#返回下一深度要抓取的url
for i in range(th_num):#启动抓取线程
t = Thread(target=self.thread_get)
t.start()
for i in range(0,1):
t = Thread(target=self.thread_put)
t.start()


def join(self): #需等待两个队列完成
self.q_req.join()
self.q_ans.join()
print '=====================im done'#这个运行的时候也能打印出来。

def push(self,req):#往任务队列装任务
self.q_req.put(req)

def thread_put(self):#负责结果处理,将页面中的url存到sel.url里面
while flag:
print flag#调试用,如果这个线程没有结束,应该不断的打印flag才对
if self.q_ans.qsize()<=0:
time.sleep(1)
continue
try:
url = self.q_ans.get()
self.urls.extend(url)
except Exception ,e:
print e,'other,excp========in=put'
finally:
self.q_ans.task_done()

def thread_get(self):
while flag:
print flag
if self.q_req.qsize()<=0:
time.sleep(1)
continue
try:
req = self.q_req.get()
urls = []
ans = self.opener.open(req).read()#抓取页面
soup = BeautifulSoup(ans)#解析页面
for a in soup.findAll('a'):#提取页面的所有链接
try:
if a['href'].startswith('http'):
urls.append(a['href'])
except Exception,ex:
print ex,'========================Exception=in=soup=findAll'
self.q_ans.put(urls)#将连接放入结果处理队列,让put线程去处理
except Exception, e:
print e,'other--exception----------in- threadget----'
finally:
print '--------------------'
self.q_req.task_done()
print "----------get---quiting"#这个看不到打印

def run(f,links):#作用是在下一轮抓取前清空urls,填装下一轮的任务,并等待任务完成。
f.urls = []
for url in links:
f.push(url)
f.join()
return f.urls

if __name__ == "__main__":
links = ['http://www.kingdowin.com/',]
deep = 2#要抓取的深度
f = Fetcher(10)
while deep > 0:#在这里结合run函数控制抓取深度
urls = run(f,links)
outputinfo='deep [%d] ok\n' % deep
print outputinfo
deep -= 1
links = urls
outputinfo='run turn links length:%d\n' % len(links)
print outputinfo
print links

time.sleep(1)
flag = False #关闭线程
time.sleep(3)
print "Exiting Main Thread"#这一行会被打印出来
#每次运行,到这里,光标不动,通过任务管理器查看python的线程数13个
#线程虽然在,但不打印信息了


askandstudy 2011-12-19
  • 打赏
  • 举报
回复
昨天这代码折腾了大半天都搞不定task_done的问题,老是报错
angel_su 2011-12-19
  • 打赏
  • 举报
回复
嗯,qsize()前上锁,也算是个办法,一般常见情况是对公用数据加锁,如我代码里thread_put改3线,那么修改self.urls前就该上个锁。
askandstudy 2011-12-19
  • 打赏
  • 举报
回复

#!/usr/bin/env python
#coding=utf-8
from BeautifulSoup import BeautifulSoup
import urllib2
from threading import Thread,Lock
from Queue import Queue
import time
import socket
socket.setdefaulttimeout(5)
lock = Lock()
flag = True#标志位,控制子线程结束
class Fetcher:#封装的抓取网页的线程,处理结果的线程
def __init__(self,th_num):
self.opener = urllib2.build_opener(urllib2.HTTPHandler)
self.q_req = Queue() #任务队列
self.q_ans = Queue() #抓取结果处理队列
self.urls = []#返回下一深度要抓取的url
for i in range(th_num):#启动抓取线程
t = Thread(target=self.thread_get)
t.start()
for i in range(0,1):
t = Thread(target=self.thread_put)
t.start()


def join(self): #需等待两个队列完成
self.q_req.join()
self.q_ans.join()
print '=====================im done'#这个运行的时候也能打印出来。

def push(self,req):#往任务队列装任务
self.q_req.put(req)

def thread_put(self):#负责结果处理,将页面中的url存到sel.url里面
while flag:
#print flag#调试用,如果这个线程没有结束,应该不断的打印flag才对
lock.acquire()
if self.q_ans.qsize()<=0:
lock.release()
time.sleep(1)
continue
try:
url = self.q_ans.get()
lock.release()
self.urls.extend(url)
except Exception ,e:
print e,'other,excp========in=put'
finally:
self.q_ans.task_done()

def thread_get(self):
while flag:
#print flag
lock.acquire()
if self.q_req.qsize()<=0:
lock.release()
time.sleep(1)
continue
try:
req = self.q_req.get()
lock.release()
urls = []
ans = self.opener.open(req).read()#抓取页面
soup = BeautifulSoup(ans)#解析页面
for a in soup.findAll('a'):#提取页面的所有链接
try:
if a['href'].startswith('http'):
urls.append(a['href'])
except Exception,ex:
print ex,'========================Exception=in=soup=findAll'
self.q_ans.put(urls)#将连接放入结果处理队列,让put线程去处理
except Exception, e:
print e,'other--exception----------in- threadget----'
finally:
print '--------------------'
self.q_req.task_done()
print "----------get---quiting"#这个看不到打印

def run(f,links):#作用是在下一轮抓取前清空urls,填装下一轮的任务,并等待任务完成。
f.urls = []
for url in links:
f.push(url)
f.join()
return f.urls

if __name__ == "__main__":
links = ['http://www.kingdowin.com/',]
deep = 2#要抓取的深度
f = Fetcher(10)
while deep > 0:#在这里结合run函数控制抓取深度
urls = run(f,links)
outputinfo='deep [%d] ok\n' % deep
print outputinfo
deep -= 1
links = urls
outputinfo='run return links length:%d\n' % len(links)
print outputinfo
print links

time.sleep(1)
flag = False #关闭线程
time.sleep(2)
print "Exiting Main Thread"#这一行会被打印出来
#每次运行,到这里,光标不动,通过任务管理器查看python的线程数13个
#线程虽然在,但不打印信息了

askandstudy 2011-12-19
  • 打赏
  • 举报
回复

学习了,其实我是对锁怎么使用不太清楚。看了楼上的代码我自己改的照猫画虎一下就能保证不会阻塞了。
执行判断qsize之前加锁,判断之后分两条路走,不管走哪一步都在执行了必要的操作后立即释放锁,这样的逻辑应该没什么问题吧。试了下也还是能运行的,不过我的方法比较土,还是楼上写的比较好。


[Quote=引用 11 楼 angel_su 的回复:]

用qsize()判断并不能保证下一步get()不会阻塞,我改用get(timeout=1),thread_put也改成3线,貌似能正常退出,你试试:
Python code
#coding=utf-8
from BeautifulSoup import BeautifulSoup
import urllib2
from threading import Thread,Lock
from Qu……
[/Quote]
angel_su 2011-12-19
  • 打赏
  • 举报
回复
用qsize()判断并不能保证下一步get()不会阻塞,我改用get(timeout=1),thread_put也改成3线,貌似能正常退出,你试试:
#coding=utf-8
from BeautifulSoup import BeautifulSoup
import urllib2
from threading import Thread,Lock
from Queue import Queue, Empty
import time
import socket

socket.setdefaulttimeout(5)
lock = Lock()
flag = True#标志位,控制子线程结束

class Fetcher:#封装的抓取网页的线程,处理结果的线程
def __init__(self,th_num):
self.opener = urllib2.build_opener(urllib2.HTTPHandler)
self.q_req = Queue() #任务队列
self.q_ans = Queue() #抓取结果处理队列
self.urls = []#返回下一深度要抓取的url
for i in range(th_num):#启动抓取线程
t = Thread(target=self.thread_get)
t.start()
for i in range(0,3):
t = Thread(target=self.thread_put)
t.start()


def join(self): #需等待两个队列完成
self.q_req.join()
self.q_ans.join()
print '=====================im done'#这个运行的时候也能打印出来。

def push(self,req):#往任务队列装任务
self.q_req.put(req)

def thread_put(self):#负责结果处理,将页面中的url存到sel.url里面
while flag:
#~ print flag#调试用,如果这个线程没有结束,应该不断的打印flag才对
try:
url = self.q_ans.get(timeout=1)
except Empty:
continue
except Exception ,e:
print e,'other,excp========in=put'
break
lock.acquire()
self.urls.extend(url)
lock.release()
self.q_ans.task_done()
print "----------put---quiting"

def thread_get(self):
while flag:
#~ print flag
try:
req = self.q_req.get(timeout=1)
except Empty:
continue
except Exception, e:
print e,'other--exception----------in- threadget----'
urls = []
try:
ans = self.opener.open(req).read()#抓取页面
soup = BeautifulSoup(ans)#解析页面
for a in soup.findAll('a'):#提取页面的所有链接
if a['href'].startswith('http'):
urls.append(a['href'])
except Exception,ex:
print ex,'========================Exception=in=ans/soup'
self.q_ans.put(urls)#将连接放入结果处理队列,让put线程去处理
print '--------------------'
self.q_req.task_done()
print "----------get---quiting"#这个看不到打印

def run(f,links):#作用是在下一轮抓取前清空urls,填装下一轮的任务,并等待任务完成。
f.urls = []
for url in links:
f.push(url)
f.join()
return f.urls

if __name__ == "__main__":
links = ['http://www.kingdowin.com/',]
deep = 2#要抓取的深度
f = Fetcher(10)
while deep > 0:#在这里结合run函数控制抓取深度
urls = run(f,links)
deep -= 1
links = urls
print len(links)

time.sleep(1)
flag = False #关闭线程
time.sleep(1)
print "Exiting Main Thread"#这一行会被打印出来
#每次运行,到这里,光标不动,通过任务管理器查看python的线程数13个
#线程虽然在,但不打印信息了
z752964360 2011-12-18
  • 打赏
  • 举报
回复
askandstudy我这样代码改了,还是不行!
能否加下我QQ 752964360
askandstudy 2011-12-18
  • 打赏
  • 举报
回复
另外加了几句调试用的代码,所以输出结果跟楼主的有些不一样:


if __name__ == "__main__":
links = ['http://www.kingdowin.com/',]
deep = 2#要抓取的深度
f = Fetcher(10)
while deep > 0:#在这里结合run函数控制抓取深度
urls = run(f,links)
outputinfo='deep [%d] ok\n' % deep
print outputinfo
deep -= 1
links = urls
outputinfo='run turn links length:%d\n' % len(links)
print outputinfo
print links
askandstudy 2011-12-18
  • 打赏
  • 举报
回复

学习了,这代码玩了大半天了,累了不玩了,还是用回了我的老方法,虽然很土。希望看到有高手能给个好一点的能实际运行的代码。
那个join,get,task_done实在玩累了


def thread_put(self):#负责结果处理,将页面中的url存到sel.url里面
while flag:
print flag#调试用,如果这个线程没有结束,应该不断的打印flag才对
if self.q_ans.qsize()<=0:
time.sleep(1)
continue



def thread_get(self):
while flag:
print flag
if self.q_req.qsize()<=0:
time.sleep(1)
continue


thread_put和thread_get里各加了三行代码,我的运行结果如下:


E:\codes\komodoprj>c:\python27\python temp.py
True
TrueTrue

True
TrueTrue

TrueTrue

True
True
True
--------------------
True
True
TrueTrue

TrueTrue

True
TrueTrueTrue


True
True=====================im done
deep [2] ok


run turn links length:17

[u'http://my.4399.com/userapp.php?id=100111', u'http://my.kingdowin.com', u'http
://apps.renren.com/tdsheep/', u'http://www.pengyou.com/index.php?mod=appmanager&
act=openapp&type=qzone&appid=16488', u'http://www.playersaid.com/runescape-gold/
', u'http://www.playersaid.com/wow-gold/', u'http://www.playersaid.com/wow-gold/
', u'http://www.playersaid.com/runescape-gold/', u'http://www.playersaid.com/rif
t-platinum/', u'http://www.renren.com', u'http://uchome.developer.manyou.com/',
u'http://www.myspace.cn/', u'http://www.facebook.com', u'http://www.pengyou.com'
, u'http://www.kaixin001.com', u'http://www.linezing.com', u'http://www.linezing
.com']
True
--------------------
True
True
TrueTrueTrueTrue
True



TrueTrueTrueTrue


True

--------------------
True
--------------------
True
--------------------
True
--------------------
True
True
True
True
True
True
--------------------
True
--------------------
True
True
True
True
--------------------
True
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
--------------------
True
True
True
True
True
True
True
True
True
<urlopen error timed out> other--exception----------in- threadget----
--------------------
True
True
True
True
--------------------
True
--------------------
True
True
<urlopen error timed out> other--exception----------in- threadget----
--------------------
True
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
--------------------
True
True
True
True
True
True
......此处省略N个True......
True
True
True
True
True
True
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' 'href'========================Exception=in=soup=findAll
========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href'True
========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
--------------------
True
True
'href' ========================Exception=in=soup=findAll
--------------------
True
True
True
True
True
True
True
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
--------------------
True
True
True
True
True
True
True
True
True
True=====================im doneTrue

deep [1] ok

run turn links length:247

[
u'http://my.4399.com/', u'http://my.4399.com/', u'http://my.4399.com/index.php?c
t=myapp', u'http://my.4399.com/network.html', u'http://my.4399.com/help1.php', u
'http://t.sina.com.cn/my4399', u'http://t.sina.com.cn/caiwensheng', u'http://my.
4399.com/sitemap/', u'http://my.4399.com/help1.php', u'http://my.4399.com/joinus
/zhaopin.html', u'http://my.4399.com/joinus/', u'http://imga.4399.com/upload_pic
/2011/icp.jpg', u'http://net.china.cn/chinese/index.htm', u'http://imga.4399.com
/upload_pic/2011/wenwangwen.jpg', u'http://imga.4399.com/upload_pic/2011/chuban.
......此处省略一大段获取到的url......
m/knowledgebase/', u'http://www.comm100.com/forum/', u'http://www.comm100.com/em
ailmarketingnewsletter/', u'http://www.comm100.com/emailticket/', u'http://www.r
ingcentral.com', u'http://www.comm100.com/livechat/', u'http://www.comm100.com/l
ivechat/', u'http://www.comm100.com/', u'http://www.comm100.com/', u'http://www.
comm100.com/livechat/', u'http://www.comm100.com/knowledgebase/', u'http://www.c
omm100.com/forum/', u'http://www.comm100.com/emailmarketingnewsletter/', u'http:
//www.comm100.com/emailticket/', u'http://www.ringcentral.com']
True
True
True
True
True
True
True
True
True
True
True
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
Exiting Main Thread

E:\codes\komodoprj>



angel_su 2011-12-17
  • 打赏
  • 举报
回复
你用get()这个当queue空了就会一直阻塞,试试get(false)或者get(timeout=...)
z752964360 2011-12-17
  • 打赏
  • 举报
回复
实际应用中用到了Queue,但我的Queue get和task_done是成对出现的,如果标志位木有错误那就是Queue的错误了!但是找不出来!
z752964360 2011-12-17
  • 打赏
  • 举报
回复
呵呵,iambic大哥,这个代码可以运行的!前面小的是用来验证标志位可行。。
但是后面实际应用就不行了。
iambic 2011-12-17
  • 打赏
  • 举报
回复
你的例子太长了。有必要用这么多代码来描述一个“标志位”的问题吗。

37,719

社区成员

发帖
与我相关
我的任务
社区描述
JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • IT.BOB
加入社区
  • 近7日
  • 近30日
  • 至今

试试用AI创作助手写篇文章吧