python多线程问题，非setDaemon，标志位控制，线程无法结束！

z752964360 2011-12-17 04:42:19

先看一个严重标志位是否好使的简单例子：



from threading import Thread,Lock

import time

a = True

class A:

    def __init__(self,):

        t = Thread(target=self.run)

        t.start()

    def run(self):

        print 'im starting'

        while a:

            print a

if __name__ == "__main__":

    aa=A()

    time.sleep(1)

    a= False

这个例子运行正常，说明可以通过标志位控制子线程。

下面这个是一个深度抓取页面的爬虫，提取出首页的所有链接继续抓这些链接的内容，，控制机制和上面例子一样。但是不能正常结束。



#coding=utf-8

from BeautifulSoup import BeautifulSoup

import urllib2

from threading import Thread,Lock

from Queue import Queue

import time

import socket

socket.setdefaulttimeout(5)



flag = True#标志位，控制子线程结束

class Fetcher:#封装的抓取网页的线程，处理结果的线程

    def __init__(self,th_num):

        self.opener = urllib2.build_opener(urllib2.HTTPHandler)

        self.q_req = Queue() #任务队列

        self.q_ans = Queue() #抓取结果处理队列

        self.urls = []#返回下一深度要抓取的url

        for i in range(th_num):#启动抓取线程

            t = Thread(target=self.thread_get)

            t.start()

        for i in range(0,1):

            t = Thread(target=self.thread_put)

            t.start()

        

 

    def join(self): #需等待两个队列完成

        self.q_req.join()

        self.q_ans.join()

        print '=====================im done'#这个运行的时候也能打印出来。

 

    def push(self,req):#往任务队列装任务

        self.q_req.put(req)

    

    def thread_put(self):#负责结果处理，将页面中的url存到sel.url里面

        while flag:

            print flag#调试用，如果这个线程没有结束，应该不断的打印flag才对

            try:             

                url = self.q_ans.get()

                self.urls.extend(url)    

            except Exception ,e:

                print e,'other,excp========in=put'

            finally:

                self.q_ans.task_done()



    def thread_get(self):

        while flag:

            print flag

            try:

                req = self.q_req.get()

                urls = []             

                ans = self.opener.open(req).read()#抓取页面

                soup = BeautifulSoup(ans)#解析页面

                for a in soup.findAll('a'):#提取页面的所有链接

                    try:

                        if a['href'].startswith('http'):

                            urls.append(a['href'])

                    except Exception,ex:

                        print ex,'========================Exception=in=soup=findAll'

                self.q_ans.put(urls)#将连接放入结果处理队列，让put线程去处理 

            except Exception, e:

                print e,'other--exception----------in- threadget----'

            finally:

                print '--------------------'

                self.q_req.task_done()

        print "----------get---quiting"#这个看不到打印

 

def run(f,links):#作用是在下一轮抓取前清空urls，填装下一轮的任务，并等待任务完成。

    f.urls = []

    for url in links:

        f.push(url)

    f.join()

    return f.urls 



if __name__ == "__main__":

    links = ['http://www.kingdowin.com/',]

    deep = 2#要抓取的深度

    f = Fetcher(10)

    while deep > 0:#在这里结合run函数控制抓取深度

        urls = run(f,links)

        deep -= 1

        links = urls

        print len(links)

        

    time.sleep(1)

    flag = False            #关闭线程

    time.sleep(1)

    print "Exiting Main Thread"#这一行会被打印出来

    #每次运行，到这里，光标不动，通过任务管理器查看python的线程数13个

    #线程虽然在，但不打印信息了

下面加上打印信息
Setting environment for using Microsoft Visual Studio 2008 x86 tools.

C:\Program Files\Microsoft Visual Studio 9.0\VC>G:

G:\>cd proxy

G:\proxy>python Quere.py
True
True
True
True
True
True
True
True
True
True
True
--------------------
True
True
=====================im done
17
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
<urlopen error timed out> other--exception----------in- threadget----
--------------------
True
<urlopen error (11001, 'getaddrinfo failed')> other--exception----------in- thre
adget----
--------------------
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
--------------------
True
True
=====================im done
195
Exiting Main Thread

#不能正常结束，如果标志位不起作用，应该不断打印true才对，但是没有。任务管理器里线程数还是10+
困惑，，，难道是Queue join的问题？但是=====================im done都打印出来了！

...全文

1284 18 打赏收藏转发到动态举报

写回复

用AI写文章

18 条回复

切换为时间正序

请发表友善的回复…

发表回复

z752964360 2011-12-20

打赏
举报

多谢su天使,,多谢 askandstudy,,上面这段代码能运行了!
这样的话验证了确实是task_done的问题.
先不结贴.等大家在看下.

z752964360 2011-12-20

打赏
举报



  def get(self, block=True, timeout=None):

   151         """Remove and return an item from the queue.

   152 

   153         If optional args 'block' is true and 'timeout' is None (the default),

   154         block if necessary until an item is available. If 'timeout' is

   155         a positive number, it blocks at most 'timeout' seconds and raises

   156         the Empty exception if no item was available within that time.

   157         Otherwise ('block' is false), return an item if one is immediately

   158         available, else raise the Empty exception ('timeout' is ignored

   159         in that case).

   160         """

   161         self.not_empty.acquire()

   162         try:

   163             if not block:

   164                 if not self._qsize():

   165                     raise Empty

   166             elif timeout is None:

   167                 while not self._qsize():

   168                     self.not_empty.wait()

   169             elif timeout < 0:

   170                 raise ValueError("'timeout' must be a positive number")

   171             else:

   172                 endtime = _time() + timeout

   173                 while not self._qsize():

   174                     remaining = endtime - _time()

   175                     if remaining <= 0.0:

   176                         raise Empty

   177                     self.not_empty.wait(remaining)

   178             item = self._get()

   179             self.not_full.notify()

   180             return item

   181         finally:

   182             self.not_empty.release()

这个是Queue里get的源码,原因找到了,如果get的时候没参数,会一直wait下去所以,,标志位不管用!!
在次感谢诸位!

askandstudy 2011-12-20

打赏
举报

是get的问题啊，如果你看一下库文档或者自己用个调试工具调试一下，你自己也能很快就找到问题在哪里的了。

python的不同版本的library还是需要有一份的，便于查阅。

get([block, [timeout]])
Remove and return an item from the queue. If optional args block is true and timeout is None (the default),
block if necessary until an item is available. If timeout is a positive number, it blocks at most timeout
seconds and raises the Empty exception if no item was available within that time. Otherwise (block is
false), return an item if one is immediately available, else raise the Empty exception (timeout is ignored in
that case). New in version 2.3: The timeout parameter.

另外我在13楼最后那次贴的代码的锁也用得很菜，当时没仔细想。要加锁也应该是创建两个锁，分别在thread_put和thread_get里用不同的锁来加锁，如果thread_put是单线程运行就不需要锁了。如果我对锁的理解没错的话。

askandstudy 2011-12-19

打赏
举报

if __name__=='__main__':部分有句代码打了个错别字：

outputinfo='run return links length:%d\n' % len(links)

askandstudy 2011-12-19

打赏
举报

因为字数限制，中间省略了一些

E:\codes\komodoprj\python2>c:\python27\python getwebpage.py
True
True
TrueTrue

True
True
True
TrueTrue

TrueTrue

TrueTrueTrueTrueTrueTrue

TrueTrueTrue
True

TrueTrueTrueTrueTrueTrue

TrueTrueTrueTrue

--------------------
True
TrueTrueTrueTrueTrueTrue

TrueTrueTrue

TrueTrue
=====================im done
deep [2] ok

run turn links length:17

[
u'http://my.4399.com/userapp.php?id=100111', u'http://my.kingdowin.com', u'http:
//apps.renren.com/tdsheep/', u'http://www.pengyou.com/index.php?mod=appmanager&a
ct=openapp&type=qzone&appid=16488', u'http://www.playersaid.com/runescape-gold/'
, u'http://www.playersaid.com/wow-gold/', u'http://www.playersaid.com/wow-gold/'
, u'http://www.playersaid.com/runescape-gold/', u'http://www.playersaid.com/rift
-platinum/', u'http://www.renren.com', u'http://uchome.developer.manyou.com/', u
'http://www.myspace.cn/', u'http://www.facebook.com', u'http://www.pengyou.com',
u'http://www.kaixin001.com', u'http://www.linezing.com', u'http://www.linezing.
com']
True
--------------------
True
TrueTrueTrue

TrueTrueTrue

TrueTrue

True
True
True
--------------------
True
--------------------
True--------------------

--------------------
TrueTrue

--------------------
True
--------------------
True
--------------------
True
--------------------
True
--------------------
True
--------------------
True
--------------------
TrueTrue

True
True
True--------------------

TrueTrue

True
True
True
True
True
True
True
True
--------------------
True
--------------------
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
True
--------------------
True
True
True
True
True
True
True
True
True
True
True
<urlopen error timed out> other--exception----------in- threadget----
--------------------
True=====================im done

deep [1] ok

run turn links length:197

[u'http://my.4399.com/', u'http://my.4399.com/', u'http://my.4399.com/index.php?
ct=myapp', u'http://my.4399.com/network.html', u'http://my.4399.com/help1.php',
u'http://t.sina.com.cn/my4399', u'http://t.sina.com.cn/caiwensheng', u'http://my
.4399.com/sitemap/', u'http://my.4399.com/help1.php', u'http://my.4399.com/joinu
s/zhaopin.html', u'http://my.4399.com/joinus/', u'http://imga.4399.com/upload_pi
c/2011/icp.jpg', u'http://net.china.cn/chinese/index.htm', u'http://imga.4399.co
m/upload_pic/2011/wenwangwen.jpg', u'http://imga.4399.com/upload_pic/2011/chuban
.jpg', u'http://reg.pengyou.qq.com/emailreg.html', u'http://reg.pengyou.qq.com/e
mailreg.html', u'http://www.pengyou.com/frame.html?ADTAG=py_tuijian_wap&frame=ht
tp://imgcache.qq.com/campus/html/mobile_py.html', u'http://mobile.qq.com/pengyou
/', u'http://itunes.apple.com/cn/app/id413616230?mt=8', u'http://www.qq.com/cult
ure.shtml', u'http://www.qq.com/icp.shtml', u'http://www.qq.com/icp1.shtml', u'h
ttp://www.qq.com/?pref=pengyou', u'http://pengyou.qq.com/index.php?mod=frame&ADT
AG=py_tuijian_wap&width=1&frame=http://imgcache.qq.com/campus/html/mobile_py.htm
l', u'http://support.qq.com/cgi-bin/beta2/titlelist_simple?pn=0&order=3&fid=371'
, u'http://www.tencent.com/zh-cn/le/copyrightstatement.shtml', u'http://www.tenc
ent.com/zh-cn/le/copyrightstatement.shtml', u'http://www.miibeian.gov.cn', u'htt
p://u.discuz.net', u'http://www.comsenz.com', u'http://www.linezing.com/index.ph
p', u'http://bbs.linezing.com/', u'http://www.linezing.com/help/stat_help.html',
u'http://light.lz.taobao.com/index.php?fid=67&flag=1&r=http%3A%2F%2Flz.taobao.c
om%2F%3Ffid%3D67%26flag%3D1', u'http://tongji.linezing.com/report.html?unit_id=1
0399', u'http://www.linezing.com/help/stat_guide.html', u'http://lz.taobao.com',
u'http://light.lz.taobao.com/index.php?fid=68&flag=1&r=http%3A%2F%2Flz.taobao.c
om%2F%3Ffid%3D68%26flag%3D1', u'http://bbs.linezing.com/', u'http://bbs.linezing
.com/read.php?tid=3974&page=1&toread=1', u'http://club.alimama.com/read-htm-tid-
458726.html', u'http://bbs.linezing.com/htm_data/4/0905/63.html', u'http://bbs.l
inezing.com/htm_data/4/0905/69.html', u'http://bbs.linezing.com/htm_data/4/0905/
62.html', Trueu'http://www.linezing.com/help/stat_help.html',
u'http://www.linezing.com/help/qianyi_faq.html', u'http://www.linezing.com/help/
qianyi_faq.html#2', u'http://www.linezing.com/help/qianyi_faq.html#3', u'http://
www.linezing.com/help/qianyi_faq.html', u'http://act.life.alipay.com/shopping/pr
omotion/subject/dbmmpro/index.html?src=shangf_weis_g01', u'http://www.paidai.com
/jobs/index.php', u'http://lz.taobao.com', u'http://www.linezing.com', u'http://
www.linezing.com/index.php', u'http://bbs.linezing.com/', u'http://www.linezing.
com/help/stat_help.html', u'http://light.lz.taobao.com/index.php?fid=67&flag=1&r
=http%3A%2F%2Flz.taobao.com%2F%3Ffid%3D67%26flag%3D1', u'http://tongji.linezing.
com/report.html?unit_id=10399', u'http://www.linezing.com/help/stat_guide.html',
u'http://lz.taobao.com', u'http://light.lz.taobao.com/index.php?fid=68&flag=1&r
=http%3A%2F%2Flz.taobao.com%2F%3Ffid%3D68%26flag%3D1', u'http://bbs.linezing.com
/', u'http://bbs.linezing.com/read.php?tid=3974&page=1&toread=1', u'http://club.
alimama.com/read-htm-tid-458726.html', u'http://bbs.linezing.com/htm_data/4/0905
/63.html', u'http://bbs.linezing.com/htm_data/4/0905/69.html', u'http://bbs.line
zing.com/htm_data/4/0905/62.html', u'http://www.linezing.com/help/stat_help.html
', u'http://www.linezing.com/help/qianyi_faq.html', u'http://www.linezing.com/he
lp/qianyi_faq.html#2', u'http://www.linezing.com/help/qianyi_faq.html#3', u'http
://www.linezing.com/help/qianyi_faq.html', u'http://act.life.alipay.com/shopping
/promotion/subject/dbmmpro/index.html?src=shangf_weis_g01', u'http://www.paidai.
com/jobs/index.php', u'http://lz.taobao.com', u'http://www.linezing.com', u'http
://reg.kaixin001.com', u'http://login.kaixin001.com', u'http://tuan.kaixin001.co
m', u'http://game.kaixin001.com', u'http://xyx.kaixin001.com', u'http://game.kai
xin001.com/#sns', u'http://reg.kaixin001.com', u'http://itunes.apple.com/cn/app/
id348883057?mt=8', u'http://zhaopin.kaixin001.com', u'http://www.miibeian.gov.cn
', u'http://reg.pengyou.qq.com/emailreg.html', u'http://reg.pengyou.qq.com/email
reg.html', u'http://www.pengyou.com/frame.html?ADTAG=py_tuijian_wap&frame=http:/
/imgcache.qq.com/campus/html/mobile_py.html', u'http://mobile.qq.com/pengyou/',
u'http://itunes.apple.com/cn/app/id413616230?mt=8', u'http://www.qq.com/culture.
shtml', u'http://www.qq.com/icp.shtml', u'http://www.qq.com/icp1.shtml', u'http:
//www.qq.com/?pref=pengyou', u'http://pengyou.qq.com/index.php?mod=frame&ADTAG=p
y_tuijian_wap&width=1&frame=http://imgcache.qq.com/campus/html/mobile_py.html',
u'http://support.qq.com/cgi-bin/beta2/titlelist_simple?pn=0&order=3&fid=371', u'
http://www.tencent.com/zh-cn/le/copyrightstatement.shtml', u'http://www.tencent.
com/zh-cn/le/copyrightstatement.shtml', u'http://share.renren.com', u'http://app
.renren.com', u'http://page.renren.com', u'http://life.renren.com', u'http://xia
ozu.renren.com/', u'http://name.renren.com', u'http://school.renren.com/allpages
.html', u'http://school.renren.com/daxue/', u'http://m.renren.com', u'http://min
i.renren.com', u'http://club.renren.com', u'http://movie.renren.com', u'http://w
wv.renren.com/xn.do?ss=10113&rt=27', u'http://www.renren.com/GLogin.do', u'http:
//support.renren.com/visitor/helpcenter', u'http://support.renren.com/GetGuestbo
okList.do?action=suggest&stage=-1', u'http://safe.renren.com/findPass.do', u'htt
p://dev.renren.com/blog/127', u'http://safe.renren.com/findPass.do', u'http://bl
og.renren.com/share/331500000/10400168898', u'http://blog.renren.com/share/22221
0538/10426113607', u'http://blog.renren.com/blog/401943014/781515062', u'http://
blog.renren.com/blog/278712439/786873567', u'http://blog.renren.com/blog/2872841
11/786922626', u'http://blog.renren.com/share/287286115/10454167790', u'http://b
log.renren.com/share/287284388/10500384368', u'http://blog.renren.com/share/4023
85140/10448971688',
......此处省略一些......
u'http://safe.renren.com/relive.do', u'http://safe.renren.com/findPass.do', u'ht
tp://safe.renren.com/findPass.do', u'https://openapi.360.cn/oauth2/authorize?cli
ent_id=5ddda4458747126a583c5d58716bab4c&response_type=code&redirect_uri=http://w
ww.renren.com/bind/tsz/tszLoginCallBack&scope=basic&display=default', u'http://w
wv.renren.com/xn.do?ss=17076&rt=1&g=2011xinsheng3', u'http://invite.renren.com/u
nrgsfidfrd.do', u'http://mobile.renren.com/mobilelink.do?psf=8000201', u'http://
mobile.renren.com/showClient?psf=41004', u'http://im.renren.com/?subver=8&word02
', u'http://im.renren.com/desktop/rrsetup-8.exe?word02', u'http://g.renren.com/?
subver=5&word02', u'http://g.renren.com/lobby/rrgamesetup-5.exe?word02', u'http:
//www.renren.com/siteinfo/about', u'http://dev.renren.com', u'http://wan.renren.
com', u'http://page.renren.com/register/regGuide/', u'http://mobile.renren.com/m
obilelink.do?psf=40002', u'http://www.nuomi.com', u'http://ads.renren.com', u'ht
tp://job.renren-inc.com/', u'http://support.renren.com/helpcenter', u'http://www
.renren.com/siteinfo/privacy', u'http://www.miibeian.gov.cn/', u'http://u.discuz
.net', u'http://www.comsenz.com']
True
True
True
True
True
True
True
True
True
True
True
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
Exiting Main Thread

E:\codes\komodoprj\python2>

askandstudy 2011-12-19

打赏
举报

我贴一下我文件里的代码吧，刚才又试了一次能直接运行了的，但是这样写我觉得还是有点问题的，只是出现问题的概率变得较小而已。我觉得需要用锁来确保：判断队列的qsize和队列的get这两句代码必须被同一个线程在一次完成，否则还是会有阻塞的问题的。怎么用锁来控制我还没有试验，因为以前都没用过锁，还不太熟，需要再看点资料试试才行。
或者还有别的什么更好一点的方法。
不好意思啊，哥们，我不用qq和msn的。

贴代码（刚运行过，输出结果贴在下一楼）：



#!/usr/bin/env python

#coding=utf-8

from BeautifulSoup import BeautifulSoup

import urllib2

from threading import Thread,Lock

from Queue import Queue

import time

import socket

socket.setdefaulttimeout(5)



flag = True#标志位，控制子线程结束

class Fetcher:#封装的抓取网页的线程，处理结果的线程

    def __init__(self,th_num):

        self.opener = urllib2.build_opener(urllib2.HTTPHandler)

        self.q_req = Queue() #任务队列

        self.q_ans = Queue() #抓取结果处理队列

        self.urls = []#返回下一深度要抓取的url

        for i in range(th_num):#启动抓取线程

            t = Thread(target=self.thread_get)

            t.start()

        for i in range(0,1):

            t = Thread(target=self.thread_put)

            t.start()





    def join(self): #需等待两个队列完成

        self.q_req.join()

        self.q_ans.join()

        print '=====================im done'#这个运行的时候也能打印出来。



    def push(self,req):#往任务队列装任务

        self.q_req.put(req)



    def thread_put(self):#负责结果处理，将页面中的url存到sel.url里面

        while flag:

            print flag#调试用，如果这个线程没有结束，应该不断的打印flag才对

            if self.q_ans.qsize()<=0:

                time.sleep(1)

                continue

            try:

                url = self.q_ans.get()

                self.urls.extend(url)

            except Exception ,e:

                print e,'other,excp========in=put'

            finally:

                self.q_ans.task_done()



    def thread_get(self):

        while flag:

            print flag

            if self.q_req.qsize()<=0:

                time.sleep(1)

                continue

            try:

                req = self.q_req.get()

                urls = []

                ans = self.opener.open(req).read()#抓取页面

                soup = BeautifulSoup(ans)#解析页面

                for a in soup.findAll('a'):#提取页面的所有链接

                    try:

                        if a['href'].startswith('http'):

                            urls.append(a['href'])

                    except Exception,ex:

                        print ex,'========================Exception=in=soup=findAll'

                self.q_ans.put(urls)#将连接放入结果处理队列，让put线程去处理

            except Exception, e:

                print e,'other--exception----------in- threadget----'

            finally:

                print '--------------------'

                self.q_req.task_done()

        print "----------get---quiting"#这个看不到打印



def run(f,links):#作用是在下一轮抓取前清空urls，填装下一轮的任务，并等待任务完成。

    f.urls = []

    for url in links:

        f.push(url)

    f.join()

    return f.urls



if __name__ == "__main__":

    links = ['http://www.kingdowin.com/',]

    deep = 2#要抓取的深度

    f = Fetcher(10)

    while deep > 0:#在这里结合run函数控制抓取深度

        urls = run(f,links)

        outputinfo='deep [%d] ok\n' % deep

        print outputinfo

        deep -= 1

        links = urls

        outputinfo='run turn links length:%d\n' % len(links)

        print outputinfo

        print links



    time.sleep(1)

    flag = False            #关闭线程

    time.sleep(3)

    print "Exiting Main Thread"#这一行会被打印出来

    #每次运行，到这里，光标不动，通过任务管理器查看python的线程数13个

    #线程虽然在，但不打印信息了

askandstudy 2011-12-19

打赏
举报

昨天这代码折腾了大半天都搞不定task_done的问题，老是报错

angel_su 2011-12-19

打赏
举报

嗯，qsize()前上锁，也算是个办法，一般常见情况是对公用数据加锁，如我代码里thread_put改3线，那么修改self.urls前就该上个锁。

askandstudy 2011-12-19

打赏
举报



#!/usr/bin/env python

#coding=utf-8

from BeautifulSoup import BeautifulSoup

import urllib2

from threading import Thread,Lock

from Queue import Queue

import time

import socket

socket.setdefaulttimeout(5)

lock = Lock()

flag = True#标志位，控制子线程结束

class Fetcher:#封装的抓取网页的线程，处理结果的线程

    def __init__(self,th_num):

        self.opener = urllib2.build_opener(urllib2.HTTPHandler)

        self.q_req = Queue() #任务队列

        self.q_ans = Queue() #抓取结果处理队列

        self.urls = []#返回下一深度要抓取的url

        for i in range(th_num):#启动抓取线程

            t = Thread(target=self.thread_get)

            t.start()

        for i in range(0,1):

            t = Thread(target=self.thread_put)

            t.start()





    def join(self): #需等待两个队列完成

        self.q_req.join()

        self.q_ans.join()

        print '=====================im done'#这个运行的时候也能打印出来。



    def push(self,req):#往任务队列装任务

        self.q_req.put(req)



    def thread_put(self):#负责结果处理，将页面中的url存到sel.url里面

        while flag:

            #print flag#调试用，如果这个线程没有结束，应该不断的打印flag才对

            lock.acquire()

            if self.q_ans.qsize()<=0:

                lock.release()

                time.sleep(1)

                continue

            try:

                url = self.q_ans.get()

                lock.release()

                self.urls.extend(url)

            except Exception ,e:

                print e,'other,excp========in=put'

            finally:

                self.q_ans.task_done()



    def thread_get(self):

        while flag:

            #print flag

            lock.acquire()

            if self.q_req.qsize()<=0:

                lock.release()

                time.sleep(1)

                continue

            try:

                req = self.q_req.get()

                lock.release()

                urls = []

                ans = self.opener.open(req).read()#抓取页面

                soup = BeautifulSoup(ans)#解析页面

                for a in soup.findAll('a'):#提取页面的所有链接

                    try:

                        if a['href'].startswith('http'):

                            urls.append(a['href'])

                    except Exception,ex:

                        print ex,'========================Exception=in=soup=findAll'

                self.q_ans.put(urls)#将连接放入结果处理队列，让put线程去处理

            except Exception, e:

                print e,'other--exception----------in- threadget----'

            finally:

                print '--------------------'

                self.q_req.task_done()

        print "----------get---quiting"#这个看不到打印



def run(f,links):#作用是在下一轮抓取前清空urls，填装下一轮的任务，并等待任务完成。

    f.urls = []

    for url in links:

        f.push(url)

    f.join()

    return f.urls



if __name__ == "__main__":

    links = ['http://www.kingdowin.com/',]

    deep = 2#要抓取的深度

    f = Fetcher(10)

    while deep > 0:#在这里结合run函数控制抓取深度

        urls = run(f,links)

        outputinfo='deep [%d] ok\n' % deep

        print outputinfo

        deep -= 1

        links = urls

        outputinfo='run return links length:%d\n' % len(links)

        print outputinfo

        print links



    time.sleep(1)

    flag = False            #关闭线程

    time.sleep(2)

    print "Exiting Main Thread"#这一行会被打印出来

    #每次运行，到这里，光标不动，通过任务管理器查看python的线程数13个

    #线程虽然在，但不打印信息了

askandstudy 2011-12-19

打赏
举报

学习了，其实我是对锁怎么使用不太清楚。看了楼上的代码我自己改的照猫画虎一下就能保证不会阻塞了。
执行判断qsize之前加锁，判断之后分两条路走，不管走哪一步都在执行了必要的操作后立即释放锁，这样的逻辑应该没什么问题吧。试了下也还是能运行的，不过我的方法比较土，还是楼上写的比较好。

[Quote=引用 11 楼 angel_su 的回复:]

用qsize()判断并不能保证下一步get()不会阻塞，我改用get(timeout=1)，thread_put也改成3线，貌似能正常退出，你试试：
Python code
#coding=utf-8
from BeautifulSoup import BeautifulSoup
import urllib2
from threading import Thread,Lock
from Qu……
[/Quote]

angel_su 2011-12-19

打赏
举报

用qsize()判断并不能保证下一步get()不会阻塞，我改用get(timeout=1)，thread_put也改成3线，貌似能正常退出，你试试：

#coding=utf-8

from BeautifulSoup import BeautifulSoup

import urllib2

from threading import Thread,Lock

from Queue import Queue, Empty

import time

import socket



socket.setdefaulttimeout(5)

lock = Lock()

flag = True#标志位，控制子线程结束



class Fetcher:#封装的抓取网页的线程，处理结果的线程

    def __init__(self,th_num):

        self.opener = urllib2.build_opener(urllib2.HTTPHandler)

        self.q_req = Queue() #任务队列

        self.q_ans = Queue() #抓取结果处理队列

        self.urls = []#返回下一深度要抓取的url

        for i in range(th_num):#启动抓取线程

            t = Thread(target=self.thread_get)

            t.start()

        for i in range(0,3):

            t = Thread(target=self.thread_put)

            t.start()

        

 

    def join(self): #需等待两个队列完成

        self.q_req.join()

        self.q_ans.join()

        print '=====================im done'#这个运行的时候也能打印出来。

 

    def push(self,req):#往任务队列装任务

        self.q_req.put(req)

    

    def thread_put(self):#负责结果处理，将页面中的url存到sel.url里面

        while flag:

            #~ print flag#调试用，如果这个线程没有结束，应该不断的打印flag才对

            try:             

                url = self.q_ans.get(timeout=1)

            except Empty:

                continue

            except Exception ,e:

                print e,'other,excp========in=put'

                break

            lock.acquire()

            self.urls.extend(url)

            lock.release()

            self.q_ans.task_done()

        print "----------put---quiting"



    def thread_get(self):

        while flag:

            #~ print flag

            try:

                req = self.q_req.get(timeout=1)

            except Empty:

                continue

            except Exception, e:

                print e,'other--exception----------in- threadget----'

            urls = []

            try:

                ans = self.opener.open(req).read()#抓取页面

                soup = BeautifulSoup(ans)#解析页面

                for a in soup.findAll('a'):#提取页面的所有链接

                    if a['href'].startswith('http'):

                        urls.append(a['href'])

            except Exception,ex:

                print ex,'========================Exception=in=ans/soup'

            self.q_ans.put(urls)#将连接放入结果处理队列，让put线程去处理 

            print '--------------------'

            self.q_req.task_done()

        print "----------get---quiting"#这个看不到打印

 

def run(f,links):#作用是在下一轮抓取前清空urls，填装下一轮的任务，并等待任务完成。

    f.urls = []

    for url in links:

        f.push(url)

    f.join()

    return f.urls 



if __name__ == "__main__":

    links = ['http://www.kingdowin.com/',]

    deep = 2#要抓取的深度

    f = Fetcher(10)

    while deep > 0:#在这里结合run函数控制抓取深度

        urls = run(f,links)

        deep -= 1

        links = urls

        print len(links)

        

    time.sleep(1)

    flag = False            #关闭线程

    time.sleep(1)

    print "Exiting Main Thread"#这一行会被打印出来

    #每次运行，到这里，光标不动，通过任务管理器查看python的线程数13个

    #线程虽然在，但不打印信息了

z752964360 2011-12-18

打赏
举报

askandstudy我这样代码改了，还是不行！
能否加下我QQ 752964360

askandstudy 2011-12-18

打赏
举报

另外加了几句调试用的代码，所以输出结果跟楼主的有些不一样：



if __name__ == "__main__":

    links = ['http://www.kingdowin.com/',]

    deep = 2#要抓取的深度

    f = Fetcher(10)

    while deep > 0:#在这里结合run函数控制抓取深度

        urls = run(f,links)

        outputinfo='deep [%d] ok\n' % deep

        print outputinfo

        deep -= 1

        links = urls

        outputinfo='run turn links length:%d\n' % len(links)

        print outputinfo

        print links

askandstudy 2011-12-18

打赏
举报

学习了，这代码玩了大半天了，累了不玩了，还是用回了我的老方法，虽然很土。希望看到有高手能给个好一点的能实际运行的代码。
那个join,get,task_done实在玩累了



    def thread_put(self):#负责结果处理，将页面中的url存到sel.url里面

        while flag:

            print flag#调试用，如果这个线程没有结束，应该不断的打印flag才对

            if self.q_ans.qsize()<=0:

                time.sleep(1)

                continue



    def thread_get(self):

        while flag:

            print flag

            if self.q_req.qsize()<=0:

                time.sleep(1)

                continue

thread_put和thread_get里各加了三行代码，我的运行结果如下：

E:\codes\komodoprj>c:\python27\python temp.py
True
TrueTrue

True
TrueTrue

TrueTrue

True
True
True
--------------------
True
True
TrueTrue

TrueTrue

True
TrueTrueTrue

True
True=====================im done
deep [2] ok

run turn links length:17

[u'http://my.4399.com/userapp.php?id=100111', u'http://my.kingdowin.com', u'http
://apps.renren.com/tdsheep/', u'http://www.pengyou.com/index.php?mod=appmanager&
act=openapp&type=qzone&appid=16488', u'http://www.playersaid.com/runescape-gold/
', u'http://www.playersaid.com/wow-gold/', u'http://www.playersaid.com/wow-gold/
', u'http://www.playersaid.com/runescape-gold/', u'http://www.playersaid.com/rif
t-platinum/', u'http://www.renren.com', u'http://uchome.developer.manyou.com/',
u'http://www.myspace.cn/', u'http://www.facebook.com', u'http://www.pengyou.com'
, u'http://www.kaixin001.com', u'http://www.linezing.com', u'http://www.linezing
.com']
True
--------------------
True
True
TrueTrueTrueTrue
True

TrueTrueTrueTrue

True

--------------------
True
--------------------
True
--------------------
True
--------------------
True
True
True
True
True
True
--------------------
True
--------------------
True
True
True
True
--------------------
True
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
--------------------
True
True
True
True
True
True
True
True
True
<urlopen error timed out> other--exception----------in- threadget----
--------------------
True
True
True
True
--------------------
True
--------------------
True
True
<urlopen error timed out> other--exception----------in- threadget----
--------------------
True
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
--------------------
True
True
True
True
True
True
......此处省略N个True......
True
True
True
True
True
True
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' 'href'========================Exception=in=soup=findAll
========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href'True
========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
--------------------
True
True
'href' ========================Exception=in=soup=findAll
--------------------
True
True
True
True
True
True
True
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
'href' ========================Exception=in=soup=findAll
--------------------
True
True
True
True
True
True
True
True
True
True=====================im doneTrue

deep [1] ok

run turn links length:247

[
u'http://my.4399.com/', u'http://my.4399.com/', u'http://my.4399.com/index.php?c
t=myapp', u'http://my.4399.com/network.html', u'http://my.4399.com/help1.php', u
'http://t.sina.com.cn/my4399', u'http://t.sina.com.cn/caiwensheng', u'http://my.
4399.com/sitemap/', u'http://my.4399.com/help1.php', u'http://my.4399.com/joinus
/zhaopin.html', u'http://my.4399.com/joinus/', u'http://imga.4399.com/upload_pic
/2011/icp.jpg', u'http://net.china.cn/chinese/index.htm', u'http://imga.4399.com
/upload_pic/2011/wenwangwen.jpg', u'http://imga.4399.com/upload_pic/2011/chuban.
......此处省略一大段获取到的url......
m/knowledgebase/', u'http://www.comm100.com/forum/', u'http://www.comm100.com/em
ailmarketingnewsletter/', u'http://www.comm100.com/emailticket/', u'http://www.r
ingcentral.com', u'http://www.comm100.com/livechat/', u'http://www.comm100.com/l
ivechat/', u'http://www.comm100.com/', u'http://www.comm100.com/', u'http://www.
comm100.com/livechat/', u'http://www.comm100.com/knowledgebase/', u'http://www.c
omm100.com/forum/', u'http://www.comm100.com/emailmarketingnewsletter/', u'http:
//www.comm100.com/emailticket/', u'http://www.ringcentral.com']
True
True
True
True
True
True
True
True
True
True
True
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
----------get---quiting
Exiting Main Thread

E:\codes\komodoprj>

angel_su 2011-12-17