如何用多线程改进这个下载程序?

fnzh0003 2011-06-17 06:07:08

我写了一个程序,下载http://www.network-theory.co.uk/docs/pytut/里面的在线教程,
import time
import urllib
import lxml.html
import os
time1=time.time()
os.mkdir('/tmp/python')
down='http://www.network-theory.co.uk/docs/pytut/'
file=urllib.urlopen(down).read()
root=lxml.html.fromstring(file)
tnodes = root.xpath("//div[@class='main']//ul/li/a")
for x in tnodes:
url='http://www.network-theory.co.uk/docs/pytut/'+x.get('href')
name=x.text
myfile=open('/tmp/python/'+name,'a')
page=urllib.urlopen(url).read()
myfile.write(page)
myfile.close()
time2=time.time()
print time2-time1
我使用上面的程序,测试了下载时间,564秒,好慢,可能使用多线程,可以加快下载.
请问,如何用多线程进行改写??

...全文

103 7 打赏收藏转发到动态举报

写回复

用AI写文章

7 条回复

切换为时间正序

请发表友善的回复…

发表回复

angel_su 2011-06-21

打赏
举报

name=x.text
myfile=open('/tmp/python/'+name,'a')
这两行照你原来的试试...

fnzh0003 2011-06-21

打赏
举报

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/pengtao/tut.py", line 25, in worker
myfile=open('/tmp/python'+name,'a')
TypeError: cannot concatenate 'str' and 'list' objects
2楼代码有错误,需要改进.

iambic 2011-06-18

打赏
举报

如果时间花在下载上用异步IO试试。
google python urllib async

angel_su 2011-06-17

打赏
举报

Queue嘛就是线程安全，不必自己加锁，主线程是用queue.join()所以线程死等不结束也没关系，方正1个工作完了记得task_done()就可以，所以前面说简单情况方便点，不用考虑太多问题...

panghuhu250 2011-06-17

打赏
举报

我看错了原问题，还以为是要做web spider，需要过滤掉重复的url，所以用set。

看了楼上的帖子，决定看看Queue是怎么实现的，发现threading及multiprocess都有Condition类，有wait，nitify方法可以避免busy waiting。

以后每天浏览一个python标准库代码文件。

angel_su 2011-06-17

打赏
举报

先学学threading模块，另外简单情况用Queue可能单纯点，以下按你代码改造用10线程下...

import time

import urllib

import lxml.html

import os

import Queue

import threading





time1=time.time()

os.mkdir('/tmp/python')

down='http://www.network-theory.co.uk/docs/pytut/'

file=urllib.urlopen(down).read()

root=lxml.html.fromstring(file)

tnodes = root.xpath("//div[@class='main']//ul/li/a")



jobs = Queue.Queue()

for x in tnodes:

    jobs.put(x)

    

def worker():

    while not jobs.empty():

        x = jobs.get()

        url='http://www.network-theory.co.uk/docs/pytut/'+x.get('href')

        name=list(x.text)

        myfile=open('/tmp/python'+name,'a')

        page=urllib.urlopen(url).read()

        myfile.write(page)

        myfile.close()

        jobs.task_done() 

        

for i in range(10):

    threading.Thread(target=worker).start()    

jobs.join()



time2=time.time()

print time2-time1

panghuhu250 2011-06-17

打赏
举报

参考python multipreocessing

下面的代码中，url_pool保存所有待处理的url，WORKER_LIMIT设定最多几个线程同时下载。worker下载完一个url，就把新发现的url加到url_pool里去。main函数不停检查是否有待处理的url，如果有则开始一个新的过程（process）。



from multiprocessing import Process, Lock

url_pool = set("first_url")

workerCounter = 0

WORKER_LIMIT = 10

l = Lock()

def worker(url):

  # download url

  # parse content to find all links that need to be downloaded

  # add the links to global pool, need the lock it first

  l.require()

  set.update(new_url_link_list)

  l.release()

def main():

  while True:

    l.require()

    if len(url_pool)==0:

      if workerCounter == 0: 

        l.release()

        return

    else:

      while workerCounter <= WORKER_LIMIT and len(url_pool)>0:

        Peocess(target=worker, args = (url_pool.pop(),)).start()

        workerCounter += 1

    l.release()

    sleep(1)