python爬虫中使用xpath遇到的问题

kui27 2014-06-28 10:40:10

最近在练习Python的时候，使用XPath来爬取网页上的内容，感觉要比使用正则匹配要更灵活和简洁一些。但是今天遇到了一个问题，找了半天资料也没找到解决方法，自己对照xpath的语法，也没错误。但就是结果不对。论坛上的各位大神可否帮忙指点一二：



#coding:utf-8



import urllib

import urllib2

from lxml import etree as etree



if __name__ == "__main__":

        #此段代码的目的是为了爬取下边网页上的“更新时间”

        req_url = 'http://www.mumayi.com/android-81548.html'

        try:

            headers = {'User-Agent':'"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0"'}

            req = urllib2.Request(req_url, headers = headers)

            content = urllib2.urlopen(req, timeout=60).read()

            if isinstance(content, unicode):

                pass

            else:

                content = content.decode('utf-8')

            #print content

            htmlSource = etree.HTML(content)

            

            names = htmlSource.find('.//ul[@class="istyle fl"]//li[4]')  #问题出在节点“li”的序列号[4]上，只要加上li[4]，结果就是None

            print names.text, type(names)

...全文

1275 8 打赏收藏转发到动态举报

写回复

用AI写文章

8 条回复

切换为时间正序

请发表友善的回复…

发表回复

hanyuwei0 2014-06-30

打赏
举报

更新时间那应该是li[3]

The_Third_Wave 2014-06-29

打赏
举报

<ul class="menu fl hidden" id="menu">
                        <li class="conBox"><strong>应用：</strong><a href="http://www.mumayi.com/android/xitonggongju"

我没搞错的话要写全信息，话说没全我没试过，我也一直用的这个！哈哈

names = htmlSource.find('.//ul[@class="menu fl hidden"]//li[4]')

The_Third_Wave 2014-06-29

打赏
举报

注意两个匹配出来不是一个类型，所以必须分开，要不就在for循环里try except处理

更新时间：
2014-06-19

The_Third_Wave 2014-06-29

打赏
举报

引用 4 楼 kui27 的回复:

[quote=引用 3 楼 u013171165 的回复:]

#coding:utf-8

import urllib
import urllib2
import lxml.html as HTML

if __name__ == "__main__":
        #此段代码的目的是为了爬取下边网页上的“更新时间”
        req_url = 'http://www.mumayi.com/android-81548.html'
        headers = {'User-Agent':'"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0"'}
        req = urllib2.Request(req_url, headers = headers)
        content = urllib2.urlopen(req, timeout=60).read()
        if isinstance(content, unicode):
            pass
        else:
            content = content.decode('utf-8')
        # print content
        htmlSource = HTML.fromstring(content)
        print htmlSource        
        names = htmlSource.xpath(r'//ul[@class="menu fl hidden"]/li/strong')  #问题出在节点“li”的序列号[4]上，只要加上li[4]，结果就是None
        for name in names:
                print name.text

>>> ================================ RESTART ================================
>>> 
<Element html at 0x2ab44e0>
应用：
游戏：
应用：
游戏：
>>>

我不知道你具体要爬那一层，写了一个给你参考！

我是要爬取这个消息：更新时间：2014-06-19。按照你的方式试了一下，还是报NoneType错误。就是节点不对。。[/quote] 你语法还是没看仔细！

# -*-coding: utf-8 -*-

import urllib
import urllib2
import lxml.html as HTML

if __name__ == "__main__":
    #此段代码的目的是为了爬取下边网页上的“更新时间”
    req_url = 'http://www.mumayi.com/android-81548.html'
    headers = {'User-Agent':'"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0"'}
    req = urllib2.Request(req_url, headers = headers)
    content = urllib2.urlopen(req, timeout=60).read()
    if isinstance(content, unicode):
        pass
    else:
        content = content.decode('utf-8')
    htmlSource = HTML.fromstring(content)   
    retrans_content_tags = htmlSource.xpath(u'//div[@class="c"][4]/child::text()|//div[@class="c"][$_i]/a[position()>1]/child::text()') #
    names = htmlSource.xpath(u'//ul[@class="istyle fl"]/li[3]/span')  
    print names[0].text
    time = htmlSource.xpath(u'//ul[@class="istyle fl"]/li[3]/child::text()')
    print time[0]

加分加分！哈哈

kui27 2014-06-29

打赏
举报

引用 3 楼 u013171165 的回复:

#coding:utf-8

import urllib
import urllib2
import lxml.html as HTML

if __name__ == "__main__":
        #此段代码的目的是为了爬取下边网页上的“更新时间”
        req_url = 'http://www.mumayi.com/android-81548.html'
        headers = {'User-Agent':'"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0"'}
        req = urllib2.Request(req_url, headers = headers)
        content = urllib2.urlopen(req, timeout=60).read()
        if isinstance(content, unicode):
            pass
        else:
            content = content.decode('utf-8')
        # print content
        htmlSource = HTML.fromstring(content)
        print htmlSource        
        names = htmlSource.xpath(r'//ul[@class="menu fl hidden"]/li/strong')  #问题出在节点“li”的序列号[4]上，只要加上li[4]，结果就是None
        for name in names:
                print name.text

>>> ================================ RESTART ================================
>>> 
<Element html at 0x2ab44e0>
应用：
游戏：
应用：
游戏：
>>>

我不知道你具体要爬那一层，写了一个给你参考！

我是要爬取这个消息：更新时间：2014-06-19。按照你的方式试了一下，还是报NoneType错误。就是节点不对。。

The_Third_Wave 2014-06-29

打赏
举报

#coding:utf-8

import urllib
import urllib2
import lxml.html as HTML

if __name__ == "__main__":
        #此段代码的目的是为了爬取下边网页上的“更新时间”
        req_url = 'http://www.mumayi.com/android-81548.html'
        headers = {'User-Agent':'"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0"'}
        req = urllib2.Request(req_url, headers = headers)
        content = urllib2.urlopen(req, timeout=60).read()
        if isinstance(content, unicode):
            pass
        else:
            content = content.decode('utf-8')
        # print content
        htmlSource = HTML.fromstring(content)
        print htmlSource        
        names = htmlSource.xpath(r'//ul[@class="menu fl hidden"]/li/strong')  #问题出在节点“li”的序列号[4]上，只要加上li[4]，结果就是None
        for name in names:
                print name.text

>>> ================================ RESTART ================================
>>> 
<Element html at 0x2ab44e0>
应用：
游戏：
应用：
游戏：
>>>

我不知道你具体要爬那一层，写了一个给你参考！

angel_su 2014-06-29

打赏
举报

文字在子节点span里，试试： names = htmlSource.find('.//ul[@class="istyle fl"]/li[3]/span') print names.text, names.tail

kui27 2014-06-28

打赏
举报

#问题出在节点“li”的序列号[4]上，只要加上li[4]，结果就是None。但是俺看网上介绍的xpath的教程，这里给li节点加上[4]的序列号语法并没有错。不知道程序为什么就会错。还是说是其他地方的错。 names = htmlSource.find('.//ul[@class="istyle fl"]//li[4]')