web harvest 100分

a512796048 2012-08-31 12:16:18

100分求大神写个例子用web harvest爬取网页拿http://www.fenzhi.com/xsc1p1.html为例
一下是我写的脚本

 <?xml version="1.0" encoding="UTF-8"?>

 <config>

 	<include path="functions.xml" />

 	<file action="write" path="fz/2.xml" charset="UTF-8">

 		<![CDATA[<catalog>]]>

 		<empty>

 			<var-def name="priceList" id="priceList">

 				<xpath expression="//div[@class='winnerLink']">

 					<html-to-xml>

 						<http url="http://www.fenzhi.com/xsc1p1.html" />

 					</html-to-xml>

 				</xpath>

 			</var-def>

 		</empty>

 		<loop item="item" index="i">

 			<list>

 				<var name="priceList"></var>

 			</list>

 			<body>

 				<xquery>

 					<xq-param name="item" type="node()">

 						<var name="item" />

 					</xq-param>

 					<xq-expression>

 						<![CDATA[ 

 								declare variable $item as node() external; 

 										let $name :=data($item)

                                 return 

                                 		<info name='{normalize-space($name)}'>

                                 		 	<name>{normalize-space($name)}</name>

                                 		 </info>

                                     ]]>

 					</xq-expression>

 				</xquery>

 			</body>

 		</loop>

 		<![CDATA[</catalog>]]>

 	</file>

 </config>

我想要的结果是在http://www.fenzhi.com/xsc1p1.html页面爬取到华为 ibm这样的名字存起来然后再爬取到超链接比如gsx3131.html 然后用这个超链接和www.fenzhi.com拼接到一起成新的url 进入到这个url后爬取公司简介
目前比较困惑的是怎么在同一个loop里去循环2个结果集总是合在一起哎

...全文

81 1 打赏收藏转发到动态举报

写回复

用AI写文章

1 条回复

切换为时间正序

请发表友善的回复…

发表回复

a512796048 2012-08-31

打赏
举报

java代码

public static void main(String[] args) {

 		ScraperConfiguration config;

 		long startTime = 0L;

 		try {

 

 			config = new ScraperConfiguration(

 					"H:\\workspace\\nutch\\src\\com\\jsq\\nutch\\jianjie.xml");

 			Scraper scraper = new Scraper(config, "E:\\tmp");// 指定工作目录，爬去后的xml会保存到这里

 			scraper.setDebug(true);

 			scraper.execute();

 			startTime = System.currentTimeMillis();

 		} catch (Exception e) {

 			// TODO Auto-generated catch block

 			e.printStackTrace();

 		}

 

 	}