nutch

卌艹部落 2016-05-01 09:12:02
nutch-0.9+tomcat6.0+jdk1.6+cygwin 在win7环境下,按照《Lucene+nutch搜索引擎开发》进行如下操作:
(1) nutch-0.9的工作目录中创建weburls.txt作为网络蜘蛛抓取的网站入口地址,文件内容:http://127.0.0.1:8080/examweb/index.htm
(2)将nutch_0.9/conf/crawl_urlfilter.txt下的+^http://([a-z0-9]*\.)*MY_DOMAIN.NAME/修改为+^http://127.0.0.1:8080/
将nutch_0.9/conf/nutch-site.xml下的http.agent.name的value设为localweb.com
(3)打开cygwin输入命令:“cd /cygdrive/d/nutch_0.9”进入nutch根目录
输入命令:“bin/nutch crawl weburls.txt -dir localweb -depth 3 -topN 100 -threads 1”

结果显示:crawl started in: localweb
rootUrlDir = weburls.txt
threads = 1
depth = 3
topN = 100
Injector: starting
Injector: crawlDb: localweb/crawldb
Injector: urlDir: weburls.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: localweb/segments/20160429152555
Generator: filtering: false
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: localweb/segments/20160429152555
Fetcher: threads: 1
fetching http://127.0.0.1:8080/examweb/index.htm
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: localweb/crawldb
CrawlDb update: segments: [localweb/segments/20160429152555]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: localweb/segments/20160429152603
Generator: filtering: false
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=1 - no more URLs to fetch.
LinkDb: starting
LinkDb: linkdb: localweb/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: localweb/segments/20160429152555
LinkDb: done
Indexer: starting
Indexer: linkdb: localweb/linkdb
Indexer: adding segment: localweb/segments/20160429152555
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: localweb/indexes
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
这个异常怎么解决啊????网上提供了一些方法 试过不行啊!!! 另外,将nutch导入eclipse时src下的org.apache.nutch.parse中的parseResult.java一直出错,所以直接在cygwin中操作,进行二次开发可以么?

当下载多个网站时
输入:“bin/nutch crawl multiurls.txt -dir multiweb -depth 2 -topN 100 -threads 5”显示:
fetching、CrawlDb、 LinkDb等都正常但 Indexing 后会显示(null)
例如:
Indexer: adding segment: multiweb/segments/20160429154720
Indexing [http://2sc.sohu.com/] with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@98f192 (null)
........
输入bin/nutch org.apache.nutch.searcher.NutchBean SUV
查询包含SUV的信息是结果为:
Total hits: 0
求大神帮忙 !!!!!! 跪谢!!!
...全文
71 回复 打赏 收藏 转发到动态 举报
写回复
用AI写文章
回复
切换为时间正序
请发表友善的回复…
发表回复

50,523

社区成员

发帖
与我相关
我的任务
社区描述
Java相关技术讨论
javaspring bootspring cloud 技术论坛(原bbs)
社区管理员
  • Java相关社区
  • 小虚竹
  • 谙忆
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧