java.net.ProtocolException: Server redirected too many times (20)

xu101q 2010-11-04 03:43:55
求JAVA网络编程高手,指点,指点!!!!

我写的一个网络爬虫采集,爬Google页面会出异常,求解决方案!!!!

	private byte[] queryData() throws Exception {
java.net.URL connUrl = new URL(url);

java.net.HttpURLConnection conn = (HttpURLConnection) connUrl.openConnection();
conn.setRequestProperty("User-agent","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 2.0.50727; Maxthon 2.0)");
java.io.InputStream input = conn.getInputStream();
byte[] data = new byte[1024];
int length = 0;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
while ((length = input.read(data)) > 0) {
baos.write(data, 0, length);
}
conn.disconnect();
return baos.toByteArray();
}



URL地址为:http://www.google.com.hk/search?q=%E5%A6%87%E5%A5%B3&hl=zh-CN
异常信息如下:

java.net.ProtocolException: Server redirected too many times (20)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLCon
nection.java:1315)
at com.xdtech.platform.util.source.SourceFetch.queryData(SourceFetch.jav
a:41)
at com.xdtech.platform.util.source.SourceFetch.queryUrl(SourceFetch.java
:29)
at com.xdtech.platform.util.source.inter.AbstractSource.queryUrl(Abstrac
tSource.java:72)
at com.xdtech.platform.util.source.Template.SearchFilteByTemplateChange.
filterByPages(SearchFilteByTemplateChange.java:187)
at com.xdtech.platform.service.source.IndexSourceDataService.collectData
ByPage(IndexSourceDataService.java:147)
at com.xdtech.platform.core.service.SourceFetchExecutorPool$CategoryFetc
h.run(SourceFetchExecutorPool.java:107)

其中at com.xdtech.platform.util.source.SourceFetch.queryData(SourceFetch.java:41) 指的是代码中的
java.io.InputStream input = conn.getInputStream();



求高手救救俺,,,,

如果把URL地址中“&hl=zh-CN” 去掉就不会出异常,但是却是繁体内容!!



...全文
1714 5 打赏 收藏 转发到动态 举报
写回复
用AI写文章
5 条回复
切换为时间正序
请发表友善的回复…
发表回复
xu101q 2010-11-05
  • 打赏
  • 举报
回复
首先,很感谢你帮我解决了上面的那个异常问题!但是用你这个方法后,就出现了十分奇怪的现象!!还是采集Google页面出现问题!


source = super.queryUrl(searchUrl, urlEncode);
PatternMatcherInput input2 = new PatternMatcherInput(source.toString());
PatternMatcher matcher = new Perl5Matcher();
System.out.println("============================input2.substring "+input2.substring(10000, input2.length()-1000));
if (tf.getBlockPat() != null) {
while (matcher.contains(input2, tf.getBlockPatPattern())) {

try {
WebSearchResult res = pressPage(matcher.getMatch().group(1));

if (res != null)
result.add(res);


} catch (Exception ex) {
continue;
}

}
}

pressPage方法如下
private WebSearchResult pressPage(String blockPat) {
System.out.println("======blockPat===="+ blockPat);
WebSearchResult result = new WebSearchResult();
PatternMatcher matcher = new Perl5Matcher();
String value = blockPat;
String url = null;
..........................................
........................... 代码省略
return result;
}

后台打印出System.out.println("======blockPat===="+ blockPat);为空值!!!

采用的URL为:http://www.google.com.hk/search?q={keyword}&num=100&hl=zh-CN&lr=&newwindow=1&safe=strict&tbo=s&tbs=qdr:n&sa=X&ei=4-qFTKfGFpHEvQOJhZiIBA&ved=0CBwQpwU
采用的正则抽取规则为<li class=g style="margin-bottom:8px"><h3 class="r">[\s\S]*?</cite><span class=gl></span></span></div>
input2 已经获得了源码,部分源码input2.substring 为
<li class=g style="margin-bottom:8px"><h3 class="r"><a href="http://rent.soufu
n.com/chuzu/1_55444876_-1.htm" target=_blank class=l onmousedown="return clk(0,'
','','','16','','0CE0QFjAP')">北齿小区租房,两室一厅北京齿轮厂宿舍_北京租房网_搜
房网</a></h3><div class="s"><span class="f std" >11 秒前</span> - <b>...</b> 中
国旅游学院附中,八里庄第三小学,北京市朝阳区育人学校幼儿园:勘测设计院幼 <b>...</b
><br><span class=f><cite>rent.soufun.com/chuzu/1_55444876_-1.htm</cite><span cla
ss=gl></span></span></div>
<li class=g style="margin-bottom:8px"><h3 class="r"><a
href="http://rent.soufun.com/chuzu/1_55444889_-1.htm" target=_blank class=l onm
ousedown="return clk(0,'','','','17','','0CE8QFjAQ')">东方瑞景租房,一室一厅出租
长安街附近东方公寓房屋_北京租房网_搜房网</a></h3><div class="s"><span class="f s
td" >58 秒前</span> - <b>...</b> 周边配套:<em>大学</em>:华夏管理学院、朝阳区职
工<em>大学</em>中小学:陈经纶中学、芳草地小学 <b>...</b><br><span class=f><cite>
rent.soufun.com/chuzu/1_55444889_-1.htm</cite><span class=gl></span></span></div
>
<li class=g id=mbb18><h3 class="r"><a href="http://newhouse.wuhu.soufun.com/201
0-11-05/4000823.htm" target=_blank class=l onmousedown="return clk(0,'','','','1
8','','0CFEQFjAR')">花开收官时最美成熟季四期臻品小高层即将推出-芜湖新房网-搜房网
</a></h3><div class="s"><span class="f std" >47 秒前</span> - 物业地址弋江区九华
南路800号(安徽师范<em>大学</em>南校区对面). 交通状况可乘坐16、18、45 <b>...</b
><br><span class=f><cite>newhouse.wuhu.soufun.com/2010-11-05/4000823.htm</cite><
span class=gl></span></span></div><div class=mbl><div class=bl><span class=ch id
=mbl18 onclick="google.x(this)" style="display:inline-block"><div class=mbi></di
v><a href=# onclick="return false" class=mblink>显示来自 soufun.com·的更多搜索
结果</a></span></div></div><div id=mbf18><span></span></div>
<li class=g style="m
argin-bottom:8px"><h3 class="r"><a href="http://house.focus.cn/news/2010-11-05/1
093022.html" target=_blank class=l onmousedown="return clk(0,'','','','19','','0
CFQQFjAS')">十年城北区盛放抢最后的新牌坊- 新闻中心- 搜狐焦点网</a></h3><div clas
s="s"><span class="f std" >38 秒前</span> - <b>...</b> 在上半年销售面积100134平
米,销售金额600798157元,两项指标双双进入<em>重庆</em>前十。 <b>...</b> 2010-11-
05二次调控中高端别墅表现出众客群稳定供需两; 2010-11-04<em>大学</em>城投资 <b>...
</b><br><span class=f><cite>house.focus.cn/news/2010-11-05/1093022.html</cite><s
pan class=gl></span></span></div>


ChDw 2010-11-04
  • 打赏
  • 举报
回复
		String cookie = "";
do {
HttpURLConnection conn = (HttpURLConnection) new URL("http://www.google.com.hk/search?q=%E5%A6%87%E5%A5%B3&hl=zh-CN").openConnection();
if(cookie.length() != 0)
conn.setRequestProperty("Cookie", cookie);
conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 8.0)");
conn.setInstanceFollowRedirects(false);
int code = conn.getResponseCode();
if(code == HttpURLConnection.HTTP_MOVED_TEMP) {
cookie += conn.getHeaderField("Set-Cookie") + ";";
}
if(conn.getResponseCode() == HttpURLConnection.HTTP_OK)
break;
} while(true);



大致这样,不过其实我觉得用Apache的HttpClient就行,它直接支持Cookie
xu101q 2010-11-04
  • 打赏
  • 举报
回复
能说明白一点吗??
或者给一个正确的Google链接,我试了其它百度啊,Google博客等都可以 ,就这个Google不行啊!!!

多谢!
ChDw 2010-11-04
  • 打赏
  • 举报
回复
关键是有Cookie信息的,你必须把Cookie送过去。还有UserAgent

81,092

社区成员

发帖
与我相关
我的任务
社区描述
Java Web 开发
社区管理员
  • Web 开发社区
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧