抓取有cookie值拦截的网站数据

gybciy1s1s1 2013-11-11 03:31:50

最近在做个网址，其中有个功能是要动态获取http://www.yesinfo.com.cn/pqs_revision/pages/jsp/popcontquery.jsp?contid=ESPU8040903&p=¶m=

这个网站的内容，但是发现网站做了仿抓取的手段，每次访问都会被强制跳转到一个页面去，导致无法正常抓取，自己研究了一阵，发现是利用cookie值来拦截的，于是网上搜了一下，发现用httpclient可以控制cookie；于是参考

写了以下代码：
String url = "http://www.yesinfo.com.cn/";

HttpClient client = new HttpClient();
GetMethod getMethod=new GetMethod();
getMethod = new GetMethod("http://www.yesinfo.com.cn/pqs_revision/pages/jsp/popcontquery.jsp?contid=CCLU6579946&p=¶m=");
getMethod.setRequestHeader("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
getMethod.setRequestHeader("Accept-Language","zh-CN,zh;q=0.8");
getMethod.setRequestHeader("Cache-Control","max-age=0");
getMethod.setRequestHeader("Connection","keep-alive");
getMethod.setRequestHeader("Cookie"," WLSESSIONID=HzGcSQQPmvJv8gkmn6y0cp9mTXvCVyK2GYrGG1TfVRrvyTQby2sN!-910258253;");
getMethod.setRequestHeader("Host","www.yesinfo.com.cn");
getMethod.setRequestHeader("Referer","http://www.yesinfo.com.cn/pqs_revision/pages/jsp/popuPublic.jsp");
getMethod.setRequestHeader("User-Agent","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36");
int status= client.executeMethod(getMethod);
System.out.println(getMethod.getResponseBodyAsString());
发现可以正常的获取到数据，但是cookie里面WLSESSIONID隔一段时间就变一次；于是写了段动态获取cookie值的程序：
public static String getCookieStr(String url, HttpClient client,
GetMethod getMethod) throws HttpException, IOException {

getMethod.setURI(new URI(url));
getMethod.setRequestHeader("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
getMethod.setRequestHeader("Accept-Language","zh-CN,zh;q=0.8");
getMethod.setRequestHeader("Cache-Control","max-age=0");
getMethod.setRequestHeader("Connection","keep-alive");
getMethod.setRequestHeader("Host","www.yesinfo.com.cn");
getMethod.setRequestHeader("Referer","http://www.yesinfo.com.cn/pqs_revision/pages/jsp/popuPublic.jsp");
getMethod.setRequestHeader("User-Agent","Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36");
client.executeMethod(getMethod);
Cookie[] cookies = client.getState().getCookies();
String tmpcookies = "";
List<String> strs = new ArrayList<String>();
for (Cookie c : cookies) {
tmpcookies = c.toString();
System.out.println("cookie:"+tmpcookies);
}
return tmpcookies;
}
试了该网站的很多网页却一直拿不到正确的cookie值，求救各位大神，有没有什么办法拿到正确的cookie值，或者其他抓取这种有限制的网站

...全文

682 7 打赏收藏转发到动态举报

写回复

用AI写文章

7 条回复

切换为时间正序

请发表友善的回复…

发表回复

txlty 2013-11-25

打赏
举报

Java的htmlunit应该可以全面模拟浏览器。
你也可以想办法集成phantomjs，我用phantomjs试了一下：

phantom.outputEncoding="gb2312";

var page = require('webpage').create();

page.viewportSize = { width: 1024, height: 768 };

page.settings.userAgent = 'Mozilla/5.0 (Windows NT 5.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1500.72 Safari/537.36'; //chrome

page.settings.loadImages = true;

page.settings.javascriptEnabled = true;

page.open("http://www.yesinfo.com.cn/pqs_revision/pages/jsp/popuPublic.jsp", function(status) {

	page.onUrlChanged = function(url) {	//当地址变化

		console.log("page url :"+url);

	};

	if (status !== 'success') {

        console.log('FAIL to load the address');

		phantom.exit();

    } else {

		window.setTimeout(function () {

			page.evaluate(function () {

				document.querySelector('input[name=cont_id]').value = 'ESPU8040903';

				document.querySelector('input[name=Submit12]').click();

			});

		}, 2000);

		window.setTimeout(function () {

			var cookie=page.evaluate(function () {

				return document.cookie;

			});

			var result=page.evaluate(function () {

				return document.querySelector('.sub_title ~ table').innerHTML;

			});

			console.log("cookie : "+cookie);

			console.log("result : "+result);

			phantom.exit();

		}, 5000);

    }    

});

结果：

gybciy1s1s1 2013-11-11

打赏
举报

http://www.yesinfo.com.cn/pqs_revision/pages/jsp/popuPublic.jsp 查询的页面

gybciy1s1s1 2013-11-11

打赏
举报

引用 4 楼 huxiweng 的回复:

输入你的url，跳了：http://www.yesinfo.com.cn/publicInfoService/index.action
引用 2 楼 gybciy1s1s1 的回复:
引用 1 楼 huxiweng 的回复:
注册都没有啊。。我擦，我登陆不了
什么意思？注册？这个是公共查询部分，不需要注册，登陆的

引用 4 楼 huxiweng 的回复:

输入你的url，跳了：http://www.yesinfo.com.cn/publicInfoService/index.action
引用 2 楼 gybciy1s1s1 的回复:
引用 1 楼 huxiweng 的回复:
注册都没有啊。。我擦，我登陆不了
什么意思？注册？这个是公共查询部分，不需要注册，登陆的

对啊，cookie里面做判断了，WLSESSIONID这个没有值，所以就跳到其他页面去了，这就是这个帖子想解决的问题，我想到了用httpclient带cookie去访问，但是没办法动态拿到正确的WLSESSIONID值