高分求助perl模拟表单提交相关问题~~~~

wx红杉树 2008-09-22 02:59:42
http://sitereview.bluecoat.com/sitereview.jsp
这是一个提交一个url,会返回它所属分类的页面,我的目的是得到这个分类结果。
这里面的表单有点怪,和普通的不一样,好像是用js动态加载表单。我用perl的
my $ua = LWP::UserAgent->new();
$ua->agent('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');
my $response = $ua->get("http://sitereview.bluecoat.com/sitereview.jsp",["url" => "www.yahoo.com"]);
if ($response->is_success) {
print $response->content;
} else {
die $response->status_line;
}
请高手指点。请详细研究下,这个确实有点麻烦。

...全文
266 9 打赏 收藏 转发到动态 举报
写回复
用AI写文章
9 条回复
切换为时间正序
请发表友善的回复…
发表回复
wx红杉树 2008-09-24
  • 打赏
  • 举报
回复
搞定
#!/usr/bin/perl
use strict;
use LWP::UserAgent;
if (@ARGV != 1) {
print "Usage: get_category.pl <target file>\n";
exit;
}
my $target = $ARGV[0];
if (! -e $target) {
print "Error: Can't find $target\n";
exit;
}
my $ua = LWP::UserAgent->new();
$ua->agent('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');
my $categoryResult = 'Top500Category.txt';
open URLS,"< $target";
while(<URLS>)
{
chomp;
my $theurl=$_;
my $response = $ua->post("http://sitereview.bluecoat.com/sitereview.jsp",
[ "name" => "jscheck","value" => "validated","url" => "$theurl"],
'Referer' => 'http://sitereview.bluecoat.com/sitereview.jsp',
);
my $content =$response->content;
if ($response->is_success) {
print "submit success! \n";
} else {
die $response->status_line;
}
my $category = "";
if($content=~m/This page is currently categorized as(.*?)\>(.*?)\<\/a\>\<br\>/ig){
my $cates = $2;
if($cates =~m/\,/ig){
my @names = split(/\<\/a\>/,$cates);
foreach my $name (@names){
if($name=~m/(.*?)\<a (.*)\"\>(.*)/ig){
$category = $category . $1 . $3;
}else{
$category = $category . $name;
}
}
}else{
$category = $cates;
}
}else{
print "$theurl no find category! \n";
}
print "$theurl \t $category \n";
open URLCATE,">> $categoryResult";
printf URLCATE "%-50s:%-50s:\n", $theurl,$category;
close(URLCATE);
}
close(URLS);
securityworld 2008-09-24
  • 打赏
  • 举报
回复
这样提交,跳过js的表单生成
my $response = $ua->post("http://sitereview.bluecoat.com/sitereview.jsp",
[ "name" => "jscheck","value" => "validated","url" => "$theurl"],
'Referer' => 'http://sitereview.bluecoat.com/sitereview.jsp',
);
iambic 2008-09-22
  • 打赏
  • 举报
回复
好长……
iambic 2008-09-22
  • 打赏
  • 举报
回复
为什么我跑出来的结果:
<html><head><script language="JavaScript" src="bcfunctions.js"></script><title>Web Page Review Process  </title><script>window.onload = updateEmail;</script></head><link href="templates/default.style.css" rel="stylesheet" type="text/css"><body bgcolor="#FFFFFF" topmargin=0 leftmargin=10 marginwidth=10 marginheight=0><form>	<br>	<img src='templates/default.logo.gif'></img>	<table width=640 style="color: #FFFFFF;" border="0" cellspacing="0" cellpadding="0">		<tr height=20 bgcolor=003366>			<td></td>		</tr>		<tr><td><td></tr>		<tr>			<td width="640" align="left"><h1>Web Page Review Process</h1></td>		</tr>	</table>				<br>			    <table width="640" border="0" cellspacing="0" cellpadding="0">    	<tr>        	<td valign="top" class="bodytext" style="padding-left:10px;">						<p>The page you want reviewed is <a href="http://www.yahoo.com/" target="_new" class="">http://www.yahoo.com/</a>   <small><a href="/sitereview.jsp" class="">(Check another site)</a></small><br>				This page is currently categorized as <a href="javascript:defwin('catdesc.jsp?locale=&catnum=40&catmap=6',660,550)">Search Engines/Portals</a><br>				Last Time Rated/Reviewed:  > 7 days <img onmouseover='document.getElementById("dtsDiv").style.display="block";' onmouseout='document.getElementById("dtsDiv").style.display="none";'src='images/info24.gif' width=10 height=10></img><div id='dtsDiv' style='background-color: white; position:absolute; display:none'><table style='border: 1px solid black;' width=600><thead><th  bgcolor=003366> </th></thead><tbody><tr><td class='bodytext'>The URL submitted for review was rated more than 7 days ago.  The default setting for Blue Coat SG clients to download rating changes is once a day.  There is no need to show ratings older than this.<br><br>Since Blue Coat's desktop client K9 and certain OEM partners update differently, ratings may differ from those of a Blue Coat SG as well as those present on the Site Review Tool.</td></tr></tbody></table></div></p>				<table class='normal'><tr><td valign='top'><img src='images/warning24.gif'></img></td><td class='note'>This Web page matches a list of high-profile URLs which are rated correctly and will not be rated differently, thus it cannot be submitted via this page.</td></tr></table><BR><BR></CENTER></FORM>			<p>NOTE:  Blue Coat manages the web site ratings system used by many different software and hardware vendors. Blue Coat  does not control whether a web page is "Blocked" or "Allowed" — your Internet Use Policy controls this. For more information on how to change your Internet Use Policy, <a href="/sitereview.jsp?&host=<localserver>&port=<localport>&policy=view">click here</a>.<br><br>The Web Page Review Process does not include full malware detection. Please contact your local Blue Coat representative to learn more about layered security defenses against malware.			</td>          </tr>	</table>    <br>	<table width=640 cellspacing=0 cellpadding=0>		<tr bgcolor=003366 height=20>			<td></td>			<td></td>		</tr>		<tr>			<td height="30" align="left" style="font-family: Arial,Helvetica,sans-serif;  font-size: 9px; font-weight:normal; color: #666666;">Copyright ©2006 Blue Coat Systems. All rights reserved. </TD>			<td height="30" align="right" style="font-family: Arial,Helvetica,sans-serif;  font-size: 9px; font-weight:normal; color: #666666;">Next Generation Web Filtering  </td>		</tr>	</table>            		    </form></body></html>
wx红杉树 2008-09-22
  • 打赏
  • 举报
回复
我已经找到头目了,先鼓励下
wx红杉树 2008-09-22
  • 打赏
  • 举报
回复
请高手继续帮助分析~~
wx红杉树 2008-09-22
  • 打赏
  • 举报
回复
我在网上找到了如下信息,可能跟我遇到的问题有关
* 有些时候, 浏览器可以正常访问到的地址, LWP 却不行. 一般是因为你的 LWP 的 header, referer , cookie 或 user-agent 等的设定与对方网络服务器允许连入的不同. 为了找到问题所在, 你需要比较浏览器发出的请求和你的 LWP 发出的请求有何不同, 然后修改再尝试. 很多时候这是反反复复的工作. 我最早使用 Ethereal 来监视,抓取数据, 目前使用 Firefox 的 LiveHTTPHeaders 插件. 现在LWP 也自带一个数据分析模块 LWP::DebugFile 来帮助你找到问题.
* 另外, 文章里提到了 HTTP::Cookies::Netscape , 现在LWP Cookies 模块支持更多浏览器 Mozilla , Safari , Omniweb
* 很多时候表单与 javascript 一起使用, LWP 没有分析 Javascript 的引擎,所以你必须分析网页源码里 Javascript 来决定怎样处理.

function Submit()
{
.........
self.document.location.href="verify.php";

return false;
}

........

<form>
......

*

<input type=button value="Submit your page" onClick="javascript:Submit();return false;//">

上面的这个例子通过表单提交来触动 javascript 的 submit 函数, 最后调用了 verify.php. 现在你就可以跳过所有的 javascript 而直接对 verify.php 来提交.

上面这段红色标注的应该跟我遇到的问题的可能性更大
wx红杉树 2008-09-22
  • 打赏
  • 举报
回复
这个我也用过了,是有返回值,但是不是真正的要的结果,在浏览器运行和用脚本得到的结果不一样。
按你的代码和我以前的代码一样都是得到
<html>
<head>
<title>Web Page Review Process</title>
<script language="JavaScript">
function forward() {
//var currentURL = window.document.URL.toString();
//alert(currentURL);
document.hiddenForm.submit();
//var queryString = "";
//if(currentURL.indexOf("?") > 0) {
// queryString = currentURL.substring(currentURL.indexOf("?"));
//}
//alert("sitereview.jsp" + queryString);
//window.location.href="sitereview.jsp" + queryString;
}
</script>
</head>
<link href="http://sitereview.cwfservice.net/templates/default.style.css" rel="stylesheet" type="text/css"></link>
<body onload='forward()'>
<noscript>

<img src='templates/default.logo.gif'></img>
<div style='font-size: 16pt'><b>JavaScript Required</b></div>
<p class='bodytext'>
It seems JavaScript is either disabled or not supported by your browser.
<br><br>
You can enable JavaScript by changing your browser options.
</p>
</noscript>
<form name='hiddenForm' action='sitereview.jsp' method='post'>

<input type=hidden name='url' value='"www.yahoo.com"'></input>

<input type=hidden name='jscheck' value='validated'></input>
</form>

</body>

这个是不正确的,正确的是包含This page is currently categorized as <a href="javascript:defwin('catdesc.jsp?locale=&catnum=40&catmap=6',660,550)">Search Engines/Portals</a><br>
你可以在http://sitereview.bluecoat.com/sitereview.jsp上面提交一个url看看结果,再和用脚本得到的结果比较,你就能发现了,请继续帮助,谢谢~~◎◎
iambic 2008-09-22
  • 打赏
  • 举报
回复
不太确定你要的结果是什么,但是加上Referer之后是有返回值得的:

use strict;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new();
$ua->agent('Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');
my $response = $ua->post("http://sitereview.bluecoat.com/sitereview.jsp",
[ "url" => "www.yahoo.com", ],
'Referer' => 'http://sitereview.bluecoat.com/sitereview.jsp',
);

if ($response->is_success) {
print $response->content;
} else {
die $response->status_line;
}

37,720

社区成员

发帖
与我相关
我的任务
社区描述
JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • IT.BOB
加入社区
  • 近7日
  • 近30日
  • 至今

试试用AI创作助手写篇文章吧