我想在我的Servlet里面得到一个WEB网站的HTML源代码 怎么做呀

wwweeerrr 2004-04-29 03:33:00
我想在我的Servlet里面得到一个WEB网站的HTML源代码 比如www.sina.com ,我只想得到HTML源代码,然后把这个源代码写到一个文件里面去,怎么做才能得到呀
...全文
34 3 点赞 打赏 收藏 举报
写回复
3 条回复
切换为时间正序
当前发帖距今超过3年,不再开放新的回复
发表回复
lEFTmOON 2004-04-30
import java.io.*;
import java.net.*;

public class GetWebPage {
public static void main(String args[])
throws Exception {
if (args.length != 1) {
System.err.println("java GetWebPage hostname");
return;
}
String host = args[0];
InetAddress addr = InetAddress.getByName(host);
Socket socket = new Socket(addr, 80);
InputStream is = socket.getInputStream();
OutputStream os = socket.getOutputStream();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
PrintWriter pw = new PrintWriter(new OutputStreamWriter(os));
pw.print("GET / HTTP/1.0\n\n");
pw.flush();
String line;
while ((line = br.readLine()) != null) { // read until EOF
System.out.println(line);
}
pw.close();
br.close();
}
}

编译后运行
java GetWebPage java.sun.com,没有问题,不过sina总是出现
HTTP/1.0 403 Forbidden
Server: squid/2.5.STABLE4
Mime-Version: 1.0
Date: Fri, 30 Apr 2004 01:58:23 GMT
Content-Type: text/html
Content-Length: 1080
Expires: Fri, 30 Apr 2004 01:58:23 GMT
X-Squid-Error: ERR_ACCESS_DENIED 0
X-Cache: MISS from xa-179.sina.com.cn
Connection: close

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The requested URL could not be retrieved</TITLE>
<STYLE type="text/css"><!--BODY{background-color:#ffffff;font-family:verdana,sans-serif}PRE{font-family:sans-serif}--></STYLE>
</HEAD><BODY>
<H1>ERROR</H1>
<H2>The requested URL could not be retrieved</H2>
<HR noshade size="1px">
<P>
While trying to retrieve the URL:
<A HREF="http://218.30.12.179/index.html">http://218.30.12.179/index.html</A>
<P>
The following error was encountered:
<UL>
<LI>
<STRONG>
Access Denied.
</STRONG>
<P>
Access control configuration prevents your request from
being allowed at this time. Please contact your service provider if
you feel this is incorrect.
</UL>
<P>Your cache administrator is <A HREF="mailto:webmaster">webmaster</A>.


<BR clear="all">
<HR noshade size="1px">
<ADDRESS>
Generated Fri, 30 Apr 2004 01:58:23 GMT by xa-179.sina.com.cn (squid/2.5.STABLE4)
</ADDRESS>
</BODY></HTML>
似乎有什么限制。
  • 打赏
  • 举报
回复
su960581 2004-04-30
不难啊!写个程序抓取就可以了
  • 打赏
  • 举报
回复
wwweeerrr 2004-04-30
怎么没有人回呀
  • 打赏
  • 举报
回复
相关推荐
发帖
Web 开发
加入

8.0w+

社区成员

Java Web 开发
申请成为版主
帖子事件
创建了帖子
2004-04-29 03:33
社区公告
暂无公告