做了一个抓取网页图片的小测试,在抓取特定网页,比如剑网三等主页图片可以成功,但是在抓取一些网页时候会报错

南橘ryc 2019-01-25 12:29:44

package J124Internet;

import java.io.BufferedReader;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
* @author Alexryc
*
*/
public class python {

//进入服务器,下载源码

public static String htmlSource(String link,String encoding) {
StringBuilder stb =new StringBuilder();
//获取网络对象
try {
URL url =new URL(link);
//建立网络链接
URLConnection uc =url.openConnection();
//设置伪装对象,跳过防火墙
uc.setRequestProperty("User-Agent", "java");
//下载源代码
InputStream is =uc.getInputStream();
InputStreamReader isr =new InputStreamReader(is,encoding);
BufferedReader br =new BufferedReader(isr);
String line =null;
while((line =br.readLine())!=null) {
stb.append(line +"\n");
}
//关闭流
br.close();
isr.close();
is.close();

} catch (MalformedURLException e) {
// 打印堆栈信息
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

return stb.toString();
}

public static void main(String[] args) {
String html =htmlSource("https://www.nanrentu.cc/mntp/","gbk");
String regex ="src=\"(.+?)\"";
Pattern ptn =Pattern.compile(regex);
//System.out.println(html);
Matcher mc =ptn.matcher(html);

while(mc.find()) {

StringBuffer s =new StringBuffer(mc.group());
//建立s3以保存图片名称,我不会用其他方法
StringBuffer s3 =new StringBuffer(mc.group());
//string与StringBuffer转换
String s1 =s.toString();
if(s1.contains("jpg")) {
s.delete(0, 7);
s.delete(s.length()-1, s.length());
s.insert(0 ,"http://", 0, 7);
System.out.println(s);
s3.delete(0, 74);
s3.delete(s3.length()-1, s3.length());
InputStream inStream = null;
//下载打印的图片

try {
String s2 =s.toString();
//新建图片位置
URL url1 = new URL(s2);
URLConnection con = url1.openConnection();
inStream = con.getInputStream();
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
byte[] buf = new byte[2048];
int len = 0;
while((len = inStream.read(buf)) != -1){
outStream.write(buf,0,len);
}
inStream.close();
outStream.close();

File file = new File("F:\\a"+"/" +s3); //图片下载的位置
FileOutputStream op = new FileOutputStream(file);
op.write(outStream.toByteArray());
op.close();

} catch (MalformedURLException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}else {
continue;
}
}

}
}



报错
http://tps://tu.nanrentu.cc/uploadImg/2019/0124/46c1ca434b9bc91343694f482f35097a_c_230_345.jpg
java.net.UnknownHostException: tps
at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:220)
at java.base/java.net.Socket.connect(Socket.java:591)
at java.base/java.net.Socket.connect(Socket.java:540)
at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:182)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:474)
at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:569)
at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:242)
at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:341)
at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:362)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1242)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1181)
at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1075)
at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1009)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1581)
at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1509)
at J124Internet.python.main(python.java:85)


...全文
156 3 打赏 收藏 转发到动态 举报
AI 作业
写回复
用AI写文章
3 条回复
切换为时间正序
请发表友善的回复…
发表回复
南橘ryc 2019-01-25
  • 打赏
  • 举报
回复
引用 2 楼 ryc1995的回复:
[quote=引用 1 楼 ryc1995的回复:]我知道我问题出在哪了
s.delete(0, 7); s.delete(s.length()-1, s.length()); s.insert(0 ,"http://", 0, 7); System.out.println(s); s3.delete(0, 74); s3.delete(s3.length()-1, s3.length());[/quote] 就不应该这么些 这些是针对特定的网站 反而我在文件里加一些正则表达式就好
南橘ryc 2019-01-25
  • 打赏
  • 举报
回复
引用 1 楼 ryc1995的回复:
我知道我问题出在哪了
s.delete(0, 7); s.delete(s.length()-1, s.length()); s.insert(0 ,"http://", 0, 7); System.out.println(s); s3.delete(0, 74); s3.delete(s3.length()-1, s3.length());
南橘ryc 2019-01-25
  • 打赏
  • 举报
回复
我知道我问题出在哪了

51,397

社区成员

发帖
与我相关
我的任务
社区描述
Java相关技术讨论
javaspring bootspring cloud 技术论坛(原bbs)
社区管理员
  • Java相关社区
  • 小虚竹
  • 谙忆
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧