public void doRead() throws ClientProtocolException, IOException
{
HttpClient httpClient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://wenku.baidu.com/view/d6b2a9d2b14e852458fb5763.html");
HttpResponse httpResponse = httpClient.execute(httpGet);
HttpEntity entity = httpResponse.getEntity();
InputStream in = null;
FileOutputStream out = null;
if(httpResponse.getStatusLine().getStatusCode()==HttpStatus.SC_OK)
{
in = entity.getContent();
byte[] bytes = new byte[1024];
out = new FileOutputStream("E:/1/1.html");
//此处用于读取服务器响应的内容,问题也就在这儿,为什么从服务器下载的某个指定的网页中会有重复的内容,比如说某一行会重复一次
while(in.read(bytes)!=-1)
{
out.write(bytes,0,bytes.length);
out.flush();
}
out.close();
in.close();
httpGet.abort();
httpClient.getConnectionManager().shutdown();
}
}
利用httpclient爬去网页时,发现下载到的网页中会有很多重复的html代码,难道是因为 while(in.read(bytes)!=-1)
{
out.write(bytes,0,bytes.length);
out.flush();
}这个地方读取的有问题吗?后来发现采用EntityUtils.toByteArray(entity)的方法不会出项问题!但是菜鸟特变想知道上面的问题原因在哪里?
还有就是怎么控制编码的问题?