使用Httpclient获取网页信息时，返回值乱码（网页是经过Gzip加密的））

sun0322 2019-12-07 09:01:45

已经对Gzip做了解压操作，但是，返回结果如下

代码及错误的具体信息，请参照下面的链接
https://blog.csdn.net/sxzlc/article/details/103439160

...全文

818 17 打赏收藏转发到动态举报

写回复

用AI写文章

17 条回复

切换为时间正序

请发表友善的回复…

发表回复

sun0322 2020-05-01

打赏
举报

争取月末结帖

sun0322 2020-01-05

打赏
举报

引用 15 楼 pws22 的回复:

这特码各种答题都是瞎几把答,很明细的一个问题,你被反爬了,缺少了cookie,你请求出来的信息就是运行一段js,生成cookie,看到args1了么,这个是密钥,下面的也不是编码的,就是js混淆的问题,祝你好运

■出现问题的原因推测最近晚上跑了几次，由于服务器端返回的都不是Gzip，所以都正常结束了，当下次再出现问题时，我按照下面的链接试一试，看看能否解决。防爬网站需要携带一些基础http头模拟成浏览器登录 https://www.jianshu.com/p/401a25134b89 ---- 非常感激提供思路。问题解决后，就会结帖。

pws22 2019-12-27

打赏
举报

这特码各种答题都是瞎几把答,很明细的一个问题,你被反爬了,缺少了cookie,你请求出来的信息就是运行一段js,生成cookie,看到args1了么,这个是密钥,下面的也不是编码的,就是js混淆的问题,祝你好运

sun0322 2019-12-13

打赏
举报

各位大神帮忙看看，提供一下思路。

sun0322 2019-12-11

打赏
举报

代码在响应的返回不是gzip的时候没有问题，
可是 gzip的时候乱码问题无法解决。
（乱码中抽出一部分，可以解码，也确定时URL编码）
但是，整体URL解码的时候出现问题。
大家能帮忙看看，如何修改，或者提供一些思路吗。
代码如下（注释部分是我尝试URL解码的代码，解码后，返回值为NULL）

package com.sxz.timecontroal;

 

import java.io.BufferedReader;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.net.URLDecoder;

import java.net.URLEncoder;

import java.util.zip.GZIPInputStream;



import org.apache.http.Header;

import org.apache.http.HttpResponse;

import org.apache.http.HttpStatus;

import org.apache.http.client.ClientProtocolException;

import org.apache.http.client.methods.HttpGet;

import org.apache.http.impl.client.DefaultHttpClient;

import org.apache.http.util.EntityUtils;

 

 

public class CheckTimeWithNet {

 

 

    static final String LOGINURL     = "https://blog.csdn.net/sxzlc?orderby=ViewCount";

    //static final String LOGINURL     = "https://blog.csdn.net/sxzlc/article/list/2?orderby=ViewCount";

 



    public static void main(final String[] args) {

 

        final DefaultHttpClient httpclient = new DefaultHttpClient();

 

        final HttpGet httpGet = new HttpGet(LOGINURL);

        HttpResponse response = null;

 

        try {

            //httpGet.addHeader("Accept-Encoding", "gzip, deflate"); 

            httpGet.addHeader("Accept-Charset", "utf-8");

            response = httpclient.execute(httpGet); 

        } catch (final ClientProtocolException cpException) {

        } catch (final IOException ioException) {

        }

 

        // verify response is HTTP OK

        final int statusCode = response.getStatusLine().getStatusCode();

        if (statusCode != HttpStatus.SC_OK) {

            System.out.println("Error authenticating to Force.com: "+statusCode);

            return;

        }

 

        System.out.println("---------------------Status code Info Start---------------------");

        System.out.println(response.getStatusLine());

        System.out.println("---------------------Status code Info end  ---------------------");

        System.out.println("---------------------Head Info Start---------------------");

        final Header[] hs = response.getAllHeaders();

        for(final Header h:hs){

         System.out.println(h.getName() + ":" + h.getValue());

        }

        System.out.println("---------------------Head Info End  ---------------------");

 

        String getResult = null;

        try {

            // response.setEntity(new GzipDecompressingEntity(response.getEntity())); 

            // getResult = EntityUtils.toString(response.getEntity(),"UTF-8");

            getResult = getStringFromResponseUzip(response);

        } catch (final Exception ioException) {

            // Handle system IO exception

        }

		System.out.println(getResult);

 

    }

 

    public static String getStringFromResponseUzip(final HttpResponse response) throws Exception {

        if (response == null) {

            return null;

        }

        String responseText = "";

        //InputStream in = response.getEntity().getContent();

        final InputStream in = response.getEntity().getContent();

        final Header[] headers = response.getHeaders("Content-Encoding");

        for(final Header h : headers){

            System.out.println(h.getValue());

            if(h.getValue().indexOf("gzip") > -1){

                System.out.println("---------------------is gzip---------------------");

                //For GZip response

                try{

                    final GZIPInputStream gzin = new GZIPInputStream(in);

                    final InputStreamReader isr = new InputStreamReader(gzin,"utf-8");

                    // final InputStreamReader isr = new InputStreamReader(gzin,"ISO-8859-1");

                    responseText = getStringFromStream(isr);

                    responseText = responseText.replace("\\x", "%");

                    //responseText = URLDecoder.decode(responseText, "UTF-8");

                    // responseText = URLDecoder.decode(responseText, "ISO-8859-1");

                    // responseText = URLEncoder.encode(responseText, "utf-8");

                }catch (final IOException exception){

                    exception.printStackTrace();

                }               

                return responseText;

            }

        }

        System.out.println("---------------------is not gzip---------------------");

        responseText = EntityUtils.toString(response.getEntity(),"utf-8");

        return responseText;

    }

 

    public static String getStringFromStream(final InputStreamReader isr) throws Exception{

        final BufferedReader br = new BufferedReader(isr);

        final StringBuilder sb = new StringBuilder();

        String tmp;

        while((tmp = br.readLine())!=null){

            sb.append(tmp);

            sb.append("\r\n");

        }

        br.close();

        isr.close();

        return sb.toString();

    }

}

sun0322 2019-12-09

打赏
举报

chunked 分块传输。。。

IT_熊 2019-12-08

打赏
举报

我是来学习的

sun0322 2019-12-08

打赏
举报

引用 9 楼 tianfang 的回复:

我在北京 server是openresty

注意到你的server 是 tengine

非常感谢大神的回答！
一会是gzip，一会儿不是，很有可能是因为服务器端的负载平衡机制造成的。
学习了！
这个现象的问题解决了。
目前只剩下代码的问题了，下周末有时间，我再研究研究。（到单位也问问同事）

tianfang 2019-12-08

打赏
举报

我在北京 server是openresty 注意到你的server 是 tengine

qybao 2019-12-08

打赏
举报

为什么代码突然好用了？因为之前的代码并没有错，只是获取的网页如果用urlencode转码过，如果获取结果不用URLdecode解码回来，那获取结果里有中文内容的话，就会以\x开头的编码显示。另外，即使你加上URLdecode处理，如果不出异常，返回值也不可能是null，也就说不可能没有返回值的，所以出现null，肯定是你什么地方出异常了，所以返回值才是null

sun0322 2019-12-08

打赏
举报

还有我想问一下大神们，造成下面这种现象的原因，
是服务器的设置造成的，还是我的电脑这边的设置造成的。
-------
上午代码执行OK的时候，请求的返回结果没有经过Gzip压缩，
在命令行中直接 curl 就能看到结果

但是下午，就不行了，因为有了Gzip压缩

----

sun0322 2019-12-08

打赏
举报

引用 5 楼 tianfang 的回复:

传输层协调，主要考虑http header有：
提交给服务器的有：
Accept-Encoding （gzip, deflate）， Accept-Charset （如：utf-8）
服务器返回值：
Transfer-Encoding ：（例如：分块（chunked）、compress、deflate、gzip和identity），
Content-Encoding （例如：gzip）
Content-Language ，这个缺省值可能是ISO-8859-1, 也可能是utf-8

内容层的字符集：这个字符集在 Content-Type 中，这个是内容的格式和编码，它是基于 Content-Language 之上的编码，当Content-Language（特别是ISO-8859-1时候）和Content-Type 字符集不同的时候，就会产生urlencode

改进：

1 试试发送请求的时候，带上 Accept-Charset =utf-8，这样Content-Language 会优先使用utf-8，则内容不会被urlencode

2 在你的代码中，InputStreamReader的编码不一定是UTF-8，应该从Http头中的 Content-Language获得

3 检测内容中是否包含\x，有则再做urldecode

非常感谢回答，
按照您的提示，做了以下修改，还是没有效果。
■请求处理部分的代码

■返回结果部分的处理
以下注释部分的代码的各种组合都尝试了，
但是，只要经过Decode处理，返回值就是NULL

■程序执行的结果
发送请求时，设定了 Accept-Charset

--
■关于 Content-Language
从Header中没有找到，直接使用【Content-Type:text/html; charset=utf-8】的编码utf-8

sun0322 2019-12-08

打赏
举报

引用 4 楼 qybao 的回复:

为什么代码突然好用了？因为之前的代码并没有错，只是获取的网页如果用urlencode转码过，如果获取结果不用URLdecode解码回来，那获取结果里有中文内容的话，就会以\x开头的编码显示。
另外，即使你加上URLdecode处理，如果不出异常，返回值也不可能是null，也就说不可能没有返回值的，所以出现null，肯定是你什么地方出异常了，所以返回值才是null

■推测出现问题的原因
还是网站那边做了什么特殊的处理
上午之所以好用，是因为网站那边返回的结果没有进行 gzip压缩，
而下午请求同样的地址，经过了gzip压缩，所以在解析处理的时候，无法正常解析。
具体现象，请参照链接，博客内容中【推测出现问题的原因】部分的记述
■新的问题有两
问题1
问什么会有这种现象，请求同样的URL，上午不是gzip，下午就是gzip了

问题2
如果返回值是gzip时，如果处理才能解决问题，
我尝试解决了，但是没有解决。。。
尝试代码如下，87，88行的代码。

---
目前最新的问题就是这些了。

tianfang 2019-12-08

打赏
举报

传输层协调，主要考虑http header有：提交给服务器的有： Accept-Encoding （gzip, deflate）， Accept-Charset （如：utf-8）服务器返回值： Transfer-Encoding ：（例如：分块（chunked）、compress、deflate、gzip和identity）， Content-Encoding （例如：gzip） Content-Language ，这个缺省值可能是ISO-8859-1, 也可能是utf-8 内容层的字符集：这个字符集在 Content-Type 中，这个是内容的格式和编码，它是基于 Content-Language 之上的编码，当Content-Language（特别是ISO-8859-1时候）和Content-Type 字符集不同的时候，就会产生urlencode 改进： 1 试试发送请求的时候，带上 Accept-Charset =utf-8，这样Content-Language 会优先使用utf-8，则内容不会被urlencode 2 在你的代码中，InputStreamReader的编码不一定是UTF-8，应该从Http头中的 Content-Language获得 3 检测内容中是否包含\x，有则再做urldecode

sun0322 2019-12-08