java String 乱码问题

ylczj 2011-05-09 11:58:18

str = new String(str.getBytes("GBK"), "utf-8");
str = new String(str.getBytes("utf-8"), "GBK");
str为什么不一样了

...全文

725 8 打赏收藏转发到动态举报

写回复

用AI写文章

8 条回复

切换为时间正序

请发表友善的回复…

发表回复

ylczj 2011-05-10

打赏
举报

getBytes（）是怎么在转的，如果我指定用GBK或是其他编码，返回的字节数据难道不随着变化吗，（在不同的编码下得到不同的字节数据，但通过编码识别后都是一样的字符内容？？不是这样吗）
有没getBytes和new String(数组,编码)的内部实现？？谢谢

beowulf2005 2011-05-10

打赏
举报



public class EncodeTest {

    

    /**

     * @param args

     * @throws Exception

     */

    public static void main(String[] args) throws Exception {

        String codeA = "UTF-8";

        String codeB = "GBK";

        String codeC = "UTF-16";

        // GBK是等长码，码长是固定的，2字节。

        // UTF8是非等长码，码长是变化的，通常1到3字节。

        // 以猪为例，GBK只要2字节，D6ED。

        // 你第一次转码的时候，你将字符串（java 内部 Unicode UTF16）经GBK解码的字节流，变成了D6ED。

        // 再用UTF8编码。因为D6ED 不在 UTF8 有效编码范围内（UTF8有效编码网上自己找），

        // 这两个字节于是被解释成两个非法字符。

        // 接下来第二次，这两个非法字符再经UTF8解码变成6字节，每个字符3字节。码长变化了。

        // 最后这6个bytes被当作GBK编码，于是得3个乱码字符。

        // 如果是简单的单字节ASCII字符，就不会出现上述问题。 比如 str="X" 就不会有问题

        String str = "猪";

        System.out.println(str);

        System.out.println(codeA + " : " + toHex(str.getBytes(codeA)));

        System.out.println(codeA + " : " + toBinary(str.getBytes(codeA)));

        System.out.println(codeB + " : " + toHex(str.getBytes(codeB)));

        System.out.println(codeB + " : " + toBinary(str.getBytes(codeB)));

        System.out.println(codeC + " : " + toHex(str.getBytes(codeC)));

        System.out.println(codeC + " : " + toBinary(str.getBytes(codeC)));

        System.out.println("------------------------------------------");

        

        str = new String(str.getBytes(codeB), codeA);

        System.out.println(str);

        System.out.println(codeA + " : " + toHex(str.getBytes(codeA)));

        System.out.println(codeA + " : " + toBinary(str.getBytes(codeA)));

        System.out.println(codeB + " : " + toHex(str.getBytes(codeB)));

        System.out.println(codeB + " : " + toBinary(str.getBytes(codeB)));

        System.out.println(codeC + " : " + toHex(str.getBytes(codeC)));

        System.out.println(codeC + " : " + toBinary(str.getBytes(codeC)));

        System.out.println("------------------------------------------");

        

        str = new String(str.getBytes(codeA), codeB);

        System.out.println(str);

        System.out.println(codeA + " : " + toHex(str.getBytes(codeA)));

        System.out.println(codeA + " : " + toBinary(str.getBytes(codeA)));

        System.out.println(codeB + " : " + toHex(str.getBytes(codeB)));

        System.out.println(codeB + " : " + toBinary(str.getBytes(codeB)));

        System.out.println(codeC + " : " + toHex(str.getBytes(codeC)));

        System.out.println(codeC + " : " + toBinary(str.getBytes(codeC)));

        System.out.println("------------------------------------------");

    }

    

    private static String toBinary(byte[] data) {

        StringBuilder sb = new StringBuilder();

        for (byte b : data) {

            sb.append(Integer.toString(b & 0xff, 2)).append(" ");

        }

        return sb.toString();

    }

    

    private static String toHex(byte[] data) {

        StringBuilder sb = new StringBuilder();

        for (byte b : data) {

            sb.append(Integer.toString(b & 0xff, 16).toUpperCase()).append(" ");

        }

        return sb.toString();

    }

    

}

这次发全了。我觉得我讲得够清楚了，还不明白就没办法了。趁早改行吧。

wang_huanming 2011-05-09

打赏
举报

str = new String(str.getBytes("GBK"), "utf-8");
这个是怎么定义的啊，你也没有完全贴出来str 怎么还能等于(str.getBytes("GBK"), "utf-8");

vipwalkingdog 2011-05-09

打赏
举报

两个new出来的东西都不一样的！！

eclipse_xu 2011-05-09

打赏
举报

GBK兼容所有汉字包括台湾地区的繁体与其他编码不一样

beowulf2005 2011-05-09

打赏
举报



String str = "猪";

        System.out.println(str);

        System.out.println(codeA+Arrays.toString(str.getBytes(codeA)));

        System.out.println(codeB+Arrays.toString(str.getBytes(codeB)));

        System.out.println(codeC+Arrays.toString(str.getBytes(codeC)));        

        System.out.println("------------------------------------------");

        

        str = new String(str.getBytes(codeB), codeA);

        System.out.println(str);        

        System.out.println(codeA+Arrays.toString(str.getBytes(codeA)));

        System.out.println(codeB+Arrays.toString(str.getBytes(codeB)));

        System.out.println(codeC+Arrays.toString(str.getBytes(codeC)));

        System.out.println("------------------------------------------");

        

        str = new String(str.getBytes(codeA), codeB);

        System.out.println(str); 

        System.out.println(codeA+Arrays.toString(str.getBytes(codeA)));

        System.out.println(codeB+Arrays.toString(str.getBytes(codeB)));

        System.out.println(codeC+Arrays.toString(str.getBytes(codeC)));

        System.out.println("------------------------------------------");

    }

GBK是等长码，码长是固定的。
UTF8是非等长码，码长是变化的。
你第一次转码的时候，你将字符串（java 内部 Unicode UTF16）经GBK解码的字节流，再用UTF8编码。码长变化了。
以猪为例
本来UTF8需要3字节表示猪，GBK只要2字节，所以Unicode的猪4字节经GBK解码后得2字节，
但这两个字节经过UTF8编码变成了两个字符，每个字符3字节，共6字节。（第一次）
接下来这两个字符再经UTF8解释成6字节，个被当作GBK编码，于是得3个字符（第二次）

thinkmore135 2011-05-09