在线等，Google baidu url编码问题？代码可以立刻结贴

chary8088 2010-01-29 05:32:33

我们在Google和百度搜索时，汉字会变成 % E%DD之类的，网上都说是UTF-8编码，找了一些UTF-8编码的代码，发现结果也不一样，不明白他们怎么编码的？

不要用 win32 API 的库函数哦，我是想学习代码怎么实现的

比如我搜索代码可以立刻结贴
http://www.google.cn/search?client=aff-cs-360se&forid=1&ie=utf-8&oe=UTF-8&q=%E4%BB%A3%E7%A0%81%E5%8F%AF%E4%BB%A5%E7%AB%8B%E5%88%BB%E7%BB%93%E8%B4%B4

搜索代码
http://www.google.cn/search?client=aff-cs-360se&forid=1&ie=utf-8&oe=UTF-8&q=%E4%BB%A3%E7%A0%81

...全文

689 30 打赏收藏转发到动态举报

写回复

用AI写文章

30 条回复

切换为时间正序

请发表友善的回复…

发表回复

cattycat 2010-01-30

打赏
举报

结贴给分吧

cattycat 2010-01-30

打赏
举报

贴一下代码吧，现在是一样的了，你可以看清楚每个字节了。

#include <stdio.h>

#include <windows.h>





wchar_t* AnsiToUnicode(const char* buf)

{

    int len = ::MultiByteToWideChar(CP_ACP, 0, buf, -1, NULL, 0);

    if (len == 0) return L"";

	

    wchar_t* wch=new wchar_t[len];

	memset(wch,0,len);

    ::MultiByteToWideChar(CP_ACP, 0, buf, -1, wch, len);

	

    return wch;

}







char* UnicodeToUtf8(const wchar_t* buf)

{

    int len = ::WideCharToMultiByte(CP_UTF8, 0, buf, -1, NULL, 0, NULL, NULL);

    if (len == 0) return "";

	

    char* utf8=new char[len];

	memset(utf8,0,len);

    ::WideCharToMultiByte(CP_UTF8, 0, buf, -1, utf8, len, NULL, NULL);



    return utf8;

}



void main() 

{ 

	char str[]="周华健";

	wchar_t *uni=AnsiToUnicode(str);

	char* utf8=UnicodeToUtf8(uni);



	for(int i=0;i<strlen(utf8);i++) 

		printf("%X ",(unsigned char)utf8[i]);

	printf("\n");



	delete uni;

	delete utf8;

}

cattycat 2010-01-30

打赏
举报

你得先把Ansi转换成Unicode,然后把Unicode转换成utf8，才行。

shiweifu 2010-01-30

打赏
举报

这个真的要MARK

healer_kx 2010-01-30

打赏
举报

哈哈，希望大家继续支持华健啊~

chary8088 2010-01-30

打赏
举报

我试过了加不加%，结果和输入都是一样的

cattycat 2010-01-30

打赏
举报

你把google那个和你那个转换函数转换出来的比较一下
%E5%91%A8%E5%8D%8E%E5%81%A5
去掉%后
E5 91 A8 E5 8D 8E E5 81 A5
共9个字节就是你那个函数转换出来的。所以你那个函数返回的结果得再处理下，得再加上%才是url中用的。
在java中URLEncoder.encode("周华健", "UTF8");
这个encoder函数就是在每个字节前加了%

chary8088 2010-01-30

打赏
举报

%E5%91%A8%E5%8D%8E%E5%81%A5 //周华健
这个是Google 的UTF-8编码，用qp::StringA Global::UnicodeToAnsi(const wchar_t* buf)
{int len= ::WideCharToMultiByte(CP_ACP,0, buf,-1, NULL,0, NULL, NULL);if (len==0)return"";

std::vector <char> utf8(len);
::WideCharToMultiByte(CP_ACP,0, buf,-1,&utf8[0], len, NULL, NULL);return&utf8[0];
}

转换怎么不可以？

cattycat 2010-01-30

打赏
举报

甘草大哥已经说了，百度的是gb2312,google是utf8.
将中文转换成这个编码后，然后在每字节之间前后加上%，就形成了那种url。

chary8088 2010-01-30

打赏
举报

用%E5%91%A8%E5%8D%8E%E5%81%A5 试了下这几个函数，怎么没一个输出正确结果的，是我那个地方搞错了？？[Quote=引用 9 楼 dontkissbossass 的回复:]
E4%BB%A3% ==>> E4BBA3 == 代
一下代码是loaden写的，不要分
C/C++ codeqp::StringW Global::AnsiToUnicode(constchar* buf)
{int len= ::MultiByteToWideChar(CP_ACP,0, buf,-1, NULL,0);if (len==0)return L"";

std::vector<wchar_t> unicode(len);
::MultiByteToWideChar(CP_ACP,0, buf,-1,&unicode[0], len);return&unicode[0];
}

qp::StringA Global::UnicodeToAnsi(const wchar_t* buf)
{int len= ::WideCharToMultiByte(CP_ACP,0, buf,-1, NULL,0, NULL, NULL);if (len==0)return"";

std::vector<char> utf8(len);
::WideCharToMultiByte(CP_ACP,0, buf,-1,&utf8[0], len, NULL, NULL);return&utf8[0];
}

qp::StringW Global::Utf8ToUnicode(constchar* buf)
{int len= ::MultiByteToWideChar(CP_UTF8,0, buf,-1, NULL,0);if (len==0)return L"";

std::vector<wchar_t> unicode(len);
::MultiByteToWideChar(CP_UTF8,0, buf,-1,&unicode[0], len);return&unicode[0];
}

qp::StringA Global::UnicodeToUtf8(const wchar_t* buf)
{int len= ::WideCharToMultiByte(CP_UTF8,0, buf,-1, NULL,0, NULL, NULL);if (len==0)return"";

std::vector<char> utf8(len);
::WideCharToMultiByte(CP_UTF8,0, buf,-1,&utf8[0], len, NULL, NULL);return&utf8[0];
}
[/Quote]

chary8088 2010-01-30

打赏
举报

[Quote=引用 16 楼 healer_kx 的回复:]
引用 12 楼 dontkissbossass 的回复:

你IE的encoding问题吧，我用UTF-8 打开的是这个http://www.google.cn/search?hl=zh-CN&source=hp&q=%E5%91%A8%E5%8D%8E%E5%81%A5&aq=f&oq= google 搜索周华健

是浏览器的问题，Baidu的HTML page是这样写的：
 <html> <head>
<meta http-equiv="content-type" content="text/html;charset=gb2312">
<title>百度搜索_周华健 </title>
...........
而Google的是:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
...........

这个你只要转码成功就OK了，至于URLEncode，只是分组前面加上%即可。
[/Quote]

[Quote=引用 9 楼 dontkissbossass 的回复:]
E4%BB%A3% ==>> E4BBA3 == 代
一下代码是loaden写的，不要分
C/C++ codeqp::StringW Global::AnsiToUnicode(constchar* buf)
{int len= ::MultiByteToWideChar(CP_ACP,0, buf,-1, NULL,0);if (len==0)return L"";

std::vector<wchar_t> unicode(len);
::MultiByteToWideChar(CP_ACP,0, buf,-1,&unicode[0], len);return&unicode[0];
}

qp::StringA Global::UnicodeToAnsi(const wchar_t* buf)
{int len= ::WideCharToMultiByte(CP_ACP,0, buf,-1, NULL,0, NULL, NULL);if (len==0)return"";

std::vector<char> utf8(len);
::WideCharToMultiByte(CP_ACP,0, buf,-1,&utf8[0], len, NULL, NULL);return&utf8[0];
}

qp::StringW Global::Utf8ToUnicode(constchar* buf)
{int len= ::MultiByteToWideChar(CP_UTF8,0, buf,-1, NULL,0);if (len==0)return L"";

std::vector<wchar_t> unicode(len);
::MultiByteToWideChar(CP_UTF8,0, buf,-1,&unicode[0], len);return&unicode[0];
}

qp::StringA Global::UnicodeToUtf8(const wchar_t* buf)
{int len= ::WideCharToMultiByte(CP_UTF8,0, buf,-1, NULL,0, NULL, NULL);if (len==0)return"";

std::vector<char> utf8(len);
::WideCharToMultiByte(CP_UTF8,0, buf,-1,&utf8[0], len, NULL, NULL);return&utf8[0];
}
[/Quote]

看来的确是UTF-8了，，，能不能直接汉字转UTF-8呢，不用那些库函数？？

wwq100 2010-01-29

打赏
举报

学习了

healer_kx 2010-01-29

打赏
举报

都别学习我，我只是搞了两年HTML而已，结果现在什么都不专了。

healer_kx 2010-01-29

打赏
举报

[Quote=引用 14 楼 cattycat 的回复:]
我的ie，google周华健，果然是9个%，看来还真是utf8，每个汉字3个字节，每个字节用%隔开的。

学习甘草大哥的。
[/Quote]

开始我以为google也会是GB2312呢，原来是UTF8的，现在的Web Page越来越多是UTF8了。
只是传送的中文会吃亏一些~

healer_kx 2010-01-29

打赏
举报

[Quote=引用 12 楼 dontkissbossass 的回复:]

你IE的encoding问题吧，我用UTF-8 打开的是这个http://www.google.cn/search?hl=zh-CN&source=hp&q=%E5%91%A8%E5%8D%8E%E5%81%A5&aq=f&oq= google 搜索周华健
[/Quote]

是浏览器的问题，Baidu的HTML page是这样写的：
<html><head>
<meta http-equiv="content-type" content="text/html;charset=gb2312">
<title>百度搜索_周华健 </title>
...........
而Google的是:
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
...........

这个你只要转码成功就OK了，至于URLEncode，只是分组前面加上%即可。

DontKissBossAss 2010-01-29

打赏
举报

这是另一个实现代码，没用api根绝UTF 和UNICODE 标准写的数学计算。

wstring UTF2Uni(const char* src, std::wstring &t)

{

    if (src == NULL) 

    {

        return L"";

    }

    

    int size_s = strlen(src);

    int size_d = size_s + 10;          //?

    

    wchar_t *des = new wchar_t[size_d];

    memset(des, 0, size_d * sizeof(wchar_t));

    

    int s = 0, d = 0;

    bool toomuchbyte = true; //set true to skip error prefix.

    

    while (s < size_s && d < size_d)

    {

        unsigned char c = src[s];

        if ((c & 0x80) == 0) 

        {

            des[d++] += src[s++];

        } 

        else if((c & 0xE0) == 0xC0)  ///< 110x-xxxx 10xx-xxxx

        {

            WCHAR &wideChar = des[d++];

            wideChar  = (src[s + 0] & 0x3F) << 6;

            wideChar |= (src[s + 1] & 0x3F);

            

            s += 2;

        }

        else if((c & 0xF0) == 0xE0)  ///< 1110-xxxx 10xx-xxxx 10xx-xxxx

        {

            WCHAR &wideChar = des[d++];

            

            wideChar  = (src[s + 0] & 0x1F) << 12;

            wideChar |= (src[s + 1] & 0x3F) << 6;

            wideChar |= (src[s + 2] & 0x3F);

            

            s += 3;

        } 

        else if((c & 0xF8) == 0xF0)  ///< 1111-0xxx 10xx-xxxx 10xx-xxxx 10xx-xxxx 

        {

            WCHAR &wideChar = des[d++];

            

            wideChar  = (src[s + 0] & 0x0F) << 18;

            wideChar  = (src[s + 1] & 0x3F) << 12;

            wideChar |= (src[s + 2] & 0x3F) << 6;

            wideChar |= (src[s + 3] & 0x3F);

            

            s += 4;

        } 

        else 

        {

            WCHAR &wideChar = des[d++]; ///< 1111-10xx 10xx-xxxx 10xx-xxxx 10xx-xxxx 10xx-xxxx 

            

            wideChar  = (src[s + 0] & 0x07) << 24;

            wideChar  = (src[s + 1] & 0x3F) << 18;

            wideChar  = (src[s + 2] & 0x3F) << 12;

            wideChar |= (src[s + 3] & 0x3F) << 6;

            wideChar |= (src[s + 4] & 0x3F);

            

            s += 5;

        }

    }

    

    t = des;

    delete[] des;

    des = NULL;

    

    return t;

}







int Uni2UTF( const wstring& strRes, char *utf8, int nMaxSize )

{

    if (utf8 == NULL) {

        return -1;

    }

    int len = 0;

    int size_d = nMaxSize;





    for (wstring::const_iterator it = strRes.begin(); it != strRes.end(); ++it)

    {

        wchar_t wchar = *it;

        if (wchar < 0x80)

        {  //

            //length = 1;

            utf8[len++] = (char)wchar;

        }

        else if(wchar < 0x800)

        {

            //length = 2;

            

            if (len + 1 >= size_d)

                return -1;

            

            utf8[len++] = 0xc0 | ( wchar >> 6 );

            utf8[len++] = 0x80 | ( wchar & 0x3f );

        }

        else if(wchar < 0x10000 )

        {

            //length = 3;

            if (len + 2 >= size_d)

                return -1;

            

            utf8[len++] = 0xe0 | ( wchar >> 12 );

            utf8[len++] = 0x80 | ( (wchar >> 6) & 0x3f );

            utf8[len++] = 0x80 | ( wchar & 0x3f );

        }

        else if( wchar < 0x200000 ) 

        {

            //length = 4;

            if (len + 3 >= size_d)

                return -1;

            

            utf8[len++] = 0xf0 | ( (int)wchar >> 18 );

            utf8[len++] = 0x80 | ( (wchar >> 12) & 0x3f );

            utf8[len++] = 0x80 | ( (wchar >> 6) & 0x3f );

            utf8[len++] = 0x80 | ( wchar & 0x3f );

        }

    

    }

    



    return len;

}

cattycat 2010-01-29

打赏
举报

我的ie，google周华健，果然是9个%，看来还真是utf8，每个汉字3个字节，每个字节用%隔开的。

学习甘草大哥的。

DontKissBossAss 2010-01-29

打赏
举报

[Quote=引用 11 楼 healer_kx 的回复:]
Java codepublicstaticvoid main(String[] args) {
String a;try {
a= URLEncoder.encode("周华健","GB2312");
System.out.println(a);//%D6%DC%BB%AA%BD%A1 a= URLEncoder.encode("周华健","UTF8");
System.out.println(a);//%E5%91%A8%E5%8D%8E%E5%81%A5 }catch (UnsupportedEncodingException e) {// TODO Auto-generated catch block e.printStackTrace();
}

}
以简练的Java代码说明问题，就能看到了，如果你是XP中文，那么应该以GB2312编码的。

[/Quote]
可惜C++不擅长写web，没有.encode这方法。。

DontKissBossAss 2010-01-29

打赏
举报

[Quote=引用 10 楼 healer_kx 的回复:]
这个编码肯定不是UTF8的，
以baidu为例，搜 "周华健"
你得到的url parameter是 %D6%DC%BB%AA%BD%A1，两个一组这么看是：

%D6%DC %BB%AA %BD%A1
周华健、
而，如果是UTF8编码的话呢，就不是这个样子了。。。

为什么是两个一组呢? 因为是Wide char嘛，一个Wide Char最大表示不就是 0xffff嘛。
以这种形式写出来就是 %FF%FF

而且准确说，这个是GB2312编码后，又经过URL Encode过的。
如果是UTF8编码的，应该是每个汉字，三个字节表示，那么三个字的汉字词语，应该是9个%出现。

[/Quote]

你IE的encoding问题吧，我用UTF-8 打开的是这个http://www.google.cn/search?hl=zh-CN&source=hp&q=%E5%91%A8%E5%8D%8E%E5%81%A5&aq=f&oq= google 搜索周华健

healer_kx 2010-01-29

打赏
举报



	public static void main(String[] args) {

		String a;

		try {

			a = URLEncoder.encode("周华健", "GB2312");

			System.out.println(a); //%D6%DC%BB%AA%BD%A1

			a = URLEncoder.encode("周华健", "UTF8");

			System.out.println(a); //%E5%91%A8%E5%8D%8E%E5%81%A5

		} catch (UnsupportedEncodingException e) {

			// TODO Auto-generated catch block

			e.printStackTrace();

		}

		

	}

以简练的Java代码说明问题，就能看到了，如果你是XP中文，那么应该以GB2312编码的。