字符集编码，高手进，要高手

tcher 2014-07-11 10:08:29

vs2008为测试环境，本地环境936，即GB2312，测试效果如下
一. Char*类型的3种编码方式的“中国”字符串，1 UTF8、2 UTF8Bom、3 默认（GB2312）
内存中查看发现 1 格式的文件确实是UTF8 格式，而后面两种竟然都是双字节BG2312码值， printf 的话 1 显示乱码，后面两种2 3 正常。命令行代码页改为 65001（UTF8）后，1正常显示，2 3乱码
为何 UTF8 BOM内存中为本地码？应该原样保存才对
二。类型全改为Wchar_t*，依然用“中国”字符串测试。 Local设置为 936
wprintf发现1乱码，2 3正常。内存中发现，2 3都是Unicode（UCS2）编码的码值，正常，但 1虽然也转码了（非原UTF8编码），但是还是3字节，既然 Wchar_t要转化成内码 UCS2，则一定是2字节啊。不仅是3字节，还不能正常打印。
Why？

高人发表下意见，谢谢啊

...全文

564 31 打赏收藏转发到动态举报

写回复

用AI写文章

31 条回复

切换为时间正序

请发表友善的回复…

发表回复

赵4老师 2014-07-23

打赏
举报

推荐使用ConvertZ软件批量转换多个文件的编码。

mujiok2003 2014-07-23

打赏
举报

#include <clocale>
#include <cstdio>
#include "_gb2312.h"
#include "_utf8.h"
#include "_utf8bom.h"


int main(int argc, char* argv[])
{
	setlocale(LC_ALL, ".936");
	printf(_gb2312);
	wprintf(_gb2312w);
	
	//不要使用uft-8 nonbom源码
	//printf(_utf8);	
	//wprintf(_utf8w);	

	printf(_utf8bom);
	wprintf(_utf8bomw);
	

	return 0;
}

mujiok2003 2014-07-23

打赏
举报

不要使用utf-8 nonbom 文件保存源码

辰岡墨竹 2014-07-22

打赏
举报

是这样的VS编译器的编码支持和具体的C++编译器的编码支持是不一样的。其实微软的C++编译器不支持不带BOM的UTF-8文件，会将它当作ANSI编码处理，所以“”字符串都不会做转换，而L""字符串会将错就错按ANSI转换为Unicode编码。对于带BOM的UTF-8，会自动转换为Unicode（UTF-16LE）再做处理。 19楼基本正确，不过有一点不对，编译器并不知道什么是GB2312。 C++ 11里面，对于UTF-8的字符串可以用u8"....."这种写法，不过微软现在还不支持。

cngaler 2014-07-22

打赏
举报

http://connect.microsoft.com/VisualStudio/feedback/details/341454/compile-error-with-source-file-containing-utf8-strings-in-cjk-system-locale 这里有这个问题，然后Visual C++ Compiler Team的Jonathan Caves回答是 The compiler when faced with a source file that does not have a BOM the compiler reads ahead a certain distance into the file to see if it can detect any Unicode characters - it specifically looks for UTF-16 and UTF-16BE - if it doesn't find either then it assumes that it has MBCS. I suspect that in this case that in this case it falls back to MBCS and this is what is causing the problem.

cngaler 2014-07-22

打赏
举报

http://blog.csdn.net/darkdong/article/details/6067119 http://www.cppblog.com/lauer3912/archive/2011/05/12/146281.aspx 可以看看这两个

tcher 2014-07-11

打赏
举报

回复中出现了疏漏，（一个中文字符要么应该是2字节，要么是4字节）

tcher 2014-07-11

打赏
举报

引用 1 楼 lfy2217 的回复:

发下代码，探讨一下

代码发了，很简单的代码

tcher 2014-07-11

打赏
举报

_utf8.h //注意，此文件必需用VS2008高级模式保存为 UTF无签名格式 char* _utf8 = "u8中国"; wchar_t* _utf8w = L"u8中国"; _utf8bom.h //注意，此文件必需用VS2008高级模式保存为 UTF带签名格式，即BOM char* _utf8bom = "ub中国"; wchar_t* _utf8bomw = L"ub中国"; _gb2312.h //默认应该就是这个，本地环境，哈哈 char* _gb2312 = "gb中国"; wchar_t* _gb2312w = L"gb中国"; charset.cpp #include "stdafx.h" #include <locale.h> #include "_gb2312.h" #include "_utf8.h" #include "_utf8bom.h" int _tmain(int argc, _TCHAR* argv[]) { char* plocal = setlocale(LC_ALL, ".936"); printf("%s\n%s\n%s\n", _utf8, _utf8bom, _gb2312); //内存中你会发现_utf8bom和_gb2312一新，Amazing。。。 wprintf(L"%s\n%s\n%s\n", _utf8w, _utf8bomw, _gb2312w); //内存中你会发现三者都会转码，但_utf8w转码后还是一个中文3个字节，明显背离UCS2 （一个中文字符要么应该是2字节，要么是字节） printf("%ls\n%ls\n%ls\n", _utf8w, _utf8bomw, _gb2312w); //这行不用管，我瞎测试 return 0; }

cngaler 2014-07-11

打赏
举报

发下代码，探讨一下

Dobzhansky 2014-07-11

打赏
举报

对于 c/c++ 源代码, 如果没有使用 L 修饰显式字符串, 那么编译后的字符串使用何种编码. 在某些编译器, 是跟文件本身编码有关的. 比如一下代码断 #include <stdio.h> int main() { char* p = "好香的MM啊"; while (*p != '\0') { printf("%x ",*p); p++; } return 0; } 分别将文件存储为本地编码ccd.c, 和 utf8编码(ccdu8.c) 2个文件. 使用各个版本的编译器编译, 查看比较结果 . vc71 结果不同. mingw (gcc 4.6.1) 结果不同. E:\>cd practice E:\practice>cl /nologo ccd.c ccd.c E:\practice>cl /nologo ccdu8.c ccdu8.c E:\practice>ccd ffffffba ffffffc3 ffffffcf ffffffe3 ffffffb5 ffffffc4 4d 4d ffffffb0 ffffffa1 E:\practice>ccdu8.exe ffffffe5 ffffffa5 ffffffbd ffffffe9 ffffffa6 ffffff99 ffffffe7 ffffff9a ffffff84 4d 4d ffffffe5 ffffff95 ffffff8a E:\practice> ================= vc80 vc10 结果相同. vc70 直接说不支持 unicode 文件 borland c 5.5 则直接对着 utf-8 编码的文件挂掉不干活.

tcher 2014-07-11

打赏
举报

还是Eclipse智能，众所周知，Java采用Unicode编码，所有源文件不管格式如何，JVM中或内存中，全是Unicode编码的码值。打印的时候，自动转化为Local，显示出来。至于上面朋友说的MultiByteToWideChar一类的，一般在设计中就避免了这类转换，麻烦也浪费效率，除非一些固定的通信程序已经定死了用普通字符，而用的地方只支持宽字符一类的，才转化。而转化又必需指明原编译方式， CP_ACP这只是本地环境转化，细心的做过研究的朋友会发现，一些麻烦点的中文，用CP_ACP根本不靠谱，转化都失败，因为本地环境只是2字节编码，2字节最多表示 65536个字符，哈哈哈哈这问题真他娘的讨厌，我不想再纠结了

tcher 2014-07-11

打赏
举报

抛开问题，通俗的说一下 char类型，编译器不应该也没权利转化，否则写死的UTF8串在不同的本地环境下编译，发出去的东西不一样，怎么得了。或者说我既然要用char那么不适应国际化这事，我是心知肚明的，按我指定的方式工作即可，出错我承担。 wchar_t类型，为国际化而生，能表示所有字符（并不是说char就不能），这个所有，意味着强行用 L或 TEXT()进行内容转化，这个是手动的，加了转化，则所有的不论原编码方式为何物，都能转化成 Unicode码（WINDOWS上为UTF16的2或4字节编译）。这点足以承载所有字符，而最终通过指定Local，则所有的串写到输出流，那必定是由Unicode码到本地码的一个转换。或许我的理论是错的，希望各位指正。

tcher 2014-07-11

打赏
举报

引用 19 楼 lfy2217 的回复:

直接说事实。 1.在保存的时候，UTF8或者 UTF8 BOM都会将中国保存为 e4 b8 ad e5 9b bd 。 2.在读取的时候，UTF8没有BOM，vs会直接认为是gb2312的编码，所以就会这样解析成3个字 e4 b8 ad e5 9b bd。 3.然后vs会把它转化为utf-8的编码，得到的结果就是 93 6d 5e e1 57 6d 4.读取的时候，有编码标识的，会根据编码标识转化为gb2312的编码，造成utf8Bom和gb2312文件的"中国"在内存里面是一样的那么来验证一下好了 char str[20] = {'\xE4','\xB8','\xAD','\xE5','\x9B','\xBD'}; //这是“中国”的utf-8表示 printf("%s\n",str); WCHAR wstr[40]; MultiByteToWideChar(CP_ACP,MB_PRECOMPOSED,str,-1,wstr,40); //这里会将它转换为宽字节的这里转换完了之后，wstr的值就是 93 6d 5e e1 57 6d 没错，你应该发现了，就是和_utf8w里面的值一样，并不是背离了UCS2，而是经过类似上述的流程。

感谢，不过和你意见有些不同，你说”有编码标识的，会根据编码标识转化为gb2312的编码“，据我所知，Char*表示的字符串，编译器应该没有权力来强制转码吧？要真如此，我想保存一些UTF8格式的字符串到内存中，还得了。第1个问题我做过测试，原码保存而不转码的话，我命令行下运行 chcp 65001然后执行程序，则标准UTF8的那个串可以显示，其他两个是乱码，这刚好满足我原码与客户端环境是一致的，不需要编译器来转。至于问题2 ，我想表达的是，WCHAR_T的会被编译器自动转化，也就是所谓的MultiByteToWideChar其实是自动调用的，内存中自动保存为 UTF16格式的码值。这就能保证，不管你源码是 GB2312，UTF8，JIS，Big5一类的，只要用 WCHAR_T声明，然后 Local指定谁，则那个Local的用户一定能正常显示，只要本地字符集支持。各自发表意见，感谢你的答复，希望还能探讨

赵4老师 2014-07-11

打赏
举报

重点关注这个： setlocale #pragma setlocale( "locale-string" ) Defines the locale (country and language) to be used when translating wide-character constants and string literals. Since the algorithm for converting multibyte characters to wide characters may vary by locale or the compilation may take place in a different locale from where an executable file will be run, this pragma provides a way to specify the target locale at compile time. This guarantees that the wide-character strings will be stored in the correct format. The default locale-string is "C". The "C" locale maps each character in the string to its value as a wchar_t (unsigned short).

cngaler 2014-07-11

打赏
举报

直接说事实。 1.在保存的时候，UTF8或者 UTF8 BOM都会将中国保存为 e4 b8 ad e5 9b bd 。 2.在读取的时候，UTF8没有BOM，vs会直接认为是gb2312的编码，所以就会这样解析成3个字 e4 b8 ad e5 9b bd。 3.然后vs会把它转化为utf-8的编码，得到的结果就是 93 6d 5e e1 57 6d 4.读取的时候，有编码标识的，会根据编码标识转化为gb2312的编码，造成utf8Bom和gb2312文件的"中国"在内存里面是一样的那么来验证一下好了 char str[20] = {'\xE4','\xB8','\xAD','\xE5','\x9B','\xBD'}; //这是“中国”的utf-8表示 printf("%s\n",str); WCHAR wstr[40]; MultiByteToWideChar(CP_ACP,MB_PRECOMPOSED,str,-1,wstr,40); //这里会将它转换为宽字节的这里转换完了之后，wstr的值就是 93 6d 5e e1 57 6d 没错，你应该发现了，就是和_utf8w里面的值一样，并不是背离了UCS2，而是经过类似上述的流程。

tcher 2014-07-11

打赏
举报

引用 15 楼 zhao4zhong1 的回复:

建议不要理会编辑器如何处理或处理是否有不尽人意的地方。自己编写个小程序，使用MultiByteToWideChar和WideCharToMultiByte转换，想怎么转换就怎么转换。另外提醒：VS IDE最新的Update下载安装了吗？

建议很好，问题1我其实不追究也罢。关键是问题2，我想保证的是，所有项目组内的组员写的东西，只要满足宽字符，最终都能成功显示出来，或者正常发往客户端。但实验证明，即使都是wchar_t，标准 UTF8的串，还是不能正常显示。我记得 UTF8 要转化为内码UTF16的嘛，这个问题我必需搞明白，否则看似用宽字符实现国际化了，但问题还在。谢谢你的建议，问题1我会放弃，问题2不能放。

tcher 2014-07-11

打赏
举报

引用 13 楼 zhao4zhong1 的回复:

当然EF BB BF是指文件前三个字节，而非每行开头或每个字符串开头。

哈哈，当然，BOM就是在文件开头的，我把我关心的内容，用红色字体标明了，哈哈

赵4老师 2014-07-11

打赏
举报

还有： setlocale #pragma setlocale( "locale-string" ) Defines the locale (country and language) to be used when translating wide-character constants and string literals. Since the algorithm for converting multibyte characters to wide characters may vary by locale or the compilation may take place in a different locale from where an executable file will be run, this pragma provides a way to specify the target locale at compile time. This guarantees that the wide-character strings will be stored in the correct format. The default locale-string is "C". The "C" locale maps each character in the string to its value as a wchar_t (unsigned short). Locale Use the setlocale function to change or query some or all of the current program locale information. “Locale” refers to the locality (the country and language) for which you can customize certain aspects of your program. Some locale-dependent categories include the formatting of dates and the display format for monetary values. For more information, see Locale Categories. Locale-Dependent Routines Routine Use setlocale Category Setting Dependence atof, atoi, atol Convert character to floating-point, integer, or long integer value, respectively LC_NUMERIC is Routines Test given integer for particular condition. LC_CTYPE isleadbyte Test for lead byte () LC_CTYPE localeconv Read appropriate values for formatting numeric quantities LC_MONETARY, LC_NUMERIC MB_CUR_MAX Maximum length in bytes of any multibyte character in current locale (macro defined in STDLIB.H) LC_CTYPE _mbccpy Copy one multibyte character LC_CTYPE _mbclen Return length, in bytes, of given multibyte character LC_CTYPE mblen Validate and return number of bytes in multibyte character LC_CTYPE _mbstrlen For multibyte-character strings: validate each character in string; return string length LC_CTYPE mbstowcs Convert sequence of multibyte characters to corresponding sequence of wide characters LC_CTYPE mbtowc Convert multibyte character to corresponding wide character LC_CTYPE printf functions Write formatted output LC_NUMERIC (determines radix character output) scanf functions Read formatted input LC_NUMERIC (determines radix character recognition) setlocale, _wsetlocale Select locale for program Not applicable strcoll, wcscoll Compare characters of two strings LC_COLLATE _stricoll, _wcsicoll Compare characters of two strings (case insensitive) LC_COLLATE _strncoll, _wcsncoll Compare first n characters of two strings LC_COLLATE _strnicoll, _wcsnicoll Compare first n characters of two strings (case insensitive) LC_COLLATE strftime, wcsftime Format date and time value according to supplied format argument LC_TIME _strlwr Convert, in place, each uppercase letter in given string to lowercase LC_CTYPE strtod, wcstod, strtol, wcstol, strtoul, wcstoul Convert character string to double, long, or unsigned long value LC_NUMERIC (determines radix character recognition) _strupr Convert, in place, each lowercase letter in string to uppercase LC_CTYPE strxfrm, wcsxfrm Transform string into collated form according to locale LC_COLLATE tolower, towlower Convert given character to corresponding lowercase character LC_CTYPE toupper, towupper Convert given character to corresponding uppercase letter LC_CTYPE wcstombs Convert sequence of wide characters to corresponding sequence of multibyte characters LC_CTYPE wctomb Convert wide character to corresponding multibyte character LC_CTYPE _wtoi, _wtol Convert wide-character string to int or long LC_NUMERIC

赵4老师 2014-07-11