VC里字符编码问题

ocean1004 2008-08-26 06:55:26



#include <iostream>

#include "windows.h"

#pragma comment(lib, "shlwapi.lib")

#include "shlwapi.h"

using namespace std;

int main()

{

	WCHAR *pwszFile = L"中问他个吧啊比";

	WCHAR *pwszSpec = L"*吧??";

	BOOL bRet = PathMatchSpecW(pwszFile, pwszSpec);

	if (TRUE == bRet)

	{

		cout<<"OK"<<endl;

	}



	char *ptszFile = "中问他个吧啊比";

	char *ptszSpec = "*吧??";

	bRet = PathMatchSpecA(ptszFile, ptszSpec);

	if (TRUE == bRet)

	{

		cout<<"OK"<<endl;

	}

}

为什么下面char也可以？
我对字符编码不是很了解，麻烦高手详解下。
应该是了解编码，知道一个是UNICODE，一个是单字节，但是不了解VC里编码的处理，是不是被VC转了？

...全文

464 6 打赏收藏转发到动态举报

写回复

用AI写文章

6 条回复

切换为时间正序

请发表友善的回复…

发表回复

老胡 2008-08-26

打赏
举报

别地方抄来的

1. wprintf
Q : sizeof(wchar_t) = ?
A : 随编译器不同。(所以:在需要跨平台的时候尽量不用wchar_t) vc : sizeof(wchar_t) = 2;

Q: 在vc中，为什么直接使用wprintf(L"测试1234")会没有结果
A: 没有设置好locale,这样做
setlocale(LC_ALL ,"chs");
wprintf(L"%s",L"测试1234");

或者(假设当前活动codepage为chs)
char scp[16];
int cp = GetACP();
sprintf(scp,".%d",cp);
setlocale( LC_ALL, scp );
wprintf(L"测试1234");

2. wcout
一样，不过设定locale,请用std::locale
locale loc("chs");
wcout.imbue(loc);
wcout << L"测试1234" << endl;

这篇文章应该是[netsin]的成果，我勤快，记下来。
注：wprintf是C的标准库函数，但wcout不是C++的标准成员，C++中的 L"……" 是宽字符，却未必是unicode字符，这与编译器实现相关。
[乾坤一笑]说：为什么 C/C++ 语言把 L"xx" 定义为由实现决定的呢？这显然是为了 C/C++ 的普适性、可移植性。Bjarne 的观点认为，C++ 的方式是允许程序员使用任何字符集作为串的字符类型。另外，unicode 编码已经发展了若干版本了，是否能永久适合下去也不得而知。有关 unicode 的详细论述以及和其它字符集的比较，我推荐你看《无废话xml》。

以下两段代码的执行环境是 windows xp professional 英文版，编译器是 VS2005RTM。
// C
#include <stdio.h>
#include <locale.h>
int main( void )
{
setlocale( LC_ALL, "chs" );
//setlocale( LC_ALL, "Chinese-simplified" );
//setlocale( LC_ALL, "ZHI" );
//setlocale( LC_ALL, ".936" );
wprintf( L"中国" );
return 0;
}
// C++
#include <iostream>
#include <locale>
using namespace std;
int main( void )
{
locale loc( "chs" );
//locale loc( "Chinese-simplified" );
//locale loc( "ZHI" );
//locale loc( ".936" );
wcout.imbue( loc );
std::wcout << L"中国" << endl;
return 0;
}
说明：别混合使用 setlocale 和 std::locale 。

------------------------- 2006-07-05 记 -------------------------
"VC知识库" 编码为：56 43 D6 AA CA B6 BF E2 00 // ANSI编码
L"VC知识库" 在VC++ 中编码为：56 00 43 00 E5 77 C6 8B 93 5E 00 00 // (windows口中的unicode)编码
L"VC知识库" 在GCC（Dev-CPP4990）中编码为：56 00 43 00 D6 00 AA 00 CA 00 B6 00 BF 00 E2 00 00 00 // 只是将ANSI编码简单的加0
L"VC知识库" 在GCC（Dev-CPP4992）中编译失败，报 Illegal byte sequence
L"VC知识库" 在 Dev-CPP4992 中解决步骤为：
a. 将文件保存为 utf-8 编码 // utf-8 是unicode的其中一种，但和(windows口中的unicode)不一样
b. 去掉BOM头：用二进制编辑器（比如VC）去掉刚才utf-8文件的前三个字节 // Linux/UNIX并不使用BOM
c. 使用 gcc/g++ 编译运行
经过以上解决步骤，在 dev-cpp4992 中
"VC知识库" 编码为： 56 43 E7 9F A5 E8 AF 86 E5 BA 93 00 // utf-8编码，注意不再是ANSI编码了，因此用 printf/cout 将输出乱码
L"VC知识库" 编码为： 56 00 43 00 E5 77 C6 8B 93 5E 00 00 // (windows口中的unicode)编码
补充：在mingw32中使用wcout和wstring需要加一些宏，比如
#define _GLIBCXX_USE_WCHAR_T 1
#include <iostream>
int main( void )
{
std::wcout << 1 << std::endl;
}
可以编译通过，但无法Link通过，在网上google了一下，stlport说mingw32有问题，mingw32说是M$的c runtime有问题。

printf 、wprintf 在console下的unicode 输出
1. printf 只能提供ANSI/MB 的输出，不支持输出unicode stream.
例如:
wchar_t test[]=L"测试1234";
printf("%s",test);
是不会正确输出的

2.wprintf 同样不会提供unicode output,
但是他会把wchar_t的string转为locale的SB/MB字符编码，然后输出
例如：
wchar_t test[] = L"测试Test";
wprintf(L"%s",test);
会输出??1234之类的字符串，或者不输出任何结果
因为wprintf没有办法把L"测试Test"转为默认的ANSI,需要设置locale
setlocale(LC_ALL,"chs");
wchar_t test[] = L"测试Test";
wprintf(L"%s",test);
会有正确的输出
等同于printf("%ls",test);

综上: CRT I/O functions do not provide Unicode output.

3. Window console自从NT4就是一个真正的unicode console
不过输出unicode string,只有使用Windows API, WriteConsoleW
例如：
wchar_t test[] = L"测试1234";
DWORD ws;
WriteConsoleW(GetStdHandle(STD_OUTPUT_HANDLE),test,wcslen(test),&ws,NULL);
可以正确的输出而不需要设置locale,因为是真正的unicode的输出，跟codepage无关

4. 如何实现跨平台的console output
不要使用wchar_t和wprintf,因为这些都依赖于编译器.
ICU是IBM的一个成熟的跨平台支持unicode的libary,推荐使用

以下是ICU的uprintf实现
void uprintf(const UnicodeString &str) {
char *buf = 0;
int32_t len = str.length();
int32_t bufLen = len + 16;
int32_t actualLen;
buf = new char[bufLen + 1];
actualLen = str.extract(0, len, buf/*, bufLen*/); // Default codepage conversion
buf[actualLen] = 0;
printf("%s", buf);
delete buf;
}
它也是先把Unicode string转化为本地的codepage,然后printf，虽然也不是unicode output,但是跨平台，大多数情况会工作得很好。
后记：
mbstowcs(wchar_t *wcstr, const char *mbstr, size_t count )等函数第三个参数
count： The maximum number of multibyte characters to convert.
指待转换的多字节字符串相对于目前活动locale的字符个数加一。
比如：字符串"abc赵123"对于C locale而言count是strlen("abc赵123"),即8+1。
而对于chinese-simplified.936而言count就是7+1。
count的计算必须和mbstowcs在同一个locale下。

yuhaozx 2008-08-26

打赏
举报

PathMatchSpec

BOOL PathMatchSpec(
LPCTSTR pszFileParam,
LPCTSTR pszSpec
);

Searches a string using a DOS wild card match type. The string can be searched for a particular file extension, such as *.bmp, *.doc, and so on.

Returns TRUE if the string matches, or FALSE otherwise.
pszFileParam
Address of the string to be searched.
pszSpec
Address of the file type for which to search.
Example:

#include <windows.h>
#include <iostream.h>
#include "Shlwapi.h"

void main( void )
{
// String path name 1.
char buffer_1[] = "C:\\Test\\File.txt";
char *lpStr1;
lpStr1 = buffer_1;
// String path name 2.
char buffer_2[] = "C:\\Test\\File.bmp";
char *lpStr2;
lpStr2 = buffer_2;
// String path name 3.
char buffer_3[] = "*.txt";
char *lpStr3;
lpStr3 = buffer_3;
// String path name 4.
char buffer_4[] = "C:\\Test\\File";
char *lpStr4;
lpStr4 = buffer_4;
// Variable to get the return.
// from "PathMatchSpec"
int retval;
// Test path name 1.
retval = PathMatchSpec(lpStr1,lpStr3);
cout << "The contents of String 1: " << lpStr1
<< "\nThe return value from the function is " << retval << " = TRUE" << endl;
// Test path name 2.
retval = PathMatchSpec(lpStr2,"*.bmp");
cout << "The contents of String 2: " << lpStr2
<< "\nThe return value from the function is " << retval << " = TRUE" << endl;
// Test path name 4.
retval = PathMatchSpec(lpStr4,lpStr2);
cout << "The contents of String 4: " << lpStr4
<< "\nThe return value from the function is " << retval << " = FALSE"<< endl;
}
OUTPUT:
==========
The contents of String 1: C:\Test\File.txt
The return value from the function is 1 = TRUE
The contents of String 2: C:\Test\File.bmp
The return value from the function is 1 = TRUE
The contents of String 4: C:\Test\File
The return value from the function is 0 = FALSE

jasonnbfan 2008-08-26

打赏
举报

从winnt 开始内核就是纯unicode 的。
你用PathMatchSpecA（char *,char *）实际上内部还是通过一个简单的转换，把ansi转换成Unicode字符串，然后调用PathMatchSpecW(WCHAR *, WCHAR *);

所以提倡在进行win32编程时，直接使用WCHAR 而不要使用char，避免内部转换，提高运行效率。

具体看windows和核心编程第二章

chenweigaoyu 2008-08-26