调用fgetws，读ansi编码的文本文件，包含中文字符，遇到问题？

ljan 2006-12-22 06:39:28

ANSI编码文本文件，包含中文，内容如下：
123宾馆酒店

现在希望读取内容到一个unicode数组中，
由于不想使用mbcs到unicode转换函数，所以直接用了fgetws，发现123等数字，可以直接转成unicode 31 00 32 00 33 00，但宾馆等中文被解析为：b1 00 f6 00 ...
而汉字 "宾"的unicode为be 5b，
解析出错，不知道我的理解哪里有问题？

FILE *p = _tfopen(_T("f:\\3.txt"), _T("r"));
TCHAR aaa[100];
fgetws(aaa, 30, p);

另：有unicode函数，可以直接读mbcs到unicode数组中吗？

...全文

739 11 打赏收藏转发到动态举报

写回复

用AI写文章

11 条回复

切换为时间正序

请发表友善的回复…

发表回复

ljan 2006-12-25

打赏
举报

谢谢各位的帮助

查了相关资料, 关于code page，基本确认上面的错误是由于locale没有设置造成的。

Locale code page. The behavior of a number of run-time routines is dependent on the current locale setting, which includes the locale code page. (For more information, see Locale-Dependent Routines.) By default, all locale-dependent routines in the Microsoft run-time library use the code page that corresponds to the “C” locale. At run-time you can change or query the locale code page in use with a call to setlocale.

The “C” locale is defined by ANSI to correspond to the locale in which C programs have traditionally executed. The code page for the “C” locale (“C” code page) corresponds to the ASCII character set. For example, in the “C” locale, islower returns true for the values 0x61 – 0x7A only. In another locale, islower may return true for these as well as other values, as defined by that locale.

原因为： fgetws内部调用了mbtowc，而mbtowc属于Locale-Dependent Routines，而default使用"C" locale，"C" locale对应为ASCII character set。所以由于代码页的错误，函数解析出错。

jixingzhong(瞌睡虫·星辰)提到的, 从 ansi扩展的伪unicode 编码，这个概念，反而有些让人confused

AI风 2006-12-22

打赏
举报

这个问题我也遇到过，你要清楚CRT中的locale的方面的知识了

setlocale(LC_ALL, "chs"); //加上这一句
FILE *p = _tfopen(_T("f:\\3.txt"), _T("r"));
TCHAR aaa[100];
fgetws(aaa, 30, p);

那么你的aaa中得到的就是正确的Unicode编码了

jixingzhong 2006-12-22

打赏
举报

test.txt 文件内容：宾馆，这个是一个测试ABCDef

其中，
宾馆的编码就是 b1 f6
和 ansi 格式一致～～

楼主看看这篇资料：
http://vckbase.com/document/viewdoc/?id=1317

jixingzhong 2006-12-22

打赏
举报

我试了一下，
Dev C++：

#define MAX 256
int main()
{
wchar_t str[MAX]={0};
char s[MAX]={0}, i;
FILE *fp=fopen("test.txt", "r");

fgetws(str, MAX, fp);
printf("Read %d characters from file.\n", wcslen(str));
fputws(str, stdout);
for(i=0; i<wcslen(str); i++)
printf("%x\t", str[i]);

rewind(fp);
fgets(s, MAX, fp);
printf("\nRead %d characters from file.\n", strlen(s));
puts(s);
for(i=0; i<strlen(s); i++)
printf("%x\t", s[i]&0xff);

system("PAUSE");
return 0;
}

输出都没有乱码～～

但是，
偶看了一下数组中的值，
发现wchar_t str[MAX]中并不是真正的 unicode 编码，
而是 ansi 的简单扩展

jixingzhong 2006-12-22

打赏
举报

宾馆等中文被解析为：b1 00 f6 00 ...

这个的意思就是说，
读取的不是真正的 unicode 编码！！
而是一种从 ansi扩展的伪unicode 编码～～

jixingzhong 2006-12-22

打赏
举报

从没这么试过，
由于编码不对的时候经常乱码～～

testing .......

ljan 2006-12-22

打赏
举报

文本文件是ansi编码格式，暂时我还没打算考虑unicode 编码格式的文本文件

ljan 2006-12-22

打赏
举报

那么fgetws的使用环境是什么？

from msdn:

When a Unicode stream I/O routine (such as fwprintf, fwscanf, fgetwc, fputwc, fgetws, or fputws) operates on a file that is open in text mode (the default), two kinds of character conversions take place:

Unicode-to-MBCS or MBCS-to-Unicode conversion. When a Unicode stream-I/O function operates in text mode, the source or destination stream is assumed to be a sequence of multibyte characters. Therefore, the Unicode stream-input functions convert multibyte characters to wide characters (as if by a call to the mbtowc function). For the same reason, the Unicode stream-output functions convert wide characters to multibyte characters (as if by a call to the wctomb function).

这里面提到这些函数自身会做mbcs to unicode的转换的？

jixingzhong 2006-12-22