如何判断字符是简体中文还是繁体中文及英文.

seikoo 2004-10-19 04:27:11
谢谢.高手快来....
...全文
1115 点赞 收藏 8
写回复
8 条回复
切换为时间正序
当前发帖距今超过3年,不再开放新的回复
发表回复
xiaomineer 2004-10-21
如果是用unicode编码,那么能否判断上面我发的那个unicode标准的FAQ已经说得很明白。那就是不能判断,因为这里面有很多例外,如果非要判断结果也是不准确的。

如果是gb2312或者big5编码,那么可以将其转化成内码,但是他们的内码范围是重叠的。正如: jamesfancy()边城狂人(James Fancy) 所指出的那样。因此按照你的程序,根据内码判断,结果只能判断出一部分繁体。或者说结果是不准确的。

最后我也觉得这种判断是没有必要的,如果想转化就直接转好了。
回复
seikoo 2004-10-20
public class Charset {

public static boolean isCS(String str){
if(null==str) return false;
if(str.trim()=="") return false;
byte[] bytes=str.getBytes();
if(bytes.length<2)
return false;
byte aa=(byte)0xB0;
byte bb=(byte)0xF7;
byte cc=(byte)0xA1;
byte dd=(byte)0xFE;
if(bytes[0]>=aa && bytes[0]<=bb){
if(bytes[1]<cc || bytes[1] > dd){
return false;
}
return true;
}
return false;
}
public static boolean isBig5(String str){
if(null==str) return false;
if(str.trim()=="") return false;
byte[] bytes=str.getBytes();
if(bytes.length<2)
return false;
byte aa=(byte)0xB0;
byte bb=(byte)0xF7;
byte cc=(byte)0xA1;
byte dd=(byte)0xFE;
if(bytes[0]>=aa && bytes[0]<=bb){
if(bytes[1]<cc || bytes[1] > dd){
return true;
}
return false;
}
return false;
}
}
自己来吧.
回复
xiaomineer 2004-10-20
Q: How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?

A: It's basically impossible and largely meaningless. It's the equivalent of asking if "a" is an English letter or a French one. There are *some* characters where one can guess based on the source information in Unihan.txt that it's traditional Chinese, simplified Chinese, Japanese, Korean, or Vietnamese, but there are too many exceptions to make this really reliable. (For example, one particularly nasty obscenity in Cantonese would probably have never been encoded for Cantonese, but has made it in for the sake of Korean, where one hopes it isn't nearly as obscene.)

The phonetic data in Unihan.txt should not be used for this purpose. A blank in the phonetic data means that nobody's supplied a reading, not that a reading doesn't exist. Because updating the Unihan database is an ongoing process, these fields will be increasingly filled out as time goes on, but they should never be taken as absolutely complete. In particular, there are obscure characters where it is known that there *is* a reading, but since the character does not occur in standard dictionaries, we are unable to supply it (e.g., U+40DF in Cantonese).

A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.

The only proper mechanism is, as for determining whether "chat" is spelled correctly in English or French, is to use a higher-level protocol
回复
seikoo 2004-10-20
好像是有点问题...能够处理判断一部分的繁体...
回复
xiaomineer 2004-10-20
seikoo(上下求索)
我不清楚你的程序是怎么考虑的,但是我copy了你的程序然后在我的机器上执行,
结果是我输入繁体中文的时候用isBig5判断返回false.
回复
边城狂人 2004-10-19
看看字节码范围

GB2312

字节 1: 0xA1-0xFE
字节 2: 0xA1-0xFE

分 94 个区,每个区 94 个字

BIG5

字节 1: 0xA1-0xF9
字节 2: 0x40-0x7E 和 0xA1-0xFE

分 89 个区,每个区 157 个字


如果是GBK中的敏体字,在 0x40-0xa1之间,可以看出来,编码是有重复的,所以,实在不好判断哪些是简体,哪些是繁体。
回复
seikoo 2004-10-19
那简体和繁体呢?
回复
winterxu416 2004-10-19
在JavaScript中:/[^\x00-\xff]/ig.test(str) 返回true表示你的str中含有中文字符
在Java中也一样的使用这个正则表达式,就可以测试出是否含有中文字符啦.
回复
发动态
发帖子
Web 开发
创建于2007-09-28

7.9w+

社区成员

Java Web 开发
申请成为版主
社区公告
暂无公告