Unicode与Shift JIS的转换问题

Jerry_shower 2005-10-11 11:38:31
比如说,在VC6的环境下,获得了一个片假名或一个日本汉字的UniCode码,应如何得到其ShiftJIS码呢?
谢谢!
...全文
1222 11 打赏 收藏 转发到动态 举报
写回复
用AI写文章
11 条回复
切换为时间正序
请发表友善的回复…
发表回复
Jerry_shower 2005-10-24
  • 打赏
  • 举报
回复
在求教,有哪位大侠知否Mac Japanese的字符集这个东西!
谢谢
Jerry_shower 2005-10-23
  • 打赏
  • 举报
回复
小弟有一问不明,不知各位大侠是否知道JIS8 X0201,是种8-bit的日文编码,我不知在VC下(以UniCode方式存储字符)如何识别收到的字符是否是属于该编码表的字符?
谢谢!
蒋晟 2005-10-17
  • 打赏
  • 举报
回复
I believe JIS8 is included in ISO-2022 as the name "iso-2022-jp".

see also

http://www.digitalmars.com/d/archives/digitalmars/D/23652.html

Sun, 15 May 2005 12:46:06 -0700 "Andrew Fedoniouk" <news xx terrainformatica.com> writes:
Good idea, I like it.

FYI: On Windows MultiByteToWideChar and WideCharToMultiByte
support many encodings other than mentioned directly in MSDN.
I am using this list:

lang_t langs[] = {
{"asmo-708",708},
{"dos-720",720},
{"iso-8859-6",28596},
{"x-mac-arabic",10004},
{"windows-1256",1256},
{"ibm775",775},
{"iso-8859-4",28594},
{"windows-1257",1257},
{"ibm852",852},
{"iso-8859-2",28592},
{"x-mac-ce",10029},
{"windows-1250",1250},
{"euc-cn",51936},
{"gb2312",936},
{"hz-gb-2312",52936},
{"x-mac-chinesesimp",10008},
{"big5",950},
{"x-chinese-cns",20000},
{"x-chinese-eten",20002},
{"x-mac-chinesetrad",10002},
{"cp866",866},
{"iso-8859-5",28595},
{"koi8-r",20866},
{"koi8-u",21866},
{"x-mac-cyrillic",10007},
{"windows-1251",1251},
{"x-europa",29001},
{"x-ia5-german",20106},
{"ibm737",737},
{"iso-8859-7",28597},
{"x-mac-greek",10006},
{"windows-1253",1253},
{"ibm869",869},
{"dos-862",862},
{"iso-8859-8-i",38598},
{"iso-8859-8",28598},
{"x-mac-hebrew",10005},
{"windows-1255",1255},
{"x-ebcdic-arabic",20420},
{"x-ebcdic-cyrillicrussian",20880},
{"x-ebcdic-cyrillicserbianbulgarian",21025},
{"x-ebcdic-denmarknorway",20277},
{"x-ebcdic-denmarknorway-euro",1142},
{"x-ebcdic-finlandsweden",20278},
{"x-ebcdic-finlandsweden-euro",1143},
{"x-ebcdic-finlandsweden-euro",1143},
{"x-ebcdic-france-euro",1147},
{"x-ebcdic-germany",20273},
{"x-ebcdic-germany-euro",1141},
{"x-ebcdic-greekmodern",875},
{"x-ebcdic-greek",20423},
{"x-ebcdic-hebrew",20424},
{"x-ebcdic-icelandic",20871},
{"x-ebcdic-icelandic-euro",1149},
{"x-ebcdic-international-euro",1148},
{"x-ebcdic-italy",20280},
{"x-ebcdic-italy-euro",1144},
{"x-ebcdic-japaneseandkana",50930},
{"x-ebcdic-japaneseandjapaneselatin",50939},
{"x-ebcdic-japaneseanduscanada",50931},
{"x-ebcdic-japanesekatakana",20290},
{"x-ebcdic-koreanandkoreanextended",50933},
{"x-ebcdic-koreanextended",20833},
{"cp870",870},
{"x-ebcdic-simplifiedchinese",50935},
{"x-ebcdic-spain",20284},
{"x-ebcdic-spain-euro",1145},
{"x-ebcdic-thai",20838},
{"x-ebcdic-traditionalchinese",50937},
{"cp1026",1026},
{"x-ebcdic-turkish",20905},
{"x-ebcdic-uk",20285},
{"x-ebcdic-uk-euro",1146},
{"ebcdic-cp-us",37},
{"x-ebcdic-cp-us-euro",1140},
{"ibm861",861},
{"x-mac-icelandic",10079},
{"x-iscii-as",57006},
{"x-iscii-be",57003},
{"x-iscii-de",57002},
{"x-iscii-gu",57010},
{"x-iscii-ka",57008},
{"x-iscii-ma",57009},
{"x-iscii-or",57007},
{"x-iscii-pa",57011},
{"x-iscii-ta",57004},
{"x-iscii-te",57005},
{"euc-jp",51932},
{"iso-2022-jp",50220},
{"iso-2022-jp",50222},
{"csiso2022jp",50221},
{"x-mac-japanese",10001},
{"shift_jis",932},
{"ks_c_5601-1987",949},
{"euc-kr",51949},
{"iso-2022-kr",50225},
{"johab",1361},
{"x-mac-korean",10003},
{"iso-8859-3",28593},
{"iso-8859-15",28605},
{"x-ia5-norwegian",20108},
{"ibm437",437},
{"x-ia5-swedish",20107},
{"windows-874",874},
{"ibm857",857},
{"iso-8859-9",28599},
{"x-mac-turkish",10081},
{"windows-1254",1254},
//{(const char *)L"unicode",1200},
//{"unicodefffe",1201},
{"utf-7",65000},
{"utf-8",65001},
//{"us-ascii",20127},
{"us-ascii",1252},
{"windows-1258",1258},
{"ibm850",850},
{"x-ia5",20105},
{"iso-8859-1",1252}, //was 28591
{"macintosh",10000},
{"windows-1252",1252},
{"system",CP_ACP}
};

Jerry_shower 2005-10-16
  • 打赏
  • 举报
回复
谢谢大家的帮助,有点明白了。
但小弟还有一个问题,在使用WideCharToMultiByte函数时,JIS8的CodePage是多少了,我把与JIS沾边的CodePage都试过了,均没得到正确结果!
蒋晟 2005-10-15
  • 打赏
  • 举报
回复
In case you don't know the English terms of Japanese writing systems:

http://en.wikipedia.org/wiki/CJK
The term CJKV is used to mean CJK plus Vietnamese, which used Chinese characters prior to adopting a written language solely on Romanization.

These languages all have a shared characteristic: Their writing systems are partly or entirely based on Chinese characters—Hanzi in Chinese, Kanji in Japanese, Hanja in Korean, and Chữ nôm in Vietnamese.

http://en.wikipedia.org/wiki/Katakana
Katakana (片仮名) are a Japanese syllabary, one of the four Japanese writing systems. The others are hiragana, kanji and rōmaji. The word katakana means "partial kana".

http://en.wikipedia.org/wiki/Hiragana
Hiragana (平仮名 literally "smooth kana") are a Japanese syllabary, one of four Japanese writing systems (the others are katakana, kanji and rōmaji).
thisisll 2005-10-15
  • 打赏
  • 举报
回复
在附件->系统工具->字符映射表

你可以看日文的编码

平假名...是什么我不知道
上面说的不知道有用没,你看看
Jerry_shower 2005-10-15
  • 打赏
  • 举报
回复
谢谢大侠的详细指点,但我还是没太看懂,比如说,我就想知道这样一些知识:
0x3000-0x3214 这段UniCode码属于片假名
............. 这段UniCode码属于平假名
............. 这段UniCode码属于日本汉字
就够了!
因为我需要根据它们的UniCode码来确定我收到的符号是平假名还是日本汉字!
请大家多多帮忙啦!
Jerry_shower 2005-10-13
  • 打赏
  • 举报
回复
还再请问楼上的大侠:平、片假名与日文汉字在UniCode中的连续始末值是多少呢?我查了很久都没找到!
谢谢!
蒋晟 2005-10-13
  • 打赏
  • 举报
回复
http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters

Mapping of Unicode characters

Unicode reserves 1,114,112 (= 220 + 216) code points, and currently assigns characters to more than 96,000 of those code points. The first 256 codes precisely match those of ISO 8859-1, the most popular 8-bit character encoding in the "Western world"; as a result, the first 128 characters are also identical to ASCII.

The Unicode code space for characters is divided into 17 "planes" and each plane has 65,536 (= 216) code points.

Basic Multilingual Plane


As of Unicode 4.1, The BMP includes the following scripts:

Basic Latin (0000–007F)
Latin-1 Supplement (0080–00FF)
Latin Extended-A (0100–017F)
Latin Extended-B (0180–024F)
IPA Extensions (0250–02AF)
Spacing Modifier Letters (02B0–02FF)
Combining Diacritical Marks (0300–036F)
Greek and Coptic (0370–03FF)
Cyrillic (0400–04FF)
Cyrillic Supplement (0500–052F)
Armenian (0530–058F)
Hebrew (0590–05FF)
Arabic (0600–06FF)
Syriac (0700–074F)
Arabic Supplement (0750–077F)
Thaana (0780–07BF)
Indic scripts:
Devanagari (0900–097F)
Bengali (0980–09FF)
Gurmukhi (0A00–0A7F)
Gujarati (0A80–0AFF)
Oriya (0B00–0B7F)
Tamil (0B80–0BFF)
Telugu (0C00–0C7F)
Kannada (0C80–0CFF)
Malayalam (0D00–0D7F)
Sinhala (0D80–0DFF)
Thai (0E00–0E7F)
Lao (0E80–0EFF)
Tibetan (0F00–0FFF)
Burmese (1000–109F)
Georgian (10A0–10FF)
Hangul Jamo (1100–11FF)
Ethiopic (1200–137F)
Ethiopic Supplement (1380–139F)
Cherokee (13A0–13FF)
Unified Canadian Aboriginal Syllabics (1400–167F)
Ogham (1680–169F)
Runic (16A0–16FF)
Filipino scripts:
Tagalog (1700–171F)
Hanunóo (1720–173F)
Buhid (1740–175F)
Tagbanwa (1760–177F)
Khmer (1780–17FF)
Mongolian (1800–18AF)
Limbu (1900–194F)
Tai Le (1950–197F)
New Tai Lue (1980–19DF)
Khmer Symbols (19E0–19FF)
Buginese (1A00–1A1F)
Phonetic Extensions (1D00–1D7F)
Phonetic Extensions Supplement (1D80–1DBF)
Combining Diacritical Marks Supplement (1DC0–1DFF)
Latin Extended Additional (1E00–1EFF)
Greek Extended (1F00–1FFF)
Symbols:
General Punctuation (2000–206F)
Superscripts and Subscripts (2070–209F)
Currency Symbols (20A0–20CF)
Combining Diacritical Marks for Symbols (20D0–20FF)
Letterlike Symbols (2100–214F)
Number Forms (2150–218F)
Arrows (2190–21FF)
Mathematical Operators (2200–22FF)
Miscellaneous Technical (2300–23FF)
Control Pictures (2400–243F)
Optical Character Recognition (2440–245F)
Enclosed Alphanumerics (2460–24FF)
Box Drawing (2500–257F)
Block Elements (2580–259F)
Geometric Shapes (25A0–25FF)
Miscellaneous Symbols (2600–26FF)
Dingbats (2700–27BF)
Miscellaneous Mathematical Symbols-A (27C0–27EF)
Supplemental Arrows-A (27F0–27FF)
Braille Patterns (2800–28FF)
Supplemental Arrows-B (2900–297F)
Miscellaneous Mathematical Symbols-B (2980–29FF)
Supplemental Mathematical Operators (2A00–2AFF)
Miscellaneous Symbols and Arrows (2B00–2BFF)
Glagolitic (2C00–2C5F)
Coptic (2C80–2CFF)
Georgian Supplement (2D00–2D2F)
Tifinagh (2D30–2D7F)
Ethiopic Extended (2D80–2DDF)
Supplemental Punctuation (2E00–2E7F)
CJK Radicals Supplement (2E80–2EFF)
Kangxi Radicals (2F00–2FDF)
Ideographic Description Characters (2FF0–2FFF)
CJK Symbols and Punctuation (3000–303F)
Hiragana (3040–309F)
Katakana (30A0–30FF)
Bopomofo (3100–312F)
Hangul Compatibility Jamo (3130–318F)
Kanbun (3190–319F)
Bopomofo Extended (31A0–31BF)
CJK Strokes (31C0–31EF)
Katakana Phonetic Extensions (31F0–31FF)
Enclosed CJK Letters and Months (3200–32FF)
CJK Compatibility (3300–33FF)
CJK Unified Ideographs Extension A (3400–4DBF)
Yijing Hexagram Symbols (4DC0–4DFF)
CJK Unified Ideographs (4E00–9FFF)
Yi Syllables (A000–A48F)
Yi Radicals (A490–A4CF)
Modifier Tone Letters (A700–A71F)
Syloti Nagri (A800–A82F)
Hangul Syllables (AC00–D7AF)
High Surrogates (D800–DB7F)
High Private Use Surrogates (DB80–DBFF)
Low Surrogates (DC00–DFFF)
Private Use Area (E000–F8FF)
CJK Compatibility Ideographs (F900–FAFF)
Alphabetic Presentation Forms (FB00–FB4F)
Arabic Presentation Forms-A (FB50–FDFF)
Variation Selectors (FE00–FE0F)
Vertical Forms (FE10–FE1F)
Combining Half Marks (FE20–FE2F)
CJK Compatibility Forms (FE30–FE4F)
Small Form Variants (FE50–FE6F)
Arabic Presentation Forms-B (FE70–FEFF)
Halfwidth and Fullwidth Forms (FF00–FFEF)
Specials (FFF0–FFFF)
Several scripts are expected to be included in the next revision of Unicode. These scripts, and their proposed code point ranges, are the following:

N'Ko (Mandekan) (07C0–07FF)
Balinese (1B00–1B7F)
Latin Extended-C (2C60–2C7F)
Phags-pa (A840–A87F)
Several other scripts are proposed for inclusion in the BMP, including:

Avestan & Pahlavi (0800–085F)
Cham (18B0–18FF)
Batak (1A20–1A5F)
Lanna (Old Tai Lue) (1A80–1AEF)
Lepcha (Rong) (1C00–1C4F)
Meithei/Manipuri (1C80–1CDF)
Santali (Ol Cemet' / Ol Chiki) (2DE0–2DFF)
Pollard Phonetic (A720–A77F)
Varang Kshiti (AA00–AA3F)
Sorang Sompeng (AA40–AA6F)
Saurashtra (AB00–AB5F)

Supplementary Multilingual Plane

Plane 1, the Supplementary Multilingual Plane, (SMP) is mostly used for historic scripts such as Linear B, but is also used for musical and mathematical symbols.
As of Unicode 4.1, Plane One includes the following scripts:

Linear B Syllabary (10000–1007F)
Linear B Ideograms (10080–100FF)
Aegean Numbers (10100–1013F)
Ancient Greek Numbers (10140–1018F)
Old Italic (10300–1032F)
Gothic (10330–1034F)
Ugaritic (10380–1039F)
Old Persian (103A0–103DF)
Deseret (10400–1044F)
Shavian (10450–1047F)
Osmanya (10480–104AF)
Cypriot Syllabary (10800–1083F)
Kharoshthi (10A00–10A5F)
Byzantine Musical Symbols (1D000–1D0FF)
Musical Symbols (1D100–1D1FF)
Ancient Greek Musical Notation (1D200–1D24F)
Tai Xuan Jing Symbols (1D300–1D35F)
Mathematical Alphanumeric Symbols (1D400–1D7FF)
Several scripts are expected to be included in the next revision of Unicode:

Phoenician
Sumero-Akkadian Cuneiform
Many other scripts are proposed for inclusion in Plane One, including:

Old Permic
Meroitic
Manichaean
Balti
Aramaic
South Arabian
Brahmi
Soyombo
Indus script
Tengwar
Cirth
Blissymbols
Basic Egyptian Hieroglyphics
Rod Numerals

thisisll 2005-10-12
  • 打赏
  • 举报
回复
Shift JIS的codepage是932
thisisll 2005-10-12
  • 打赏
  • 举报
回复
用这个WideCharToMultiByte

16,550

社区成员

发帖
与我相关
我的任务
社区描述
VC/MFC相关问题讨论
社区管理员
  • 基础类社区
  • Creator Browser
  • encoderlee
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告

        VC/MFC社区版块或许是CSDN最“古老”的版块了,记忆之中,与CSDN的年龄几乎差不多。随着时间的推移,MFC技术渐渐的偏离了开发主流,若干年之后的今天,当我们面对着微软的这个经典之笔,内心充满着敬意,那些曾经的记忆,可以说代表着二十年前曾经的辉煌……
        向经典致敬,或许是老一代程序员内心里面难以释怀的感受。互联网大行其道的今天,我们期待着MFC技术能够恢复其曾经的辉煌,或许这个期待会永远成为一种“梦想”,或许一切皆有可能……
        我们希望这个版块可以很好的适配Web时代,期待更好的互联网技术能够使得MFC技术框架得以重现活力,……

试试用AI创作助手写篇文章吧