使用正则表达式将网页中的Unicode编码转换为中文

moon5284 2011-06-16 07:55:30

RT，倾家荡产雪地里跪求！如果
string test = @"<h1 id=folder_view_heading>\u6536\u4ef6\u7bb1</h1><span>\u6240\u6709\u90ae\u4ef6</span>";
将网页中的\u4ef6等十六进制unicode替换为汉字并输出result...

以下这个方法正则不能命中，求大神指导并测试：

public static string UnicodeToGB(string content)

        {

            Regex objRegex = new Regex(@"\\u([a-zA-Z0-9]{4});", RegexOptions.IgnoreCase);//&#(?<UnicodeCode>[\\d]{5})

            Match objMatch = objRegex.Match(content);

            StringBuilder sb = new StringBuilder(content);

            while (objMatch.Success)

            {

                string code = Convert.ToString(Convert.ToInt32(objMatch.Result("${UnicodeCode}"), 16), 16);

                byte[] array = new byte[2];

                array[0] = (byte)Convert.ToInt32(code.Substring(2), 16);

                array[1] = (byte)Convert.ToInt32(code.Substring(0, 2), 16);



                sb.Replace(objMatch.Value, Encoding.Unicode.GetString(array));



                objMatch = objMatch.NextMatch();

            }

            return sb.ToString();

        }

请大家确保测试通过，

...全文

545 7 打赏收藏转发到动态举报

写回复

用AI写文章

7 条回复

切换为时间正序

请发表友善的回复…

发表回复

lstzr 2011-10-16

打赏
举报

顶3#

moon5284 2011-06-17

打赏
举报

[Quote=引用 4 楼 lvyichang 的回复:]
如果是用正则取，应该是：

C# code
string str = @"<h1 id=folder_view_heading>\u6536\u4ef6\u7bb1</h1><span>\u6240\u6709\u90ae\u4ef6</span>";
Regex reg = new Regex(@"\\u\w{4}");
MatchCol……
[/Quote]

感谢3楼无敌简洁的解决方案，同时感谢4楼很明白我的意思，现将4楼解决的方案贴出来供大家参考：

public static string UnicodeToGB(string content)

        {

            Regex objRegex = new Regex(@"\\u(?<UnicodeCode>[\w]{4})", RegexOptions.IgnoreCase);//&#(?<UnicodeCode>[\\d]{5})||@"\\u([a-zA-Z0-9]{4});"

            Match objMatch = objRegex.Match(content);

            StringBuilder sb = new StringBuilder(content);

            while (objMatch.Success)

            {

                string code = Convert.ToString(Convert.ToInt32(objMatch.Result("${UnicodeCode}"), 16), 16);

                byte[] array = new byte[2];

                array[0] = (byte)Convert.ToInt32(code.Substring(2), 16);

                array[1] = (byte)Convert.ToInt32(code.Substring(0, 2), 16);



                sb.Replace(objMatch.Value, Encoding.Unicode.GetString(array));



                objMatch = objMatch.NextMatch();

            }

            return sb.ToString();

        }

3楼的也能解决这个问题！再次表示感谢！

jeogegxs 2011-06-16

打赏
举报

顶#3楼的

lvyichang 2011-06-16

打赏
举报

如果是用正则取，应该是：

        string str = @"<h1 id=folder_view_heading>\u6536\u4ef6\u7bb1</h1><span>\u6240\u6709\u90ae\u4ef6</span>";

        Regex reg = new Regex(@"\\u\w{4}");

        MatchCollection mc = reg.Matches(str);

        foreach (Match ma in mc)

        {

            string s = Regex.Unescape(ma.Value.ToString()).ToString();

            //.......

        }

q107770540 2011-06-16

打赏
举报



void Main()

{

	string test = @"<h1 id=folder_view_heading>\u6536\u4ef6\u7bb1</h1><span>\u6240\u6709\u90ae\u4ef6</span>";

	Console.WriteLine(Regex.Unescape(test));

	//<h1 id=folder_view_heading>收件箱</h1><span>所有邮件</span>



}