word转html如何清除冗余代码

hyzkui 2009-11-24 12:06:05

我有几万个从word转来的html文件，但这些html文件由doc的100多K变成了几M，几十M。

原来转为html时产生了大量的冗余代码，请问有什么方法可以清除这些垃圾。

需要程序代码。

...全文

534 6 打赏收藏转发到动态举报

写回复

用AI写文章

6 条回复

切换为时间正序

请发表友善的回复…

发表回复

我是一只小小小的菜鸟 2010-05-19

打赏
举报

楼主能将你的word转成html的代码发给我份嘛我正在找呢！QQ：104517300
bq112972@126.com

hyzkui 2009-11-24

打赏
举报

刚才没分了，现在又有了，可以加分的

hyzkui 2009-11-24

打赏
举报

看错了，你那个就是c#代码，呵呵

hyzkui 2009-11-24

打赏
举报

非常感谢，有没有c#的代码？

fonvey 2009-11-24

打赏
举报

楼上强啊

winner2050 2009-11-24

打赏
举报

/// <summary>
/// 清理Word生成的冗余HTML
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
public static string CleanWordHtml(string html)
{
StringCollection sc = new StringCollection();
// get rid of unnecessary tag spans (comments and title)
sc.Add(@"");
sc.Add(@"<title>(\w|\W)+?</title>");
// Get rid of classes and styles
sc.Add(@"\s?class=\w+");
sc.Add(@"\s+style='[^']+'");
// Get rid of unnecessary tags
//sc.Add(@"<(meta|link|/?o:|/?style|/?div|/?st\d|/?head|/?html|body|/?body|/?span|!\[)[^>]*?>");
sc.Add(@"<(meta|link|/?o:|/?style|/?font|/?strong|/?st\d|/?head|/?html|body|/?body|/?span|!\[)[^>]*?>");
// Get rid of empty paragraph tags
sc.Add(@"(<[^>]+>)+ (</\w+>)+");
// remove bizarre v: element attached to <img> tag
sc.Add(@"\s+v:\w+=""[^""]+""");
// remove extra lines
sc.Add(@"(\n\r){2,}");
foreach (string s in sc)
{
html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase);
}
return html;
}