怎样将html过滤成txt，就象IE浏览器可以将页面另存为txt文件一样？

shyworm 2003-06-20 02:13:52

我用WebClient.DownloadData之后，想过滤成纯文本（不含Html格式信息），请大家多帮忙。

...全文

213 11 打赏收藏转发到动态举报

写回复

用AI写文章

11 条回复

切换为时间正序

请发表友善的回复…

发表回复

shyworm 2003-06-23

打赏
举报

感谢各位高人的帮助，问题解决了！

cdshelf 2003-06-21

打赏
举报

最好把 <BR> <HR> <TR> <P> <DIV> 转换成换行符，这样可以保留原来的断行，出来的文本才可读。

nean 2003-06-21

打赏
举报

留个记号,up一下，呵呵

redflute 2003-06-21

打赏
举报

如果你的系统没有日文输入法,可能会看到一些乱码,但是没关系啦,全是注释嘛!

redflute 2003-06-21

打赏
举报

我正好昨天碰到一个这样的程序webget.cs,你自己编译一下就可以用了.
不过注释是日文的,我尽量给你翻译.

-------------*********源代码*******---------

using System;
using System.Drawing;
using System.Collections;
using System.ComponentModel;
using System.Windows.Forms;
using System.Data;
using System.Diagnostics;
using System.IO;
using System.Text;
using System.Web;
using System.Net;

namespace HTMLGet
{
/// <summary>
/// Form1 の概要の説明です。
/// </summary>
public class Form1 : System.Windows.Forms.Form
{
private System.Windows.Forms.Button button1;
/// <summary>
/// 必要なデザイナ変数です。
/// </summary>
private System.ComponentModel.Container components = null;

public Form1()
{
//
// Windows フォームデザイナサポートに必要です。
//译文:对 windows form designer supprot. 必要的.
InitializeComponent();

//
// TODO: InitializeComponent 呼び出しの後に、コンストラクタコードを追加してください。
//译文:调用InitializeComponent 后,请添加construct code.
}

/// <summary>
/// 使用されているリソースに後処理を実行します。
/// </summary>
protected override void Dispose( bool disposing )
{
if( disposing )
{
if (components != null)
{
components.Dispose();
}
}
base.Dispose( disposing );
}

#region Windows Form Designer generated code
/// <summary>
/// デザイナサポートに必要なメソッドです。このメソッドの内容を
/// コードエディタで変更しないでください。
///译文:这是对 windows form designer supprot的必要的修改,补丁(mend).这一小块的代码或数据请勿变动.
/// </summary>
private void InitializeComponent()
{
this.button1 = new System.Windows.Forms.Button();
this.SuspendLayout();
//
// button1
//
this.button1.Location = new System.Drawing.Point(136, 48);
this.button1.Name = "button1";
this.button1.TabIndex = 0;
this.button1.Text = "button1";
this.button1.Click += new System.EventHandler(this.button1_Click);
//
// Form1
//
this.AutoScaleBaseSize = new System.Drawing.Size(5, 12);
this.ClientSize = new System.Drawing.Size(292, 266);
this.Controls.AddRange(new System.Windows.Forms.Control[] {
this.button1});
this.Name = "Form1";
this.Text = "Form1";
this.ResumeLayout(false);

}
#endregion

/// <summary>
/// アプリケーションのメインエントリポイントです。
///译文:这里是application’s main 或enter point .
/// </summary>
[STAThread]
static void Main()
{
Application.Run(new Form1());
}

private void button1_Click(object sender, System.EventArgs e)
{
Read();
}

private void Read()
{
Stream stream = null;
StreamReader sr = null;

try
{
WebRequest webReq = HttpWebRequest.Create( "http://xww.fxsz.com.cn" );
webReq.Method = "GET";
// １秒でタイムアウトさせる。译文:timeout 设为1秒.
webReq.Timeout = 1000000;
// IE のプロキシ設定を使用する。译文:使用IE的proxy设定
webReq.Proxy = System.Net.WebProxy.GetDefaultProxy();

WebResponse webRes = webReq.GetResponse();
// HttpWebRequest からストリームを取得する。
//译文:从HttpWebRequest得到 stream
stream = webRes.GetResponseStream();
// 1行ごとに扱いたいので、StreamReader にする。
//译文:在streamreader 中想得到一行
sr=new System.IO.StreamReader(stream, Encoding.GetEncoding("x-euc-jp"));

string str;
str = sr.ReadToEnd();
MessageBox.Show(str);
//Debug.WriteLine(str);
}
catch (Exception exc)
{
// わかりやすいメッセージに変える。
//变成可以理解的message
//throw(new Exception("xxxに接続できませんでした。"));
throw(new Exception("不能连接到XXX"));
}
finally
{
if (sr != null) sr.Close();
if (stream != null) stream.Close();
}
}
}
}

benzite 2003-06-21

打赏
举报

引用别人的提取文本的程序，你试试:
s is the source code of the webpage.

private String fetchText(String s)
{
//Filter out HTML and JavaScript from the page, leaving only body text
s = Convert.ToString(Regex.Match(s, @"<body.+?</body>", RegexOptions.Singleline | RegexOptions.IgnoreCase)); //strip everything but <BODY>
s = Regex.Replace(s, "<script[^>]*?>.*?</script>", "", RegexOptions.Singleline | RegexOptions.IgnoreCase); //strip JavaScript
s = Regex.Replace(s, "<[^>]*>", ""); //strip HTML tags
s = Regex.Replace(s, "&(copy|#169);|&(quot|#34);|&(amp|#38);|&(lt|#60);&(gt|#62);|&(nbsp|#160);|&(iexcl|#161);|&(cent|#162);|&(pound|#163);|·", " "); //strip symbols
s = s.Replace("\t", " "); //strip tabs
s = Regex.Replace(s, "([\r\n])+", " "); //strip carriage returns
s = Regex.Replace(s, "\\s\\s+", " "); //strip white space (must be last)
return s.Trim();
}

saucer 2003-06-20

打赏
举报

do multiple passes, for example

str = System.Text.RegularExpressions.Regex.Replace(YourString,@"<script[^>]*>.*?</script>","",RegexOptions.IgnoreCase);

str = System.Text.RegularExpressions.Regex.Replace(YourString,@"<style[^>]*>.*?</style>","",RegexOptions.IgnoreCase);

str = System.Text.RegularExpressions.Regex.Replace(YourString,@"<[^>]+>","");

shyworm 2003-06-20

打赏
举报

net_lover(孟子E章) : 你在说什么啊？

我不需要写html代码，而是需要C#代码来处理Html，转换成纯文本。

孟子E章 2003-06-20

打赏
举报

<body onclick="alert(document.documentElement.innerText)">
<a href="xxxxxxxxxxxx">dddddddddd</a>

shyworm 2003-06-20

打赏
举报

You are right. but it not work with <style> or <script>.
But your way is useful, thanks a lot!

Any more hints?

saucer 2003-06-20

打赏
举报

use regular expressions to filter out tags, for example (might not always work, for example, with <script>...):

str = System.Text.RegularExpressions.Regex.Replace(YourString,@"<[^>]+>","");

(完整word版)Android期末考试复习试卷(仅供参考).doc

(完整word版)数据库课程设计-教务管理系统[1].doc

(完整版)PMP考试秘笈之常见考题的答题原则与套路.doc

动态粒子爱心，包含多个版本优化过程

FlyAIBox_dcu-in-action_28604_1752500976840