VB.net 如何读取网页内容(不要代码)

apple_37 2008-06-15 12:27:00

我想取网页表面内容，但得到很多代码（比如www.baidu.com）,（如下所列），
<html> <head> <meta http-equiv=Content-Type content="text/html;charset=gb2312"> <title>百度一下，你就知道
......

能否可以只保留表面文字(就像游览网页看到的一样)，而没有象 <html> <head> < 之类的代码（如下所列），

"百度一下，你就知道 "

我的程序如下，希望那位高手帮我修改一下，去掉代码

Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim Doc As New System.Net.WebClient
Dim TempText As String
TempText = System.Text.Encoding.Default.GetChars(Doc.DownloadData("http://quality.w2.cdnhost.cn/1.htm"))

RichTextBox1.Text = TempText
End Sub

...全文

806 8 打赏收藏转发到动态举报

写回复

用AI写文章

8 条回复

切换为时间正序

请发表友善的回复…

发表回复

a523194491 2008-06-16

打赏
举报

public static string FilterScript(string content)
{
if(content==null || content=="")
{
return content;
}
string regexstr=@"(?i)<script([^>])*>(\w|\W)*</script([^>])*>";//@"<script.*</script>";
content=Regex.Replace(content,regexstr,string.Empty,RegexOptions.IgnoreCase);
content=Regex.Replace(content,"<script([^>])*>",string.Empty,RegexOptions.IgnoreCase);
return Regex.Replace(content,"</script>",string.Empty,RegexOptions.IgnoreCase);
}

a523194491 2008-06-16

打赏
举报

public static string RemoveHtml(string content)
{
string newstr=FilterScript(content);
string regexstr=@"<[^>]*>";
return Regex.Replace(newstr,regexstr,string.Empty,RegexOptions.IgnoreCase);
}

it_gz_xi 2008-06-16

打赏
举报

路过学习一下

panxuejian 2008-06-16

打赏
举报

使用WebBrowser就可以了

读取Title标记的内容可以直接使用WebBrowser.Document.Title获取。

同时还可以获取网页的代码、字符集等其他的页面信息

mfineky 2008-06-15

打赏
举报

Public html As String

Public Sub Button1_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles Button1.Click
Dim Doc As New System.Net.WebClient
Dim TempText As String
TempText = System.Text.Encoding.Default.GetChars(Doc.DownloadData("http://www.baidu.com"))

TextBox1.Text = TempText

html = TempText

End Sub

Protected Sub Button2_Click(ByVal sender As Object, ByVal e As System.EventArgs) Handles Button2.Click
TextBox2.Text = checkStr(html)
End Sub

Public Function checkStr(ByVal html As String) As String

Dim regex1 As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex("<script[\s\S]+</script *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

Dim regex2 As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(" href *= *[\s\S]*script *:", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
Dim regex3 As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex(" no[\s\S]*=", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
Dim regex4 As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex("<iframe[\s\S]+</iframe *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
Dim regex5 As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex("<frameset[\s\S]+</frameset *>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
Dim regex6 As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex("\<img[^\>]+\>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
Dim regex7 As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex("</p>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
Dim regex8 As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex("<p>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)
Dim regex9 As System.Text.RegularExpressions.Regex = New System.Text.RegularExpressions.Regex("<[^>]*>", System.Text.RegularExpressions.RegexOptions.IgnoreCase)

html = regex1.Replace(html, "") '//过滤<script></script>标记
html = regex2.Replace(html, "") '//过滤href=javascript: (<A>) 属性
html = regex3.Replace(html, " _disibledevent=") ' //过滤其它控件的on...事件
html = regex4.Replace(html, "") '//过滤iframe
html = regex5.Replace(html, "") '//过滤frameset
html = regex6.Replace(html, "") '//过滤frameset
html = regex7.Replace(html, "") '//过滤frameset
html = regex8.Replace(html, "") '//过滤frameset
html = regex9.Replace(html, "")
html = html.Replace(" ", "")
html = html.Replace("</strong>", "")
html = html.Replace("<strong>", "")

Return html

End Function
似乎还有些错误，不过大致思路是这样的！就是利用正则表达式，过滤字符，正则表达式自己还可以完善下！

apple_37 2008-06-15