请高手们不宁赐教:怎样得到一个html网页上所有链接，是否是这样。。。。——〉（在现等待，解决后马上结贴！多谢！）

coolfire729 2003-02-08 11:19:18

如果我用c#得到一个html网页上所有的链接，怎么办？
是不是分析这个html页面的源代码，然后逐字查找<a href=>等的关键字，然后保存？？？
不知是否有更好的办法？？？要求实现的效果和flashget下在全部链接差不多，请高手们，不宁赐教。（最好有代码例子）谢了！

...全文

169 25 打赏收藏转发到动态举报

写回复

用AI写文章

25 条回复

切换为时间正序

请发表友善的回复…

发表回复

chestnuts 2003-02-08

打赏
举报

感激涕零~~~~~~
这上网费没白花~~~~~~~~~~~~~~~

chestnuts 2003-02-08

打赏
举报

ok~~~~~~~~`

coolfire729 2003-02-08

打赏
举报

好的
给分

chestnuts 2003-02-08

打赏
举报

bug?我已经改过了
就是这句！与原来的差一点哦！你能看出来！
Regex.Match(sHTMLContent, "href\\s*=\\s*(?:[\"'](?<1>[^\"]*)|(?<1>\\S+)[\"'])", RegexOptions.IgnoreCase);
只有这一行，照这样改

coolfire729 2003-02-08

打赏
举报

好，好，哈哈哈哈，让我再说声，谢谢！

chestnuts 2003-02-08

打赏
举报

少废话拉~给分~

coolfire729 2003-02-08

打赏
举报

那个bug怎么改呢？？？

chestnuts 2003-02-08

打赏
举报

我的泡泡号
chestnuts@netease.com
你察看Msdn！目录里有正则表达式！
但很难看懂~~~~~~~~
最好的方法是拿着我给你的源码查动态帮助

coolfire729 2003-02-08

打赏
举报

好啊，刚好我还不知道这段代码的bug在哪，呵呵，感激（流鼻涕中。。。。。。）

coolfire729 2003-02-08

打赏
举报

哈哈哈，谢谢橙子鸟，谢谢各位高手
顺便问一下，什么是正则表达式？好像闻名已久，但不曾涉猎，望指点。。。

chestnuts 2003-02-08

打赏
举报

如果有需要我帮你改源码！
我的以前是研究着玩的，虽然系统还原丢了，但是具体的方法我还记得！

chestnuts 2003-02-08

打赏
举报

那个是asp.net的！你根本看不到源码！！！
我给你的是该死的对的！！最棒的！！
以后看到我，不要怀疑我！！
直接给我分就是了！！！

coolfire729 2003-02-08

打赏
举报

http://www.planetsourcecode.com/vb/scripts/ShowCode.asp?txtCodeId=861&lngWId=10
看了，在下代码，呵呵，好像很不错哦，感激中。。。。。

coolfire729 2003-02-08

打赏
举报

看看。。。。。。。。呵呵，先谢谢各位高手，看看先。。。。

http://www.softwaremaker.net/DotNetApps/HTMLContentParser/Index.aspx是怎么做的？看不到源代码哦。。。。

chestnuts 2003-02-08

打赏
举报

这个程序没有分析要下载的东西的部分，
如果是其他要下载的东西
则分析src
以前参照这个程序改做过一个自己的html分析程序
不巧系统还原搞没了:(
建议你好好研究一下这个程序，不大，非常精巧

chestnuts 2003-02-08

打赏
举报

顺便说一下，最底下那个bug report是我报的：）

chestnuts 2003-02-08

打赏
举报

分析的不是上面所说的那样！
而是分析href
具体到用正则表达式
Regex.Match(sHTMLContent, "href\\s*=\\s*(?:[\"'](?<1>[^\"]*)|(?<1>\\S+)[\"'])", RegexOptions.IgnoreCase);
给你一个最简单的程序吧，他这个程序有bug!但是如果你仔细研究后就知道怎么修正了！
参看
http://www.planetsourcecode.com/vb/scripts/ShowCode.asp?txtCodeId=861&lngWId=10

coolfire729 2003-02-08

打赏
举报

我是新手，还没这么尝试过嘎，好像有点麻烦哦。。。

to ar7_top(黑白呸，男生女生呸):然后再找 href=" " 字符串对————可是有些链接是这样的：<a href=/Expert/topic/3242.htm </a>
怎么办？

呵呵，有没有更好的办法？比如浏览器ax控件的接口。。。

还有阿，就是下载下来后的html页面没包括图片一起下哦~怎么回事，是不是也要像找链接一样找出来在下载哦？？？？

孟子E章 2003-02-08

打赏
举报

http://www.softwaremaker.net/DotNetApps/HTMLContentParser/Index.aspx

孟子E章 2003-02-08

打赏
举报

给你一个VB.NET的

This code here goes into a Class called
HTMLContentParser.vb
'///////////////////////////
Imports System.IO
Imports System.Net
Imports System
Imports System.Text
Imports System.Text.RegularExpressions
Public Class HTMLContentParser
Function Return_HTMLContent(ByVal sURL As String)
Dim sStream As Stream
Dim URLReq As HttpWebRequest
Dim URLRes As HttpWebResponse
Try
URLReq = WebRequest.Create(sURL)
URLRes = URLReq.GetResponse()
sStream = URLRes.GetResponseStream()
Return New StreamReader(sStream).ReadToEnd()
Catch ex As Exception
Return ex.Message
End Try
End Function
Function ParseHTMLLinks(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
Dim rRegEx As Regex
Dim mMatch As Match
Dim aMatch As New ArrayList()
rRegEx = New Regex("a.*href\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", _ RegexOptions.IgnoreCase Or RegexOptions.Compiled)
mMatch = rRegEx.Match(sHTMLContent)
While mMatch.Success
Dim sMatch As String
sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
aMatch.Add(sMatch)
mMatch = mMatch.NextMatch()
End While
Return aMatch
End Function
Function ParseHTMLImages(ByVal sHTMLContent As String, ByVal sURL As String) As ArrayList
Dim rRegEx As Regex
Dim mMatch As Match
Dim aMatch As New ArrayList()
rRegEx = New Regex("img.*src\s*=\s*(?:""(?<1>[^""]*)""|(?<1>\S+))", _ RegexOptions.IgnoreCase Or RegexOptions.Compiled)
mMatch = rRegEx.Match(sHTMLContent)
While mMatch.Success
Dim sMatch As String
sMatch = ProcessURL(mMatch.Groups(1).ToString, sURL)
aMatch.Add(sMatch)
mMatch = mMatch.NextMatch()
End While
Return aMatch
End Function
Private Function ProcessURL(ByVal sInput As String, ByVal sURL As String)
'Find out if the sURL has a "/" after the Domain Name 'If not, give a "/" at the end 'First, check out for any slash after the 'Double Dashes of the http:// 'If there is NO slash, then end the sURL string with a SLASH If InStr(8, sURL, "/") = 0 Then
sURL += "/"
End If
'FILTERING
'Filter down to the Domain Name Directory from the Right
Dim iCount As Integer
For iCount = sURL.Length To 1 Step -1
If Mid(sURL, iCount, 1) = "/" Then
sURL = Left(sURL, iCount)
Exit For
End If
Next
'Filter out the ">" from the Left
For iCount = 1 To sInput.Length
If Mid(sInput, iCount, 4) = ">" Then
sInput = Left(sInput, iCount - 1) 'Stop and Take the Char before
Exit For
End If
Next
'Filter out unnecessary Characters
sInput = sInput.Replace("<", Chr(39))
sInput = sInput.Replace(">", Chr(39))
sInput = sInput.Replace(""", "")
sInput = sInput.Replace("'", "")
If (sInput.IndexOf("http://") < 0) Then
If (Not (sInput.StartsWith("/")) And Not (sURL.EndsWith("/"))) Then
Return sURL & "/" & sInput
Else
If (sInput.StartsWith("/")) And (sURL.EndsWith("/")) Then
Return sURL.Substring(0, sURL.Length - 1) + sInput
Else
Return sURL + sInput
End If
End If
Else
Return sInput
End If
End Function
End Class