求助C#采集网站内容.

hunanseo 2009-11-15 05:23:40

http://video.baidu.com/v?word=nba&ct=301989888&rn=20&pn=0&db=0&s=0&fbl=1024
类似我要采集这个HTML源码里面的

<a href="http://v.ku6.com/show/fkD6WckjXMA_CLcY.html" onmousedown="return vc(this,'3602152116,1743565720','19',1,'16','08:45')" title="NBA精彩突破过人镜头" target="_blank"><img src="http://v1.baidu.com/itn?u=3602152116,1743565720" alt="NBA精彩突破过人镜头"></a>

C# winform应该如何写呀..谢谢了..

20分求助...小弟没多少分...

...全文

141 12 打赏收藏转发到动态举报

写回复

用AI写文章

12 条回复

切换为时间正序

请发表友善的回复…

发表回复

hunanseo 2009-11-17

打赏
举报

谢谢...结贴...

-过客- 2009-11-16

打赏
举报

[Quote=引用 8 楼 hunanseo 的回复:]
不过我想下.我的正则那里有问题呢?
一匹配的时候就匹配了一大篇HTML代码了.
[/Quote]

你用的全是贪婪模式，而且量词修饰的又都是匹配范围比较大的小数点“.”，当然是一取一大堆了

参考
正则基础之——贪婪与非贪婪模式

xuejie09242 2009-11-16

打赏
举报

学习

hunanseo 2009-11-16

打赏
举报

<td><div\s+class=x><a\s+href="(?<videourl>.+)"\s+onmousedown=".+"\s+title="(?<title>.+)"\s+target="_blank"><img\s+src="(?<imgurl>.+)"\s+alt=".+\"></[a]>

谢谢..谢谢你的正则表达式..
不过我想下.我的正则那里有问题呢?

一匹配的时候就匹配了一大篇HTML代码了.

-过客- 2009-11-15

打赏
举报

这样？

/// <summary>

/// 通过URL取网页源代码

/// </summary>

/// <param name="url">URL</param>

/// <param name="encoding">网页编码</param>

/// <returns></returns>

private string GetHtmlCode(string url, Encoding encoding)

{

    System.Net.HttpWebRequest request = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(url);

    request.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022)";

    System.Net.WebResponse response = request.GetResponse();

    System.IO.Stream resStream = response.GetResponseStream();

    System.IO.StreamReader sr = new System.IO.StreamReader(resStream, encoding);

    string html = (sr.ReadToEnd());

    resStream.Close();

    sr.Close();

    return html;

}

//调用

string src = GetHtmlCode(@"http://video.baidu.com/v?word=nba&ct=301989888&rn=20&pn=0&db=0&s=0&fbl=1024", Encoding.GetEncoding("gb2312"));

Regex reg = new Regex(@"<a(?!\s+href=""/"">)[^>]*><img[^>]*></a>");

MatchCollection mc = reg.Matches(src);

foreach (Match m in mc)

{

    richTextBox2.Text += m.Value + "\n-----------------------\n";

}

kai8341 2009-11-15

打赏
举报

学习一下。

hunanseo 2009-11-15

打赏
举报

HTML是这样的

<td><div class=x><a href="http://news.joy.cn/video/842859.htm" onmousedown="return vc(this,'0,0','1',1,'1','00:00')" title="NBA常规赛五大好球" target="_blank"><img src="http://file1.joy.cn/Boke/0487/0343/aaaleo/Thumbnail/NBA1_d9fab0c3c43b47e5b4a2d09baf3265aa.jpg" alt="NBA常规赛五大好球"></a></div><div class=r><p><a target="_blank" href="http://news.joy.cn/video/842859.htm" onmousedown="return vc(this,'0,0','1',2,'1','00:00');" title="NBA常规赛五大好球"><font color=#c60a00>NBA</font>常规赛五大好球</a></p><p>分类：<bdo class="fl"><a href="/v?word=NBA&ct=301989888&pn=0&db=0&s=1"><span><font color=#c60a00>NBA</font></span></a>, <a href="/v?word=%B3%A3%B9%E6%C8%FC&ct=301989888&pn=0&db=0&s=1"><span>常规赛</span></a>, <a href="/v?word=%BD%F8%C7%F2&ct=301989888&pn=0&db=0&s=1"><span>进球</span></a></bdo></p><span class="su">news.joy.cn</span></div></td><td><div class=x><a href="http://games.joy.cn/video/562525.htm" onmousedown="return vc(this,'0,0','2',1,'1','00:00')" title="NBA五大神秘天王之鲨鱼" target="_blank"><img src="http://webpic.megajoy.com/onlinegame/download/1257824612461.jpg" alt="NBA五大神秘天王之鲨鱼"></a></div><div class=r><p><a target="_blank" href="http://games.joy.cn/video/562525.htm" onmousedown="return vc(this,'0,0','2',2,'1','00:00');" title="NBA五大神秘天王之鲨鱼"><font color=#c60a00>NBA</font>五大神秘天王之鲨鱼</a></p><p>分类：<bdo class="fl"><a href="/v?word=NBA&ct=301989888&pn=0&db=0&s=1"><span><font color=#c60a00>NBA</font></span></a>, <a href="/v?word=%CC%EC%CD%F5&ct=301989888&pn=0&db=0&s=1"><span>天王</span></a>, <a href="/v?word=%F6%E8%D3%E3&ct=301989888&pn=0&db=0&s=1"><span>鲨鱼</span></a></bdo></p><span class="su">games.joy.cn</span></div></td><td>

hunanseo 2009-11-15

打赏
举报

http://v.ku6.com/show/fkD6WckjXMA_CLcY.html

嗯谢谢各位..如果我要取像上面这样的地址应该怎么写呢?我正则是这样的写.但是不对

<td><div\s+class=x><a\s+href="(?<videourl>.+)"\s+onmousedown=".+"\s+title="(?<title>.+)"\s+target="_blank"><img\s+src="(?<imgurl>.+)"\s+alt=".+\"></[a]>

wuyq11 2009-11-15

打赏
举报

只能采集页面内容，视频很难获取
public static string GetHtml(string URL, out string cookie)
{
WebRequest wr;
wr = WebRequest.Create(URL);
wr.Credentials = CredentialCache.DefaultCredentials;
WebResponse wp;
wp = wt.GetResponse();
string html = new StreamReader(wp.GetResponseStream(), Encoding.UTF8).ReadToEnd();
return html;
}
或
HttpWebRequest req= (HttpWebRequest)HttpWebRequest.Create(URL);
req.Accept = "*/*";
req.Referer = "";
httpWebRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; .NET CLR 1.1.4322)";
httpWebRequest.Method = "GET";
HttpWebResponse webResponse = (HttpWebResponse)httpWebRequest.GetResponse();
header = webResponse.Headers.ToString();
Stream getStream = webResponse.GetResponseStream();
StreamReader sr= new StreamReader(getStream, Encoding.UTF8);
string html = sr.ReadToEnd();
sr.Close();
getStream.Close();

cc_net 2009-11-15

打赏
举报

写错了HttpWebRequest。。。。

可以用post或get发送请求



HttpWebRequest musicPageReq = (HttpWebRequest)WebRequest.Create(ReqUrl);

musicPageReq.AllowAutoRedirect = false;

musicPageReq.Method = "GET";

musicPageReq.Timeout = TimeOut;

try

{

 // 获取页面响应

  using (HttpWebResponse musicPageRes = (HttpWebResponse)musicPageReq.GetResponse())

 {

   // 如果HTTP为200

      if (musicPageRes.StatusCode == HttpStatusCode.OK)

    {

       // 获取响应的页面流

               Stream pageStrem = musicPageRes.GetResponseStream();



       // 读取页面流，获取页面HTML字符串

              StreamReader reader = new StreamReader(pageStrem, encode);

       pageHtml = ReplaceHtml(reader.ReadToEnd());

     }

    }

}