100分求新闻抓取

村长_乐 2010-09-25 10:42:28

新闻抓取，由于本人之前没有做过，请尽可能的详细说明一下，或者给个代码，谢谢
我要抓取的地址是http://golf.sina.com.cn/scrollnews.html，里面有不同的类别，我还要抓取他的详细信息，把他的详细信息添加到我的数据库中，如何实现？
谢谢

...全文

814 80 打赏收藏转发到动态举报

写回复

用AI写文章

80 条回复

切换为时间正序

请发表友善的回复…

发表回复

村长_乐 2010-09-28

打赏
举报

[Quote=引用 79 楼 wangkun9999 的回复:]
pubDate后面没有逗号和结尾符了:

C# code

System.Text.RegularExpressions.Regex reg = new System.Text.RegularExpressions.Regex(@"(?is)category:""(?<category>[^""]*?)"",\s+cLink:""(?<cLink>[^""]*?)"",\s+title:……
[/Quote]
晕...忘记这个了...
谢谢了！！
刚吃饭回来，结贴...谢谢大家...

wangkun9999 2010-09-28

打赏
举报

pubDate后面没有逗号和结尾符了:



 System.Text.RegularExpressions.Regex reg = new System.Text.RegularExpressions.Regex(@"(?is)category:""(?<category>[^""]*?)"",\s+cLink:""(?<cLink>[^""]*?)"",\s+title:""(?<title>[^""]*?)"",\s+link:""(?<link>[^""]*?)"",\s+media:""(?<media>[^""]*?)"",\s+author:""(?<author>[^""]*?)"",\s+pubDate:""(?<pubDate>[^""]*?)""", System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        System.Text.RegularExpressions.MatchCollection m = reg.Matches(str.Substring(str.IndexOf("item:"),str.Length-str.IndexOf("item:"))); //设定要查找的字符串

weikai4321 2010-09-28

打赏
举报

mark...

嗷嗷 2010-09-28

打赏
举报

学习了哈

村长_乐 2010-09-28

打赏
举报

[Quote=引用 70 楼 qdwangle 的回复:]
C# code

var sinaRss = {pubDate:"2010-09-25 6:29", link:"",
item:[

{
category:"美巡赛",
cL……
[/Quote]
帮我写个正则，解决这个问题就结贴了...
谢谢！

leiziaitudou 2010-09-27

打赏
举报

http://topic.csdn.net/u/20100925/20/7173d428-9659-4996-a787-31d2d842cd24.html?32210
看看这个

leiziaitudou 2010-09-27

打赏
举报

RSS

村长_乐 2010-09-27

打赏
举报

<!--[294,57,15] published at 2010-09-27 10:01:49 from #187 by system-->

var sinaRss = {pubDate:"2010-09-25 6:29", link:"",

item:[



		{

			category:"美 巡 赛",

			cLink:"http://golf.sina.com.cn/pgatour.html",

			title:"图文-巡回锦标赛第四轮 福瑞克夺冠后兴奋不已",

			link:"http://sports.sina.com.cn/golf/p/2010-09-27/09585221946.shtml",

            media:"新浪体育讯",

			author:"",

			pubDate:"2010/09/27 9:58"

		},

		{

			category:"美 巡 赛",

			cLink:"http://golf.sina.com.cn/pgatour.html",

			title:"图文-巡回锦标赛第四轮 福瑞克与妻子分享喜悦",

			link:"http://sports.sina.com.cn/golf/p/2010-09-27/09575221944.shtml",

            media:"新浪体育讯",

			author:"",

			pubDate:"2010/09/27 9:57"

		},

		{

			category:"美 巡 赛",

			cLink:"http://golf.sina.com.cn/pgatour.html",

			title:"图文-巡回锦标赛第四轮 福瑞克紧紧拥抱爱妻",

			link:"http://sports.sina.com.cn/golf/p/2010-09-27/09565221943.shtml",

            media:"新浪体育讯",

			author:"",

			pubDate:"2010/09/27 9:56"

		},

怎么抓取item:里面的pubDate，我读取都错误

 System.Text.RegularExpressions.Regex reg = new System.Text.RegularExpressions.Regex(@"(?is)category:""(?<category>[^""]*?)"",\s+cLink:""(?<cLink>[^""]*?)"",\s+title:""(?<title>[^""]*?)"",\s+link:""(?<link>[^""]*?)"",\s+media:""(?<media>[^""]*?)"",\s+author:""(?<author>[^""]*?)"",\s+pubDate:""(?<pubDate>[^""]*?)"",\s+", System.Text.RegularExpressions.RegexOptions.IgnoreCase);

        System.Text.RegularExpressions.MatchCollection m = reg.Matches(str.Substring(str.IndexOf("item:"),str.Length-str.IndexOf("item:"))); //设定要查找的字符串

就这个pubDate取不出来，不报错，但是没有值，如果不取pubDate就有值...

skyaspnet 2010-09-27

打赏
举报

楼上已经说得很清楚了

打一壶酱油 2010-09-27

打赏
举报

这个,可以通过 lucence + heritrix 抓取，根据种子网址横向或纵向获取到页面并对其解析，解析用到正则表达式，其实也不难，主要是忘了，呵呵，当年我用这个做了个下载了迅雷，精明眼什么什么网站漫画的系统

村长_乐 2010-09-26

打赏
举报

[Quote=引用 61 楼 oturer 的回复:]
如果你想保存在的数据库的话，我的思路是
点击连接时【查找数据库是否存在此链接的内容信息】，有就读取，没有则解析目标页面的内容信息，显示同时保存入库。如果库里也没有连接也打不开的话就提示个过期删除之类的信息就得了
新闻内容页的图片很少我觉得影响不了什么速度，不用开什么多线程。
[/Quote]
这些我会根据我的数据库和具体信息来做，可是现在没有图片路径啥也没用。。。
关键是怎么取到图片的路径...谢谢！

beg200710 2010-09-26

打赏
举报

[Quote=引用 17 楼 wwfgu00ing 的回复:]
C# code
/// <summary>
/// 读取URL
/// </summary>
/// <param name="url"></param>
/// <returns></returns>
private System.String readUrlHTML(System.String url……
[/Quote]

顶

oturer 2010-09-26

打赏
举报

如果你想保存在的数据库的话，我的思路是
点击连接时【查找数据库是否存在此链接的内容信息】，有就读取，没有则解析目标页面的内容信息，显示同时保存入库。如果库里也没有连接也打不开的话就提示个过期删除之类的信息就得了
新闻内容页的图片很少我觉得影响不了什么速度，不用开什么多线程。

No1bigtooth 2010-09-26

打赏
举报

关注学习

oturer 2010-09-26

打赏
举报

除了内容页中的图片有用，其他的我觉得都可视为广告

oturer 2010-09-26

打赏
举报

这个我觉得，先找到想要图片的最外围的html标识，根据标识获取内部的
<img 标签中的内容

村长_乐 2010-09-26

打赏
举报

我要图片地址，这样才可以保存到本地，把保存后的名字放到数据库...
现在取不到图片地址？

oturer 2010-09-26

打赏
举报

这个图片是想删除还是保留？

村长_乐 2010-09-26

打赏
举报

[Quote=引用 54 楼 oturer 的回复:]
C# code
protected void Page_Load(object sender, EventArgs e)
{
//string strurl="http://blog.hnce.net"; //欲获取的网页地址
string strurl = " http://info.secu.hc360.com/list/news.sht……
[/Quote]
谢谢！！！我现在列表读取成功，内容也读取到了，用的正则，现在的问题一个是<img src="http://www.baidu.com/image/1.jpg">，这个图片的路径怎么读取？而且里面很多<img src="">这样的标签，有甚好办法能快速找到它
再有一个就是太慢，48楼说的最好开多线程...

oturer 2010-09-26

打赏
举报

protected void Page_Load(object sender, EventArgs e)

    {

        //string strurl="http://blog.hnce.net";				//欲获取的网页地址

        string strurl = "  http://info.secu.hc360.com/list/news.shtml";



        WebClient myWebClient = new WebClient();				//创建WebClient实例myWebClient



        //获取或设置用于对向 Internet 资源的请求进行身份验证的网络凭据。

        myWebClient.Credentials = CredentialCache.DefaultCredentials;



        //从资源下载数据并返回字节数组。（加@是因为网址中间有"/"符号）

        byte[] pagedata = myWebClient.DownloadData(strurl);



        //以下两句每次只要使用一条即可，功能是一样是用来转换字符集，根据获取网站页面的字符编码选择

        string result = Encoding.Default.GetString(pagedata);



      //  result = result.Substring(result.IndexOf("<!--all industry start-->")+22 , (result.IndexOf("<!--all industry end-->")-result.IndexOf("<!--all industry start-->")-22) );

        result = result.Substring(result.IndexOf("jrsd_0914")+37, (result.IndexOf("1099916") - result.IndexOf("jrsd_0914")-37));

        result = StripHTML(result);

        result = result.Replace("href=\"", "href=\"Default.aspx?key=http://info.secu.hc360.com");



        //如果获取网站页面采用的是GB2312，则使用这句

        //string result=Encoding.UTF8.GetString(pagedata);	

        //如果获取网站页面采用的是UTF-8，则使用这句	

        //因为我的博客使用了UTF-8编码，所以在这里我使用这句

        Response.Write(result);			//在WEB页中显示获取的内容

    }

    public static void Main()

    {

        try

        {



            WebClient MyWebClient = new WebClient();

            

            MyWebClient.Credentials = CredentialCache.DefaultCredentials;



            //Byte[] pageData = MyWebClient.DownloadData("http://blog.hnce.net");

            Byte[] pageData = MyWebClient.DownloadData(" http://info.secu.hc360.com/list/news.shtml");

            string pageHtml = Encoding.UTF8.GetString(pageData);

            Console.WriteLine(pageHtml);



        }

        catch (WebException webEx)

        {

            Console.Write(webEx.ToString());

        }

    }

    private string StripHTML(string strHtml)

    {        

        int divs = strHtml.IndexOf("<div class=\"list\" style=\"padding-top:7px;\">");

        string strOutput = strHtml.Substring(divs);

        int dive = strOutput.IndexOf("</div>");

        strOutput = strHtml.Substring(divs, dive - divs);



        return strOutput;

    }

代码复制粘贴即可使用，只抓取新闻列表页没抓取内容页，内容页思路也是这样的
首先你要了解要抓取的目标页的html结构，然后截取想要的信息列表，再修改相应的链接，
我这里是result = result.Replace("href=\"", "href=\"Default.aspx?key=http://info.secu.hc360.com");
Default.aspx页中我将再次执行这类操作抓取目标页中有用的信息。
使用正则表达式可以有效地去掉一些无用的html代码。