提取html中的链接，正则表达式！

Daqing 2011-05-31 05:19:40

提取提取html中的链接，和<P>段落中的内容，如果能把图片链接的alt提取出来更好，在这里感谢各位！

...全文

243 8 打赏收藏转发到动态举报

写回复

用AI写文章

8 条回复

切换为时间正序

请发表友善的回复…

发表回复

Daqing 2011-06-01

打赏
举报

[Quote=引用 5 楼 ojlovecd 的回复:]
C# code

string inputs = "<div class=\"box_01\"> <a href=\"http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml\" target=\"_blank\"><img src=\"http://i1.sinaimg.cn/IT/U5311P2T1D5……
[/Quote]谢谢！

Ray720_KIllua 2011-05-31

打赏
举报

[Quote=引用 2 楼 kingdom_0 的回复:]
C# code

string inputs = "<div class=\"box_01\"> <a href=\"http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml\" target=\"_blank\"><img src=\"http://i1.sinaimg.cn/IT/U5311P2T1D5539462F2755D……
[/Quote]
过客怎么还没来答？
求学习~~~~~~
[""'^#]是什么意思？

脾气不坏 2011-05-31

打赏
举报

是不是html只是<div></div>这一部分的？
是的话应该很好办
超链接正则：http://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?
<p>应该比较好提取先用<p>.*</p>提取<p>最近，国外某大学的学生为了自己的毕业设计，...</p> 然后将<p> 和</p>直接replace()掉

alt也可以用这个方法

感觉有点费事等牛人出高效正则

我姓区不姓区 2011-05-31

打赏
举报



            string inputs = "<div class=\"box_01\"> <a href=\"http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml\" target=\"_blank\"><img src=\"http://i1.sinaimg.cn/IT/U5311P2T1D5539462F2755DT20110518095231.jpg\" width=\"135\" height=\"85\" alt=\"徕卡昂贵镜头遭遇切片\" /></a><h3><a href=\"http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml\" target=\"_blank\">徕卡昂贵镜头遭遇切片</a></h3><p>最近，国外某大学的学生为了自己的毕业设计，...</p> </div>";

            string patterns = @"(?is)((href|src)=(['""])*([^\s]+?)\3)|(<p>(.*?)</p>)|(alt=(['""])*([^\s]+?)\8)";





            MatchCollection matches = Regex.Matches(inputs, patterns);

            foreach (Match match in matches)

            {

                if (!string.IsNullOrEmpty(match.Groups[2].Value))

                {

                    Console.WriteLine("type:\t{0}", match.Groups[2].Value);

                    Console.WriteLine("href|src:\t{0}", match.Groups[4].Value);

                }

                else if (!string.IsNullOrEmpty(match.Groups[5].Value))

                {

                    Console.WriteLine("type:\tp");

                    Console.WriteLine("Content:\t{0}", match.Groups[6].Value);

                }

                else if (!string.IsNullOrEmpty(match.Groups[7].Value))

                {

                    Console.WriteLine("type:\talt");

                    Console.WriteLine("alt:\t{0}", match.Groups[9].Value);

                }

                Console.WriteLine();

            }

/*

type:   href

href|src:       http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml



type:   src

href|src:       http://i1.sinaimg.cn/IT/U5311P2T1D5539462F2755DT20110518095231.jpg



type:   alt

alt:    徕卡昂贵镜头遭遇切片



type:   href

href|src:       http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml



type:   p

Content:        最近，国外某大学的学生为了自己的毕业设计，...



*/

脾气不坏 2011-05-31

打赏
举报

[Quote=引用 3 楼 kingdom_0 的回复:]

没有<p>标签的～～...
[/Quote]
"<div class=\"box_01\"> <a href=\"http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml\" target=\"_blank\"><img src=\"http://i1.sinaimg.cn/IT/U5311P2T1D5539462F2755DT20110518095231.jpg\" width=\"135\" height=\"85\" alt=\"徕卡昂贵镜头遭遇切片\" /></a><h3><a href=\"http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml\" target=\"_blank\">徕卡昂贵镜头遭遇切片</a></h3><p>最近，国外某大学的学生为了自己的毕业设计，...</p> </div>"

有的~~

kingdom_0 2011-05-31

打赏
举报

没有<p>标签的～～...

kingdom_0 2011-05-31

打赏
举报



string inputs = "<div class=\"box_01\"> <a href=\"http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml\" target=\"_blank\"><img src=\"http://i1.sinaimg.cn/IT/U5311P2T1D5539462F2755DT20110518095231.jpg\" width=\"135\" height=\"85\" alt=\"徕卡昂贵镜头遭遇切片\" /></a><h3><a href=\"http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml\" target=\"_blank\">徕卡昂贵镜头遭遇切片</a></h3><p>最近，国外某大学的学生为了自己的毕业设计，...</p> </div>";

            string patterns = @"(?is)(href|src|alt)=+([""'^#][\w\S]*[""'>])";

            MatchCollection matches = Regex.Matches(inputs, patterns);

            foreach (Match match in matches)

            {

                Console.WriteLine("type:        {0}", match.Groups[1].Value);

                Console.WriteLine("href:        {0}", match.Groups[2].Value);

                Console.WriteLine();

            }

Daqing 2011-05-31

打赏
举报

[Quote=引用楼主 tsapi 的回复:]
提取提取html中的链接，和<P>段落中的内容，如果能把图片链接的alt提取出来更好，在这里感谢各位！
[/Quote]

            string inputs = "<div class=\"box_01\"> <a href=\"http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml\" target=\"_blank\"><img src=\"http://i1.sinaimg.cn/IT/U5311P2T1D5539462F2755DT20110518095231.jpg\" width=\"135\" height=\"85\" alt=\"徕卡昂贵镜头遭遇切片\" /></a><h3><a href=\"http://tech.sina.com.cn/digi/dc/2011-05-18/09425539462.shtml\" target=\"_blank\">徕卡昂贵镜头遭遇切片</a></h3><p>最近，国外某大学的学生为了自己的毕业设计，...</p> </div>";

            string patterns = @"(href|HREF|src|SRC|<p>)={1,}([""'^#][\w\S]*[""'>|</p>])"; 

           



            MatchCollection matches = Regex.Matches(inputs, patterns);

            foreach (Match match in matches)

            {

                Console.WriteLine("type:        {0}", match.Groups[1].Value);

                Console.WriteLine("href:        {0}", match.Groups[2].Value);

                Console.WriteLine("title:       {0}", match.Groups[3].Value);

                Console.WriteLine("Content:     {0}", match.Groups[4].Value);

                Console.WriteLine();

            }

这是我的源码，inputs就是我的html标签，谢谢。目前就是需要个正确的正则！