regular expression again, again and again

qinglinmeng 2002-09-20 02:36:28

1. can you suggect me somebooks on regular expression? i mean, electronic, downloadable or emailable.

2. how to write a regular expression for that to get name, address, zip?

<tr>
<td>name 1 here
</td>
<td>address 1 here
<font...>zip here
</td>
</tr>
<tr>
<td>name 2 here
</td>
<td>
<font...>zip here

</td>
</tr>
<tr>
<td>name 3 here
</td>
<td>address 3 here
</td>
</tr>
<tr>
<td>name 4 here
</td>
</tr>

...全文

82 8 打赏收藏转发到动态举报

写回复

用AI写文章

8 条回复

切换为时间正序

请发表友善的回复…

发表回复

qinglinmeng 2002-09-21

打赏
举报

i checked the documents about html parser. i found it's more diffcult to use. because i can only find special tags instead of special patterns. for example, i can check all <tr>s, but can't check <tr><td>...</td></tr> patterns. that make it difficlut to me.
i can't remember exactly, but it seems that msdn says regular expression can be used for html, xml parser. i will check later.

qqchen79 2002-09-20

打赏
举报

In most Regex Implementation, the default behavior for repeat is "longest match". Meanwhile, there is a special syntax to say "shortest match:
For example, <tr>.*</tr> will eat all lines until reach the end of the whole table. While <tr>(.*?)</tr> will only match to a single line.

I don't really know whether C#/.NET supports this kind of semantic.

And, In your problem, I think you could simply put <tr> and </tr> as the match boundary, then try to find the info you need between them.

qqchen79 2002-09-20

打赏
举报

Regex is not intent to be used in language parsing, such as HTML or C, etc. It only has very limited power and a much strict syntax (for Regular Grammar only, which don't allow nesting).
In your case, if you want your parser more tolerant, I would agree with Saucer that you should use the Browser control instead.

qinglinmeng 2002-09-20

打赏
举报

thank u, qqchen79(知秋一叶)
do you mean i should do two matches, first match the <tr></tr>, then match detailed information in the return string?
I think that's ok for the simple html like i wrote above. but to the more complex one, like a real html page downloaded from a website. there will be more <tr>.</tr>s than we need. we will get a lot of junk information, consume more memory, make the program more complex. i think that may not be the best solution. i hope i can get everything in a sinlge, powerful regular expression.
.Net has the power of get the shortest match. it's "[\w\W]*?" . i tried this in my code, but i find it's hard to control in a complex regular expression. in a easy one, it workes fine. but when regex become complex. it workes in a strange way that i have mentioned above.

qinglinmeng 2002-09-20

打赏
举报

thanks again, saucer.
i was out this afternoon after post my question.
about my code:
first, usually an zip has a format of ([\d\s]{3} [\d\s]{3}), like h3a 2b1. let's make it simple, use telephone number instead of zip. telphone is quite regular, is \([\d]{3}\)[\d]{3}-[\d]{4}.
second, my real problem is. when try to match 1, it find everything, that's ok. easy. but when try to match 2, it find only telephone number, not address. then it will look for a matched address in following html code. it find one in 3. this is to say, match 2 will get name, zip of 2 and phone number of 3. it will skip totoally skip 3 and get 2 matches.
i want to know if there is an elegant way to limit it to a <tr>...</tr> scope. currently, i am using {0,n} to limit the length it can skip. but i think it's ugly.
third, thanks for book and other articles. i will check them.

saucer 2002-09-20

打赏
举报

look how this guy tried it in C#:

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=trmhq93s1g2h9f%40corp.supernews.com&rnum=2&prev=/groups%3Fq%3DC%2523%2BMSHTML.HTMLDocument%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26selm%3Dtrmhq93s1g2h9f%2540corp.supernews.com%26rnum%3D2

saucer 2002-09-20

打赏
举报

suggestion: instead of parsing the html string yourself, you should utilize the IE parsing engine, see how it is done in VB:

Parsing HTML without Using the Browser Control
http://codeguru.earthweb.com/vb_internet/htmlparser.html

saucer 2002-09-20

打赏
举报

1.
Mastering Regular Expressions, 2nd Edition
By Jeffrey E. F. Friedl
O'Reilly

Regular Expressions
http://www.opennc.org/onlinepubs/7908799/xbd/re.html

正则表达式
http://www1.baidu.com/baidu?word=%d5%fd%d4%f2%b1%ed%b4%ef%ca%bd&cl=3&tn=cnyahoo

2. how do you tell
<tr>
<td>name 2 here
</td>
<td>
<font...>zip here

</td>
</tr>

and

<tr>
<td>name 3 here
</td>
<td>address 3 here
</td>
</tr>

apart? I mean, assume the first one is always name, how do you distinguish between address and zip?

11. **grep** (Global Regular Expression Print): 在文件中搜索特定模式，如`grep "keyword" file.txt`。 12. **find** (Find Files): 根据指定条件查找文件，如`find / -name "myfile"`在根目录下找名为"myfile...

文件为doc版，可自行转成txt，在手机上看挺好的。本资源来自网络，如有纰漏还请告知，如觉得还不错，请留言告知后来人，谢谢！！！！！ ...入门学习Linux常用必会60个命令实例详解 ...Linux提供了大量的命令，利用它...

正则表达式（Regular Expression）是一种强大的文本处理工具，用于快速高效地执行复杂字符串操作，如搜索、替换、分割等。它使用一种特殊语法来描述需要匹配的字符序列或模式，从而在文本中定位和处理这些特定模式。

What are Regular Expressions? Regular Expressions are a powerful pattern matching language that is part of many modern programming languages. Regular Expressions allow you to apply a pattern to an...

正则表达式[Regular Expression]使用详解作者：未知时间：2006-01-04 21:37:40 来自：网上转载浏览次数：529 文字大小：【大】【中】【小】如果我们问那些UNIX系统的爱好者他们最喜欢什么，答案除了稳定的...