regular expression again, again and again

qinglinmeng 2002-09-20 02:36:28
1. can you suggect me somebooks on regular expression? i mean, electronic, downloadable or emailable.

2. how to write a regular expression for that to get name, address, zip?

<tr>
<td><font ...>name 1 here</font>
</td>
<td><font ...>address 1 here
<font...>zip here</font>
</font></td>
</tr>
<tr>
<td><font ...>name 2 here</font>
</td>
<td>
<font...>zip here
</font>
</td>
</tr>
<tr>
<td><font ...>name 3 here</font>
</td>
<td><font ...>address 3 here
</font></td>
</tr>
<tr>
<td><font ...>name 4 here</font>
</td>
</tr>

...全文
59 8 打赏 收藏 转发到动态 举报
写回复
用AI写文章
8 条回复
切换为时间正序
请发表友善的回复…
发表回复
qinglinmeng 2002-09-21
  • 打赏
  • 举报
回复
i checked the documents about html parser. i found it's more diffcult to use. because i can only find special tags instead of special patterns. for example, i can check all <tr>s, but can't check <tr><td>...</td></tr> patterns. that make it difficlut to me.
i can't remember exactly, but it seems that msdn says regular expression can be used for html, xml parser. i will check later.
qqchen79 2002-09-20
  • 打赏
  • 举报
回复
In most Regex Implementation, the default behavior for repeat is "longest match". Meanwhile, there is a special syntax to say "shortest match:
For example, <tr>.*</tr> will eat all lines until reach the end of the whole table. While <tr>(.*?)</tr> will only match to a single line.

I don't really know whether C#/.NET supports this kind of semantic.

And, In your problem, I think you could simply put <tr> and </tr> as the match boundary, then try to find the info you need between them.
qqchen79 2002-09-20
  • 打赏
  • 举报
回复
Regex is not intent to be used in language parsing, such as HTML or C, etc. It only has very limited power and a much strict syntax (for Regular Grammar only, which don't allow nesting).
In your case, if you want your parser more tolerant, I would agree with Saucer that you should use the Browser control instead.
qinglinmeng 2002-09-20
  • 打赏
  • 举报
回复
thank u, qqchen79(知秋一叶)
do you mean i should do two matches, first match the <tr></tr>, then match detailed information in the return string?
I think that's ok for the simple html like i wrote above. but to the more complex one, like a real html page downloaded from a website. there will be more <tr>.</tr>s than we need. we will get a lot of junk information, consume more memory, make the program more complex. i think that may not be the best solution. i hope i can get everything in a sinlge, powerful regular expression.
.Net has the power of get the shortest match. it's "[\w\W]*?" . i tried this in my code, but i find it's hard to control in a complex regular expression. in a easy one, it workes fine. but when regex become complex. it workes in a strange way that i have mentioned above.
qinglinmeng 2002-09-20
  • 打赏
  • 举报
回复
thanks again, saucer.
i was out this afternoon after post my question.
about my code:
first, usually an zip has a format of ([\d\s]{3} [\d\s]{3}), like h3a 2b1. let's make it simple, use telephone number instead of zip. telphone is quite regular, is \([\d]{3}\)[\d]{3}-[\d]{4}.
second, my real problem is. when try to match 1, it find everything, that's ok. easy. but when try to match 2, it find only telephone number, not address. then it will look for a matched address in following html code. it find one in 3. this is to say, match 2 will get name, zip of 2 and phone number of 3. it will skip totoally skip 3 and get 2 matches.
i want to know if there is an elegant way to limit it to a <tr>...</tr> scope. currently, i am using {0,n} to limit the length it can skip. but i think it's ugly.
third, thanks for book and other articles. i will check them.




saucer 2002-09-20
  • 打赏
  • 举报
回复
look how this guy tried it in C#:

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=trmhq93s1g2h9f%40corp.supernews.com&rnum=2&prev=/groups%3Fq%3DC%2523%2BMSHTML.HTMLDocument%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26selm%3Dtrmhq93s1g2h9f%2540corp.supernews.com%26rnum%3D2
saucer 2002-09-20
  • 打赏
  • 举报
回复
suggestion: instead of parsing the html string yourself, you should utilize the IE parsing engine, see how it is done in VB:

Parsing HTML without Using the Browser Control
http://codeguru.earthweb.com/vb_internet/htmlparser.html
saucer 2002-09-20
  • 打赏
  • 举报
回复
1.
Mastering Regular Expressions, 2nd Edition
By Jeffrey E. F. Friedl
O'Reilly

Regular Expressions
http://www.opennc.org/onlinepubs/7908799/xbd/re.html

正则表达式
http://www1.baidu.com/baidu?word=%d5%fd%d4%f2%b1%ed%b4%ef%ca%bd&cl=3&tn=cnyahoo

2. how do you tell
<tr>
<td><font ...>name 2 here</font>
</td>
<td>
<font...>zip here
</font>
</td>
</tr>

and

<tr>
<td><font ...>name 3 here</font>
</td>
<td><font ...>address 3 here
</font></td>
</tr>


apart? I mean, assume the first one is always name, how do you distinguish between address and zip?

110,531

社区成员

发帖
与我相关
我的任务
社区描述
.NET技术 C#
社区管理员
  • C#
  • Web++
  • by_封爱
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告

让您成为最强悍的C#开发者

试试用AI创作助手写篇文章吧