(关于正则的问题) 要求从文件中提取email地址

fx397993401 2012-03-16 07:21:38
我最近在stanford 的网络公开课听自然语言处理的课, 然后现在的homework是,从给出的文件中提取相应email地址
题目网址: https://www.coursera.org/nlp/assignment/view?assignment_id=2

文件的下载地址 : http://spark-public.s3.amazonaws.com/nlp/homework/pa1-spamlord-v3.tar.gz
解压后 :
$cd python
$python SpamLord.py ../data/dev/ ../data/devGOLD 就可以运行很简单的版本的程序

然后我们需要完善 。
比如,有如下格式的email

jurafsky(at)cs.stanford.edu
jurafsky at csli dot stanford dot edu
<script type="text/javascript">obfuscate('stanford.edu','jurafsky')</script> # 在浏览器中可以显示正确的email格式

程序要有如下结果
jurafsky@stanford.edu
jurafsky@cs.stanford.edu
jurafsky@csli.stanford.edu

同时我还在他的文件中发现如下几种BT 的格式
pal at cs stanford edu
<em>ada@graphics.stanford.edu</em>
(email to support at gradiance dt com)
<em>ada@graphics.stanford.edu</em>
ouster (followed by “@cs.stanford.edu”)
engler@lcs.mit.edu
jurafsky@stanford.edu
jurafsky.ssss @stanford.edu
jurassfsky @ stanford.edu
jurafsky(at)cs.stanford.edu
jurafskssssy (at) cs.stanford.edu
sssjurafsky at csli dot stanford dot edu
都要返回正确的结果 。
我当时写了 五个正则表达
这个我在本地测试 ,我还是有六个email 不能包括进来,不知道大家有没有比较好的 正则表达式 的方案。

我觉得我的预处理有缺陷,有人说把()等转成空格 ,对空格不做处理

1 我把大写全部转为小写 空格删去 ,dot 转换成 .
2 然后正则

my_first_pat = '([\w+\.?]+)@([\w+\.]+)edu'
email_pat1 = '([\w+\.?\;?]+)\(?at\)?([\w+\.?\;?]+)edu'
email_pat2 = '\<em\>([\w+]+).*\;([\w+\.?\;?]+)edu'
email_pat4 = '([\w+\.?]+)\S*@([\w+\.]+)edu'
email_pat3 = 'obfuscate\(\'(\w+.edu)\'\,\'(\w+)\''
...全文
1220 2 打赏 收藏 转发到动态 举报
写回复
用AI写文章
2 条回复
切换为时间正序
请发表友善的回复…
发表回复
RabbitLBJ 2012-03-19
  • 打赏
  • 举报
回复
[Quote=引用 1 楼 fx397993401 的回复:]

Python code

import sys
import os
import re
import pprint

my_first_pat = '([\w+\.?]+)@([\w+\.]+)edu'

email_pat1 = '(?:email)?([\w+\.?\;]+)\s*at\s*([\w+\.?\;?]+)edu'
email_pat2 = '\<em\>([\w+]+).*\……
[/Quote]
++
fx397993401 2012-03-16
  • 打赏
  • 举报
回复

import sys
import os
import re
import pprint

my_first_pat = '([\w+\.?]+)@([\w+\.]+)edu'

email_pat1 = '(?:email)?([\w+\.?\;]+)\s*at\s*([\w+\.?\;?]+)edu'
email_pat2 = '\<em\>([\w+]+).*\;([\w+\.?\;?]+)edu'
email_pat4 = '([\w+\.?]+).*@([\w+\.]+)edu'
email_pat3 = 'obfuscate\(\'(\w+.edu)\'\,\'(\w+)\''

email_pat5 = 'email([\w+\.?]+)@([\w+\.]+)edu'
email_pat6 = '\<address\>([\w+]+)where([\w+dom]+)edu'


phone_pat1 = '\+?\d?-?\)?(\d{3})\)?-?(\d{3})-?(\d{4})'
phone_pat2 = '\(\s*(\d{3})\s*\)\s*(\d{3})\s*-?\s*(\d{4})'
phone_pat3 = '(\d{3})\s*-?\s*(\d{3})\s*-?\s*(\d{4})'
phone_pat4 = '(\d{3})\s*\S+\s*(\d{3})\s*\S+\s*(\d{4})'


"""
TODO
This function takes in a filename along with the file object (or
an iterable of strings) and scans its contents against regex patterns.
It returns a list of (filename, type, value) tuples where type is either
and 'e' or a 'p' for e-mail or phone, and value is the formatted phone
number or e-mail. The canonical formats are:
(name, 'p', '###-###-#####')
(name, 'e', 'someone@something')
If the numbers you submit are formatted differently they will not
match the gold answers

NOTE: ***don't change this interface***, as it will be called directly by
the submit script
"""
def process_file(name, f):
# note that debug info should be printed to stderr
# sys.stderr.write('[process_file]\tprocessing file: %s\n' % (path))
res = []
for line_t in f:
line = line_t.lower()
line_t = line.replace('(|)',' ');
#line_t = line
line = line_t.replace('dot','.')
line = line.replace('@','at')
matches = re.findall(my_first_pat,line)
for m in matches:
email = '%s@%sedu' % m
res.append((name,'e',email))

matches = re.findall(email_pat1,line)
for m in matches:
email = '%s@%sedu' % m
res.append((name,'e',email))

matches = re.findall(email_pat2,line)
for m in matches:
email_t = '%s@%sedu' % m
email = email_t.replace(';','.')
res.append((name,'e',email))

matches = re.findall(email_pat4,line)
for m in matches:
email = '%s@%sedu' % m
res.append((name,'e',email))

matches = re.findall(email_pat5,line)
for m in matches:
email = '%s@%sedu' % m
res.append((name,'e',email))

matches = re.findall(email_pat6,line)
for m in matches:
email_t = '%s@%sedu' % m
email = email_t.replace('dom','.')
res.append((name,'e',email))


matches = re.findall(email_pat3,line)
for m in matches:
#print m
email = m[1]+'@'+m[0]
res.append((name,'e',email))

matches = re.findall(phone_pat1,line)
for m in matches:
phone = '%s-%s-%s' %m
res.append((name,'p',phone))


return res

"""
You should not need to edit this function, nor should you alter
its interface as it will be called directly by the submit script
"""
def process_dir(data_path):
# get candidates
guess_list = []
for fname in os.listdir(data_path):
if fname[0] == '.':
continue
path = os.path.join(data_path,fname)
f = open(path,'r')
f_guesses = process_file(fname, f)
guess_list.extend(f_guesses)
return guess_list

"""
You should not need to edit this function.
Given a path to a tsv file of gold e-mails and phone numbers
this function returns a list of tuples of the canonical form:
(filename, type, value)
"""
def get_gold(gold_path):
# get gold answers
gold_list = []
f_gold = open(gold_path,'r')
for line in f_gold:
gold_list.append(tuple(line.strip().split('\t')))
return gold_list

"""
You should not need to edit this function.
Given a list of guessed contacts and gold contacts, this function
computes the intersection and set differences, to compute the true
positives, false positives and false negatives. Importantly, it
converts all of the values to lower case before comparing
"""
def score(guess_list, gold_list):
guess_list = [(fname, _type, value.lower()) for (fname, _type, value) in guess_list]
gold_list = [(fname, _type, value.lower()) for (fname, _type, value) in gold_list]
guess_set = set(guess_list)
gold_set = set(gold_list)

tp = guess_set.intersection(gold_set)
fp = guess_set - gold_set
fn = gold_set - guess_set

pp = pprint.PrettyPrinter()
#print 'Guesses (%d): ' % len(guess_set)
#pp.pprint(guess_set)
#print 'Gold (%d): ' % len(gold_set)
#pp.pprint(gold_set)
print 'True Positives (%d): ' % len(tp)
pp.pprint(tp)
print 'False Positives (%d): ' % len(fp)
pp.pprint(fp)
print 'False Negatives (%d): ' % len(fn)
pp.pprint(fn)
print 'Summary: tp=%d, fp=%d, fn=%d' % (len(tp),len(fp),len(fn))

"""
You should not need to edit this function.
It takes in the string path to the data directory and the
gold file
"""
def main(data_path, gold_path):
guess_list = process_dir(data_path)
gold_list = get_gold(gold_path)
score(guess_list, gold_list)

"""
commandline interface takes a directory name and gold file.
It then processes each file within that directory and extracts any
matching e-mails or phone numbers and compares them to the gold file
"""
if __name__ == '__main__':
if (len(sys.argv) != 3):
print 'usage:\tSpamLord.py <data_dir> <gold_file>'
sys.exit(0)
main(sys.argv[1],sys.argv[2])

37,741

社区成员

发帖
与我相关
我的任务
社区描述
JavaScript,VBScript,AngleScript,ActionScript,Shell,Perl,Ruby,Lua,Tcl,Scala,MaxScript 等脚本语言交流。
社区管理员
  • 脚本语言(Perl/Python)社区
  • WuKongSecurity@BOB
加入社区
  • 近7日
  • 近30日
  • 至今

试试用AI创作助手写篇文章吧