新手求分解域名=>(主机名 ,域名，后缀)完美正则-_-!

多加旺 2011-12-21 02:31:14

将域名解析为 (主机名 ,域名，后缀)
例如:
www.baidu.com => ("www.","baidu",".com")
abc.baidu.com.cn => ("abc.","baidu",".com.cn")
baidu.com => ("www.","baidu",".com")
baidu.com.cn => ("www.","baidu",".com.cn")
www.wx.js.cn => ("www.","wx",".js.cn")
wx.js.cn => ("www.","wx",".js.cn")

后缀不单单.com .net .cn，而是要所有的可能形式.com.info.net.me.mobi.org.us.biz.xxx.ca.mx.tv.ws.com.ag.net.ag.org.ag.ag .js.cn 还有很多

我的正则:
([a-z0-9_-]{1,32}\.)+([a-z0-9_-]{1,32})((\.[a-z]{2,4})(.[a-z]{1,2})?)"

说明:
主机: ([a-z0-9_-]{1,32}\.)+
域名: ([a-z0-9_-]{1,32})
后缀: ((\.[a-z]{2,4})(.[a-z]{1,2})?)"

我的代码:



import re



def phraseDomainName(namestr):

    if namestr.count('''.''')<1:

        return False

    if namestr.count('''.''')==1:

        namestr="www."+namestr;    

    regex="([a-z0-9_-]{1,32}\.)+([a-z0-9_-]{1,32})((\.[a-z]{2,4})(.[a-z]{1,2})?)"

    match = re.search(regex,namestr)    

    if match.group(1)!= None:

        hostname=match.group(1)

    else:

        hostname=""            

    if match.group(2)!= None:

        domain=match.group(2)                 

    else:

        return False

    if match.group(3)!= None:

        suffix=match.group(3)       

    else:

        return False        

    return (hostname,domain,suffix)



while True:

    domain=input("input url:")

    print(phraseDomainName(domain))

input("end")

当前遇到wx.js.cn abc.com.cn 不能正确解析

...全文

377 6 打赏收藏转发到动态举报

写回复

用AI写文章

6 条回复

切换为时间正序

请发表友善的回复…

发表回复

libralibra 2011-12-22

打赏
举报

昨天漏看了最后一个wx.js.cn这个二级域名,把.js.cn添加到list就ok了

#! /usr/bin/env python



def parseURL(url):

    lookList = ['.com.cn','.js.cn'] # expand this list if necessary

    backList = ['.comcn','.jscn'] # mapped to lookList one by one



    # replace 2nd level domain

    for i in range(len(lookList)):

        if lookList[i] in url:

            url = url.replace(lookList[i],backList[i])



    # split

    if url.count('.')==1 and not url.startswith('www.'):

        url = 'www.'+url

    firstDot = url.index('.')

    secondDot = url[firstDot+1:len(url)].index('.')+firstDot+1



    # recover 2nd level domain

    for i in range(len(backList)):

        if backList[i] in url:

            url = url.replace(backList[i],lookList[i])



    # return

    return (url[:firstDot],url[firstDot+1:secondDot],url[secondDot+1:])



def main():

    testurl = ['www.baidu.com','abc.baidu.com.cn','baidu.com','baidu.com.cn','www.wx.js.cn','wx.js.cn']

    for url in testurl:

        print url, parseURL(url)



if __name__=='__main__':

    main()

结果:

>>> ================================ RESTART ================================

>>> 

www.baidu.com ('www', 'baidu', 'com')

abc.baidu.com.cn ('abc', 'baidu', 'com.cn')

baidu.com ('www', 'baidu', 'com')

baidu.com.cn ('www', 'baidu', 'com.cn')

www.wx.js.cn ('www', 'wx', 'js.cn')

wx.js.cn ('www', 'wx', 'js.cn')

>>>

libralibra 2011-12-21

打赏
举报

没用正则,纯字符串处理,但是二级域名例如.com.cn这种,如果有多个,需自己扩充替换list,这个不会很多的.

代码



#! /usr/bin/env python



def parseURL(url):

    lookList = ['.com.cn'] # expand this list if necessary

    backList = ['.comcn'] # mapped to lookList one by one

    

    for i in range(len(lookList)):

        if lookList[i] in url:

            url = url.replace(lookList[i],backList[i])



    if url.count('.')==1 and not url.startswith('www.'):

        url = 'www.'+url

    firstDot = url.index('.')

    secondDot = url[firstDot+1:len(url)].index('.')+firstDot+1



    for i in range(len(backList)):

        if backList[i] in url:

            url = url.replace(backList[i],lookList[i])

            

    return (url[:firstDot],url[firstDot+1:secondDot],url[secondDot+1:])



def main():

    testurl = ['www.baidu.com','abc.baidu.com.cn','baidu.com','baidu.com.cn','www.wx.js.cn','wx.js.cn']

    for url in testurl:

        print url, parseURL(url)



# unit test

if __name__=='__main__':

    main()

测试结果:



www.baidu.com ('www', 'baidu', 'com')

abc.baidu.com.cn ('abc', 'baidu', 'com.cn')

baidu.com ('www', 'baidu', 'com')

baidu.com.cn ('www', 'baidu', 'com.cn')

www.wx.js.cn ('www', 'wx', 'js.cn')

wx.js.cn ('wx', 'js', 'cn')