python：提取文本中的某几列

chosen86 2014-08-15 02:13:09

[14/Jul/2014:16:39:22 CST] [4019943168] 10.6.99.163 test1 "CONNECT" STARTED 0 0 0 (10.6.99.163:63548 -> 10.203.19.28:8080)
[14/Jul/2014:16:39:24 CST] [4019808000] 10.6.99.163 test1 "CONNECT" ISERROR 0 0 - (10.6.99.163:63551 -> proxy.sgm.shanghaigm.com:8080)
要处理的文本中每一行都是这种形式，但是对我有用的只有红色标记的这四列，想问一下怎么把这四列提取出来写入到一个新文件中，谢谢！

...全文

11242 17 打赏收藏转发到动态举报

写回复

用AI写文章

17 条回复

切换为时间正序

请发表友善的回复…

发表回复

chosen86 2014-09-09

打赏
举报

引用 3 楼 u013171165 的回复:

import re
with open(r'C:\Users\admin-ZH\Desktop\111.txt', 'r') as fr:
    fw = open(r'C:\Users\admin-ZH\Desktop\222.txt', 'w+')
    for text in fr:
        times = re.findall(r'(?<=\[).*?(?= CST\])', text)[0]
        ips = re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', text)[0]
        test = re.findall(r'(?<={}).*?(?=")'.format(ips), text)[0].strip()
        ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()
        infos = ' '.join([times, ips, test, ip_port]) + '\n' 
        fw.write(infos)
    fw.close()

with open(r'C:\Users\admin-ZH\Desktop\222.txt', 'r') as fr:
    for text in fr:
        print text

同为学生，还是劝楼主不要做伸手党。

请问正则匹配test的时候：(?<={}).*?(?=")，能解释一下为什么这么写吗？

chosen86 2014-08-27

打赏
举报

引用 4 楼 u013171165 的回复:

哦，ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()直接复制的，写多了点

哥们不好意思，再问个问题，单独提取端口号port（和ip分开）的正则表达式怎么配，没学过配了半天没配出来。。。最后join的时候想把端口号加进去，谢了

chosen86 2014-08-17

打赏
举报

引用 14 楼 u013171165 的回复:

[quote=引用 13 楼 chosen86 的回复:] [quote=引用 12 楼 u013171165 的回复:] [quote=引用 11 楼 chosen86 的回复:] [quote=引用 9 楼 u013171165 的回复:] [quote=引用 8 楼 chosen86 的回复:] [quote=引用 4 楼 u013171165 的回复:] 哦，ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()直接复制的，写多了点

如果不要ip后边的端口号的话，这一行是不是就不要了？[/quote] ip_port = re.findall(r'(?<=->).*?(?=:)', text)[0].strip()[/quote] 哥们，代码运行后出现这个错误： File "extract_1.py", line 6, in <module> ips = re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', text)[0] IndexError: list index out of range 下标问题？ [/quote]下标越界。你是问题解决了也不接贴啊。[/quote] 别慌，另一个已经散分了，这个也会给你的。这个[0]是取所有满足ip正则表达式的第一个？或者要改成什么才对？[/quote]0都越界了，表示你的数据里根本没有ip，也就是这条数据对你无用了，你只用try except IndexError 处理下，异常就继续下一条数据处理，或者你先判断下有数据没len(ips) != 0就继续处理数据，否则处理下一条数据[/quote] 明白，因为我要处理的文本中有一些error数据，不含有ip 用户名这些信息，我接着试试，谢谢！

The_Third_Wave 2014-08-17

打赏
举报

引用 13 楼 chosen86 的回复:

[quote=引用 12 楼 u013171165 的回复:] [quote=引用 11 楼 chosen86 的回复:] [quote=引用 9 楼 u013171165 的回复:] [quote=引用 8 楼 chosen86 的回复:] [quote=引用 4 楼 u013171165 的回复:] 哦，ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()直接复制的，写多了点

chosen86 2014-08-17

打赏
举报

引用 12 楼 u013171165 的回复:

[quote=引用 11 楼 chosen86 的回复:] [quote=引用 9 楼 u013171165 的回复:] [quote=引用 8 楼 chosen86 的回复:] [quote=引用 4 楼 u013171165 的回复:] 哦，ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()直接复制的，写多了点

The_Third_Wave 2014-08-17

打赏
举报

引用 11 楼 chosen86 的回复:

[quote=引用 9 楼 u013171165 的回复:] [quote=引用 8 楼 chosen86 的回复:] [quote=引用 4 楼 u013171165 的回复:] 哦，ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()直接复制的，写多了点

chosen86 2014-08-17

打赏
举报

引用 9 楼 u013171165 的回复:

[quote=引用 8 楼 chosen86 的回复:] [quote=引用 4 楼 u013171165 的回复:] 哦，ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()直接复制的，写多了点

chosen86 2014-08-15

打赏
举报

引用 9 楼 u013171165 的回复:

[quote=引用 8 楼 chosen86 的回复:] [quote=引用 4 楼 u013171165 的回复:] 哦，ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()直接复制的，写多了点

如果不要ip后边的端口号的话，这一行是不是就不要了？[/quote] ip_port = re.findall(r'(?<=->).*?(?=:)', text)[0].strip()[/quote] thx!

The_Third_Wave 2014-08-15

打赏
举报

引用 8 楼 chosen86 的回复:

[quote=引用 4 楼 u013171165 的回复:] 哦，ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()直接复制的，写多了点

如果不要ip后边的端口号的话，这一行是不是就不要了？[/quote] ip_port = re.findall(r'(?<=->).*?(?=:)', text)[0].strip()

chosen86 2014-08-15

打赏
举报

引用 4 楼 u013171165 的回复:

哦，ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()直接复制的，写多了点

如果不要ip后边的端口号的话，这一行是不是就不要了？

惟愿莲心不染尘 2014-08-15

打赏
举报


import re
Str = r'''[14/Jul/2014:16:39:22 CST] [4019943168] 10.6.99.163   test1 "CONNECT" STARTED 0 0 0 (10.6.99.163:63548 -> 10.203.19.28:8080)'''
p = re.compile(r"\d+/\w+/\d+:\d+:\d+:\d+")
print p.findall(Str)
p = re.compile(r'\s(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s')
print p.findall(Str)
p = re.compile(r"->\s(.*:\d+)")
print p.findall(Str)

chosen86 2014-08-15

打赏
举报

引用 5 楼 u013171165 的回复:

[quote=引用 2 楼 chosen86 的回复:] [quote=引用 1 楼 u013171165 的回复:] 不是给你写出来了么？

不是一个问题，上一个是把有用的行提取出来写入一个文件，现在在这个文件的基础上，提取这四列，写入一个新文件，也就是最后的文件只有四列。[/quote]可以合在一起。再发我不帖代码了，这很简单的。[/quote] 多谢，刚开始学，任务比较急，望理解。

The_Third_Wave 2014-08-15

打赏
举报

引用 2 楼 chosen86 的回复:

[quote=引用 1 楼 u013171165 的回复:] 不是给你写出来了么？

不是一个问题，上一个是把有用的行提取出来写入一个文件，现在在这个文件的基础上，提取这四列，写入一个新文件，也就是最后的文件只有四列。[/quote]可以合在一起。再发我不帖代码了，这很简单的。

The_Third_Wave 2014-08-15

打赏
举报

哦，ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()直接复制的，写多了点

The_Third_Wave 2014-08-15

打赏
举报

import re
with open(r'C:\Users\admin-ZH\Desktop\111.txt', 'r') as fr:
    fw = open(r'C:\Users\admin-ZH\Desktop\222.txt', 'w+')
    for text in fr:
        times = re.findall(r'(?<=\[).*?(?= CST\])', text)[0]
        ips = re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', text)[0]
        test = re.findall(r'(?<={}).*?(?=")'.format(ips), text)[0].strip()
        ip_port = re.findall(r'(?<=->).*?(?=\))'.format(ips), text)[0].strip()
        infos = ' '.join([times, ips, test, ip_port]) + '\n' 
        fw.write(infos)
    fw.close()

with open(r'C:\Users\admin-ZH\Desktop\222.txt', 'r') as fr:
    for text in fr:
        print text

同为学生，还是劝楼主不要做伸手党。

chosen86 2014-08-15