37,744
社区成员




有两个文件,格式一样,每行有两列,一列标签,一列对应数字。两边行数不同,两边标签大部分相同。需要找出两边差异,输出到一个新文件。
使用OS命令先sort ,在diff -u 1.txt 2.txt可以合并一个新文件,再基于"+""-"关键字grep过滤。
用python中用 pandas 对比,转化两个dataframe对比并输出差异部分,结果总是不符合要求。
文件1.txt
e39aa0582fb3d25b42a77fb9cb09472d 99
e39aa0582fb3d25b42a77fb9cb094721 99
e39aa0582fb3d25b42a77fb9cb094722 98
e39aa0582fb3d25b42a77fb9cbc94722 99
e39aa0582fb3d25b42a77fb9cb094723 99
e39aa0582fb3d25b42a77fb9cb094724 99
e39aa0582fb3d25b42a77fb9cb094725 99
文件2.txt
e39aa0582fb3d25b42a77fb9cb09472d 99
e39aa0582fb3d25b42a77fb9cb094721 99
e39aa0582fb3d25b42a77fb9cb094722 99
e39aa0582fb3d25b42a77fb9cb094723 99
e39aa0582fb3d25b42a77fb9cb094724 98
e39aa0582fb3d25b42a77fb9cb094734 98
e39aa0582fb3d25b42a77fb9cb094725 99
希望能对比两个文件不同,并输出差异,希望效果如下,按照uuid关联两个文件,仅显示有差异的
uuid1 num1
e39aa0582fb3d25b42a77fb9cb094722 98.0
e39aa0582fb3d25b42a77fb9cbc94722 99.0
e39aa0582fb3d25b42a77fb9cb094724 99.0
e39aa0582fb3d25b42a77fb9cb094734 NaN
结果最好类似如下
uuid1 num1_x num1_y
e39aa0582fb3d25b42a77fb9cb094722 98.0 99.0
e39aa0582fb3d25b42a77fb9cbc94722 99.0 NaN
e39aa0582fb3d25b42a77fb9cb094724 99.0 98.0
e39aa0582fb3d25b42a77fb9cb094734 NaN 98.0
使用pandas dataframe的compare,diff无法达到效果,以下是代码样例
import pandas as pd
fd1=pd.read_table('1.txt', sep=' ',header=None)
fd1.columns=['uuid1','num1']
fd2=pd.read_table('2.txt', sep=' ',header=None)
fd2.columns=['uuid1','num1']
merge1 = pd.merge(fd1,fd2,how='outer',on='uuid1')
compare1=fd2.compare(fd1, align_axis=1, keep_shape=True, keep_equal=True)
merge1['diff']=merge1['num1_x']-merge1['num1_y']
顶