37,743
社区成员




class test:
def __init__(self, name, value)
self.name = name
self.value = value
tlist = []
tlist.append(test('a', 1))
tlist.append(test('d', 2))
tlist.append(test('a', 2))
tlist.append(test('b', 8))
tlist.sort(lambda p1, p2:cmp(p1.value, p2.value))
class test:
def __init__(self, name, value):
self.name = name
self.value = value
def show(self):
print self.name,self.value
tlist = []
tlist.append(test('a', 1))
tlist.append(test('d', 2))
tlist.append(test('a', 2))
tlist.append(test('b', 8))
def my_sort(p1,p2):
if p1.name == p2.name:
return cmp(p1.value,p2.value)
else:
return cmp(p1.name,p2.name)
tlist.sort(my_sort)
for i in tlist:
i.show()
def fourth_word(ln1,ln2):
lst1 = string.split(ln1)
lst2 = string.split(ln2)
#-- Compare "long" lines
if len(lst1) >= 4 and len(lst2) >= 4:
return cmp(lst1[3:],lst2[3:])
#-- Long lines before short lines
elif len(lst1) >= 4 and len(lst2) < 4:
return -1
#-- Short lines after long lines
elif len(lst1) < 4 and len(lst2) >= 4:
return 1
else: # Natural order
return cmp(ln1,ln2)
2.1.1 Problem: Quickly sorting lines on custom criteria
Sorting is one of the real meat-and-potatoes algorithms of text processing and, in fact, of most programming. Fortunately for Python developers, the native [].sort method is extraordinarily fast. Moreover, Python lists with almost any heterogeneous objects as elements can be sorted—Python cannot rely on the uniform arrays of a language like C (an unfortunate exception to this general power was introduced in recent Python versions where comparisons of complex numbers raise a TypeError; and [1+1j,2+2j].sort() dies for the same reason; Unicode strings in lists can cause similar problems).
SEE ALSO: complex 22;
The list sort method is wonderful when you want to sort items in their "natural" order—or in the order that Python considers natural, in the case of items of varying types. Unfortunately, a lot of times, you want to sort things in "unnatural" orders. For lines of text, in particular, any order that is not simple alphabetization of the lines is "unnatural." But often text lines contain meaningful bits of information in positions other than the first character position: A last name may occur as the second word of a list of people (for example, with first name as the first word); an IP address may occur several fields into a server log file; a money total may occur at position 70 of each line; and so on. What if you want to sort lines based on this style of meaningful order that Python doesn't quite understand?
The list sort method [].sort() supports an optional custom comparison function argument. The job this function has is to return -1 if the first thing should come first, return 0 if the two things are equal order-wise, and return 1 if the first thing should come second. The built-in function cmp() does this in a manner identical to the default [].sort() (except in terms of speed, 1st.sort() is much faster than 1st.sort(cmp)). For short lists and quick solutions, a custom comparison function is probably the best thing. In a lot of cases, you can even get by with an in-line lambda function as the custom comparison function, which is a pleasant and handy idiom.
When it comes to speed, however, use of custom comparison functions is fairly awful. Part of the problem is Python's function call overhead, but a lot of other factors contribute to the slowness. Fortunately, a technique called "Schwartzian Transforms" can make for much faster custom sorts. Schwartzian Transforms are named after Randal Schwartz, who proposed the technique for working with Perl; but the technique is equally applicable to Python.
The pattern involved in the Schwartzian Transform technique consists of three steps (these can more precisely be called the Guttman-Rosler Transform, which is based on the Schwartzian Transform):
Transform the list in a reversible way into one that sorts "naturally."
Call Python's native [].sort() method.
Reverse the transformation in (1) to restore the original list items (in new sorted order).
The reason this technique works is that, for a list of size N, it only requires O(2N) transformation operations, which is easy to amortize over the necessary O(N log N) compare/flip operations for large lists. The sort dominates computational time, so anything that makes the sort more efficient is a win in the limit case (this limit is reached quickly).
Below is an example of a simple, but plausible, custom sorting algorithm. The sort is on the fourth and subsequent words of a list of input lines. Lines that are shorter than four words sort to the bottom. Running the test against a file with about 20,000 lines—about 1 megabyte—performed the Schwartzian Transform sort in less than 2 seconds, while taking over 12 seconds for the custom comparison function sort (outputs were verified as identical). Any number of factors will change the exact relative timings, but a better than six times gain can generally be expected.
schwartzian_sort.py
# Timing test for "sort on fourth word"
# Specifically, two lines >= 4 words will be sorted
# lexographically on the 4th, 5th, etc.. words.
# Any line with fewer than four words will be sorted to
# the end, and will occur in "natural" order.
import sys, string, time
wrerr = sys.stderr.write
# naive custom sort
def fourth_word(ln1,ln2):
lst1 = string.split(ln1)
lst2 = string.split(ln2)
#-- Compare "long" lines
if len(lst1) >= 4 and len(lst2) >= 4:
return cmp(lst1[3:],lst2[3:])
#-- Long lines before short lines
elif len(lst1) >= 4 and len(lst2) < 4:
return -1
#-- Short lines after long lines
elif len(lst1) < 4 and len(lst2) >= 4:
return 1
else: # Natural order
return cmp(ln1,ln2)
# Don't count the read itself in the time
lines = open(sys.argv[1]).readlines()
# Time the custom comparison sort
start = time.time()
lines.sort(fourth_word)
end = time.time()
wrerr("Custom comparison func in %3.2f secs\n" % (end-start))
# open('tmp.custom','w').writelines(lines)
# Don't count the read itself in the time
lines = open(sys.argv[1]).readlines()
# Time the Schwartzian sort
start = time.time()
for n in range(len(lines)): # Create the transform
1st = string.split(lines[n])
if len(lst) >= 4: # Tuple w/ sort info first
lines[n] = (1st[3:], lines[n])
else: # Short lines to end
lines[n] = (['\377'], lines[n])
lines.sort() # Native sort
for n in range(len(lines)): # Restore original lines
lines[n] = lines[n] [1]
end = time.time()
wrerr("Schwartzian transform sort in %3.2f secs\n" % (end-start))
# open('tmp.schwartzian','w').writelines(lines)
Only one particular example is presented, but readers should be able to generalize this technique to any sort they need to perform frequently or on large files.