Large-Scale Problem Solving

姐夫 2011-06-28 03:27:00

以下参考改写别人的帖子。
1. 给你A,B两个文件，各存放50亿条URL，每条URL占用64字节，内存限制是4G，让你找出A,B文件共同的URL。如果是三个乃至n个文件呢？
Analysis:
Using Bloom Filtering for set union. The key idea is to map each element (here is URL) to k keys by k different hash functions. Then set the obtained k bits to 1 in a bitmap. For a given URL, if one or more of its k bits are not 1, it means this URL does not contained. Otherwise, i.e., all the k bits are 1, it's probably contained in the set.
The key problem of bloom filtering is to determine the number of hash functions k and the length of the bitmap m, depending on the number of elements n and error rate e. In general we have:
m = 1.44 * n * lg( 1 / e )
k = ln 2 * ( m / n )
For example, when error rate e = 0.01, we have m = 13 n, k = 8.
所以对于50亿条URL，我们需要50 × 13＝650亿个bit来标志它们是否出来。已有内存4G × 8 ＝ 320亿个bit。如果用这320亿来做错误率会上升，但仍小于0.02。所以，先用bloom filstering把文件A中的URL处理掉得到一个bitmap，然后逐个检查B中的URL是否存在于这个bitmap中。注意这是近似解。有了这步，多个的问题也不大。

2. 海量日志数据，提取出某日访问百度次数最多的那个IP。
Analysis:
Using Hash Table for existence query. In general, we require the total number of elements can be loaded into memory. Always consider hash first due to its efficiency merit.
Since there are total 2^32 (4G) IP addresses, we just use hash table to count the frequency.

3. 已知某个文件内包含一些电话号码，每个号码为8位数字，统计是否重复.
Analysis:
Using Bit Map for existence query of integers. Given an integer n, we just look up the n-th bit to see whether the number exist or not.
8位数的电话号码最多99 999 999位，用bitmap只需要十几M的内存。如果是统计电话号码出现的次数，就扩展到用个整型数组，这种情况其实bitmap就变成了hashtable了，hash function就是号码本身。

4. 2.5亿个整数中找出不重复的整数的个数，内存空间不足以容纳这2.5亿个整数。

将bit-map扩展一下，用2bit表示一个数即可，0表示未出现，1表示出现一次，2表示出现2次及以上。或者我们不用2bit来进行表示，我们用两个bit-map即可模拟实现这个2bit-map。仍然只需要几十M的内存。

5. 100W个数中找最大的前100个数。
Analysis:
Using Heap to maintain a small fraction of the large-scale data sample.
我们只需要维护一个100个元素的最小堆，如果当前查询的数字比堆里最小的数小就丢掉，否则删掉堆中最小的数，然后插入当前数。这样复杂度是log( m ) * n，m是堆大小，n是元素数目。

6. 5亿个int找它们的中位数。
Analysis:
Using Bucket to divide the integer related problems and conquer them separately. Probably a merge step is required.
整数最多有2^32个，把它分成2^16个bucket，平均每个bucket表示的范围包含2^32 / 2^16 = 64K个数字。统计这5亿个数字落到每个bucket里的数目，可以知道中位数落在哪个bucket里。去到那个bucket里可以很快的找到中位数了。
注意，第4题也可以用这种思想解决，如果位数组不能整个放到内存里，把那些数字先划分到多个bucket里，然后一个一个处理。

7. 有一个1G大小的一个文件，里面每一行是一个词，词的大小不超过16个字节，内存限制大小是1M。返回频数最高的100个词。
Analysis:
Using Trie to record a dictionary.
Here each word contains at most 16 characters. Normally a dictionary contains only a few thousands words. Thus it can be handled by 1M memory.
Or,
Using External Sorting to sort the words.
然后遍历一遍排好序的文件，可以统计出频率最大的一百个词。可用一个最小堆来维护。
Or,
Using Map to store the frequency of the words.

Trie or Map depends on the assumption that the dictionary can be handled in memory. External sorting does not require this.

Now you know how to solve the problems below:

8. 海量数据分布在100台电脑中，想个办法高效统计出这批数据的TOP10。

9. 1000万字符串，其中有些是相同的(重复),需要把重复的全部去掉，保留没有重复的字符串。请问怎么设计和实现？

10. 怎么在海量数据中找出重复次数最多的一个。

11. 上千万or亿数据（有重复），统计其中出现次数最多的前N个数据。

12. 一个文本文件，大约有一万行，每行一个词，要求统计出其中最频繁出现的前十个词。请给出思想，给时间复杂度分析。

13. 一个文本文件，也是找出前十个最经常出现的词，但这次文件比较长，说是上亿行或者十亿行，总之无法一次读入内存，问最优解。

14. 有10个文件，每个文件1G，每个文件的每一行都存放的是用户的query，每个文件的query都可能重复要按照query的频度排序

...全文

330 7 打赏收藏转发到动态举报

写回复

用AI写文章

7 条回复

切换为时间正序

请发表友善的回复…

发表回复

天下第一好大人 2011-06-29

打赏
举报

正是如此，所以才说不够用啊。

int表示范围有2^32个数，每个数用2bit，就是2^33个bits = 8G bits = 1G bytes。

[Quote=引用 6 楼 perfectzjf 的回复:]
int最大和最小都是知道的。你可以认为他限定了整数范围
[/Quote]

姐夫 2011-06-29

打赏
举报

int最大和最小都是知道的。你可以认为他限定了整数范围

天下第一好大人 2011-06-28

打赏
举报

题目并没有限定整数范围。你如何用5亿bit表示？

[Quote=引用 4 楼 yaoweijq 的回复:]
500000000.0/8/1024/1024
/8是byte
/1024是kb
再/1024是mb

引用 2 楼 gogdizzy 的回复:
第4题，怎么可能是几十M内存呢。。
[/Quote]