Large-Scale Problem Solving

姐夫 2011-06-28 03:27:00
以下参考改写别人的帖子。
1. 给你A,B两个文件,各存放50亿条URL,每条URL占用64字节,内存限制是4G,让你找出A,B文件共同的URL。如果是三个乃至n个文件呢?
Analysis:
Using Bloom Filtering for set union. The key idea is to map each element (here is URL) to k keys by k different hash functions. Then set the obtained k bits to 1 in a bitmap. For a given URL, if one or more of its k bits are not 1, it means this URL does not contained. Otherwise, i.e., all the k bits are 1, it's probably contained in the set.
The key problem of bloom filtering is to determine the number of hash functions k and the length of the bitmap m, depending on the number of elements n and error rate e. In general we have:
m = 1.44 * n * lg( 1 / e )
k = ln 2 * ( m / n )
For example, when error rate e = 0.01, we have m = 13 n, k = 8.
所以对于50亿条URL,我们需要50 × 13=650亿个bit来标志它们是否出来。已有内存4G × 8 = 320亿个bit。如果用这320亿来做错误率会上升,但仍小于0.02。所以,先用bloom filstering把文件A中的URL处理掉得到一个bitmap,然后逐个检查B中的URL是否存在于这个bitmap中。注意这是近似解。有了这步,多个的问题也不大。

2. 海量日志数据,提取出某日访问百度次数最多的那个IP。
Analysis:
Using Hash Table for existence query. In general, we require the total number of elements can be loaded into memory. Always consider hash first due to its efficiency merit.
Since there are total 2^32 (4G) IP addresses, we just use hash table to count the frequency.

3. 已知某个文件内包含一些电话号码,每个号码为8位数字,统计是否重复.
Analysis:
Using Bit Map for existence query of integers. Given an integer n, we just look up the n-th bit to see whether the number exist or not.
8位数的电话号码最多99 999 999位,用bitmap只需要十几M的内存。如果是统计电话号码出现的次数,就扩展到用个整型数组,这种情况其实bitmap就变成了hashtable了,hash function就是号码本身。

4. 2.5亿个整数中找出不重复的整数的个数,内存空间不足以容纳这2.5亿个整数。

将bit-map扩展一下,用2bit表示一个数即可,0表示未出现,1表示出现一次,2表示出现2次及以上。或者我们不用2bit来进行表示,我们用两个bit-map即可模拟实现这个2bit-map。仍然只需要几十M的内存。

5. 100W个数中找最大的前100个数。
Analysis:
Using Heap to maintain a small fraction of the large-scale data sample.
我们只需要维护一个100个元素的最小堆,如果当前查询的数字比堆里最小的数小就丢掉,否则删掉堆中最小的数,然后插入当前数。这样复杂度是log( m ) * n,m是堆大小,n是元素数目。

6. 5亿个int找它们的中位数。
Analysis:
Using Bucket to divide the integer related problems and conquer them separately. Probably a merge step is required.
整数最多有2^32个,把它分成2^16个bucket,平均每个bucket表示的范围包含2^32 / 2^16 = 64K个数字。统计这5亿个数字落到每个bucket里的数目,可以知道中位数落在哪个bucket里。去到那个bucket里可以很快的找到中位数了。
注意,第4题也可以用这种思想解决,如果位数组不能整个放到内存里,把那些数字先划分到多个bucket里,然后一个一个处理。

7. 有一个1G大小的一个文件,里面每一行是一个词,词的大小不超过16个字节,内存限制大小是1M。返回频数最高的100个词。
Analysis:
Using Trie to record a dictionary.
Here each word contains at most 16 characters. Normally a dictionary contains only a few thousands words. Thus it can be handled by 1M memory.
Or,
Using External Sorting to sort the words.
然后遍历一遍排好序的文件,可以统计出频率最大的一百个词。可用一个最小堆来维护。
Or,
Using Map to store the frequency of the words.

Trie or Map depends on the assumption that the dictionary can be handled in memory. External sorting does not require this.

Now you know how to solve the problems below:

8. 海量数据分布在100台电脑中,想个办法高效统计出这批数据的TOP10。

9. 1000万字符串,其中有些是相同的(重复),需要把重复的全部去掉,保留没有重复的字符串。请问怎么设计和实现?

10. 怎么在海量数据中找出重复次数最多的一个。

11. 上千万or亿数据(有重复),统计其中出现次数最多的前N个数据。

12. 一个文本文件,大约有一万行,每行一个词,要求统计出其中最频繁出现的前十个词。请给出思想,给时间复杂度分析。

13. 一个文本文件,也是找出前十个最经常出现的词,但这次文件比较长,说是上亿行或者十亿行,总之无法一次读入内存,问最优解。

14. 有10个文件,每个文件1G,每个文件的每一行都存放的是用户的query,每个文件的query都可能重复要按照query的频度排序
...全文
315 7 打赏 收藏 转发到动态 举报
写回复
用AI写文章
7 条回复
切换为时间正序
请发表友善的回复…
发表回复
  • 打赏
  • 举报
回复
正是如此,所以才说不够用啊。

int表示范围有2^32个数,每个数用2bit,就是2^33个bits = 8G bits = 1G bytes。

[Quote=引用 6 楼 perfectzjf 的回复:]
int最大和最小都是知道的。你可以认为他限定了整数范围
[/Quote]
姐夫 2011-06-29
  • 打赏
  • 举报
回复
int最大和最小都是知道的。你可以认为他限定了整数范围
  • 打赏
  • 举报
回复
题目并没有限定整数范围。你如何用5亿bit表示?

[Quote=引用 4 楼 yaoweijq 的回复:]
500000000.0/8/1024/1024
/8是byte
/1024是kb
再/1024是mb

引用 2 楼 gogdizzy 的回复:
第4题,怎么可能是几十M内存呢。。
[/Quote]
yaoweijq 2011-06-28
  • 打赏
  • 举报
回复

500000000.0/8/1024/1024
/8是byte
/1024是kb
再/1024是mb
[Quote=引用 2 楼 gogdizzy 的回复:]
第4题,怎么可能是几十M内存呢。。
[/Quote]
cnmhx 2011-06-28
  • 打赏
  • 举报
回复
nice job!
  • 打赏
  • 举报
回复
第4题,怎么可能是几十M内存呢。。
xuexiaodong2009 2011-06-28
  • 打赏
  • 举报
回复
牛人啊,没遇到过
When creating real-time and embedded (RTE) systems, there is no room for error. The nature of the final product demands that systems be powerful, efficient, and highly reliable. The constraints of processor and memory resources add to this challenge. Sophisticated developers rely on design patterns—proven solutions to recurrent design challenges—for building fail-safe RTE systems. Real-Time Design Patterns is the foremost reference for developers seeking to employ this powerful technique. The text begins with a review of the Unified Modeling Language (UML) notation and semantics then introduces the Rapid Object-Oriented Process for Embedded Systems (ROPES) process and its key technologies. A catalog of design patterns and their applications follows. Key topics covered in this book include: Identifying large-scale strategic decisions that affect most software elements Coordinating and organizing system components and subsystems Managing memory and resources Defining how objects can be distributed across multiple systems Building safe and reliable architectures Mapping subsystem and component architectures to underlying hardware The book's extensive problem-solving templates, which draw on the author's years in the trenches, will help readers find faster, easier, and more effective design solutions. The accompanying CD-ROM (Examples link) contains: Related papers Object Management Group (OMG) specifications Rhapsody(TM)—a UML-compliant design automation tool that captures the analysis and design of systems and generates full behavioral code with intrinsic model-level debug capabilities RapidRMA(TM)—a tool that integrates with Rhapsody(TM) to perform schedulability and timeliness analysis of UML models
Introduction to High-Performance Scientific Computing,第二版,2014 This text evolved from a new curriculum in scientific computing that was developed to teach undergraduate science and engineering majors how to use high-performance computing systems (supercomputers) in scientific and engineering applications.Designed for undergraduates, An Introduction to High-Performance Scientific Computing assumes a basic knowledge of numerical computation and proficiency in Fortran or C programming and can be used in any science, computer science, applied mathematics, or engineering department or by practicing scientists and engineers, especially those associated with one of the national laboratories or supercomputer centers.The authors begin with a survey of scientific computing and then provide a review of background (numerical analysis, IEEE arithmetic, Unix, Fortran) and tools (elements of MATLAB, IDL, AVS). Next, full coverage is given to scientific visualization and to the architectures (scientific workstations and vector and parallel supercomputers) and performance evaluation needed to solve large-scale problems. The concluding section on applications includes three problems (molecular dynamics, advection, and computerized tomography) that illustrate the challenge of solving problems on a variety of computer architectures as well as the suitability of a particular architecture to solving a particular problem.Finally, since this can only be a hands-on course with extensive programming and experimentation with a variety of architectures and programming paradigms, the authors have provided a laboratory manual and supporting software via anonymous ftp.Scientific and Engineering Computation series
We review the use of factor graphs for the modeling and solving of large-scale inference problems in robotics. Factor graphs are a family of probabilistic graphical models, other examples of which are Bayesian networks and Markov random fields, well known from the statistical modeling and machine learning literature. They provide a powerful abstraction that gives insight into particular inference problems, making it easier to think about and design solutions, and write modular software to perform the actual inference. We illustrate their use in the simultaneous localization and mapping problem and other important problems associated with deploying robots in the real world. We introduce factor graphs as an economical representation within which to formulate the different inference problems, setting the stage for the subsequent sections on practical methods to solve them.We explain the nonlinear optimization techniques for solving arbitrary nonlinear factor graphs, which requires repeatedly solving large sparse linear systems. The sparse structure of the factor graph is the key to understanding this more general algorithm, and hence also understanding (and improving) sparse factorization methods. We provide insight into the graphs underlying robotics inference, and how their sparsity is affected by the implementation choices we make, crucial for achieving highly performant algorithms. As many inference problems in robotics are incremental, we also discuss the iSAM class of algorithms that can reuse previous computations, re-interpreting incremental matrix factorization methods as operations on graphical models, introducing the Bayes tree in the process. Because in most practical situations we will have to deal with 3D rotations and other nonlinear manifolds, we also introduce the more sophisticated machinery to perform optimization on nonlinear manifolds. Finally, we provide an overview of applications of factor graphs for robot perception, showing the broad impact factor graphs had in robot perception. (只有第一章内容,我也才发现)

33,025

社区成员

发帖
与我相关
我的任务
社区描述
数据结构与算法相关内容讨论专区
社区管理员
  • 数据结构与算法社区
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧