The current report is the result of a study that addressed the following charge:
Assess the current state of data analysis for mining of massive sets and streams of data,
Identify gaps in current practice and theory, and
Propose a research agenda to fill those gaps.
Thus, this report examines the frontiers of research that is enabling the analysis of massive data. The major esearch areas covered are as follows:
Data representation, including characterizations of the raw data and transformations that are often applied to data, particularly transformations that attempt to reduce the representational complexity of the data;
Computational complexity issues and how the understanding of such issues supports characterization of the omputational resources needed and of trade-offs among resources;
Statistical model-building in the massive data setting, including data cleansing and validation;
Sampling, both as part of the data-gathering process but also as a key methodology for data reduction; and
Methods for including humans in the data-analysis loop through means such as crowdsourcing, where humans are used as a source of training data for learning algorithms, and visualization, which not only helps humans understand the output of an analysis but also provides human input into model revision.
是让大容量信息在用数字签名软件签署私人密匙前被"压缩"成一种保密的格式（就是把一个任意长度的字节串变换成一定长的大整数）。不管是MD2、MD4还是MD5，它们都需要获得一个随机长度的信息并产生一个128位的信息摘要。虽然这些算法的结构或多或少有些相似，但MD2的设计与MD4和MD5完全不同，那是因为MD2是为8位机器做过设计优化的，而MD4和MD5却是面向32位的电脑。这三个算法的描述和C语言源代码在Internet RFCs 1321中有详细的描述），这是一份最权威的文档，由Ronald L. Rivest在1992年8月向IETF提交。