67,513
社区成员
发帖
与我相关
我的任务
分享
String[] str = {"12","121"};
String[] str1 = {"12","121"};
String[] str2 = {"12","121","122"};
String[] str3 = {"123","1234","125","126","111"};
String[] str4 = {"1234","126","125","1232","111"};
String[] str6 = {"12","121","222","1433","1234","126","125","1232"};
public static void main(String[] args) {
int samples = 5000;
int arrayCount = 200000;
int tests = 1000;
OccurrenceCounter counter = new OccurrenceCounter();
Random random = new Random();
String[][] arrays = new String[arrayCount][MAX_ELEMENTS_IN_AN_ARRAY];
for (String[] array : arrays) {
for (int i = 0; i < array.length; i++) {
array[i] = String.valueOf(random.nextInt(samples));
}
}
long start1 = System.nanoTime();
counter.addAll(arrays);
int count = 0;
long start2 = System.nanoTime();
for (int i = 0; i < tests; i++) {
String x = String.valueOf(random.nextInt(samples));
String y = String.valueOf(random.nextInt(samples));
// String z = String.valueOf(random.nextInt(samples));
count += counter.count(x, y /* , z */);
}
System.out.println(count);
long end = System.nanoTime();
System.out.println((end - start1) * 0.000000001);
System.out.println((end - start2) * 0.000000001);
}
测试了下,查找速度(end - start2)完全能满足要求,到20万个数组(每个数组5个,总样本最多5000个)里面去找500对组合出现的次数,花费不到200毫秒。不过初始化的时间,就相对长了,超过1秒,且有可能用到900个分区。
也就是说,这个算法的特性是查找飞快,但初始化较慢。而且,完全是空间换时间。所以,是否采用,完全看楼主具体情况。比如,样本相对固定,但是会比较频繁查找不同组合,或者即使有变化,也是只小批量的新增(可以多次调用addAll方法),就比较适宜用这个方法。反之,如果那个大的数据样本经常变换,查了两三次,样本就可能完全变掉,那就不适宜了。 public static List<Set<String>> allCombinations(String[] array) {
List<Set<String>> combinations = new LinkedList<Set<String>>();
// 反向遍历 [2, 2^N-1]
for (int i = 1 << array.length; i-- > 1;) {
Set<String> combination = new HashSet<String>();
for (int j = 0; j < array.length; j++) {
if ((i & (1 << j)) != 0) {
combination.add(array[j]);
}
}
if (combination.size() >= 2) {
combinations.add(combination);
}
}
return combinations;
}
public static void main(String[] args) {
String[][] arrays = {{"1", "2", "3", "4", "5"}, {"1", "3", "5"}};
Map<Set<String>, Integer> counter = new HashMap<Set<String>, Integer>();
for (String[] array : arrays) {
List<Set<String>> allCombinations = allCombinations(array);
for (Set<String> combination : allCombinations) {
int count = 0;
if (counter.containsKey(combination)) {
count = counter.get(combination);
}
counter.put(combination, count + 1);
}
}
System.out.println(counter);
}
概念性代码如上。
除了一些容器的初始大小方面的优化的话,基本上就这个样子了
/**
* 统计数组出现次数。
*
* @param strings
* 要同时出现的数组
* @return 出现次数
*/
按楼主的意思,你这不对!楼主要的不是查找单个组合的出现次数,而是统计所有可能出现的组合的出现次数! 参考#49楼代码,或者帮我优化下,不胜感激!
按楼主的意思,你这不对!楼主要的不是查找单个组合的出现次数,而是统计所有可能出现的组合的出现次数! 参考#49楼代码,或者帮我优化下,不胜感激!
测试了下,查找速度(end - start2)完全能满足要求,到20万个数组(每个数组5个,总样本最多5000个)里面去找500对组合出现的次数,花费不到200毫秒。不过初始化的时间,就相对长了,超过1秒,且有可能用到900个分区。 也就是说,这个算法的特性是查找飞快,但初始化较慢。而且,完全是空间换时间。所以,是否采用,完全看楼主具体情况。比如,样本相对固定,但是会比较频繁查找不同组合,或者即使有变化,也是只小批量的新增(可以多次调用addAll方法),就比较适宜用这个方法。反之,如果那个大的数据样本经常变换,查了两三次,样本就可能完全变掉,那就不适宜了。public static void main(String[] args) { int samples = 5000; int arrayCount = 200000; int tests = 1000; OccurrenceCounter counter = new OccurrenceCounter(); Random random = new Random(); String[][] arrays = new String[arrayCount][MAX_ELEMENTS_IN_AN_ARRAY]; for (String[] array : arrays) { for (int i = 0; i < array.length; i++) { array[i] = String.valueOf(random.nextInt(samples)); } } long start1 = System.nanoTime(); counter.addAll(arrays); int count = 0; long start2 = System.nanoTime(); for (int i = 0; i < tests; i++) { String x = String.valueOf(random.nextInt(samples)); String y = String.valueOf(random.nextInt(samples)); // String z = String.valueOf(random.nextInt(samples)); count += counter.count(x, y /* , z */); } System.out.println(count); long end = System.nanoTime(); System.out.println((end - start1) * 0.000000001); System.out.println((end - start2) * 0.000000001); }