擂台：超大整数高精度快速算法

gxqcn 2004-05-10 02:08:44

我用 C++ 开发了一套《超大整数高精度快速算法》，前后花了十余年。

实现了如下功能：

⊙ 高精度快速加法
⊙ 高精度快速减法
⊙ 高精度快速乘法
⊙ 高精度快速除法
⊙ 高精度快速同余
⊙ 高精度快速乘方
⊙ 高精度快速开方
⊙ 高精度快速乘方取模
⊙ 高精度快速计算 Fibonacci 数列
⊙ 高精度快速阶乘、双阶乘
⊙ 高精度快速求排列
⊙ 高精度快速求组合
⊙ 高精度快速求最大公约数（支持群组运算）
⊙ 高精度快速求最小公倍数（支持群组运算）
⊙ 高精度快速“等幂和”（支持群组运算）
⊙ 高精度快速任意进位制转换
⊙ 高精度计时器（有暂停、累计、复位等功能）
⊙ 强大而灵活的输出

该套算法效率非常高，举例如下：

在 P4 1.7GHz 256MB 的机器上：
生成所有不大于 1亿的素数仅需 0.828s；
精确计算 100,000! 需要 2.078s；
精确计算 200,000! 需要 6.935s。

（我的算法纯粹是数学上的优化，未内嵌一行汇编代码）

最新版下载地址：http://maths.myrice.com/software.htm#02

因为 CSDN 是一个高手云集的地方，所以特借这块宝地广泛征集类似算法或软件，以期达到相互促进的目的。

注：
1、声明, 我的程序快只能说明我对此算法下了功夫，并不表明水平有多高，事实上，我的水平很有限（甚至还不会汇编）；
2、并不要求所有功能均须强于我现有的软件，有一两项也请及时告诉我，谢谢！
3、算法或软件不限于自身原创，欢迎提供第三方链接信息，谢谢！

...全文

4437 158 打赏收藏转发到动态举报

写回复

用AI写文章

158 条回复

切换为时间正序

请发表友善的回复…

发表回复

不羡仙z 2021-07-20

打赏
举报

100000!我只用了50ms啊

zhuzebin 2004-08-19

打赏
举报

Mark!

gxqcn 2004-07-21

打赏
举报

喜讯：HugeCalc Ver 3.0.0.1 (beta) 发布了！
=========================================

它采用了“快速数论算法”，速度得以再次提升！
里面的参数参考了apfloat，同时改写和优化了许多核心代码。

计算一百万的阶乘，在 P4 1.7GHz / 256MB RAM 的机器上需要 13.500s，
速度已超过 apfloat!

有兴趣者可下载：http://www.freewebs.com/maths/download/HugeCalc.rar
正式版将于 2004-08-08 正式发布（现仅需完善 help 文档了）。

//=================================================================

据说，论坛有回复 30 条限制（对楼主居然也不网开一面？），
所以，如果我一旦发言受限，将会立即结帖，并新开贴讨论。

原本打算来者有分，可惜我的级别太低，只能提供 100 分的给分权限，
到时，我将尽可能地公平给分，如果有不满意的，还请多多包涵！

（在本贴申明，主要是节省一次宝贵的发言机会；同时也消除一些不必要/无意义的误会）

wsmagic 2004-07-21

打赏
举报

本贴有续集，欢迎继续

请大家转至：
http://community.csdn.net/Expert/topic/3197/3197332.xml?temp=.4934198

gxqcn 2004-07-21

打赏
举报

test report:

P4 CPU 1.70GHz / 256MRAM / WinXP
================================

10,000! 0.031s 0.28462... x 10^35,660
20,000! 0.109s 0.18192... x 10^77,338
40,000! 0.390s 0.20916... x 10^166,714
80,000! 1.328s 0.30977... x 10^357,507
100,000! 1.969s 0.28242... x 10^456,574
200,000! 6.438s 0.14202... x 10^973,351
400,000! 20.828s 0.25344... x 10^2,067,110
800,000! 66.921s 0.56846... x 10^4,375,040
--------2004.05.28

HugeCalc 3.0.0.1 (beta)
10,000! 0.031s 0.28462... x 10^35,660
20,000! 0.078s 0.18192... x 10^77,338
40,000! 0.219s 0.20916... x 10^166,714
80,000! 0.562s 0.30977... x 10^357,507
100,000! 0.671s 0.28242... x 10^456,574
200,000! 1.640s 0.14202... x 10^973,351
400,000! 4.015s 0.25344... x 10^2,067,110
800,000! 8.859s 0.56846... x 10^4,375,040
1,000,000! 13.469s 0.82639... x 10^5,565,709
--------2004.07.21

apfloat 2.40 (shines(郭子) 提供的 FactorialApfloat.exe，2004-06-21)
10,000! 0.039s
20,000! 0.089s
40,000! 0.225s
80,000! 0.721s
100,000! 0.830s
200,000! 2.118s
400,000! 5.134s
800,000! 11.411s
1,000,000! 17.120s

Mathematica 4.0.0.0
10,000! 0.031s
20,000! 0.141s
40,000! 0.406s
80,000! 1.094s
100,000! 1.422s
200,000! 4.046s
400,000! 10.016s
800,000! 24.719s
1,000,000! 44.219s

shines77 2004-07-16

打赏
举报

汇报一下，目前我着眼于FFT的研究，FFT的代码优化是个严重的问题

yaos 2004-07-16

打赏
举报

shines啊，是全部95秒啊，计算时间才23秒啊

刚做输出到字符中，105秒，郁闷啊，嘿嘿，转换时间105-23=82，倒！！
linux下top查看的，99%以上CPU时间，还是比较准确的，不过内存占用一直在变，有动态内存管理，哎

下边是他的原代码，计算fact的函数

/* mpz_fac_ui(result, n) -- Set RESULT to N!.

Copyright 1991, 1993, 1994, 1995, 2000, 2001, 2002 Free Software Foundation,
Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation; either version 2.1 of the License, or (at your
option) any later version.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public
License for more details.

You should have received a copy of the GNU Lesser General Public License
along with the GNU MP Library; see the file COPYING.LIB. If not, write to
the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston,
MA 02111-1307, USA. */

#include "gmp.h"
#include "gmp-impl.h"
#include "longlong.h"

/* Enhancements:

Data tables could be used for results up to 3 or 4 limbs to avoid
fiddling around with small quantities.

The product accumulation might be worth splitting out into something that
could be used elsewhere too.

The tree of partial products should be done with TMP_ALLOC, not mpz_init.
It should be possible to know a maximum size at each level.

Factors of two could be stripped from k to save some multiplying (but not
very much). The same could be done with factors of 3, perhaps.

The prime factorization of n! is easy to determine, it might be worth
using that rather than a simple 1 to n. The powering of primes could do
some squaring instead of multiplying. There's probably other ways to use
some squaring too. */

/* for single non-zero limb */
#define MPZ_SET_1_NZ(z,n) do { mpz_ptr __z = (z); ASSERT ((n) != 0); PTR(__z)[0] = (n); SIZ(__z) = 1; } while (0)

/* for single non-zero limb */
#define MPZ_INIT_SET_1_NZ(z,n) do { mpz_ptr __iz = (z); ALLOC(__iz) = 1; PTR(__iz) = __GMP_ALLOCATE_FUNC_LIMBS (1); MPZ_SET_1_NZ (__iz, n); } while (0)

/* for src>0 and n>0 */
#define MPZ_MUL_1_POS(dst,src,n) do { mpz_ptr __dst = (dst); mpz_srcptr __src = (src); mp_size_t __size = SIZ(__src); mp_ptr __dst_p; mp_limb_t __c; ASSERT (__size > 0); ASSERT ((n) != 0); MPZ_REALLOC (__dst, __size+1); __dst_p = PTR(__dst); __c = mpn_mul_1 (__dst_p, PTR(__src), __size, n); __dst_p[__size] = __c; SIZ(__dst) = __size + (__c != 0); } while (0)

void
mpz_fac_ui (mpz_ptr result, unsigned long int n)
{
#if SIMPLE_FAC
/* Be silly. Just multiply the numbers in ascending order. O(n**2). */
unsigned long int k;
mpz_set_ui (result, 1L);
for (k = 2; k <= n; k++)
mpz_mul_ui (result, result, k);
#else

/* Be smarter. Multiply groups of numbers in ascending order until the
product doesn't fit in a limb. Multiply these partial product in a
balanced binary tree fashion, to make the operand have as equal sizes
as possible. When the operands have about the same size, mpn_mul
becomes faster. */

unsigned long k;
mp_limb_t p, p1, p0;

/* Stack of partial products, used to make the computation balanced
(i.e. make the sizes of the multiplication operands equal). The
topmost position of MP_STACK will contain a one-limb partial product,
the second topmost will contain a two-limb partial product, and so
on. MP_STACK[0] will contain a partial product with 2**t limbs.
To compute n! MP_STACK needs to be less than
log(n)**2/log(BITS_PER_MP_LIMB), so 30 is surely enough. */
#define MP_STACK_SIZE 30
mpz_t mp_stack[MP_STACK_SIZE];

/* TOP is an index into MP_STACK, giving the topmost element.
TOP_LIMIT_SO_FAR is the largets value it has taken so far. */
int top, top_limit_so_far;

/* Count of the total number of limbs put on MP_STACK so far. This
variable plays an essential role in making the compututation balanced.
See below. */
unsigned int tree_cnt;

/* for testing with small limbs */
if (MP_LIMB_T_MAX < ULONG_MAX)
ASSERT_ALWAYS (n <= MP_LIMB_T_MAX);

top = top_limit_so_far = -1;
tree_cnt = 0;
p = 1;
for (k = 2; k <= n; k++)
{
/* Multiply the partial product in P with K. */
umul_ppmm (p1, p0, p, (mp_limb_t) k);

#if GMP_NAIL_BITS == 0
#define OVERFLOW (p1 != 0)
#else
#define OVERFLOW ((p1 | (p0 >> GMP_NUMB_BITS)) != 0)
#endif
/* Did we get overflow into the high limb, i.e. is the partial
product now more than one limb? */
if OVERFLOW
{
tree_cnt++;

if (tree_cnt % 2 == 0)
{
mp_size_t i;

/* TREE_CNT is even (i.e. we have generated an even number of
one-limb partial products), which means that we have a
single-limb product on the top of MP_STACK. */

MPZ_MUL_1_POS (mp_stack[top], mp_stack[top], p);

/* If TREE_CNT is divisable by 4, 8,..., we have two
similar-sized partial products with 2, 4,... limbs at
the topmost two positions of MP_STACK. Multiply them
to form a new partial product with 4, 8,... limbs. */
for (i = 4; (tree_cnt & (i - 1)) == 0; i <<= 1)
{
mpz_mul (mp_stack[top - 1],
mp_stack[top], mp_stack[top - 1]);
top--;
}
}
else
{
/* Put the single-limb partial product in P on the stack.
(The next time we get a single-limb product, we will
multiply the two together.) */
top++;
if (top > top_limit_so_far)
{
if (top > MP_STACK_SIZE)
abort();
/* The stack is now bigger than ever, initialize the top
element. */
MPZ_INIT_SET_1_NZ (mp_stack[top], p);
top_limit_so_far++;
}
else
MPZ_SET_1_NZ (mp_stack[top], p);
}

/* We ignored the last result from umul_ppmm. Put K in P as the
first component of the next single-limb partial product. */
p = k;
}
else
/* We didn't get overflow in umul_ppmm. Put p0 in P and try
with one more value of K. */
p = p0;
}

/* We have partial products in mp_stack[0..top], in descending order.
We also have a small partial product in p.
Their product is the final result. */
if (top < 0)
MPZ_SET_1_NZ (result, p);
else
MPZ_MUL_1_POS (result, mp_stack[top--], p);
while (top >= 0)
mpz_mul (result, result, mp_stack[top--]);

/* Free the storage allocated for MP_STACK. */
for (top = top_limit_so_far; top >= 0; top--)
mpz_clear (mp_stack[top]);
#endif
}

yaos 2004-07-16

打赏
举报

有点不公平啊，GMP输出到屏幕，你们输出到文件

嘿嘿，windows系统下，输出到屏幕一个近5M的文本，恐怕100秒不够吧
不很明白linux的屏幕输出原理，肯定比windows快很多，但是肯定比文件操作慢

用手表也不准确，输出前，可能有隐含操作

我找GMP转换函数直接转换吧

嘿嘿

shines77 2004-07-16

打赏
举报

to yaos:
昨天晚上我还担心你自己写的阶乘算法会有些慢，原来GMP自己写了阶乘的函数mpz_fac_ui(f, n);，总的来说GMP真的有点慢，计算时间都有些不太快（不知其算法怎么样，不好说，相信不一定是最好的），转换的过程就自然不必说了，一定是长过计算时间的

to liangbch(宝宝)：
我上面问“不知道你的输出时间是不是包括转换进制的时间，还是包括了其他诸如显示输出或输出到文本的时间，”是问yaos的，不过现在知道他是包括转换进制和显示屏幕的时间，
我的输出时间和你是一样的，都是包括转换成字符串，格式化，打开文件，写、关闭文件这些I/O时间

你的输出真是快到不行了，看来充分利用缓冲区是一个很好的技巧，“2.文件写入使用fwrite函数，一次写64k左右.”，看我的测试结果，看看10000的时候，够快吧，怎么比你上面测试的还快啊：（

please input a num, 0:exit
n=?10000
计算过程耗时0.058362s
输出过程耗时0.002710
please input a num, 0:exit
n=?100000
计算过程耗时3.389233s
输出过程耗时0.020861
please input a num, 0:exit
n=?200000
计算过程耗时12.100887s
输出过程耗时0.041514

yaos 2004-07-16

打赏
举报

介于karatsuba和FFT中间的乘法：
T(n) = O(n*2^sqrt(2*logn)*logn)

yaos 2004-07-16

打赏
举报

介于karatsuba和FFT中间的乘法：
T(n) = O(n*2*sqrt(2*logn)*logn)

gxqcn 2004-07-16

打赏
举报

哦，明白了，在这个特定的前提下（a、b 均小于 p），
a * b % p 最多只需调一次 div 指令，不必判断高位是否要做 div 的问题。
多谢提醒。

yaos 2004-07-16

打赏
举报

__asm
{
mov eax, a
mov edx, b
mul edx
div p
}

输出 eax商 ed余数

gxqcn 2004-07-16

打赏
举报

我用的是“unsigned __int64 型直接求余数”，
可否告诉我后两者如何实现，因为你的汇编实在太强了。

liangbch 2004-07-16

打赏
举报

to gxq:
影响 FNT 算法最重要是的模乘法(a * b % p)的速度如何，你的模乘法是如何实现的，是使用double型求商再使用乘法-减法余数，还是使用unsigned __int64 型直接求余数，还是使用整数汇编指令求余，还是使用整形汇编指令以乘代除法求余数。
我觉得你可以在这个方面作一些尝试，看看那个更快。

yaos 2004-07-16

打赏
举报

SSE2专门有乘加指令计算
a*b+c*d

gxqcn 2004-07-16

打赏
举报

这正是我的新开发的FNT偏慢的瓶颈所在。
尽管我采用了更合理的数据结构，和更优化的流程。
看来，又有一块硬骨头需要啃下去了。。。

yaos 2004-07-16

打赏
举报

哦，我曾经做过FPU代码优化，没成功
不知SSE2能行不，嘿嘿，用汇编速度至少提高３－１０倍

wsmagic 2004-07-16

打赏
举报

to yaos:
就是快速傅立叶变换啊。

我用printf("%d", data[i]);这样的方式，在Windows下输出10000000!的结果大致是50多秒，赛扬900，应该有理由相信Linux下屏幕输出会快一点，如果使用STL下的 cout << 输出那个才叫慢呢，虽然屏幕好像没那么花眼，但是也柔和得太慢了：）

to chenzhengzhanglu(陈正)：
我想大家都在努力啊

我在这个贴子里也只能以这个ID出现了，因为shines回复已经超过30次了（汗，这个蝈蝈俊，我想骂死他的）

yaos 2004-07-16