Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://uneex.lorien.cs.msu.su/static/springer/fulltext.pdf.1
Äàòà èçìåíåíèÿ: Mon Sep 26 12:35:54 2011
Äàòà èíäåêñèðîâàíèÿ: Tue Oct 2 06:38:00 2012
Êîäèðîâêà: IBM-866
A Parallel GNFS Algorithm Based on a Reliable Lo ok-Ahead Blo ck Lanczos Metho d for Integer Factorization
Laurence T. Yang
1

1,2

, Li Xu2 , Man Lin2 , and John Quinn2

Department of Computer Science and Engineering Jiangsu Polytechnic University Changzhou, Jiangsu Province, 213164, P.R. China 2 Department of Computer Science St. Francis Xavier University Antigonish, Nova Scotia, B2G 2W5, Canada {lyang, x2002uwf, mlin, jquinn}@stfx.ca

Abstract. The Rivest-Shamir-Adleman (RSA) algorithm is a very p opular and secure public key cryptosystem, but its security relies on the difficulty of factoring large integers. The General Numb er Field Sieve (GNFS) algorithm is currently the b est known method for factoring large integers over 110 digits. Our previous work on the parallel GNFS algorithm, which integrated the Montgomery's block Lanczos method to solve large and sparse linear systems over GF(2), is less reliable. In this pap er, we have successfully implemented and integrated the parallel General Numb er Field Sieve (GNFS) algorithm with the new look-ahead block Lanczos method for solving large and sparse linear systems generated by the GNFS algorithm. This new look-ahead block Lanczos method is based on the look-ahead technique, which is more reliable, avoiding the break-down of the algorithm due to the domain of GF(2). The algorithm can find more dep endencies than Montgomery's block Lanczos method with less iterations. The detailed exp erimental results on a SUN cluster will b e presented in this pap er as well.

1

Introduction

To day, the Rivest-Shamir-Adleman (RSA) algorithm [21] is the most popular algorithm in public-key cryptosystem and it also has been widely used in realworld applications such as: internet explorer, email systems, online banking, cell phones etc. The security of this algorithm mainly relies on the difficulty of factoring large integers. Many integer factorization algorithms have been developed. Examples are: Trial division [22], Pollard's p-1 algorithm [19], Lenstra Elliptic Curve Factorization (ECM) [13], Quadratic Sieve (QS) [20] and General Number Field Sieve (GNFS) [1,2,3,15] algorithm. GNFS is the best known metho d for factoring large composite numbers over 110 digits so far. Although the GNFS algorithm is the fastest algorithm so far, it still takes a long time to factor large integers. In order to reduce the execution time, one natural solution is to use parallel computers. The GNFS algorithm contains several
E. Sha et al. (Eds.): EUC 2006, LNCS 4096, pp. 110í120, 2006. c IFIP International Federation for Information Pro cessing 2006


A Parallel GNFS Algorithm

111

steps. The most time consuming step is sieving which is used to generate enough relations. This step is very suitable for parallelization because the relation generations are independent. Another step that could benefit from parallel processing is the Montgomery's blo ck Lanczos method [16]. It is used to solve large and sparse linear systems over GF(2) generated by the GNFS algorithm. The disadvantage of this blo ck Lanczos metho d is its unreliability. The lo ok-ahead blo ck Lanczos metho d proposed in [8] has overcome this disadvantage and improved the overall reliability of blo ck Lanczos algorithm. There are numerous references available on the look-ahead blo ck Lanczos metho d [6,7,9,18], but none of those metho ds can be applied to GF(2) field directly. The algorithm we are developing and implementing is very suitable for solving the generated large and sparse linear systems over small finite fields such as GF(2). In this paper we have successfully developed and implemented the look-ahead blo ck Lanczos metho d, and integrated together with the GNFS algorithm for integer factorization. The rest of the paper is organized as follows: we first briefly describe the original GNFS algorithm in section 2. Then we present two blo ck Lanczos metho ds, namely Montgomery's blo ck Lanczos method [16] and lo ok-ahead blo ck Lanczos metho d [8] in section 3 and 4 respectively. Section 5 shows the detailed implementation and corresponding parallel performance results.

2

The GNFS Algorithm

The General Number Field Sieve (GNFS) algorithm [2,3,15] is derived from the number fields sieve (NFS) algorithm, developed by Lenstra et al. [14]. It is the fastest known algorithm for integer factorization. The idea of GNFS is from the congruence of squares algorithm [12]. Suppose we want to factor an integer n where n has two prime factors p and q . Let's assume we have two integers s and r, such that s2 and r2 are perfect squares and satisfy the constraint s2 r2 (mod n). Since n = pq , the following conditions must hold [2]: pq|(s2 -r2 ) pq|(s-r)(s+r) p|(s-r)(s+r) and q|(s-r)(s+r). We know that, if c|ab and gcd(b,c) = 1, then c|a. So p, q, r and s must satisfy p|(s-r) or p|(s+r) and q|(s-r) or q|(s+r). Based on this, it can be proved that we can find factors of n by computing the greatest common divisor gcd(n,(s+r)) and gcd(n,(s-r)) with the possibility of 2/3 (see [2]). Therefore, the essence of GNFS algorithm is based on the idea of factoring n by computing the gcd(n, s+r) and gcd(n, s-r). There are six ma jor steps [15]: 1. Selecting parameters: cho ose an integer mZ and a polynomial f which satisfies f(m) 0 (mod n). 2. Defining three factor bases: rational factor base R, algebraic factor base A and quadratic character base Q.


112

L.T. Yang et al. Table 1. The comp osite numb er n and the results after integer factorization

name tst10030 F739 tst15045 Briggs51 tst20061 tst25076

numb er 727563736353655223147641208603 = 743774339337499§978204944528897 680564733841876926926749214863536422914 = 5704689200685129054721§59649589127497217 799356282580692644127991443712991753990450969 = 32823111293257851893153§24353458617583497303673 556158012756522140970101270050308458769458529626977 = 1236405128000120870775846228354119184397§449818591141 1241445153765162090376032461564730757085137334450817128010073 = 1127192007137697372923951166979§1101360855918052649813406915187 3675041894739039405533259197211548846143110109152323761665377505538520830273 = 69119855780815625390997974542224894323§53169119831396634916152282437374262651

3. 4. 5. 6.

Sieving: generate enough pairs (a,b) (relations) to build Pro cessing relations: filter out useful pairs (a,b) found Building up and solve a large and sparse linear system Squaring ro ot: use the results from the previous step to squares, then factor n.

a linear dependence. from sieving. over GF(2). generate two perfect

Based on the previous studies, the most time consuming step is step 3, sieving. In our previous work [23,24], we have successfully implemented the sieving in parallel with very scalable performance. In this paper, we are fo cusing on another time consuming part, namely solving the large and sparse linear systems over GF(2) in parallel.

3

Montgomery's Block Lanczos Method

Montgomery's blo ck Lanczos method was proposed by Montgomery in 1995 [16]. This blo ck Lanczos metho d is a variant of the standard Lancozs metho d [10,11]. Both Lanczos metho ds are used to solve large and sparse linear systems. In the standard Lanczos metho d, suppose we have a symmetric matrix ARn½n . Based on the notations used in [16], the method can be described as follows: w0 = b,
i-1

wi = Awi-1 -
j =0

T wj A2 wi- T wj Awj

1

.

(1)

The iteration will stop when wi =0. {w0 , w1 , ... wi-1 } are a basis of span{b, Ab, A2 b, ...} with the properties: 0 i < m, 0 i < j < m,
T wi Awi = 0, T T wi Awj = wj Awi = 0.

(2) (3)


A Parallel GNFS Algorithm

113

The solution x can be computed as follows:
m- 1

x=
j =0

T wj b wj . T Aw wj j

(4)

Furthermore the iteration of wi can be simplified as follows: wi = Awi-1 - (Awi-1 )T (Awi-1 ) (Awi-2 )T (Awi-1 ) wi-1 - wi-2 . T T wi-1 (Awi-1 ) wi-2 (Awi-2 )

The total time for the standard Lanczos metho d iss O(dn2 )+O(n2 ), d is the average number of nonzero entries per column. The Montgomery's blo ck Lanczos method is an extension of the standard Lanczos metho d applied over field GF(2). The ma jor problem for working on GF(2) is that inner pro ducts are very likely to become zero because of the binary entries, then the algorithm breaks down accordingly, can not pro ceed easily. The Montgomery's blo ck Lanczos metho d is the first attempt to avoid such break down by using N vectors at a time (N is the length of the computer word). Instead of using vectors for iteration which easily leads to inner pro ducts to zero, we are using the subspace instead. First we generate the subspace: isA - inv ertible, Wi T Wj AWi = {0}, {i = j }, AW W , Then we define x to be:
m- 1

(5)
-1

W = W0 + W1 + ... + Wm

.

x=
j =0

Wj (WjT AWj )-1 WjT b,

(6)

where W is a basis of W . The iteration in the standard Lanczos metho d will be changed to: Wi = Vi Si ,
i T Vi+1 = AWi Si + Vi - j =0

Wj Ci+1,j

(i 0), (7)

Wi = Wi , in which
T Ci+1,j = (WjT AWj )-1 WjT A(AWi Si + Vi ).

(8)

This iteration will stop when ViT AVi =0 where i = m. The iteration can also be further simplified as follows:
T Vi+1 = AVi Si Si + Vi D i+1

+ Vi-1 Ei+1 + Vi-2 Fi+1 .


114

L.T. Yang et al.
i+1

where D

,Ei+1 ,Fi+1 are:

T Di+1 = IN - Wiinv (ViT A2 Vi Si Si + ViT AVi ), v T Ei+1 = -Wiin1 ViT AVi Si Si , - v v Fi+1 = -Wiin2 (IN - ViT 1 AVi-1 Wiin1 )(ViT 1 A2 Vi-1 S - - - - T i-1 Si-1 T + ViT 1 AVi-1 )Si Si . -

Si is an N ½ Ni pro jection matrix (N is the length of computer word and Ni < N ). The cost of the Montgomery's block Lanczos metho d will be reduced to O(n2 )+O(dn2 /N).

4

Look-Ahead Block Lanczos Method

In this paper, the look-ahead block Lanczos metho d over small finite fields such as GF(2) we are developing is mainly based on the metho d proposed in [8]. There are some advantages of such lo ok-ahead blo ck Lanczos metho d compared with Montgomery's blo ck Lanczos metho d: first of all, this metho d is bi-orthogonalizing, so the input matrix generated from GNFS do es not need to be symmetric. In order to apply Montgomery's blo ck Lanczos metho d, we need to multiply the coefficient matrix A with AT . However over GF(2), the rank of the pro duct AT A is, in general, much less than that of A. Thus, when applied to find elements of the nullspace of A, the Montgomery's blo ck Lanczos metho d may find many spurious vectors. Secondly, also more importantly, it solves the problem of break down we mentioned before, namely (W i T AW i ={0}). Due to the limited space, we only outline the algorithm in the paper. First we cho ose v0 and u0 from Kn½N . Then we will compute v1 , v2 , §§§ , vm-1 and u1 , u2 , §§§ , um-1 . We try to achieve the following conditions: í í í í K(AT ,u0 ) = m-1 ui and K(A,v0 )= m-1 vi . i=0 i=0 Each subspace ui and vi is of dimension at most N. For all 0i i

Then we can decompose the vector spaces u v i , ui , v i , ui , v and u have the properties: ^ ^ i i i i í v T Avi = 0. ^i í uT Av i = 0. ^ i í uT Av i is invertible. ïi ï and

and vi . Define variables v i , ui , ïï

u ïi ^i ui := {ui |ui } = ui i , i v vi := {vi |vi } = vi i , i ïi ^i

(9) (10)

v and u are two invertible matrices in KN ½N . This may be computed by i i performing a Gauess-Jordan decomposition of the matrix uT Avi and using the i


A Parallel GNFS Algorithm

115

output to select the independent row and column vectors, which then correspond u v to the columns of vi and ui , respectively. The matrices i and i permute these ï ï columns to the front and apply row and column dependencies, respectively, to ^ give ui and vi . We define ui and vi to be the matrices representing this decompo^ sition: v i =vi v , ui =ui u . Through this, then vi+1 and ui+1 can be computed by: i i
i

vi+1 = Avi -
k=0 i

vk (ïT Avk )-1 uT A2 vi , ï uk ï ïk uk (ïk AT uk )-1 vk (AT )2 ui . ï vT ï ïT

(11)

u In computing u ( u
i-1

i+1

= A ui -
T k=0

(12)

i+2

and vi+2 in next iteration, we have the following situations: ri-
1,i-1

u rii 0 ri,
i+1

|u

i-1

) A(i-1 |vi |vi+1 |Avi+1 ) = v
T

T i-1

A2 vi+1 (13)

si+1,i+2 ri+1,i+2

Since ri-1,i-1 and ri,i are assumed invertible (mo difying to operate in the case where it is not invertible is straightforward), elimination steps to zero uT-1 A2 vi+1 i and si+1,i+2 are performed. For the cases of ri,i+1 has full rank or not, we cope differently. We continue the same manner until all rows corresponding to ui have an asso ciated invertible minor. The iterative formula has been simplified as follows:
i

u

i+1

= AT ui -
k=0 i

ui ((vk )T AT ui )-1 (ïk )T (AT )2 ui , k ïi k vi

(14)

vi+1 = Avi -
k=0

vk ((ui )T Avk )-1 (ïi )T A2 vi . i ïk i uk

(15)

The elimination and decomposition steps presented above do not yield sufficient orthogonality conditions to allow computation of a candidate system solution easily. We would continue the elimination and decomposition until it has a permuted blo ck diagonal structure, in which the non-zero parts are as closely clustered around the diagonal as possible. Please refer to [8] for details. Eventually, the solution x can be calculated by:
m- 1

x=
i=0

vi ((um )T Avi )-1 (ïm )T b. m ïi m ui

(16)

5

Parallel Implementation Details

As we mentioned before, the most time consuming part in GNFS is sieving. This part has already been parallelized in our previous work [23,24]. This paper is build on top of the our previous parallel implementation. Our overall parallel code is built on the sequential source GNFS code from Monico [15].


116

L.T. Yang et al.

5.1

Hardware and Programming Environment

The whole implementation is built on two software packages, the sequential GNFS co de from Monico [15] (Written in ANSI C) and the sequential Lo okahead block Lanczos code from Hovinen [8] (Written in C++). For parallel implementation, MPICH1 (Message Passing Interface) [5] library is used, version 1.2.5.2. The GMP 4.x is also used [4] for precision arithmetic calculations. We use GUN compiler to compile whole program and MPICH1 [17] for our MPI library. The cluster we use is a Sun cluster from University of New Brunswick Canada whose system configurations is: í í í í í Mo del: Sun Microsystems V60. Architecture: x86 cluster. Pro cessor count: 164. Master pro cessor: 3 GB registered DDR-266 ECC SDRAM. Slave pro cessor: 2 to 3 GB registered DDR-266 ECC SDRAM.

In the program, each slave pro cessor only communicates with the master pro cessor. Figure 1 shows the flow chart of our parallel program.
Main Program

MPI_Init

Slave

Slave

Master

Slave

Slave

MPI_Finalize

Main Program

Fig. 1. Each processors do the sieving at the same time, and all the slave nodes send the result back to master node

6

Performance Evaluation

We have six test cases, each test case have a different size of n, all are listed in Table 1. The sieving time increases when the size of n increases. Table 2 shows the average sieving time for each n with one processor. Table 3 shows the number


A Parallel GNFS Algorithm Table 2. Average sieving time for each n name numb tst10030 F739 tst15045 Briggs51 tst20061 tst25076 er of sieve average sieve time(s) 1 35.6 1 28.8 5 50.6 3 85.67 7 560.6 7 4757.91

117

Table 3. Numb er of processors for each test case name numb er of slave processors tst10030 1,2,4,8,16 F739 1,2,4,8,16 tst15045 1,2,4,8,16 Briggs51 1,2,4,8,16 tst20061 1,2,4,8,16 tst25076 1,2,4,8,16
60
tst100 F7

50 Total Execution Time(s)

40

30

20

10

0

4

8 12 Number of Processors

16

20

Fig. 2. Execution time for tst100 and F7
1400
11000
tst200

1200 Total Execution Time(s)

tst150 Briggs

10000 Total Execution Time(s)

9000

1000

8000

800

7000

600

0

4

8 12 Number of Processors

16

20

6000

0

4

8 12 Number of Processors

16

20

Fig. 3. Execution time for tst150, Briggs and tst200


118

L.T. Yang et al.
200
tst100 F7 tst150 Briggs

4000
tst200

150 Total Sieve Time(s)
Total Sieve Time(s)

3000

100

2000

50

1000

0

0

4

8 12 Number of Processors

16

20

0

0

4

8 12 Number of Processors

16

20

Fig. 4. Sieve time for tst100, F7, tst150, Briggs and tst200
3e+05
Total-exe-time Total-sieve-time

14 12

Speedup Efficiency

2e+05 Time(s)

10 8 6

1e+05
4 2

0

0

4

8 12 Number of Processors

16

20

0

0

4

8 12 Number of processors

16

20

Fig. 5. Total execution time, sieve time, sp eedup and efficiency for test case tst250
10
Briggs F7 tst100 tst150 tst200

1.4 1.2 1

8

Briggs F7 tst100 tst150 tst200

Speed-up

Efficiency

6

0.8 0.6 0.4

4

2
0.2

0

0

4

8 12 Number of Processors

16

20

0

0

4

8 12 Number of Processors

16

20

Fig. 6. Sp eedups and parallel efficiency

of processors we use for each test case. Figure 2 and 3 show the total execution time for each test case in seconds. The total sieve time for test case: tst100, F7, tst150, Briggs and tst200 are presented in Figure 4. Figure 5 gives the total execution time, sieve time, speedups and parallel efficiency with different processor numbers. Figure 6 gives the speed-ups and parallel efficiency for each test case with different pro cessor numbers.


A Parallel GNFS Algorithm

119

Additionally, based on our comparisons on a few limited test cases, the metho d can find more dependencies than Montgomery's blo ck Lanczos metho d with less iterations. We will report the details in future publications.

Acknowledgements
We would like to thank C. Monico of Texas Tech University and B. Hovinen of University of Waterloo. Our work is based on their sequential source co des. They also helped us with some technical problems through emails. Dr. Silva from IBM advised us many go o d ideas for our parallel program.

References
1. M. E. Briggs. An introduction to the general numb er field sieve. Master's thesis, Virginia Polytechnic Institute and State University, 1998. 2. M. Case. A b eginner's guide to the general numb er field sieve. Oregon State University, ECE575 Data Security and Cryptography Pro ject, 2003. 3. J. Dreib ellbis. Implementing the general numb er field sieve. pages 5í14, June 2003. 4. T. Granlund. The GNU Multiple Precision Arithmetic Library. TMG Datakonsult, Boston, MA, USA, 2.0.2 edition, June 1996. 5. W. Gropp, E. Lusk, and A. Skjellum. Using MPI: Portable Paral lel Programming with the Message-Passing Interface. MIT Press, 1994. 6. M. H. Gutknecht. Block krylov space methods for linear systems with multiple right-hand sides. In The Joint Workshop on Computational Chemistry and Numerical Analysis (CCNA2005), Tokyo, Dec 2005. 7. M. H. Gutknecht and T. Schmelzer. A QR-decomp osition of block tridiagonal matrices generated by the block lanczos process. In Proceedings IMACS World Congress, Paris, July 2005. 8. B. Hovinen. Blocked lanczos-style algorithms over small finite fields. Master Thesis of Mathematics, University of Waterloo, Canada, 2004. 9. R. Lamb ert. Computational Aspects of Discrete Logarithms. PhD thesis, University of Waterloo, 1996. 10. C. Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral op erators. In Journal of Research of the National Bureau of Standards, volume 45, pages 255í282, 1950. 11. C. Lanczos. Solutions of linread equations by minimized iterations. In Journal of Research of the National Bureau of Standards, volume 49, pages 33í53, 1952. 12. A. K. Lenstra. Integer factoring. Designs, Codes and Cryptography, 19(2-3):101í 128, 2000. 13. H. W. Lenstra. Factoring integers with elliptic curves. Annals of Mathematics(2), 126:649í673, 1987. 14. H. W. Lenstra, C. Pomerance, and J. P. Buhler. Factoring integers with the numb er field sieve. In The Development of the Number Field Sieve, volume 1554, pages 50í 94, New York, 1993. Lecture Notes in Mathematics, Springer-Verlag. 15. C. Monico. General numb er field sieve documentation. GGNFS Documentation, Nov 2004. 16. P. L. Montgomery. A block lanczos algorithm for finding dep endencies over gf(2). In Proceeding of the EUROCRYPT '95, volume 921 of LNCS, pages 106í120. Springer, 1995.


120

L.T. Yang et al.

17. MPICH. http://www- unix.mcs.anl.gov/mpi/mpich/ . 18. B. N. Parlett, D. R. Taylor, and Z. A. Liu. A look-ahead lanczos algorithm for unsymetric matrics. Mathematics of Computation, 44:105í124, 1985. 19. J. M. Pollard. Theorems on factorization and primality testing. In Proceedings of the Cambridge Philosophical Society, pages 521í528, 1974. 20. C. Pomerance. The quadratic sieve factoring algorithm. In Proceeding of the EUROCRYPT 84 Workshop on Advances in Cryptology: Theory and Applications of Cryptographic Techniques, pages 169í182. Springer-Verlag, 1985. 21. R. L. Rivest, A. Shamir, and L. M. Adelman. A method for obtaining digital signatures and public-key cryptosystems. Technical Rep ort MIT/LCS/TM-82, 1977. 22. M. C. Wunderlich and J. L. Selfridge. A design for a numb er theory package with an optimized trial division routine. Communications of ACM, 17(5):272í276, 1974. 23. L. Xu, L. T. Yang, and M. Lin. Parallel general numb er field sieve method for integer factorization. In Proceedings of the 2005 International Conference on Paral lel and Distributed Processing Techniques and Applications (PDPTA-05), pages 1017í1023, Las Vegas, USA, June 2005. 24. L. T. Yang, L. Xu, and M. Lin. Integer factorization by a parallel gnfs algorithm for public key cryptosystem. In Proceddings of the 2005 International Conference on Embedded Software and Systems (ICESS-05), pages 683í695, Xian, China, Decemb er 2005.