Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://www.mccme.ru/ium/postscript/f12/nesterov-Lect3_HugeScale.pdf
Äàòà èçìåíåíèÿ: Fri Sep 7 23:12:32 2012
Äàòà èíäåêñèðîâàíèÿ: Mon Feb 4 19:55:18 2013
Êîäèðîâêà:

Ïîèñêîâûå ñëîâà: trees
Lecture 3: Huge-scale optimization problems
Yurii Nesterov, CORE/INMA (UCL)

September 10, 2012


Outline

1 Problems sizes 2 Random co ordinate search 3 Confidence level of solutions 4 Sparse Optimization problems 5 Sparse up dates for linear op erators 6 Fast updates in computational trees 7 Simple subgradient methods 8 Application examples


Nonlinear Optimization: problems sizes

Class Operations Dimension Iter.Cost Small-size All 100 - 102 n4 n3 Medium-size A-1 103 - 104 n3 n2 Large-scale Ax 105 - 107 n2 n Huge-scale x +y 108 - 1012 n log n Sources of Huge-Scale problems Internet (New) Telecommunications (New) Finite-element schemes (Old) Partial differential equations (Old)

Memory Kilobyte: Megabyte: Gigabyte: Terabyte:

103 106 109 1012


Very old optimization idea: Coordinate Search

Problem:

x R

min f (x ) (f is convex and differentiable). n

Coordinate relaxation algorithm For k 0 iterate
1 2

Choose active coordinate ik . Update xk +1 = xk - hk ik f (xk )eik ensuring f (x (ei is i th coordinate vector in R n .)
k +1

) f (xk ).

Main advantage: Very simple implementation.


Possible strategies

1 2 3

Cyclic moves. (Difficult to analyze.) Random choice of coordinate (Why?) Choose coordinate with the maximal directional derivative. x , y R n.
1 2nL 2

Complexity estimate: assume f (x ) - f (y ) L x - y , 1 Let us choose hk = L . Then f (xk ) - f (x
k +1

)

1 2L

|

ik

f (xk )|2

f (xk )

1 2nLR
2

2

(f (xk ) - f )2 . (For Grad.Method, drop n.)

Hence, f (xk ) - f

2nLR k

, k 1.

This is the only known theoretical result known for CDM!


Criticism

Theoretical justification: Complexity bounds are not known for the most of the schemes. The only justified scheme needs computation of the whole gradient. (Why don't use GM?) Computational complexity: Fast differentiation: if function is defined by a sequence of operations, then C ( f ) 4C (f ). Can we do anything without computing the function's values? Result: CDM are almost out of the computational practice.


Google problem

Let E R nân be an incidence matrix of a graph. Denote e = (1, . . . , 1)T and ¯ E = E · diag (E T e )-1 . ¯ Thus, E T e = e . Our problem is as follows: Find x 0 : Optimization formulation: f (x ) =
def 1 2

¯ E x = x .

¯ Ex - x

2

+ [ e , x - 1]2 2

x R

min n


Huge-scale problems

Main features The size is very big (n 107 ). The data is distributed in space. The requested parts of data are not always available. The data may be changing in time. Consequences Simplest operations are expensive or infeasible: Update of the full vector of variables. Matrix-vector multiplication. Computation of the objective function's value, etc.


Structure of the Google Problem

Let ua look at the gradient of the objective:
i

f (x ) =

ai , g (x ) + [ e , x - 1], i = 1, . . . , n, ¯ (E = (a1 , . . . , an )).

¯ g (x ) = E x - x R n , Main observations:

The coordinate move x+ = x - hi i f (x )ei needs O (pi ) a.o. (pi is the number of nonzero elements in ai .) di = diag
def 2 def ¯ ¯ f = E T E + ee T i

=+

1 pi

are available.
1 di

We can use them for choosing the step sizes (hi = Reasonable co ordinate choice strategy?

).

Random!


Random coordinate descent methods (RCDM)

min f (x ),
x R
N

(f is convex and differentiable)

Main Assumption: |fi (x + hi ei ) - fi (x )| Li |hi |, where ei is a coordinate vector. Then f (x + hi ei ) f (x ) + fi (x )hi +
Li 2 2 hi

hi R , i = 1, . . . , N ,

. x R N , hi R .
1 L i fi

Define the coordinate steps: Ti (x ) = x - f (x ) - f (Ti (x ))
1 2Li

def

(x )ei . Then,

[fi (x )]2 ,

i = 1, . . . , N .


Random choice for coordinates
We need a special random counter R , R : Prob [i ] =
(i ) p

=

L i

N

-1

·
j =1

L j

,

i = 1, . . . , N .

Note: R0 generates uniform distribution. Metho d RCDM (, x0 ) For k 0 iterate: 1) Choose ik = R . 2) Update xk
+1

= Tik (xk ).


Complexity bounds for RCDM
We need to introduce the following norms for x , g R N :
N 1/2

x



=
i =1

L i

[x

(i ) ]2

,

g



N

1/2 1 L i

=
i =1

[g

(i ) ]2

.

After k iterations, RCDM (, x0 ) generates random output xk , which depends on k = {i0 , . . . , ik }. Denote k = Ek -1 f (xk ). Theorem. For any k 1 we have
2 k N

k - f where R (x0 ) = max
x



·
j =1

2 L · R1- (x0 ), j

x X

max x - x



: f (x ) f (x0 ) .


Interpretation

N

Denote S =
i =1

L . i

1. = 0. Then S0 = N , and we get k - f Note We use the metric x Matrix with diagonal { Li [x (i ) ]2 . i =1 Li }N can have i =1
2 1 N



2N k

2 · R1 (x0 ).

=

its norm equal to n.

Hence, for GM we can guarantee the same bound. But its cost of iteration is much higher!


Interpretation
2. = 1 . Denote 2 D (x0 ) = max
x y X 1i N

max max |x

(i )

-y

(i )

| : f (x ) f (x0 ) .

2 2 Then, R1/2 (x0 ) S1/2 D (x0 ), and we obtain N 2

k - f Note:





2 k

·
i =1

Li

1/2

2 · D (x0 ).

For the first order methods, the worst-case complexity of minimizing over a box depends on N . Since S1/2 can be bounded, RCDM can be applied in situations when the usual GM fail.


Interpretation

3. = 1. Then R0 (x0 ) is the size of the initial level set in the standard Euclidean norm. Hence, k - f




2 k

N

·
i =1

2 Li · R0 (x0 )

2N k

·

1 N

N i =1

2 Li · R0 (x0 ).

Rate of convergence of GM can be estimated as f (xk ) - f 2 R (x0 ), k0

where satisfies condition f (x ) · I , x R N . Note: maximal eigenvalue of symmetric matrix can reach its trace. In the worst case, the rate of convergence of GM is the same as that of RCDM .


Minimizing the strongly convex functions
Theorem. Let f (x ) be strongly convex with respect to with convexity parameter 1- > 0. Then, for {xk } generated by RCDM (, x0 ) we have k -


·

1-



1-

1- S

k

(f (x0 ) - f ).

Pro of: Let xk be generated by RCDM after k iterations. Let us estimate the expected result of the next iteration.
N

f (xk ) - Eik (f (xk
N

+1 2

)) =
i =1 1 2S

p · [f (xk ) - f (Ti (xk ))] ( f (xk )
2 1- )

(i )



i =1 1- S

p 2L

(i ) i

[fi (xk )] =

(f (xk ) - f ).
-1

It remains to compute expectation in k

.


Confidence level of the answers

Note: We have proved that the expected values of random f (xk ) are good. Can we guarantee anything after a single run? Confidence level: Probability (0, 1), that some statement about random output is correct. Main to ol: Chebyschev inequality ( 0): Prob [ T ] Our situation: Prob [f (xk ) - f ] 1 [k - f ] 1 - . We need k - f · (1 - ). Too expensive for 1?
E ( ) T

.


Regularization technique
Consider fµ (x ) = f (x ) +
µ 2

x - x0

2 1-

. It is strongly convex.

Therefore, we can obtain k - fµ · (1 - ) in

O

1 µ S

ln

1 ·(1- )
2 4R0 (x0 )

iterations. , and choose + ln
1 1-

Theorem. Define = 1, µ = k 1+

2 8S1 R0 (x0 )

ln

2 2S1 R0 (x0 )

.

Let xk be generated by RCDM (1, x0 ) as applied to fµ .Then Prob (f (xk ) - f ) . Note: = 1 - 10
-p



ln 10p = 2.3p .


Implementation details: Random Counter
Given the values Li , i = 1, . . . , N , generate efficiently random
N

i {1, . . . , N } with probabilities Prob [i = k ] = Lk /
j =1

Lj . ,

Solution: a) Trivial O (N ) operations. b). Assume N = 2p . Define p + 1 vectors Sk R k = 0, . . . , p : S0 Sk
(i )

2

p -k

= Li , i = 1, . . . , N . = Sk
(2i ) -1

(i )

+ Sk

(2i -1) -1

, i = 1, . . . , 2

p -k

,

k = 1, . . . , p .

Algorithm: Make the choice in p steps, from top to bottom. If the element i of Sk is chosen, then choose in Sk or 2i - 1 in accordance to probabilities
Sk
(2i ) -1 (i ) Sk

-1

either 2i .

or

S

(2i -1) k -1 (i ) Sk

Difference: for n = 220 > 106 we have p = log2 N = 20.


Sparse problems
Problem: min f (x ), where Q is closed and convex in R N , and

x Q

f (x ) = (Ax ), where is a simple convex function: (y1 ) (y2 ) + (y2 ), y1 - y2 , A:R
N

y1 , y2 R M ,

R

M

is a sparse matrix.

Let p (x ) = # of nonzeros in x . Sparsity coefficient: (A) =
def p (A) MN

def

.

Example 1: Matrix-vector multiplication Computation of vector Ax needs p (A) operations. Initial complexity MN is reduced in (A) times.


Gradient Method

x0 Q ,

xk

+1

= Q (xk - hf (xk )),

k 0.

Main computational expenses Projection onto a simple set Q needs O (N ) operations. Displacement xk xk - hf (xk ) needs O (N ) operations. f (x ) = AT (Ax ). If is simple, then the main efforts are spent for two matrix-vector multiplications: 2p (A). Conclusion: As compared with full matrices, we accelerate in (A) times. Note: For Large- and Huge-scale problems, we often have (A) 10-4 . . . 10-6 . Can we get more?


Sparse updating strategy
Main idea After update x+ = x + d we have y What happens if d is sparse? Denote (d ) = {j : d
(j ) def j (d ) def +

= Ax+ = Ax +Ad .
y

= 0}. Then y+ = y +
j (d )

d

(j )

· Aej .

Its complexity, A (d ) = A (d ) = M
j (d )

p (Aej ),

can be VERY small!
1 p (d )

(Aej ) = (d ) ·
1j m

(Aej ) · MN
j (d )

(d ) max (Aej ) · MN . If (d ) c (A), (Aj ) c (A), then A (d ) c 2 · 2 (A) · MN . Exp ected acceleration: (10-6 )2 = 10- years
12

1 sec 32 000


When it can work?

Simple methods: No full-vector operations! (Is it possible?) Simple problems: Functions with sparse gradients. Examples
1

Quadratic function f (x ) = 1 Ax , x - b , x . The gradient 2 f (x ) = Ax - b , x R N , is not sparse even if A is sparse. Piece-wise linear function g (x ) = max [ ai , x - b (i ) ]. Its
1i m

2

subgradient f (x ) = ai (x ) , i (x ) : f (x ) = ai can be sparse if ai is sparse!

(x )

,x - b

(i (x ))

,

But: We need a fast procedure for updating max-operations.


Fast updates in short computational trees
Def: Function f (x ), x R n , is short-tree representable, if it can be computed by a short binary tree with the height ln n. Let n = 2k and the tree has k + 1 levels: v0,i = x (i ) , i = 1, . . . , n. Size of the next level halves the size of the previous one: vi
+1,j

= i

+1,j

(vi

,2j -1

, vi ,2j ),

j = 1, . . . , 2k

-i -1

, i = 0, . . . , k - 1,

where i ,j are some bivariate functions. vk ... v2,1
1

vk
-1,1

,1

vk

-1,2

... v
3 1,2

v

v1,
0,1

...
4

v0

,2

v0,

v0,

/4 v1,n/2 n/2-1 v0,n-3v0,n-2v0,n-1 v0,n

v1,

... v2,n


Main advantages

Important examples (symmetric functions) f (x ) = x
p

,
n

p 1, i ,j (t1 , t2 ) [ |t1 |p + |t2 |p ] e
x
(i )

1/p

,

f (x ) = ln
i =1

,

i ,j (t1 , t2 ) ln (e t1 + e t2 ) , i ,j (t1 , t2 ) max {t1 , t2 } .

f (x ) =

1i n

max x (i ) ,

The binary tree requires only n - 1 auxiliary cells. Its value needs n - 1 applications of i ,j (·, ·) ( operations). If x+ differs from x in one entry only, then for re-computing f (x+ ) we need only k log2 n operations. Thus, we can have pure subgradient minimization schemes with Sublinear Iteration Cost .


Simple subgradient methods
I. Problem: f
def =

x Q

min f (x ),

where

Q is a closed and convex and f (x ) L(f ), x Q , the optimal value f is known. Consider the following optimization scheme (B.Polyak, 1967): x0 Q , Denote f
k

x

k +1

=

Q

xk -

f (xk ) - f f (xk ) , f (xk ) 2

k 0.

= min f (xi ). Then for any k 0 we have:
0i k

fk - f xk - x






L(f ) x0 -X (x0 ) (k +1)1/2

,

x0 - x



,

x X .


Proof:

Let us fix x X . Denote rk (x ) = xk - x r
2 k +1



. Then

(x )

xk -

2 = rk (x 2 rk (x

f (xk )-f f (xk f (xk ) 2 f (xk )-f ) - 2 f (xk ) 2 2 ) - (f (xk )-f ) f (xk ) 2

)-x



2 (f (xk )-f )2 f (xk ) 2 (fk -f )2 . L2 ( f )

f (xk ), xk - x
2 rk (x ) -

+

From this reasoning, xk +1 - x 2 xk - x 2 , x X . Corollary: Assume X has recession direction d . Then xk - X (x0 ) x0 - X (x0 ) , d , xk d , x0 .

(Proof: consider x = X (x0 ) + d , 0.)


Constrained minimization (N.Shor (1964) & B.Polyak)
II. Problem: min{f (x ) : g (x ) 0}, where

x Q

Q is closed and convex, f , g have uniformly bounded subgradients. Consider the following method. It has step-size parameter h > 0. If g (xk ) > h g (xk ) , then (A): x else (B): x
k +1 k +1

= =

Q Q

xk - xk -

g (xk ) g (xk ) h f (xk )

2

g (xk ) ,

f (xk ) .

Let Fk {0, . . . , k } be the set (B)-iterations, and fk = min f (xi ).
i F
k

Theorem: If k > x0 - x

2

/h2 , then Fk = and max g (xi ) hL(g ).
i F
k

fk - f (x ) hL(f ),


Computational strategies
1. Constants L(f ), L(g ) are known (e.g. Linear Programming) We can take h = max{L(f ),L(g )} . Then we need to decide on the number of steps N (easy!). Note: The standard advice is h =
R N +1

(much more difficult!)

2. Constants L(f ), L(g ) are not known Start from a guess. Restart from scratch each time we see the guess is wrong. The guess is doubled after restart. 3. Tracking the record value f Double run.
k

Other ideas are welcome!


Application examples

Observations:
1

Very often, Large- and Huge- scale problems have repetitive sparsity patterns and/or limited connectivity.
Social networks. Mobile phone networks. Truss topology design (local bars). Finite elements models (2D: four neighbors, 3D: six neighbors).

2

For p -diagonal matrices (A) p 2 .


Nonsmooth formulation of Google Problem
Main property of spectral radius (A 0) If A R
nân +

, then (A) = min max

1
(i )

x 0 1i n x

ei , Ax .

The minimum is attained at the corresponding eigenvector. ¯ Since (E ) = 1, our problem is as follows: f (x )
def

=

1i N

¯ max [ ei , E x - x (i ) ]



min .
x 0

Interpretation: Maximizing the self-esteem! Since f = 0, we can apply Polyak's method with sparse updates. Additional features; the optimal set X is a convex cone. If x0 = e , then the whole sequence is separated from zero: x , e x , xk x
1

· xk



= x , e · xk



.

Goal: Find x 0 such that x 1 and f (x ) . ¯ ¯ ¯ (First condition is satisfied automatically.)


Computational experiments: Iteration Cost
We compare Polyak's GM with sparse update (GMs ) with the standard one (GM ). Setup: Each agent has exactly p random friends. Thus, (A) p 2 . Iteration Cost: GMs p 2 log2 N , Time for N 1024 2048 4096 8192 16384 104 iterations (A) GMs 1632 3.00 1792 3.36 1888 3.75 1920 4.20 1824 4.69 (p = 32) GM 2.98 6.41 15.11 139.92 408.38 GM pN . Time for 103 iterations (p = 16) N (A) GMs GM 131072 576 0.19 213.9 262144 592 0.25 477.8 524288 592 0.32 1095.5 1048576 608 0.40 2590.8 1 sec 100 min!