Chunlei Chen: Annealing algorithm

http://en.wikipedia.org/wiki/Simulated_annealing

The name and inspiration come from annealing in metallurgy, a technique involving heating and controlled cooling of a material to increase the size of its crystals and reduce their defects.

The heat causes the atoms to become unstuck from their initial positions (a local minimum of the internal energy) and wander randomly through states of higher energy; the slow cooling gives them more chances of finding configurations with lower internal energy than the initial one.

1 Overview
In the simulated annealing (SA) method, each point s of the search space is analogous to a state of some physical system, and the function E(s) to be minimized is analogous to the internal energy of the system in that state. The goal is to bring the system, from an arbitrary initial state, to a state with the minimum possible energy.

1.1 The basic iteration

At each step, the SA heuristic considers some neighbouring state s' of the current state s, and probabilistically decides between moving the system to state s' or staying in state s. These probabilities ultimately lead the system to move to states of lower energy. Typically this step is repeated until the system reaches a state that is good enough for the application, or until a given computation budget has been exhausted.

1.2 The neighbours of a state

1 The neighbours of a state are new states of the problem that are produced after altering the given state in some particular way.

2 For example, in the traveling salesman problem, each state is typically defined as a particular permutation of the cities to be visited. The neighbours of some particular permutation are the permutations that are produced for example by interchanging a pair of adjacent cities(only an example, "neighbouring states" can be defined in different ways). The action taken to alter the solution in order to find neighbouring solutions is called "move" and different "moves" give different neighbours.

3 These moves usually result in minimal alterations of the solution, as the previous example depicts, in order to help an algorithm to optimize the solution to the maximum extent and also to retain the already optimum parts of the solution and affect only the suboptimum parts. In the previous example(travelling salesman problem), the parts of the solution are the parts of the tour.

4 Searching for neighbours to a state is fundamental to optimization because the final solution will come after a tour of successive neighbours.
(1) Simple heuristics move by finding best neighbour after best neighbour and stop when they have reached a solution which has no neighbours that are better solutions.
The problem with this approach is that a solution that does not have any immediate neighbours that are better solution is not necessarily the optimum. It would be the optimum if it was shown that any kind of alteration of the solution does not give a better solution and not just a particular kind of alteration.
For this reason it is said that simple heuristics can only reach local optima and not the global optimum.
(2) Metaheuristics, although they also optimize through the neighbourhood approach, differ from heuristics in that they can move through neighbours that are worse solutions than the current solution.
(3) Simulated Annealing in particular doesn't even try to find the best neighbour. The reason for this is that the search can no longer stop in a local optimum and in theory, if the metaheuristic can run for an infinite amount of time, the global optimum will be found.

1.3 Acceptance probabilities

The probability of making the transition from the current state

s

to a candidate new state

s'

is specified by an acceptance probability function

P (e, e', T)

, that depends on the energies

e = E (s)

and

e' = E (s')

of the two states, and on a global time-varying parameter

T

called the temperature.

1 Requirements on the acceptance probabilities:
(1)Annealing algorithm may make the system energy go uphill sometimes
Acceptance probabilities

P (e, e', T)

must be non-zero when

e' > e

, meaning that the system may move to the new state even when it is worse (has a higher energy) than the current one. It is this feature that prevents the method from becoming stuck in a local minimum—a state that is worse than the global minimum, yet better than any of its neighbours.
(2) The probability that system energy goes downhill should be bigger that that of going uphill
During the process in which

T

goes to zero, the probability

P (e, e', T)

<1>must tend to zero if

e' > e

, <2>and tend to a positive value if

e' < e

. That way, for sufficiently small values of

T

, the system will increasingly favour moves that go "downhill" (to lower energy values), and avoid those that go "uphill".

(3) when T turns zero, the annealing algorithm will only search for local optimum instead of global optimum when T is larger than zero

In particular, when

T

becomes 0, the procedure will reduce to the greedy algorithm—which makes the move only if it goes downhill.(In other words, when T turns zero, the annealing algorithm will only search for local optimum instead of global optimum when T is larger than zero)
In the original description of SA, the probability

P (e, e', T)

was defined as 1 when

e' < e

— i.e., the procedure always moved downhill when it found a way to do so, irrespective of the temperature. Many descriptions and implementations of SA still take this condition as part of the method's definition. However, this condition is not essential for the method to work, and one may argue that it is both counter-productive and contrary to its principle.
(4) When the system decides to take uphill moves,smaller uphill moves are more likely to be taken than larger ones
The

P

function is usually chosen so that the probability of accepting a move decreases when the difference

e' - e

increases—that is, small uphill moves are more likely than large ones. However, this requirement is not strictly necessary, provided that the above requirements are met.
(5)When T is larger, evolution of system state is apt to be sensitive to coarser energy variation; but when T is smaller, evolution of system state is apt to be sensitive to finer energy variations.
Given these properties, the temperature

T

plays a crucial role in controlling the evolution of the state

s

of the system vis-a-vis its sensitivity to the variations of system energies. To be precise, for a large

T

, the evolution of

s

is sensitive to coarser energy variations, while it is sensitive to finer energy variations when

T

is small.

1.4 The annealing schedule

Maybe "annealing schedule" means how to schedule the dropping process of the temperature. For example, adopting a large step size to make it drop fast, or small step size to make it drop slowly.

The name and inspiration of the algorithm demand an interesting feature related to the temperature variation to be embedded in the operational characteristics of the algorithm. This necessitates a gradual reduction of the temperature as the simulation proceeds. The algorithm starts initially with $T$ set to a high value (or infinity), and then it is decreased at each step following some annealing schedule—which may be specified by the user, but must end with $T = 0$ towards the end of the allotted time budget. In this way, the system is expected to wander initially towards a broad region of the search space containing good solutions, ignoring small features of the energy function; then drift towards low-energy regions that become narrower and narrower; and finally move downhill according to the steepest descent heuristic.

Example illustrating the effect of cooling schedule on the performance of simulated annealing. The problem is to rearrange the pixels of an image so as to minimize a certain potential energy function, which causes similar colours to attract at short range and repel at a slightly larger distance. The elementary moves swap two adjacent pixels. These images were obtained with a fast cooling schedule (left) and a slow cooling schedule (right), producing results similar to amorphous and crystalline solids, respectively.

It can be shown^[4] that for any given finite problem, the probability that the simulated annealing algorithm terminates with the global optimal solution approaches 1 as the annealing schedule is extended. This theoretical result, however, is not particularly helpful, since the time required to ensure a significant probability of success will usually exceed the time required for a complete search of the solution space.

2 Pseudocode

The following pseudocode implements the simulated annealing heuristic as described above. It starts from a state s0 and continues to either a maximum of kmax steps or until a state with an energy of emax or less is found. In the process, the call neighbour(s) should generate a randomly chosen neighbour of a given state s; the call random() should return a random value in the range

[0,1]

. The annealing schedule is defined by the call temp(r), which should yield the temperature to use, given the fraction r of the time budget that has been expended so far.

s ← s0; e ← E(s)                                // Initial state, energy.
sbest ← s; ebest ← e                            // Initial "best" solution
k ← 0                                           // Energy evaluation count.
while k < kmax and e > emax                     // While time left & not good enough:
  snew ← neighbour(s)                           // Pick some neighbour.
  enew ← E(snew)                                // Compute its energy.
  if P(e, enew, temp(k/kmax)) > random() then   // Should we move to it?
    s ← snew; e ← enew                          // Yes, change state.
  if enew < ebest then                          // Is this a new best?
    sbest ← snew; ebest ← enew                  // Save 'new neighbour' to 'best found'.
  k ← k + 1                                     // One more evaluation done
return sbest                                    // Return the best solution found.

Actually, the "pure" SA algorithm does not keep track of the best solution found so far: it does not use the variables sbest and ebest, it lacks the second if inside the loop, and, at the end, it returns the current state s instead of sbest. While remembering the best state is a standard technique in optimization that can be used in any metaheuristic, it does not have an analogy with physical annealing — since a physical system can "store" a single state only.
In strict mathematical terms, saving the best state is not necessarily an improvement, since one may have to specify a smaller kmax in order to compensate for the higher cost per iteration and since there is a good probability that sbest equals s in the final iteration anyway. However, the step sbest ← snew happens only on a small fraction of the moves. Therefore, the optimization is usually worthwhile, even when state-copying is an expensive operation.^{[citation needed]}

3 Selecting the parameters

In order to apply the SA method to a specific problem, one must specify the following parameters:
(1)State space,
(2)Candidate generator procedure neighbour(),

(3)Energy (goal) function E(),

(4)Annealing schedule temp()
(5)Initial temperature <init temp>.

(6)Acceptance probability function P(),

These choices can have a significant impact on the method's effectiveness. Unfortunately, there are no choices of these parameters that will be good for all problems, and there is no general way to find the best choices for a given problem. The following sections give some general guidelines.

3.1 Diameter of the search graph

Simulated annealing may be modelled as a random walk on a search graph, whose vertices are all possible states, and whose edges are the candidate moves. An essential requirement for the neighbour() function is that it must provide a sufficiently short path on this graph from the initial state to any state which may be the global optimum. (In other words, the diameter of the search graph must be small.) In the travelling salesman example above, for instance, the search space for $n = 20$ cities has $n! = 2432902008176640000$ (2.4 quintillion) states; yet the neighbour generator function that swaps two consecutive cities can get from any state (tour) to any other state in maximum $n (n - 1) / 2 = 190$ steps.(Note: n(n-1)/2 is the maximum number of edges in the search graph. Here "neighbouring state" doesn't absolutely means that two states are made different only by inter-changing two adjacent sites. Neighbouring relationship in the graph can also defined as "neighbouring" )

3.2 Transition probabilities

For each edge

(s, s')

of the search graph, one defines a transition probability, which is the probability that the SA algorithm will move to state

s'

when its current state is

s

.

This probability depends on the current temperature.

The current temperature is specified by:
(1) Annealing schedule:temp(),
(2) The order in which the candidate moves are generated by the neighbour() function (Maybe during the process in which candidate moves are generated, the temperature is still on a drop. So in (3), because candidates are tested serially, each candidate move may corresponds to a different temperature)
(3) The acceptance probability function P(). (Note that the transition probability is not simply

P (e, e', T)

(it is acceptance probability), because the candidates are tested serially.)

3.3 Acceptance probabilities

The specification of neighbour(), P(), and temp() is partially redundant. In practice, it's common to use the same acceptance function P() for many problems, and adjust the other two functions according to the specific problem.
In the formulation of the method by Kirkpatrick et al., the acceptance probability function $P (e, e', T)$ was defined as 1 if $e' < e$ , and $exp((e - e') / T)$ otherwise. This formula was superficially justified by analogy with the transitions of a physical system; it corresponds to the Metropolis-Hastings algorithm, in the case where the proposal distribution of Metropolis-Hastings is symmetric. However, this acceptance probability is often used for simulated annealing even when the neighbour() function, which is analogous to the proposal distribution in Metropolis-Hastings, is not symmetric, or not probabilistic at all. As a result, the transition probabilities of the simulated annealing algorithm do not correspond to the transitions of the analogous physical system, and the long-term distribution of states at a constant temperature

T

need not bear any resemblance to the thermodynamic equilibrium distribution over states of that physical system, at any temperature. Nevertheless, most descriptions of SA assume the original acceptance function, which is probably hard-coded in many implementations of SA.

3.4 Efficient candidate generation

When choosing the candidate generator neighbour(), one must consider that after a few iterations of the SA algorithm, the current state is expected to have much lower energy than a random state. Therefore, as a general rule, one should skew the generator towards candidate moves where the energy of the destination state

s'

is likely to be similar to that of the current state. This heuristic (which is the main principle of the Metropolis-Hastings algorithm) tends to exclude "very good" candidate moves as well as "very bad" ones; however, the latter are usually much more common than the former, so the heuristic is generally quite effective.

In the travelling salesman problem above, for example, swapping two consecutive cities in a low-energy tour is expected to have a modest effect on its energy (length); whereas swapping two arbitrary cities is far more likely to increase its length than to decrease it. Thus, the consecutive-swap neighbour generator is expected to perform better than the arbitrary-swap one, even though the latter could provide a somewhat shorter path to the optimum (with

n - 1

swaps, instead of

n (n - 1) / 2

).

A more precise statement of the heuristic is that one should try first candidate states

s'

for which

P (E (s), E (s'), T)

is large. For the "standard" acceptance function

P

above, it means that

E (s') - E (s)

is on the order of

T

or less. Thus, in the travelling salesman example above, one could use a neighbour() function that swaps two random cities, where the probability of choosing a city pair vanishes as their distance increases beyond

T

3.5 Barrier avoidance

When choosing the candidate generator neighbour() one must also try to reduce the number of "deep" local minima — states (or sets of connected states) that have much lower energy than all its neighbouring states. Such "closed catchment basins" of the energy function may trap the SA algorithm with high probability (roughly proportional to the number of states in the basin) and for a very long time (roughly exponential on the energy difference between the surrounding states and the bottom of the basin).
As a rule, it is impossible to design a candidate generator that will satisfy this goal and also prioritize candidates with similar energy. On the other hand, one can often vastly improve the efficiency of SA by relatively simple changes to the generator. In the traveling salesman problem, for instance, it is not hard to exhibit two tours

A

B

, with nearly equal lengths, such that (0)

A

is optimal, (1) every sequence of city-pair swaps that converts

A

B

goes through tours that are much longer than both, and (2)

A

can be transformed into

B

by flipping (reversing the order of) a set of consecutive cities. In this example,

A

and

B

lie in different "deep basins" if the generator performs only random pair-swaps; but they will be in the same basin if the generator performs random segment-flips.

3.6 Cooling schedule

The physical analogy that is used to justify SA assumes that the cooling rate is low enough for the probability distribution of the current state to be near thermodynamic equilibrium at all times. Unfortunately, the relaxation time—the time one must wait for the equilibrium to be restored after a change in temperature—strongly depends on the "topography" of the energy function and on the current temperature. In the SA algorithm, the relaxation time also depends on the candidate generator, in a very complicated way. Note that all these parameters are usually provided as black box functions to the SA algorithm.
Therefore, in practice the ideal cooling rate cannot be determined beforehand, and should be empirically adjusted for each problem. The variant of SA known as thermodynamic simulated annealing tries to avoid this problem by dispensing with the cooling schedule, and instead automatically adjusting the temperature at each step based on the energy difference between the two states, according to the laws of thermodynamics.

Chunlei Chen

Search This Blog

Thursday, January 13, 2011

Annealing algorithm