FNC Notes | Chapter 1

Floating-point numbers

Floating-point numbers: $\pm(1+f) \times 2^n$

  • $1+f$ is the significand and $f \in [0, 1)$.
  • $n$ is the exponent.

Significand format:
$$
\begin{aligned}
f&=\sum_{i=1}^d b_i 2^{d-i} \quad \text{where} \quad b_{i} \in {0,1} \\
&= 2^{-d} \sum_{i=0}^{d-1} b_i 2^i
\end{aligned}
$$

  • $d$ is the number of bits representing the significand, called binary precision.
  • There are $2^d$ significands, which are evenly spaced by $2^{-d}$ from $0$ to $2^d-1$.
    • Example ($d=2$): $1, 1+\frac{1}{4}, 1+\frac{2}{4}, 1+\frac{3}{4}$

Machine precision: $\epsilon_{\text{mach}}=2^{-d}$

  • Machine precision inherently indicates the relative error of $\text{fl}(x)$ mapping a real number to the nearest floating-point number.
    $$
    \frac{|\text{fl}(x)-x|}{|x|} \le \frac{2^n\times 2^{-d-1}}{2^n} = \frac{1}{2} \epsilon_{\text{mach}}
    $$
    • $|\text{fl}(x)-x| \le \frac{1}{2}\times 2^{-d}$
    • $2^n \le x < 2^{n+1}$
    • Equivalent to $\text{fl(x)}=x(1+\epsilon)$ for some $\epsilon \le \frac{1}{2}\epsilon_{\text{mach}}$

Accuracy

  • Absolute accuracy: $|\widetilde{x}-x|$
  • Relative accuracy: $\dfrac{|\widetilde{x}-x|}{|x|}$
  • Accurate digits: $-\log_{10} \dfrac{|\widetilde{x}-x|}{|x|}$

Problems and conditioning

Subtractive cancellation (loss of significance): Adding $-1.0012$ to $1.0000$, we get $-0.0012=-1.2 \times 10^3$. The significant digits reduce from $5$ to $2$.

Condition number is the ratio of the output’s relative error to the input’s relative error.
$$
\kappa_f(x) = \left|\dfrac{xf’(x)}{f(x)}\right|
$$

  • For composite $h(x)=f(g(x))$, the condition number is $\kappa_h(x)=\kappa_f(g(x))\kappa_g(x)$

Error estimation: With a smalle $\epsilon$, the estimated relative error is
$$
\left|\dfrac{f(x+\epsilon x)-f(x)}{f(x)}\right| \approx \kappa_f(x)|\epsilon|
$$

  • If $\kappa_f\approx 10^d$, then computing $f(x)$ loses up to $d$ digits.

Algorithms

Horner’s algorithm(秦九韶算法) is an efficient polynomial evaluation.
$$
\begin{aligned}
p(x) &= c_1 + c_2x + \dots + c_nx^{n-1} \\
&= (((c_nx + c_{n-1})x + c_{n-2})x+\dots+c_2)x+c_1
\end{aligned}
$$

  • Regular method takes $n-1$ additions and $\frac{n(n-1)}{2}$ or $2n-3$ (cached $x_i$) multiplications, while the Horner’s algorithm takes $n-1$ additions and $n-1$ multiplications.

Stability

Backward error: $\widetilde{f}$ is an algorithm for $f$. The approximate result $\widetilde{y}=\widetilde{f}(x)$. If there is a value $\widetilde{x}$ such that $f(\widetilde{x})=\widetilde{y}$, then the relative backward error in $\widetilde{y}$ is $\left|\dfrac{\widetilde{x}-x}{x}\right|$ and the absolute backward error is $\left|\widetilde{x}\right|$

Backward error

  • It measures the change to the original input that reproduces the result that was found by the algorithm.
  • Small backward error implies the algorithm is stable, but a stable algorithm may produce a large backward error.
    • Example $f(x)=x+1$ with $x < \frac{1}{2}\epsilon_{\text{mach}}$: $\ \widetilde{f}(x=\frac{1}{4}\epsilon_{\text{mach}})=1=f(\widetilde{x}=0)$, the backward error is $100%$.