Posted 2026-04-11Updated 2026-04-15Fundamentals of Numerical Computation Notes3 minutes read (About 471 words)

Floating-point numbers

Floating-point numbers: $\pm(1+f) \times 2^n$

$1+f$ is the significand and $f \in [0, 1)$.
$n$ is the exponent.

Significand format:
$$
\begin{aligned}
f&=\sum_{i=1}^d b_i 2^{d-i} \quad \text{where} \quad b_{i} \in {0,1} \\
&= 2^{-d} \sum_{i=0}^{d-1} b_i 2^i
\end{aligned}
$$

$d$ is the number of bits representing the significand, called binary precision.
There are $2^d$ significands, which are evenly spaced by $2^{-d}$ from $0$ to $2^d-1$.
- Example ($d=2$): $1, 1+\frac{1}{4}, 1+\frac{2}{4}, 1+\frac{3}{4}$

Machine precision: $\epsilon_{\text{mach}}=2^{-d}$

Machine precision inherently indicates the relative error of $\text{fl}(x)$ mapping a real number to the nearest floating-point number.
$$
\frac{|\text{fl}(x)-x|}{|x|} \le \frac{2^n\times 2^{-d-1}}{2^n} = \frac{1}{2} \epsilon_{\text{mach}}
$$
- $|\text{fl}(x)-x| \le \frac{1}{2}\times 2^{-d}$
- $2^n \le x < 2^{n+1}$
- Equivalent to $\text{fl(x)}=x(1+\epsilon)$ for some $\epsilon \le \frac{1}{2}\epsilon_{\text{mach}}$

Accuracy

Absolute accuracy: $|\widetilde{x}-x|$
Relative accuracy: $\dfrac{|\widetilde{x}-x|}{|x|}$
Accurate digits: $-\log_{10} \dfrac{|\widetilde{x}-x|}{|x|}$

Problems and conditioning

Subtractive cancellation (loss of significance): Adding $-1.0012$ to $1.0000$, we get $-0.0012=-1.2 \times 10^3$. The significant digits reduce from $5$ to $2$.

Condition number is the ratio of the output’s relative error to the input’s relative error.
$$
\kappa_f(x) = \left|\dfrac{xf’(x)}{f(x)}\right|
$$

For composite $h(x)=f(g(x))$, the condition number is $\kappa_h(x)=\kappa_f(g(x))\kappa_g(x)$

Error estimation: With a smalle $\epsilon$, the estimated relative error is
$$
\left|\dfrac{f(x+\epsilon x)-f(x)}{f(x)}\right| \approx \kappa_f(x)|\epsilon|
$$

If $\kappa_f\approx 10^d$, then computing $f(x)$ loses up to $d$ digits.

Algorithms

Horner’s algorithm（秦九韶算法） is an efficient polynomial evaluation.
$$
\begin{aligned}
p(x) &= c_1 + c_2x + \dots + c_nx^{n-1} \\
&= (((c_nx + c_{n-1})x + c_{n-2})x+\dots+c_2)x+c_1
\end{aligned}
$$

Regular method takes $n-1$ additions and $\frac{n(n-1)}{2}$ or $2n-3$ (cached $x_i$) multiplications, while the Horner’s algorithm takes $n-1$ additions and $n-1$ multiplications.

Stability

Backward error: $\widetilde{f}$ is an algorithm for $f$. The approximate result $\widetilde{y}=\widetilde{f}(x)$. If there is a value $\widetilde{x}$ such that $f(\widetilde{x})=\widetilde{y}$, then the relative backward error in $\widetilde{y}$ is $\left|\dfrac{\widetilde{x}-x}{x}\right|$ and the absolute backward error is $\left|\widetilde{x}\right|$

Backward error

It measures the change to the original input that reproduces the result that was found by the algorithm.
Small backward error implies the algorithm is stable, but a stable algorithm may produce a large backward error.
- Example $f(x)=x+1$ with $x < \frac{1}{2}\epsilon_{\text{mach}}$: $\ \widetilde{f}(x=\frac{1}{4}\epsilon_{\text{mach}})=1=f(\widetilde{x}=0)$, the backward error is $100%$.

Floating-point numbers

Problems and conditioning

Algorithms

Stability

Categories

Recents

Archives