An Improved Algorithm for hypot(a,b)

Carlos F. Borges

arXiv:1904.09481·math.NA·June 17, 2019

An Improved Algorithm for hypot(a,b)

Carlos F. Borges

PDF

Open Access

TL;DR

This paper introduces a new, faster, and more accurate algorithm for computing the hypotenuse of two floating point numbers, improving upon existing library functions in Julia and C.

Contribution

The paper presents four novel algorithms for hypot computation and compares their performance to standard library implementations.

Findings

01

New algorithms outperform existing functions in speed and accuracy

02

Simulation results demonstrate significant improvements

03

Enhanced hypot computation benefits scientific and engineering applications

Abstract

We develop a fast and accurate algorithm for evaluating $a^{2} + b^{2}$ for two floating point numbers $a$ and $b$ . Library functions that perform this computation are generally named {\tt hypot(a,b)}. We will compare four approaches that we will develop in this paper to the current resident library function that is delivered with Julia 1.1 and to the code that has been distributed with the C math library for decades. We will demonstrate the performance of our algorithms by simulation.

Tables2

Table 1. Table 1. Testing done with a , b ∼ 𝒩 ( 0 , 1 ) similar-to 𝑎 𝑏 𝒩 0 1 a,b\sim\mathcal{N}(0,1)

Method	One ulp errors (%)	Two ulp errors (%)
Julia1.1_hypot()	35.08	0.16
clib_hypot()	12.91	0
Naive (Unfused)	16.70	0
Naive (Fused)	13.02	0
Corrected (Unfused)	0.54	0
Corrected (Fused)	0	0

Table 2. Table 2. Percentage of incorrectly rounded results by relative scale for a ∼ U ( 2 N , 2 N + 1 ) similar-to 𝑎 𝑈 superscript 2 𝑁 superscript 2 𝑁 1 a\sim U(2^{N},2^{N+1}) and b ∼ U ( 1 , 2 ) similar-to 𝑏 𝑈 1 2 b\sim U(1,2)

			Naive		Corrected
N	Julia1.1_hypot()	clib_hypot()	Unfused	Fused	Unfused	Fused
0	29.2282102	11.9488456	15.5629732	12.6486134	0.8005489	0
1	36.6616214	13.8082644	17.1661734	13.6513319	0.6212053	0
2	39.193388	13.1111126	17.087418	13.048634	0.5320131	0
3	39.9510329	13.0036249	17.2513067	12.9993049	0.2760227	0
4	40.1496371	12.9966526	17.307515	12.9960858	0.0799361	0
5	40.2052568	12.9972241	17.3246431	12.9970895	0.0212722	0
6	40.2194854	12.9962604	17.3277278	12.996271	0.005506	0
7	40.2215552	12.9957037	17.3283368	12.9957233	0.0013865	0
8	40.2199042	12.995122	17.3271957	12.9951272	0.0003476	0
9	40.2189412	12.9970594	17.3293785	12.997061	8.73E-05	0
10	40.2212047	12.9959983	17.3285051	12.9959978	2.31E-05	0
11	40.2206441	12.9956756	17.3285992	12.9956754	6.00E-06	0
12	40.1675266	12.9963116	17.3284901	12.9963118	7.00E-07	0
13	40.7464532	12.9970054	17.3285922	12.9970032	7.00E-07	0
14	44.5853178	12.9967234	17.3279259	12.9967244	0	0
15	45.3009456	12.997981	17.3296289	12.99798	0	0
16	45.3487089	12.9976653	17.3282103	12.9976649	0	0
17	45.3496228	12.9961767	17.3260454	12.9961782	0	0
18	45.351359	12.9986208	17.3302856	12.9986212	0	0
19	45.3513136	12.9964465	17.3285827	12.9964469	0	0
20	45.3510285	12.9965624	17.3302195	12.9965623	0	0
21	45.3503113	12.9941995	17.3281136	12.9942002	0	0
22	45.3454522	12.9965376	17.3279103	12.996537	0	0
23	45.3151619	12.9971009	17.3284377	12.9971044	0	0
24	45.1841774	13.0202852	17.3391219	13.0202826	0	0
25	46.6619492	13.2828222	17.261221	13.2828215	0	0
26	58.5652028	16.8179593	21.0366657	13.7850471	0	0
27	0	4.0541284	4.3904588	2.1950835	0	0
28	0	0	0	0	0	0
29	0	0	0	0	0	0

Equations30

f l (x) = x (1 + δ)

f l (x) = x (1 + δ)

x = f l (x (1 + δ))

x = f l (x (1 + δ))

a^{2} + b^{2}

a^{2} + b^{2}

\frac{( b / a ) ^{2}}{2} \leq \frac{ϵ}{4}

\frac{( b / a ) ^{2}}{2} \leq \frac{ϵ}{4}

z = (a^{2} + b^{2}) (1 + δ_{1}) .

z = (a^{2} + b^{2}) (1 + δ_{1}) .

h

h

\frac{3 ϵ}{4} + \frac{ϵ ^{2}}{8}

\frac{3 ϵ}{4} + \frac{ϵ ^{2}}{8}

\frac{5 ϵ}{4} + \frac{3 ϵ ^{2}}{8}

\frac{5 ϵ}{4} + \frac{3 ϵ ^{2}}{8}

∣ z - (a^{2} + b^{2}) ∣ \leq ϵz

∣ z - (a^{2} + b^{2}) ∣ \leq ϵz

a^{2} + b^{2} + x = f l (a^{2} + b^{2})

a^{2} + b^{2} + x = f l (a^{2} + b^{2})

a^{2} + b^{2} + x = h^{2}

a^{2} + b^{2} + x = h^{2}

h^{2} - x = h - \frac{x}{2 h} + O (x^{2})

h^{2} - x = h - \frac{x}{2 h} + O (x^{2})

x

x

a (2 δ - a) + (δ - 2 (a - b)) δ

a (2 δ - a) + (δ - 2 (a - b)) δ

2 δ (a - 2 b) + (4 δ - b) b + δ^{2}

2 δ (a - 2 b) + (4 δ - b) b + δ^{2}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNumerical Methods and Algorithms

Full text

An Improved Algorithm for hypot(a,b)

Carlos F. Borges

Department of Applied Mathematics

Naval Postgraduate School

Monterey CA 93943

[email protected]

(Date: April 30, 2019.)

Abstract.

We develop a fast and accurate algorithm for evaluating $\sqrt{a^{2}+b^{2}}$ for two floating point numbers $a$ and $b$ . Library functions that perform this computation are generally named hypot(a,b). We will compare four approaches that we will develop in this paper to the current resident library function that is delivered with Julia 1.1 and to the code that has been distributed with the C math library for decades. We will demonstrate the performance of our algorithms by simulation.

Key words and phrases:

hypot(), floating point, fused multiply-add, IEEE 754

2000 Mathematics Subject Classification:

Primary 65Y04

Given two floating point numbers $a$ and $b$ we wish to compute $\sqrt{a^{2}+b^{2}}$ quickly and accurately. Library functions that perform this computation are generally named hypot(a,b). And although the IEEE 754 standard suggests (but does not require) that computer languages provide a version that gives a correctly rounded answer, we are not aware of any current language in wide use that does so.

We note that the problem is trivial if either operand is zero and further that the magnitudes of the operands are important but the signs are not. Therefore, we shall confine our in depth analysis to the case where $a\geq b>0$ . Enforcing this condition will be an initialization step when we develop an actual code. This paper will be restricted to the case where all floating point calculations are done in IEEE 754 compliant arithmetic using round-to-nearest rounding, although many of the results can be extended to other formats under proper conditions.

1. Existing approaches

The need to compute $\sqrt{a^{2}+b^{2}}$ is common in numerical computation and so there are a few existing standard codes for doing so. A common textbook approach is the widely known stable algorithm for this computation that is used in Julia 1.1 which we shall call Julia1.1_hypot(a,b). It essentially amounts to

Algorithm 1.

Julia1.1_hypot()* *

if* a == 0 then *

* h = 0 *

else

* r = b/a *

* h = asqrt(1+rr) *

end* *if

This approach avoids unecessary overflow/underflow that might occur in an interim calculation when very large/small inputs are squared. In the non-trivial case it accomplishes this by rescaling so that the quantity $r$ that must be squared is $1\geq r\geq 0$ and this cannot lead to an overflow/underflow error prior to the calculation of the square root. Overflow is only possible if the true value of $\sqrt{a^{2}+b^{2}}$ is beyond the range of the floating point format.

The downside of this approach is that it rescales even when rescaling is unecessary and that rescaling generates additional rounding error. We will see in our experiements that this approach sometimes give answers that are as far as two ulps from the correctly rounded answer.

The second existing approach is the code that has been delivered with the standard C math library since at least 1991. The code appears in appendix B and we shall refer to it as clib_hypot(). It deals with widely varying operands and prevents intermediate overflow/underflow using methods similar to those we will use in ours.111There is a clear flaw in the threshold used for widely varying operands and it is much broader than necessary. It works by using a variety of tricks to artificially extend the precision of the computed $a^{2}+b^{2}$ so that they get a correctly rounded value (or very nearly so). This is then passed to the sqrt() function. What we will observe in the testing is that all this additional work can be effectively superseded by a single fused multiply-add.

2. Mathematical Preliminaries

We begin with a few definitions. We shall denote by $\mathbb{F}\subset\mathbb{R}$ the set of all strictly positive floating point numbers in the normalized range of the current format. We define $fl(x):\mathbb{R}\rightarrow\mathbb{F}$ to be a function such that $fl(x)$ is the element of $\mathbb{F}$ that is closest to $x$ provided $x$ lies within the normalized range of the current format. We define machine epsilon, which we will denote by $\epsilon$ , to be the difference between $1$ and the closest larger element of $\mathbb{F}$ . For example, in IEEE 754 double precision $\epsilon=2^{-52}$ . We define $F_{max}$ and $F_{min}$ to be, respectively, the largest and smallest elements of $\mathbb{F}$ .

We will also need the following three lemmas. The first two both relate to floating point computations.

Lemma 1.

Let $x\in\mathbb{R}$ lie within the normalized range of the current format. Then

[TABLE]

for some $\delta$ satisfying $|\delta|<\frac{\epsilon}{2}$

This lemma is very well known and can be found in [1]. A lesser known but related lemma is:222To see why it’s true simply consider the operation 1.0*(1.0 - eps/3.9).

Lemma 2.

Let $x\in\mathbb{F}$ be any normalized floating point number. Then

[TABLE]

for any $\delta$ satisfying $|\delta|<\frac{\epsilon}{4}$ .

And finally, this analytical lemma will be useful in what follows and is easily established by examining the Maclaurin series expansion of $\sqrt{1+x}$ :

Lemma 3.

If $x>0$ then $\sqrt{1+x}<1+\frac{x}{2}$

3. Operands with widely differing magnitudes

To see what happens when $a>b$ have widely differing magnitudes observe that

[TABLE]

where the inequality follows by replacing $x$ with $(b/a)^{2}$ in lemma 3. Now, if

[TABLE]

then lemma 2 guarantees that $a=fl(\sqrt{a^{2}+b^{2}})$ . We can rewrite the inequality and accept $a$ as the correctly rounded answer whenever $b\leq a\sqrt{\epsilon/2}$ . Note that this form will also yield the exact answer whenever $b=0$ . If we test for this case first in our code then we will not have to separately check for zero operands in the initialization phase.

4. Operands without widely differing magnitudes

When the operands do not have widely differing magnitudes then our approach will be to compute $a^{2}+b^{2}$ with some care and use the built-in sqrt() function. There could be an unecessary floating point exception if the calculation of $a^{2}+b^{2}$ leads to an overflow/underflow/denormalization and we need to avoid that where possible. Clearly, there can be no overflow if $a\leq\sqrt{F_{max}/2}$ and there can be no underflow/denormalization provided that $b\geq\sqrt{F_{min}}$ . If one of these is violated333At this stage of the process it is not possible to violate both because such a pair of operands would have widely differing magnitudes. we can avoid the exception by rescaling the operands, doing the calculation, and then scaling back. When such rescaling is necessary there are two options. Some codes use the standard library functions frexp() and ldexp() to directly change the exponents and hence effect the rescaling without incurring rounding errors. The only downside of this is that it can be a bit more costly and it is not as direct. A better approach is to rescale by multiplying by a power of the base. This has the same effect as directly changing the exponents (i.e. there is no rounding error) but it leverages the hardware floating point operations which are generally much faster. For this to work the rescaling constant needs to be an appropriate integer power of the base. The precise range of appropriate values is dependent on the particular floating point format but one relatively format independent method for finding an appropriate rescaling value is to use eps(sqrt(floatmin(T))) where T is the floating point type. This will give a rescaling whose value is smaller than $\sqrt{F_{min}}$ and whose reciprocal is larger than $\sqrt{F_{max}/2}$ . This can be used to rescale and then scale back the operands with no rounding error. Moreover, the cumbersome calculation of this constant should be taken care of at compile time so there is no additional cost.

A little rounding error analysis will be very useful at this point. Let $z\in\mathbb{F}$ be the result of our floating point calculation of $a^{2}+b^{2}$ . Because this calculation will be done in floating point there must be some $\delta_{1}$ , whose value we do not yet know, such that

[TABLE]

Now let $h=\mbox{{\tt sqrt}}(z)$ . Since we are assuming that the sqrt() function is correctly rounded lemma 1 implies that $h=\sqrt{z}(1+\delta_{2})$ for some $|\delta_{2}|<\frac{\epsilon}{2}$ . Putting this all together and invoking lemma 3 leads to

[TABLE]

The first thing we should observe is that if we were able to compute a correctly rounded $z$ then $|\delta_{1}|<\frac{\epsilon}{2}$ . Plugging in above we see that the relative error in the computed $h$ could be as large as

[TABLE]

and hence the computed $h$ , after rounding, could be as much as one $ulp(h)$ different from the correctly rounded true value.

In a similar fashion if $|\delta_{1}|<\frac{3\epsilon}{2}$ then the relative error in the computed $h$ could be as large as

[TABLE]

and hence the computed $h$ , after rounding, could not be more than one $ulp(h)$ different from the correctly rounded true value. Careful rounding error analysis of the computation $z=a*a+b*b$ shows that the computed value satisfies

[TABLE]

and hence our computed answer must be within one ulp of the correctly rounded one. An excellent treatment can be found in section 5.3 of [2].

4.1. Naive Approaches

At this point we propose two naive algorithms for the computation in the event that the operands are not widely separated. First is the direct calculation

Algorithm 2.

Naive (Unfused)

h = sqrt(aa+bb)* *

And the second uses the fused multiply-add operation which allows us to get a more accurate calculation of $a^{2}+b^{2}$ .

Algorithm 3.

Naive (Fused)

h = sqrt(fma(a,a,bb)) *

4.2. Differential Correction

As we noted at the end of section 4 the correctly rounded square root of the correctly rounded $a^{2}+b^{2}$ can still be off by as much as one ulp. This hints at the possibility that working harder to compute $a^{2}+b^{2}$ more accurately may not be the best path to a better answer. That leads us to this final approach which we shall see has a great deal of merit.

A subtle and highly underappreciated phenomenon of floating point computation is that formally invertible functions are often not one-to-one in correctly rounded floating point. The square root function is a perfect example. If we let $S=\left\{x\in\mathbb{F}|1\leq x<4\right\}$ then it is clear that sqrt() maps the finite set $S$ onto a proper subset of itself and hence is not one-to-one by the pigeonhole principle.

Why is this is important? Because it tells us that we should not be looking for a correction $x$ that

[TABLE]

but rather we should be looking for a correction $x$ such that if $h=sqrt(a*a+b*b)$ is the computed hypoteneuse we have

[TABLE]

If we had such an $x$ then we could use the Maclaurin series expansion

[TABLE]

truncated to first order to correct our computed hypoteneuse value.444We note that our testing indicates that there is no point in taking the Maclaurin series correction past the first order term.

Setting $h^{2}-x=a^{2}+b^{2}$ allows us to find that $x=h^{2}-a^{2}-b^{2}$ and although we cannot generally compute $x$ exactly, we do want to do so very carefully. If we let $\delta=h-a$ then the reader can verify that

[TABLE]

Note that because $h$ and $a$ are nearly equal numbers there is no rounding error in computing $\delta=h-a$ . The precise arrangement and evaluation order of this formula is critical to get the best performance over the full range of values.

The form above gives good performance over the full range of operands but we can do better if we switch approaches based on the sizes of the arguments. In particular, if $h\leq 2b$ then we will set $\delta=h-b$ and use

[TABLE]

otherwise, we will set $\delta=h-a$ and use

[TABLE]

in both cases $\delta$ is computed without rounding error. It is worth noting that there are many other ways to compute $x$ and one can break this problem into lots of tiny pieces and do better but we feel that this approach is the best tradeoff between accuracy and efficient code. We can now use this to apply a correction to the computed value of $h$ which leads to our third algorithm which uses only the standard floating point operations.

Algorithm 4.

Corrected (Unfused)

h = sqrt(aa+bb)* *

if* h <= 2ay* then *

* delta = h-ay *

* x = ax(2delta - ax)+(delta-2(ax-ay))delta *

else

* delta = h-ax *

* x = 2delta*(ax-2ay) + (4delta-ay)ay + deltadelta* *

end* *if

h = h - x/(2h) *

Finally, if it happens that we are on an architecture that supports the fused multiply-add then it is possible to compute the correction far more accurately. This leads to our final code:

Algorithm 5.

Corrected (Fused)

h = sqrt(fma(a,a,bb)) *

h_sq = hh *

a_sq = aa *

x = fma(-b,b,h_sq-a_sq) + fma(h,h,-h_sq) - fma(a,a,-a_sq)* *

h = h - x/(2h) *

5. Testing

We now test our proposed algorithms against the two existing approaches. As a baseline for testing purposes we will use the big float format in Julia to do an arbitrary precision calculation of the hypoteneuse and then cast the result to a double precision floating point number to get a correctly rounded value that we shall call $\bar{h}$ . We note that in some exceedingly rare situations it may not be true that this yields the correctly rounded value (due to the effects of double rounding) but it is certainly good enough for our purposes.

Our first test will consist of creating $10^{9}$ random pairs of double precision floating point numbers where both are distributed according to a standard normal distribution. We will run all of our algorithms on each pair as well as computing $\bar{h}$ . We summarize the percentage of times each algorithm differed from $\bar{h}$ by exactly one ulp and exactly two ulps in table 1.

We immediately observe several things. First of all, the Julia1.1_hypot() code which is a very widely known approach to the problem performs rather horribly. The willingness to add rounding error to escape intermediate overflow/underflow problems destroys accuracy. Not only does it have the worst performance of all the approaches it is the only one that can produce two ulp errors (although this is admittedly rare). Second, the clib_hypot() and the Naive (Fused) approaches yield very comparable results. Both are basically trying to improve accuracy by computing $a^{2}+b^{2}$ with additional precision and so it is not surprising that they behave similarly even though the get the additional precision in different ways. Unfortunately, both suffer from the effects of iterated rounding. Finally, and most importantly, we see very clearly the superiority of the corrected approach (whether fused or unfused). When leveraging the fused multiply-add we get correctly rounded results every time. And the unfused version runs almost as well.

A further experiment will more completely display the differences in these algoritms. This time we track their accuracy over a range of relative scales of the operands. For this experiment we will use uniformly distributed operands. In particular, we will let $a\sim U(2^{N},2^{N+1})$ and $b\sim U(1,2)$ . In effect, our test pairs will be floating point numbers whose significands are uniformly distributed, but whose exponents differ by $N$ . We will run $10^{9}$ random pairs for each value $N=0,1,...,29$ and will compute the percentage of incorrectly rounded results for each of the six approaches. The results appear graphically in figure 1 as well as numerically in table 2. One can see the stunning difference in performance between the approaches. Note that in table 2 we observe that we are getting true correctly rounded results more than $99\%$ of the time for all relative scales for the unfused version of the corrected approach.

Julia code for all of the algorithms proposed in this paper appear in Appendix A.

6. Conclusions

Both the naive (unfused) and naive (fused) algorithms are clear improvements on the Julia1.1_hypot() code. Moreover, with its more accurate calculation of $a^{2}+b^{2}$ the naive (fused) algorithm gives results that are directly comparable to the results from clib_hypot() with far less work. In many ways, this demonstrates the limits of trying to do better by just computing $a^{2}+b^{2}$ more accurately. However, the hands down winner in all of this is the corrected code (both versions). It requires substantially less work than the clib_hypot() code but delivers vastly superior results.

Acknowledgments

The author would like to thank Prof. Lucas Wilcox of the Naval Postgraduate School for invaluable help with the subtleties of the Julia programming language, and also Claude-Pierre Jeannerod of INRIA for several very helpful comments on the analysis.

Appendix A: Julia Source Code for the Tested Algorithms

function MyHypot1(x::T,y::T) where T<:AbstractFloat # Naive (Unfused) #Return Inf if either or both imputs is Inf (Compliance with IEEE754) if isinf(x) || isinf(y) return convert(T,Inf) end

# Order the operands
ax,ay = abs(x), abs(y)
if ay > ax
    ax,ay = ay,ax
end

# Widely varying operands
if ay < ax*sqrt(eps(T)/2)  #Note: This also gets ay == 0
    return ax
end

# Operands do not vary widely
scale = eps(sqrt(floatmin(T)))  #Rescaling constant
if ax > sqrt(floatmax(T)/2)
    ax = ax*scale
    ay = ay*scale
    scale = inv(scale)
elseif ay < sqrt(floatmin(T))
    ax = ax/scale
    ay = ay/scale
else
    scale = one(scale)
end
sqrt(ax*ax+ay*ay)*scale

end

function MyHypot2(x::T,y::T) where T<:AbstractFloat # Naive (Fused) #Return Inf if either or both imputs is Inf (Compliance with IEEE754) if isinf(x) || isinf(y) return convert(T,Inf) end

# Order the operands
ax,ay = abs(x), abs(y)
if ay > ax
    ax,ay = ay,ax
end

# Widely varying operands
if ay <= ax*sqrt(eps(T)/2)  #Note: This also gets ay == 0
    return ax
end

# Operands do not vary widely
scale = eps(sqrt(floatmin(T)))  #Rescaling constant
if ax > sqrt(floatmax(T)/2)
    ax = ax*scale
    ay = ay*scale
    scale = inv(scale)
elseif ay < sqrt(floatmin(T))
    ax = ax/scale
    ay = ay/scale
else
    scale = one(scale)
end
sqrt(muladd(ax,ax,ay*ay))*scale

end

function MyHypot3(x::T,y::T) where T<:AbstractFloat # Corrected (Unfused) #Return Inf if either or both imputs is Inf (Compliance with IEEE754) if isinf(x) || isinf(y) return convert(T,Inf) end

# Order the operands
ax,ay = abs(x), abs(y)
if ay > ax
    ax,ay = ay,ax
end

# Widely varying operands
if ay <= ax*sqrt(eps(T)/2)  #Note: This also gets ay == 0
    return ax
end

# Operands do not vary widely
scale = eps(sqrt(floatmin(T)))  #Rescaling constant
if ax > sqrt(floatmax(T)/2)
    ax = ax*scale
    ay = ay*scale
    scale = inv(scale)
elseif ay < sqrt(floatmin(T))
    ax = ax/scale
    ay = ay/scale
else
    scale = one(scale)
end
h = sqrt(ax*ax+ay*ay)

# This is a well balanced single branch code
# delta = h-ax
#h -= ( delta*(2*(ax-ay)) + (2*delta-ay)*ay + delta*delta)/(2*h)
# End single branch

# This is a well balanced twopar branch code
if h <= 2*ay
    delta = h-ay
    h -= (ax*(2*delta - ax)+(delta-2*(ax-ay))*delta)/(2*h)
else
    delta = h-ax
    h -= ( 2*delta*(ax-2*ay) + (4*delta-ay)*ay + delta*delta)/(2*h)
end
# End two branch code

h*scale

end

function MyHypot4(x::T,y::T) where T<:AbstractFloat # Corrected (Fused) #Return Inf if either or both imputs is Inf (Compliance with IEEE754) if isinf(x) || isinf(y) return convert(T,Inf) end

# Order the operands
ax,ay = abs(x), abs(y)
if ay > ax
    ax,ay = ay,ax
end

# Widely varying operands
if ay <= ax*sqrt(eps(T)/2)  #Note: This also gets ay == 0
    return ax
end

# Operands do not vary widely
scale = eps(sqrt(floatmin(T)))  #Rescaling constant
if ax > sqrt(floatmax(T)/2)
    ax = ax*scale
    ay = ay*scale
    scale = inv(scale)
elseif ay < sqrt(floatmin(T))
    ax = ax/scale
    ay = ay/scale
else
    scale = one(scale)
end
h = sqrt(fma(ax,ax,ay*ay))
h_sq = h*h
ax_sq = ax*ax
x = fma(-ay,ay,h_sq-ax_sq) + fma(h,h,-h_sq) - fma(ax,ax,-ax_sq)
h-=x/(2*h)
h*scale

end

Appendix B: C Source Code for clib_hypot()

/* @(#)e_hypot.c 1.3 95/01/18 / /

====================================================
Developed at SunSoft, a Sun Microsystems, Inc. business.
Permission to use, copy, modify, and distribute this
software is freely granted, provided that this notice
is preserved.
==================================================== */

#include "cdefs-compat.h" //__FBSDID(" $FreeBSD: src/lib/msun/src/e_hypot.c,v 1.14 2011/10/15 07:00:28 das Exp$ ");

/* __ieee754_hypot(x,y) *

Method : ΨIf (assume round-to-nearest) z=xx+y*y *Ψhas error less than sqrt(2)/2 ulp, than *Ψsqrt(z) has error less than 1 ulp (exercise).

ΨSo, compute sqrt(xx+yy) with some care as Ψfollows to get the error below 1 ulp: * ΨAssume x>y>0; Ψ(if possible, set rounding to round-to-nearest) Ψ1. if x > 2y use ΨΨx1x1+(yy+(x2(x+x1))) for xx+yy Ψwhere x1 = x with lower 32 bits cleared, x2 = x-x1; else Ψ2. if x <= 2y use ΨΨt1y1+((x-y)(x-y)+(t1y2+t2y)) *Ψwhere t1 = 2x with lower 32 bits cleared, t2 = 2x-t1, *Ψy1= y with lower 32 bits chopped, y2 = y-y1. *ΨΨ *ΨNOTE: scaling may be necessary if some argument is too *Ψ large or too tiny *

Special cases: *Ψhypot(x,y) is INF if x or y is +INF or -INF; else *Ψhypot(x,y) is NAN if x or y is NAN.
Accuracy:
Ψhypot(x,y) returns sqrt(x^2+y^2) with error less
Ψthan 1 ulps (units in the last place) */

#include <float.h> #include <openlibm_math.h>

#include "math_private.h"

OLM_DLLEXPORT double __ieee754_hypot(double x, double y) { Ψdouble a,b,t1,t2,y1,y2,w; Ψint32_t j,k,ha,hb;

ΨGET_HIGH_WORD(ha,x); Ψha &= 0x7fffffff; ΨGET_HIGH_WORD(hb,y); Ψhb &= 0x7fffffff; Ψif(hb > ha) {a=y;b=x;j=ha; ha=hb;hb=j;} else {a=x;b=y;} Ψa = fabs(a); Ψb = fabs(b); Ψif((ha-hb)>0x3c00000) {return a+b;} /* x/y > 260 / Ψk=0; Ψif(ha > 0x5f300000) {Ψ/ a>2500 / Ψ if(ha >= 0x7ff00000) {Ψ/ Inf or NaN / Ψ u_int32_t low; Ψ / Use original arg order iff result is NaN; quieten sNaNs. / Ψ w = fabs(x+0.0)-fabs(y+0.0); Ψ GET_LOW_WORD(low,a); Ψ if(((ha&0xfffff)|low)==0) w = a; Ψ GET_LOW_WORD(low,b); Ψ if(((hb^0x7ff00000)|low)==0) w = b; Ψ return w; Ψ } Ψ / scale a and b by 2**-600 / Ψ ha -= 0x25800000; hb -= 0x25800000;Ψk += 600; Ψ SET_HIGH_WORD(a,ha); Ψ SET_HIGH_WORD(b,hb); Ψ} Ψif(hb < 0x20b00000) {Ψ/ b < 2**-500 / Ψ if(hb <= 0x000fffff) {Ψ/ subnormal b or 0 / Ψ u_int32_t low; ΨΨGET_LOW_WORD(low,b); ΨΨif((hb|low)==0) return a; ΨΨt1=0; ΨΨSET_HIGH_WORD(t1,0x7fd00000);Ψ/ t1=2^1022 / ΨΨb = t1; ΨΨa = t1; ΨΨk -= 1022; Ψ } else {ΨΨ/ scale a and b by 2^600 / Ψ ha += 0x25800000; Ψ/ a = 2^600 / ΨΨhb += 0x25800000;Ψ/ b = 2^600 / ΨΨk -= 600; ΨΨSET_HIGH_WORD(a,ha); ΨΨSET_HIGH_WORD(b,hb); Ψ } Ψ} / medium size a and b / Ψw = a-b; Ψif (w>b) { Ψ t1 = 0; Ψ SET_HIGH_WORD(t1,ha); Ψ t2 = a-t1; Ψ w = sqrt(t1t1-(b(-b)-t2(a+t1))); Ψ} else { Ψ a = a+a; Ψ y1 = 0; Ψ SET_HIGH_WORD(y1,hb); Ψ y2 = b - y1; Ψ t1 = 0; Ψ SET_HIGH_WORD(t1,ha+0x00100000); Ψ t2 = a - t1; Ψ w = sqrt(t1y1-(w(-w)-(t1y2+t2b))); Ψ} Ψif(k!=0) { Ψ u_int32_t high; Ψ t1 = 1.0; Ψ GET_HIGH_WORD(high,t1); Ψ SET_HIGH_WORD(t1,high+(k<<20)); Ψ return t1*w; Ψ} else return w; }

#if LDBL_MANT_DIG == 53 __weak_reference(hypot, hypotl); #endif

Bibliography2

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. L. Overton , Numerical Computing with IEEE Floating Point Arithmetic , Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2001.
2[2] C-P. Jeannerod and S. M. Rump , On Relative Errors of Floating-Point Operations: Optimal Bounds and Applications , Mathematics of Computation, American Mathematical Society, 2018, 87, pp. 803–819.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

An Improved Algorithm for hypot(a,b)

Abstract.

Key words and phrases:

2000 Mathematics Subject Classification:

1. Existing approaches

Algorithm 1**.**

2. Mathematical Preliminaries

Lemma 1**.**

Lemma 2**.**

Lemma 3**.**

3. Operands with widely differing magnitudes

4. Operands without widely differing magnitudes

4.1. Naive Approaches

Algorithm 2**.**

Algorithm 3**.**

4.2. Differential Correction

Algorithm 4**.**

Algorithm 5**.**

5. Testing

6. Conclusions

Acknowledgments

Appendix A: Julia Source Code for the Tested Algorithms

Appendix B: C Source Code for clib_hypot()

Algorithm 1.

Lemma 1.

Lemma 2.

Lemma 3.

Algorithm 2.

Algorithm 3.

Algorithm 4.

Algorithm 5.