|
|
|
date: Mon, 10 Mar 2008 21:06:13 +0100,
group: microsoft.public.vstudio.development
back
Re: 64 bit integers still slow on X64 ?!?!?
MitchAlsup wrote:
> Integer 64-bit pipeline back-to-back issue delay:
> Add/Sub: 1 cycle scalarity 3/cycle
> Mul: 5 cycles scalarity 1/cycle
> Div: 67 cycles scalarity 1/cycle
The divide time is long enough that it could be feasible to make some
interesting workarounds:
I'm willing to bet that nearly all divisors will actually fit inside 53
bits, not needing those last 11 mantissa bits, in which case...
>
> Float 64-bit pipeline back-to-back issue delay
> Add/Sub: 1 cycle scalarity 1/cycle
> Mul: 1 cycle scalarity 1/cycle
> Div: 17 cycles scalarity 1/cycle shared with FMul.
... a 17-cycle fp divide might do a pretty good job!
In fact, since using the fp divider will require a fixup step anyway,
why not go all the way and start by calculating an approximate
reciprocal, then use that in a two-stage process to generate the exact
answer?
If you can setup your code to do two or four of these divisions at the
same time, with independent divisors, then you can use the SSE 12-bit
1/x lookup opcode, one NR iteration to get ~float precision, then
convert to double and do another iteration for near double precision.
Multiply the dividend by this ~50-bit reciprocal, convert to integer,
back-multiply and subtract.
One more iteration like this and you have four exact results, probably
in close to the same time as a single 64-bit DIV opcode. :-)
> Divide is eating your lunch, and multiply is not doing you any favors.
> In Barcelona, int div comes down into the 20-odd cycle range thanks to
> dedicated int div circuitry.
At which point tricky code becomes less useful.
Terje
>
> Mitch
--
-
"almost all programming can be viewed as an exercise in caching"
date: Mon, 10 Mar 2008 21:58:24 +0100
author: Terje Mathisen
|
|