[m-dev.] deleting tags_high

Paul Bone paul at bone.id.au
Fri Jul 27 12:14:16 AEST 2018


On Fri, Jul 27, 2018 at 03:50:04AM +0200, Zoltan Somogyi wrote:
> 
> 
> On Fri, 27 Jul 2018 10:54:47 +1000, Paul Bone <paul at bone.id.au> wrote:
> > It surprises me that tags_low is faster. 
> 
> It shouldn't. With low tags, given a tagged pointer, subtracting the known
> primary tag (e.g. -1) and adding the offset of the desired field (e.g. +8)
> to the value of the base pointer can be done together (by adding -1+8 = +7
> to the value of the tagged pointer). Load and store instructions in most
> instruction sets include the ability to specify such small offsets, so the arithmetic
> is not incurring any cost over the cost of the load or store itself. With high tags,
> the combined offsets cannot possibly fit into the load or store instruction
> (they would need 64 bits, whereas load/store instructions typically have room
> for 8 or 16 bits). The masking off of the high tag will typically take two or three
> instructions (load the tag value constant in the low order bits of a register,
> shift it up to the required bit positions, then mask), and *then* do the load
> or store. Since the address to access depends on the result of the mask op,
> there is no room for instruction level parallelism either. This is why tags high
> is not just slower, but significantly slower than tags low. It also has more instructions,
> making the I-cache less effective.

Yeah, that makes sense, plus the use of the extra register (small compared
to the other operations).  This is also going to be slow when you're
following pointers from one struct to another, serialising them all with the
memory reads.

> > Hrm, although the GC would need to
> > handle this the way it handles low bits, which I don't know but it might
> > treat them as interior pointers. 
> 
> Yes, we register 0-3 (on 32 bit machines) and 0-7 (on 64 machines)
> as possible offsets for boehm gc, so any value at any such offset to
> the start of an allocated block counts as a pointer to the block.
> 
> > One benefit of tags_high is that on x86_64 (at least at the moment) there
> > are 16 bits available at the high end of a pointer.
> > 
> > https://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details
> 
> Look up reports of the transition from the IBM S/360 to the S/370 in the 1970s.
> Both used 32 bits words, but addresses were only 24 bits on the S/360.
> People who stashed other info in the top 8 bits of words containing pointers
> regretted it when the S/370 came along, which extended pointer size to 32 bits
> (though IBM literature calls them 31 bit, since the top bit had another use).

Yes, in that link I provided you can also see where Mozilla hit this problem
on some archiectures.  In the context of JS it does make sense so that NaN
boxing is possible.

> The rule "In addition, the AMD specification requires that the most significant
> 16 bits of any virtual address, bits 48 through 63, must be copies of bit 47"
> on the page you link to was specified by AMD specifically to avoid any similar
> problems. I know because one of the architects involved said so on comp.arch
> at the time.

:-)  Yes, you'd need to clear those bits (0 because it's in user space)
before dereferencing the pointer, and you'd have the problem you mentioned
above, with the load immediate, shift and mask.

> However, just because there are 16 bits available NOW does not mean that
> those same 16 bits will ALWAYS be available. The page you link to even has
> diagrams illustrating this fact.

Yes, I acknowledged that in my first e-mail, that was something you said
when I raised this previously.  But I raised it again because I had
forgotten that the load+shift+mask cost more than the free offset
calculation.

> > So if using 16 or maybe 17 bits as high tag bits sounds interesting, this is
> > Also remember that BoehmGC's minimum allocation and alignment is two machine
> > words.  So you could use 4 or 3 low tag bits.
> 
> Not all memory cells containing Mercury terms are in memory allocated by boehm;
> some are in statically generated data structures. We would need to ensure that
> these are aligned on 16 byte boundaries as well. Does anyone know how portable
> gcc's __attribute__((aligned(16))) is?

Good point.

> (It would be nice to have someone on the C standardization committee again :-)

:-)


-- 
Paul Bone
http://paul.bone.id.au


More information about the developers mailing list