dasyatid1: “delta prime” (Default)

A strange idea came to me today while thinking about garbage collector design. On priors, probably not original, but I don't remember seeing it before, so!

There's a couple of usual mechanisms for allowing the mutator to hook object reclamation in GCs. Finalizers are a traditional one, but they don't allow the mutator to interact with their timing easily, sometimes leading to exciting concurrency issues, and object resurrection during finalization seems to be a source of exciting complications in both GC and application design. Java has been on the path to deprecating Object.finalize for a while now.

(I'll refer to the procedure that's meant to execute after an object becomes unreachable as the “hook”, below, to make it clearer that it's not always a finalizer in the above narrower sense.)

Then there's guardians, such as the ones in Guile; will executors in Racket seem similar. These accumulate registered objects that would otherwise have been collected, then allow the mutator to dequeue them for cleanup. This avoids the synchronization issues of finalizers, but they interact weirdly with weak references, and resurrection still feels thorny to me; Andy Wingo has written about some related issues.

Things get simpler if you don't allow the hook access to the original object. The most recent place I've run into this personally is in SBCL's sb-ext:finalize, whose documentation warns that you shouldn't close over the object in the hook closure or else it will never get collected. Similarly, in Java, phantom references are now promoted over finalizers; they use a queueing mechanism similar to guardians but can't be dereferenced. The Cleaner class abstracts over this to provide an interface that handles the queue processing in the background; it too warns to not close over the original object in the hook.

But the problem with not being able to get access to the original object is that it's common to need some of the information from it in the hook. For instance, finalization of GC objects is often used to backstop cleanup of foreign resources, but the hook needs the underlying resource handle for that—and if the object also provides explicit cleanup, then the hook needs to know whether it's been done already or not, so duplicating the shared info once isn't enough. The main way I know of to handle this is to box the shared info in a sub-object that the hook closes over, but the most straightforward way to do that also results in extra indirection during normal access. Keeping two copies of the info and only propagating writes avoids indirection on read but is more awkward to implement correctly. The Cleanable objects that Cleaner gives back above promise that they'll only run once whether explicitly or implicitly called, which helps a lot with the ergonomics of the easier cases (you trigger the same path for implicit and explicit cleanup) but still involves jumping through hoops.

So: what if, instead, the GC met the mutator halfway? Rather than either allowing access to the original object during finalization (with attendant complexities around reachability and resurrection) or disallowing it completely (with attendant difficulties around sharing state with the hook), you'd allocate a corpse object at registration time, which the mutator never sees while the original object is live. At finalization time, the GC would automatically copy info from the now-dead object into the corpse and pass that to the hook, while not allowing the original object pointer to escape.

Choosing what to copy could be done in a few ways. I like the idea of being able to designate any subset of the slots (maybe a static subset per concrete type if that makes it easier to synthesize the corpse type), but that could be pretty hairy to provide depending on the details. Simpler alternatives would be to copy everything (but that's probably more than needed) or extract a single slot (but the application has to degrade to the sub-object case above if the data is more complicated). In the case of a single source slot that's a pointer or can be converted into a tagged immediate, you could just reuse the slot where you'd have stored the corpse reference and avoid taking an extra allocation.

What have I reinvented? And is this useful? Or is it just too complicated or otherwise inferior to current practice?

Addendum like half an hour later (sigh): aha, I've found a problem I didn't think of the first time! If any of the copied slots are heap references, they could still create resurrection issues indirectly, so you still have to make sure the sequence of operations of the collector doesn't trip over that. Phooey. That's not so bad I guess—you could queue all the copied stuff to be marked along the way, and I think maybe it'd mean that ‘good’ acyclic cases can avoid triggering weird weak reference interactions or requiring extra collection cycles as compared to the guardian/will-executor approach where the original pointer always escapes. You could check for the ‘bad’ case and raise a ruckus, or try to exclude it a priori using static type restrictions, but those could both get awkward.

dasyatid1: “delta prime” (Default)

(This is also posted on the fediverse at mastodon.online.)

From the catalog of thoughts I had hazy intentions of actually researching and writing up in detail, and then didn't, and sat on it for like five years or whatever—so, screw it, have the sloppy version: a certain perspective on how the silo model of per-app data wound up predominating in the consumer sphere.

(Probably it's been done before, but I don't know where. We still don't have good concept-shape reference search. Maybe someone who reads this will know?)

In UX, it seems common wisdom that your design should try to follow the user's model of the interaction. What do they expect to happen? What meshes with their approach and habits? The deeper version of this (at least from my peanut-gallery perspective) was involved in the focus shift from “UI” to “UX”: if the machine model is too at odds with the user model, remake the application to fit the user, rather than dictating from the manual and expecting the user to comply.

Some of you probably remember growing up with PCs around the time they started coming into the mainstream, and becoming the de facto tech support hub for less computing-oriented family members. So you might remember this conversation:

“But where did you save the letter?”

“In Word.”

Among other things, user-managed storage and lower prevalence of networking made this less practical as a machine model where uses spanned computers. Lower-capacity physical media made user management of storage volumes more pressing. (How many diskettes does that come on?) And, a bit further down the line, lack of sophisticated search infrastructure made user organization of files more immediately beneficial.

With subsidized datacenter storage and an assumption of high-speed Internet access everywhere?

Observe what the default user model was, back in the 1990s.

Solve for the equilibrium.

dasyatid1: “delta prime” (Default)

Welcome to Cubetown

For somewhat ridiculous reasons which will for now remain unspecified, I recently wound up thinking about how one would neatly represent the rotation group of the cube (also known as chiral octahedral symmetry) in software, primarily for the purpose of applying them to 3D geometry. For background, in the general realm of 3D, one can represent arbitrary rotations as linear transformations using 3×3 matrices (or 4×4 ones, in 3DH coordinates), which can be applied to vectors easily through matrix–vector multiplication and can be easily composed with other linear transformations (or projective ones, in 3DH). For rotations specifically, one can also reach for unit quaternions which are more compact and numerically stable for manipulating in themselves. So there we see two different breadths of representation, from any 3D linear transformation down to the rotation group of the sphere—but what about going down to the rotation group of the cube, specifically?

Unlike the conceptually continuous rotation group of the sphere, which thus encourages the use of floating-point numbers to model its elements, the rotation group of the cube is finite, consisting of 24 discrete possibilities. Stack Exchange is full of examples of ways to think about the derivation thereof, of which I find the “choose one of six faces to point up, then choose one of its four edges to be the north one” approach the most geometrically intuitive to imagine. (I wasn't aware until very recently of the isomorphism with the permutation group of four elements, in fact, or the relationship of that to the four long diagonals of the cube!)

Swizzles and sign bits

So what approach do we want to base our representation on? Since the primary use here is transforming 3D vectors, we can start by looking at the 3D rotation matrices for quarter-turns around each axis, since any cube rotation can be composed of a sequence of those. Each one has the effect of swapping the two other axes and negating one of them in the process; for instance, in a right-handed coordinate system, a quarter-turn anticlockwise rotation about the positive Z axis takes (x, y, z) to (−y, x, z). From this we can also observe that the result of a cube rotation can be represented as a permutation of the axes followed by a negation of some subset of them.

There are 6 (= 3!) permutations of the axes and 8 (= 2^3) possible subsets of them to negate. Multiplied together, that gives 48 possibilities, which is twice as many as we want. In fact, this corresponds to the full symmetry group of the cube, allowing both rotation and reflection operations. So how do we constrain that to only the rotations? Looking again at the effect of a quarter-turn about an axis, we find that each quarter-turn simultaneously alters the number of negations by 1 (by turning one negative axis positive or vice versa) and flips the parity of the permutation between even and odd (by doing a single transposition). The identity rotation has zero negations and the (even) identity permutation, so an even permutation must be paired with an even number of negations, and an odd permutation must be paired with an odd number of negations, neatly halving the number of valid transformations and giving us an easy way to tell which are which.

Since we're in discrete territory and aiming at a software use, let's get out our unsigned integers. Among the six permutations of the axes, there's one even and one odd permutation for each initial element, so we can map those to even and odd numbers—equivalent to using the low bit for the parity—and let the next two bits represent the initial element between 0 and 2, giving us the numbering {0=XYZ, 1=XZY, 2=YZX, 3=YXZ, 4=ZXY, 5=ZYX}. Meanwhile, subsets of an ordered finite set can be conveniently numbered in binary as bitmasks, so we could let the negation mask be a number from 0 to 7 with bits 0, 1, and 2 set if the X, Y, and Z axes respectively are negated after the permutation—but due to the no-reflections constraint above, we only need to know the signs of two of the axes to reconstruct the sign of the third based on the permutation parity, so we drop the Z bit and keep only X and Y.

Putting those together, we have the following bitfield-ish representation:

  • Bits 4–3: initial axis of the permutation (0=X, 1=Y, 2=Z)
  • Bit 2: parity of the permutation (0=even, 1=odd)
  • Bit 1: set if the Y axis is negated after permutation
  • Bit 0: set if the X axis is negated after permutation

… and that forms a dense numbering from 0 to 23!

What do the operations look like?

For transforming 3D vectors, it seems more convenient to have the expanded permutation already available as easily addressable elements, and similarly an expanded +1/−1 representation of the signs. In the environment I was originally thinking of this in, the type system only allowed ‘heavyweight’ objects for user-defined types to begin with, so having six extra byte fields in each one seemed fine, with the instances themselves constructed at static init time and stashed in a 24-element array. To get the expanded permutation from an index, you only need the second and third elements since you already have the first, and 4 bits × 6 permutations gives us a 24-bit lookup table that fits in an immediate constant, which feels more sympathetic to the CPU than trying to do axis-number arithmetic modulo 3. The sign bit for the post-permutation Z axis can also be reconstructed as the XOR of the low three bits of the index.

Given a dense numbering, composition and inversion seem like reasonable places for lookup tables. Doing it the easy way, a 24×24 lookup table of uint8 is 576 bytes, but since the indices are only 5 bits long, you could alternatively pack 6 of them to a uint32 or 12 of them to a uint64, for 16 bytes per row, and drop that down to 16×24 = 384 bytes. If you were willing to reconstruct the parity bit as the XOR of the parities of the two input rotations, you could pack eight 4-bit elements per uint32 and lower that to 12×24 = 288 bytes, but the position of the parity bit makes the bit-juggling really awkward. The inversion table only has one row but is otherwise similar.

Fetching the rotation of an angle about an axis also seems like a reasonable place for a tiny lookup table: 4 angles × 3 axes × 5 bits = 60 bits total. An angle expressed as a number of quarter-turns is easily normalized by masking, and inverting the reference direction of the axis or switching between clockwise and anticlockwise both just negate the angle.

Calculating the mapping of a face, given a face index from 0 to 5 where the low bit is the sign and the upper two bits are the axis, is a lookup in the expanded permutation for the new axis and an XOR of the sign bits; depending on whether you store the forward or reverse permutation and whether you want the forward or reverse mapping of the face, you might need to invert the rotation first.

Even though it mostly doesn't map meaningfully for the operations we're performing, I also just kind of love that the identity rotation is represented as 0.

Further directions

(or, that's what I'd call this section if I were trying to pretty it up for a journal)

There's a bunch of obvious micro-variations on this idea that I haven't really explored. Moving the parity bit to the bottom would make using a 4-bit-per-element lookup table for the composition operator easier, but looking up the permutation would become harder because its index would be split up; that could be an improvement overall, especially if you're keeping the pre-expanded objects around per above, but it maybe loses some elegance along the way. Keeping the Z sign instead of the parity is another alternative, which to me feels worse at a glance, but I could see other people having different intuitions there. Whether to index on the forward or reverse permutation of the axes makes a difference in the accesses when transforming vectors, and if you have hardware support for swizzling that wants a certain format, then maybe you want to align with that. Same thing for whether to model the negations as applied before or after the permutation.

In general, if I wanted to know performance differences between micro-level implementation variations, I'd want to actually measure, which I haven't. My baseline guess would be that if you're also doing other 3D geometry manipulation, then most of the stuff in here is cheap enough that it gets overwhelmed, and you might just fuse a given rotation into your existing matrix math before the expensive part happens elsewhere anyway, but some use cases might care.

But mostly I just went “goodness, that's a nice dense numbering for those that has a bunch of cute properties in terms of what you can do with binary integer manipulation”. So there it is. If you've seen it before, please do tell me where! I expect most of my ideas to not be original in an absolute sense, after all…

Powered by Dreamwidth Studios