Wednesday, May 25, 2011

When null isn't just null -- value representation in ParaSail

The qualifier optional may be applied to any type in ParaSail, effectively producing a type with one additional value, namely the null value.  In many languages, null is only used for pointer types, and is often represented as a zero address.  In ParaSail, we want to be able to add the null value to any type without presuming that the zero value is available for that purpose, so this means that the representation for null will vary depending on the particular type.

For simple types, coming up with a null value is pretty straightforward.  For unsigned integer types, a value one greater than the maximum value could be used for null.   For example, given an unsigned type going from 0 to 100, the value 101 could be used to represent null.  Similarly, for enumeration types that use an internal unsigned integer representation, such as 0 for #false and 1 for #true, the null value could be represented by one more than the maximum internal unsigned integer code, for example 2 for the null value of an optional Boolean type.  For a signed integer type, one less than the minimum value might make sense, particularly on a 2's complement machine where there is one "extra" negative value anyway.  So for a 64-bit signed integer, one might use -2**63+1 .. 2**63-1 for "normal" values of the type, and reserve -2**63 for the null value.  Most floating point representations include the notion of "Not a Number" (NaN), and some NaN value would make sense for null.  Since there are no run-time checks in ParaSail (checking all being done at compile-time), it would be fine to use a "non-signaling" NaN for null.

For more complex types, the representation for null is a bit more interesting.  One common kind of type would be a simple "wrapper," that is, a type defined by a class that has exactly one var component.  For example:
class Box<Contents_Type is Assignable<>> is
    var Contents : Contents_Type;
  exports
    function Create(Value : Contents_Type) -> Box is
      return (Contents => Value);
    end function Create;
    ...
end class Box; 
In this case, it would be nice to have the wrapper type use exactly the same representation as that of the underlying component type (e.g. Contents_Type).  This would mean that the null value for the wrapper would be the same as the null value for the component type.  This does mean that the component type must not itself be marked as optional, as then there would be no way to distinguish the wrapper being non-null but containing a single null component, from the case where the wrapper itself is null.

So our conclusion is that a wrapper type can use the same representation as its component type so long as the component type is not itself marked optional.  If the component type is itself marked optional, then the wrapper needs to allow for its "own" representation for null, which might in some cases be accommodated by simply allowing for yet one more value if the component type is "simple," or a level of indirection for a more complex component type.

Now what about more complex types?  For example, a type defined by a class with multiple var components:
class Pair<Element_Type is Assignable<>> is
    var Left : Element_Type;
    var Right : Element_Type;
  exports
    function Cons(Left, Right : Element_Type) -> Pair is
      return (Left => Left, Right => Right);
    end function Cons;
    ...
end class Pair;  
One obvious representation for a type with multiple components is as a sequence of memory cells long enough to accommodate the representation of each of the components, and then some provision for representing null, which could be by piggy-backing on one of the components if it is not itself allowed to be null, or by adding an additional bit to indicate null-ness.  However, in our initial (prototype, ParaSail-Virtual-Machine-based) implementation, we have chosen to represent every object using a single 64-bit memory cell.  This basically means that if the value cannot fit in a single 64-bit cell, it will need to use a level of indirection.  To simplify further, we won't be doing any "packing" initially, so even if the components are each quite short (such as booleans), we will nevertheless go to an indirect representation.  We do anticipate supporting packed arrays, but that would initially be handled by doing explicit shifting and masking, rather than by building in the notion of packing into the ParaSail Virtual Machine instruction set.  In the ParaSail Virtual Machine, pretty much everything occupies a single 64-bit word.

So back to the initial question -- how will we represent objects with multiple components (or with a single component whose type is marked optional)?  And how will we represent null for such objects?  One important thing to remember is that (large) ParaSail objects live in regions, and the regions are associated with stack frames.  Whenever assigning a value to an object, if the new value can't "fit" in the same space as its current value, then the space for the old value needs to be released and the space for the new value needs to be allocated, in the appropriate region.  Since a "ref var" parameter (or subcomponent thereof) might be assigned a new value, it won't necessarily be known at compile-time what region the object being assigned "lives" in.  This suggests that every value for a large object must identify its region, so that its space can be released back to that region when it is freed, and so that the new value can be allocated in that region.  This same requirement applies even if the object has a null value to begin with.  Hence, we are left with the conclusion that a null value for a "large" type needs to identify its region.

Another issue for "large" objects is that we need to know how to reclaim the entire "subtree" of subcomponents when we release the enclosing object.  In some cases we will know the type at compile-time, but in the case of polymorphic objects, or in cases where the type of the object is a formal type parameter of the enclosing module, we won't necessarily know.  This implies that we may want to have some kind of "type-id" associated with large objects.  This then results in the following representation for large objects:

    A 64-bit pointer to:
         [Region, Type-Id, component1, component2, component3, ...]

where the Type-Id provides enough information to know which of the components are themselves "large" and hence need to be recursively released.  In addition, there would probably be a special null Type-Id, which would allow the "is null" operation to be implemented relatively efficiently by comparing the Type-Id against some kind of standard "Null-Type-Id" value.  Each region would only need a single null value, which pointed back to the region, and had the Null-Type-Id.  Such null objects would not be reclaimed if they were the "old" value of the left-hand side of an assignment, since there is only one such value per region, and it is shared among all objects in that region currently having a null value.

So to summarize, we now see that null is not a single value, but rather each simple type potentially has its own representation for null, and for "large" types that use a level of indirection, there is a separate null value for each region.  So the "is null" operation is not necessarily a simple check for equality with the global null value, but instead would depend on the type, and in particular for "large" objects, would be a check of the Type-Id field of the object to see if it has the Null-Type-Id.

On a final note, when a function returns "large" values, it needs to be "told" in which region to allocate the value(s) being returned.  One simple way to do this is for the "large" output parameter(s) to be initialized with a (large) null value by the caller.  Since such a null value identifies the region, the function can use that information when allocating the space for the returned value.

8 comments:

  1. What value would a "null" double-float have? IEEE 754 completely constrains the bit pattern of +/-Inf. You could use one of the many possible NaN bit patterns, but other numerical libraries may prefer to use those bit patterns to convey meaningful information (that was the original intent). Just curious ;-)

    ReplyDelete
  2. The expectation is that there would be user-definable "null"() and "is_null()" operators, so a particular floating-point type could determine what is the representation of null. It could even be a module parameter, so that different instances of the same "Float" module could use different representations for null.

    ReplyDelete
  3. You know it's just going to be NaN so don't bother thinking harder ;-P

    Seriously, it seems like defining null requires a negotiation between ParaSail code and calls to foreign functions. The negotiation would have to be made separately for each foreign library.

    ReplyDelete
  4. Is there really that much variation in the handling of NaN is libraries? If you had some specific examples I would be interested.

    ReplyDelete
  5. Honestly, I've never spent much time looking at NaN bit patterns. I don't know of a library which uses the NaN payload (the values of the nonspecified bits) to convey information. If you want to experiment, you can use something like Matlab's

    typecast(x,'uint64')

    which reinterprets the bits of the double-precision floating-point value x as a 64-bit unsigned integer.

    The important distinction is between quiet and signalling NaNs, which differ by a bit. This has caused me some trouble when calling noninteractive foreign libraries from interactive programming languages that set the FPU to make all NaNs signal.

    The 2008 standard says: "Quiet NaNs should, by means left to the implementer’s discretion, afford retrospective diagnostic information inherited from invalid or unavailable data and results. To facilitate propagation of diagnostic information contained in NaNs, as much of that information as possible should be preserved in NaN results of operations."

    This suggests you could make all NaNs "null," but pick a specific NaN to be the value you get when you ask for "null." That's what you likely would have done anyway, and there is plenty of precedent for this. Just don't make 0/0=1 like APL did, and we'll be happy ;-P

    ReplyDelete
  6. Is_Null(NaN) and Is_NaN(null) seem like a reasonable pair of promises.

    ReplyDelete
  7. I think that's OK as long as null() always returns quiet NaN. (Computing NaN via 1/0 may signal, depending on whether the "NaNs always signal" bit is set.)

    ReplyDelete
  8. Here is an update on the current representation of "null" for "large" objects:

    We were representing a large "null" value by a reference to an object with a header specifying the appropriate region id and a "null" type-id. In the current implementation, we have now eliminated the extra level of indirection in the representation of "null." Large objects are normally represented by the (virtual) address of their header. We now reserve a big enough part of the virtual address space for "null" values, so we can encode both the fact that the address represents a "null" value, as well as the region id where the contents of the object should be allocated if it is ever assigned a non-null value.

    For example, a large null value might always have the high 32 bits being zero, with the low 32 bits being the region-id. The actual representation for any given ParaSail implementation will depend on what parts of the 64-bit virtual address space would not normally be used for object storage.

    ReplyDelete