4

Value type companions, encapsulated

 2 years ago
source link: http://cr.openjdk.java.net/~jrose/values/encapsulating-val.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Value type companions, encapsulated

Value type companions, encapsulated

John Rose for Valhalla EG, July 2022 (ver 0.2)

Background

(We will start with background information. The new stuff comes afterward. Impatient readers can find a very quick summary of restrictions at the end.)

Affordances of C.ref

Every class or interface C comes with a companion type, the reference type C.ref derived from C which describes any expression (variable, return value, array element, etc.) whose values are either null or are instances of a concrete class derived from C.

We are not in the habit of distinguishing C.ref from C, but the distinction is there. For example, if we call Object::getClass on a variable of type C.ref we might not get C.class; we might even get a null pointer exception! Put another way, C as a class means a particular class declaration, while C.ref as a type means a variable which can refer to instances of class C or any subclass. Also C.ref can be null, which is of no class at all. One can view the result of Object::getClass as a type rather than a mere class, since the API of Class includes representation of types like int and C.val as well as classes. In any case, the fact that a class can now have two associated types requires a clearer distinction between classes and types.

We are so very used to working with reference types (for short, ref-types) that we sometimes forget all that they do for us in addition to their linkage to specific classes:

  • C.ref gives a starting point for accessing C’s members.
  • C.ref provides abstraction: C or a subtype might not be loaded yet.
  • C.ref provides the standard uninitialized value null.
  • C.ref can link C objects into graphs, even circular ones.
  • C.ref has a known size, one “machine word”, carefully tuned by the JVM.
  • C.ref allows a single large object to be shared from many locations.
  • C.ref with an identity class can centralize access to mutable state.
  • C.ref values uniformly convert to and from general types like Object.
  • C.ref values are polymorphic (for non-final C), with varying Object::getClass values.
  • C.ref is safe for publication if the fields of C are final.

When I store a bunch of C objects into an object array or list, sort it, and then share it with another thread, I am using several of the above properties; if the other thread down-casts the items to C.ref and works on them it relies on those properties.

If I implement C as a doubly-linked list data structure or a (alternatively) a value-based class with tree structure, I am using yet more of the above properties of references.

If my C object has a lot of state and I pass out many pointers to it, and perhaps compute and cache interesting values in its mutable fields, I am again relying on the special properties of references, as well as of identity classes (if fields are mutable).

By the way, in the JVM, variables of type C.ref (some of them at least) are associated not with C simple, but with the so-called L-descriptor spelled LC;. When we talk about C.ref we are usually talking about those L-descriptors in the JVM, as well.

I don’t need to think much about this portfolio of properties as I go about my work. But if they were to somehow fail, I would notice bugs in my code sooner or later.

One of the big consequences of this overall design is that I can write a class C which has full control over its instance states. If it is mutable, I can make its fields private and ensure that mutations occur only under appropriate locking conditions. Or if I declare it as a value-based class, I can ensure that its constructor only allows legitimate instances to be constructed. Under those conditions, I know that every single instance of my class will have been examined and accepted by the class constructor, and/or whatever factory and mutator methods I have created for it. If I did my job right, not even a race condition can create an invalid state in one of my objects.

Any instance state of C which has been reached without being produced from a constructor, factory, mutator, or constant of C can be called non-constructed. Of course, inside a class any state whatever can be constructed, subject to the types of fields and so on. But the author of the class gets to decide which states are legitimate, and the decisions are enforced by access control at the boundaries of the encapsulation.

The author of an encapsulation determines whether the constant C.default is part of the public API or not. Therefore, the value of C.default is non-constructed only if C.val is privatized.

So if I code my class right, using access control to keep bad states away from my clients, my class’s external API will have no non-constructed states.

Reflection and serialization provide additional modes of access to a class’s API. The author of an encapsulation must be given control over these modes of access as well. (This is discussed further below.) If the author of C allows deserialization of C values not otherwise constructible via the public API, those values must be regarded as constructed, not non-constructed, but the API may also be regarded as poorly designed.

Costs of C.ref

In that case why have value types at all, if references are so powerful? The answer is that reference-based abstraction pays for its benefits with particular costs, costs that Java programmers do not always wish to pay:

  • A reference (usually) requires storage for a pointer to the object.
  • A reference (usually) requires storage for a header embedded inside the object.
  • Access to an object’s fields (usually) requires extra cycles to chase the pointer.
  • The GC expends effort administering a singular “home location” for every object.
  • Cache line invalidation near that home location can cause useless memory traffic.
  • A reference must be able to represent null; tightly-packed types like int and long would need to add an extra bit somewhere to cover this.

The major alternative to references, as provided by Valhalla, is flat class instances, where instance fields are laid out immediately in their containers, in place of a pointer which points to them stored elsewhere. Neither alternative is always better than the other, which is why Java has both int and Integer types and their arrays, and why Valhalla will offer a corresponding choice for value classes.

Alternative affordances of C.val

Now, instances of a value class can be laid out flat in their containing variables. But they can also be “boxed” in the heap, for classic reference-based access. Therefore, a value class C has not one but two companion types associated it, not only the reference companion C.ref but also the value companion C.val. Only value classes have value companions, naturally. The companion C.val is called a value type (or val-type for short), by contrast with any reference type, whether Object.ref or C.ref.

The two companion types are closely related and perform some of the same jobs:

  • C.ref and C.val both give a starting point for accessing C’s members.
  • C.ref and C.val can link C instances into acyclic graphs.
  • C.ref and C.val values uniformly convert to and from general types like Object.

For these jobs, it usually doesn’t matter which type companion does the work.

Specifically,

  • An expression of the form myc.method() cares about the class of myc but not which companion type it is. The same point is true (probably) of methods like Class::getMethods which ignore the distinction between the mirrors C.ref.class and C.val.class.
  • I can build a tree of C nodes using children lists of either companion type. (If however my C node contains direct child fields they cannot be of the C.val type.)
  • Converting a variable myc to Object (or, respectively, casting an Object to store in myc), does the same kind of thing regardless of which companion type myc has. The only difference that null cannot be a result if myc is C.val (or, respectively, that null is rejected as a C.val value).

Despite the similarities, many properties of a value companion type are subtly different from any reference type:

  • C.val is non-abstract: You must load its class file before making a variable.
  • C.val cannot nest except by reference; C cannot declare a C.val field.
  • C.val does not represent the value null.
  • C.val is routinely flattenable, avoiding headers and indirection pointers
  • C.val has configurable size, depending on C’s non-static fields.
  • C.val heap variables (fields, array elements) are initialized to all-zeroes.
  • C.val might not be safe for publication (even though its fields are final).

The overall effect is that a C.val variable has a very specific concrete format, a flattened set of application-defined fields, often without added overhead from object headers and pointer chasing.

The JVM distinguishes C.val by giving it a different descriptor, a so-called Q-descriptor of the form QC;, and it also provides a so-called secondary mirror C.val.class which is similar to the built-in primitive mirrors like int.class.

As the Valhalla performance model notes, flattening may be expected but is not fully guaranteed. A C.val stored in an Object container is likely to be boxed on the heap, for example. But C.val instances created as bytecode temporaries, arguments, and return values are likely to be flattened into machine registers, and C.val fields and array elements (at least below certain size thresholds) are also likely to be flattened into heap words.

As a special feature, C.ref is potentially flattenable if C is a value class. There are additional terms and conditions for flattening C.ref, however. If C is not yet loaded, nothing can be done: Remember that reference types have full abstraction as one of their powers, and this means building data structures that can refer to them even before they are loaded. But a class file can request that the JVM “peek” at a class to see if it is a value class.

This request is conveyed via the Preload attribute defined in recent drafts of JEP 8277163 (Value Objects). If this request is acted on early enough (at the JVM’s discretion), then the JVM can choose to lay out some or all C.ref values as flattened C.val values plus a boolean or other sentinel value which indicates the null state.

If the JVM succeeds in flattening a C.ref variable, the JMM still requires that racing reads to such a variable will always return a consistent, safely published state. The atomicity or non-atomicity of the C.val companion type has no effect on the races possible to a C.ref variable. Thus, flattening a C.ref variable with a non-atomic value type is not simply a matter of adding a null channel field to a struct, if races are possible on that variable. Most machines today provide hardware atomicity only to 128 bits, so racing updates must probably be accomplished within the limits of 64- or 128-bit reads and writes, for a flattened C.ref. It seems likely that the heap buffering enjoyed by today’s value-based classes will also be the technique of choice in the future, at least for larger value classes, when their containers are in the heap. Since JVM stack and locals can never race, adjoining a null state for a C.ref value can be a simple matter of allocating another calling sequence register or stack slot, for an argument or return value.

Pitfalls of C.val

The advantages of value companion types imply some complementary disadvantages. Hopefully they are rarely significant, but they must sometimes be confronted.

  • C.val might need to load a class file which is somehow unloadable
  • C.val will fail to load if its instance layout directly or indirectly includes a C.val field or subfield
  • C.val will throw an exception if you try to assign a null to it.
  • C.val may have surprising costs for multi-word footprint and assignment (and so might C.ref if that is flattened)
  • C.val is initialized to its all-zero value, which might be non-constructed
  • C.val might allow data races on its components, creating values which are non-constructed

The footprint issue shows up most strongly if you have many copies of the same C.val value; each copy will duplicate all the fields, as opposed many copies of the same C.ref reference, which are likely to all point to a single heap location with one copie of all the fields.

Flat value size can also affect methods like Arrays.sort, which perform many assignments of the base type, and must move all fields on each assignment. If a C.val array has many words per element, then the costs of moving those words around may dominate a sort request. For array sorting there are ways to reduce such costs transparently, but it is still a “law of physics” that editing a whole data structure will have costs proportional to the size of the edited portions of the data structure, and C.ref arrays will often be somewhat more compact than C.val arrays. Programmers and library authors will have to use their heads when deciding between the new alternatives given by value classes.

But the last two pitfalls are hardest to deal with, because they both have to do with non-constructed states. These states are the all-zero state with the second-to-last pitfall, and (with the last pitfall) the state obtained by mixing two previous states by means of a pair of racing writes to the same mutable C.val variable in the heap. Unlike reference types, value types can be manipulated to create these non-constructed states even in well-designed classes.

Now, it may be that a public constructor (or factory) might be perfectly able to create a zero state or an arbitrary field combination, no strings attached. In that case, the class author is enforcing few or no invariants on the states of the value class. Many numeric classes, like complex numbers, are like this: Initialization to all-zeroes is no problem, and races between components are acceptable, compared to the costs of excluding races. The worst a race condition can ever do is create a state that is legitimately constructed via the class API. We can say that a class which is this permissive has no non-constructed states at all.

Such a class will sometimes choose to permit races in order to get faster loads and stores from fields and arrays. A similar choice is made by today’s C ABIs: When they define 128-bit complex numbers, they do not mandate 128-bit atomic loads and stores for them, even if the platform supports such stores. This allows compilers a more flexible choice between a pair of 64-bit memory operations (both of which are probably atomic but not mutually coherent) or a single 128-bit memory operation (which may or may not be atomic). The JVM is likely to prefer pairs of 64-bit operations for such accesses, if the class permits non-atomicity. If the class requires atomicity, the JVM is likely to use an extra layer of heap buffering, or else a somewhat slower (but properly atomic) 128-bit load or store on Intel or ARM.

(The reader may recall that early JVMs accepted races on the high and low halves of 64-bit integers as well; this is no longer a widespread issue, but bigger value types like complex raise the same issue again, and we need to provide class authors the same solution, if it fits their class.)

There are also some classes for which there are no good defaults, or for which a good default is definitely not the all-zero bit pattern. Authors of such types will often wish to make that bit pattern inaccessible to their clients and provide some factory or constant that gives the real default. We expect that such types will choose the C.ref companion, and rely on the extra null checks to ensure correct initialization.

Other classes may need to avoid other non-constructed values that may arise from data races, perhaps for reasons of reliability or security. This is a subtle trade-off; very few class authors begin by asking themselves about the consequences of data races on mutable members, and even fewer will ask about races on whole instances of value types, especially given that fields in value types are always immutable. For this reason, we will set safety as the default, so that a class (like complex numbers) which is willing to tolerate data races must declare its tolerance explicitly. Only then will the JVM drop the internal costs of race exclusion.

Whether to tolerate the all-zero bit pattern is a simpler decision. Still, it turns out to be useful to give a common single point of declarative control to handle all non-constructed states, both the default value of C.val and its mysterious data races.

So different encapsulation authors will want to make different choices. We will give them the means to make these choices. And (spoiler alert) we will make the safest choice be the default choice.

Privatization to the rescue

(Here are the important details about the encapsulation of value types. The impatient reader may enjoy the very quick summary of restrictions at the end of this document.)

In order to hide non-constructed states, the value companion C.val may be privatized by the author of the class C. A privatized value companion is effectively withdrawn from clients and kept private to its own class (and to nestmates). Inside the class, the value companion can be used freely, fully under control of the class author.

But untrusted clients are prevented from building uninitialized fields or arrays of type C.val. This prevents such clients from creating (either accidentally or purposefully) non-constructed states of type C.val. How privatization is declared and enforced is discussed in the rest of this document.

(To review, for those who skipped ahead, non-constructed states are those not created under control of the class C by constructors or other accessible API points. A non-constructed state may be either an uninitialized variable of C.val, or the result of a data race on a shared mutable variable of type C.val. The class itself can work internally with such values all day long, but we exclude external access to them by default.)

Atomicity as well

As a second tactic, a value class C may select whether or not the JVM enforces atomicity of all occurrences of its value companion C.val. A non-atomic value companion is subject to data races, and if it is not privatized, external code may misuse C.val variables (in arrays or mutable fields) to create non-constructed states via data races.

A value companion which is atomic is not subject to data races. This will be the default if the the class C does not explicitly request non-atomicity. This gives safety by default and limits non-constructed states to only the all-zero initial value. The techniques to support this are similar to the techniques for implementing non-tearing of variables which are declared volatile; it is as if every variable of an atomic value variable has some (not all) of the costs of volatility.

The JVM is likely to flatten such an atomic value only up to the largest available atomically settable memory unit, usually 128 bits. Values larger than that are likely to be boxed, or perhaps treated with some other expensive transactional technique. Containers that are immutable can still be fully flattened, since they are not subject to data races.

The behavior of an atomic C.val is aligned with that of C.ref. A reference to a value class C never admits data races on C’s fields. The reason for this is simple: A C.ref value is a C.val instance boxed on the heap in a single immutable box-class field of type C.val. (Actually, the JVM may partially or wholly flatten the representation of C.ref if it can get away with it; full flattening is likely for JVM locals and stack values, but any such secret flattening is undetectable by the user.) Since it is final all the way down (to C’s fields) any C.ref value is safely published without any possibility of data races. Therefore, an extra declaration of non-atomicity in C affects only the value companion C.val.

It seems that there are use cases which justify all four combinations of both choices (privatization and declared non-atomicity), although it is natural to try to boil down the size of the matrix.

  • C.val private & atomic is the default, and safest configuration hiding the most non-constructed states outside of C and all data races even inside of C. There are some runtime costs.

  • C.val public & non-atomic is the opposite, with fewer runtime costs. It must be explicitly declared. It is desirable for numerics like complex numbers, where all possible bitwise states are meaningful. It is analogous to the situation of a naturally non-atomic primitive like long.

  • C.val public & atomic allows everybody to see the all-zero initial value but no racing non-constructed states. This is analogous to the situation of a naturally atomic primitive like int.

  • C.val private & non-atomic allows C complete access to and control over non-constructed states, but C also has the ability to work internally on arrays of non-atomic elements. C should take care not to leak internally-created flat arrays to untrusted clients, lest they use data races to hammer non-constructed values into those arrays.

It is logically possible, but there does not seem to be a need, for allowing a single class C to work with both kinds of arrays, atomic and non-atomic. (In principle, the dynamic typing of Java arrays would support this, as long as each array was configured at its creation.) The effect of this can be simulated by wrapping a non-atomic class C in another wrapper class WC which is atomic. Then C.val[] arrays are non-atomic and WC.val[] arrays are atomic, yet each kind of array can have the same “payload”, a repeated sequence of the fields of C.

Privatization in code

For source code and bytecode, privatization is enforced by performing access checks on names.

Privatization rules in the language

We will stipulate that a value class C always has a value companion type C.val, even if it is never declared or used. And we give the author of C some control over how clients may use the type C.val, in a manner roughly similar to nested member classes like C.M.

Specifically, the declaration of C always selects an access mode for its value companion C.val from one of the following three choices:

  • C.val is declared private
  • C.val is declared public
  • C.val is declared, but neither public nor private

If C.val is declared private, then only nestmates of C may access C.val. If it is neither public nor private, only classes in the same runtime package as C may access it. If it is declared public, then any class that can access C may also access C.val.

As an independent choice, the declaration of C may select an atomicity for its value companionC.val` from one of the following two choices:

  • C.val is explicitly declared non-atomic
  • C.val is not explicitly declared non-atomic, and is thus atomic

If there is no explicit access declaration for C.val in the code of C, then C.val is declared private and atomic. That is, we set the default to the safest and most restrictive choice.

In source code, these declarations are applied to explicit occurrences of the type name C.val. The access modification of C.val is also transferred to the implicitly declared name C.default

The syntax looks like this:

class C {
  //only one of the following lines may be specified
  //the first line is the default
  private value companion C.val;  //nestmates only
  value companion C.val;          //package-mates only
  public value companion C.val;   //all may access
  // the non-atomic modifier may be present:
  private non-atomic value companion C.val;
  public non-atomic value companion C.val;
  non-atomic value companion C.val;
}

When a type name C.val or an expression C.default is used by a class X, there are two access checks that occur. First, access from X to the class C is checked according to the usual rules of Java. If access to C is permitted, a second check is done if the companion is not declared public. If the companion is declared private, then X and C must be nestmates, or else access will fail. If the companion is neither public nor private, then X and C must be in the same package, or else access will fail.

Example privatized value companion

Here is an example of a class which refuses to construct its default value, and which prevents clients from seeing that state:

class C {
  int neverzero;
  public C(int x) {
    if (x == 0)  throw new IllegalArgumentException();
    neverzero = x;
  }
  public void print() { System.out.println(this); }

  private value companion C.val;  //privatized (also the default)

  // some valid uses of C.val follow:
  public C.val[] flatArray() { return new C.val[]{ this }; }
  private static C.ref nonConstructedZero() {
    return (new C.val[1])[0];  //OK:  C.val private but available
  }
  public static C.ref box(C.val val) { return val; }  //OK param type
  public C.val unbox() { return this; }  //OK return type

  // valid use of private C.default, with Lookup negotiation
  public static
  C.ref defaultValue(java.lang.reflect.MethodHandles.Lookup lookup) {
    if (!lookup.in(C.class).hasFullPrivilegeAccess())
      return null;     //…or throw
    return C.default;  //OK: default for me and maybe also for thee
  }
}

// non-nestmate client:
class D {
  static void passByValue(C x) {
    C.ref ref = box(x);   //OK, although x is null-checked
    if (false)  box((C.ref) null);  //would throw NPE
    assert ref == x;
  }

  static Object useValue(C x) {
    x.unbox().print();   //OK, invoke method on C.val expression
    var xv = x.unbox();  //OK, although C.val is non-denotable
    xv.print();          //OK
    //> C.val xv = x.unbox();  //ERROR: C.val is private
    return xv;  //OK, originally from legitimate method of C
  }

  static Object arrays(C x) {
    var a = x.flatArray();
    //> C.val[] va = a;  //ERROR: C.val is private
    Arrays.toString(a);  //OK
    C.ref[] a2 = a;      //covariant array assignment
    C.ref[] na = new C.ref[1];
    //> na = new C.val[1];  //ERROR: C.val is private
    return a[0];  //constructed values only
  }
}

The above code shows how a privatized value companion can and cannot be used. The type name may never be mentioned. Apart from that restriction, client code can work with the value companion type as it appears in parameters, return values, local variables, and array elements. In this, a privatized companion behaves like other non-denotable types in Java.

Rationale: Note that a companion type is not a real class. Therefore it cannot appeal, precisely, to the existing provisions (in JLS or JVMS) for enforcing class accessibility. But because it is a type, and today nearly all types are classes (and interfaces), users have a right to expect that encapsulation of companion types will “feel like” encapsulation of type names. More precisely, users will hope to re-use their knowledge about how type name access works when reasoning about companion types. We aim to accommodate that hope. If it works, users won’t have to think very often about the class-vs-type distinction. That is also why the above design emulates pre-existing usage patterns for non-denotable types.

Privatization in translation

When a value class is compiled to a class file, some metadata is included to record the explicit declaration or implicit status of the value companion.

The access selection of C’s value companion (public, package, private) is encoded in the value_flags field of the ValueClass attribute of the class information in the class file of C.

The value_flags field (16 bits) has the following legitimate values:

  • zero: C.val default access, non-atomic
  • ACC_PUBLIC: C.val public access, non-atomic
  • ACC_PRIVATE: C.val private access, non-atomic
  • ACC_FINAL: C.val default access, atomic
  • ACC_FINAL|ACC_PUBLIC: C.val public access, atomic
  • ACC_FINAL|ACC_PRIVATE: C.val private access, atomic

Other values are rejected when the class file is loaded.

The choice of ACC_FINAL for this job is arbitrary. It basically means “please ensure safe publication of final fields of this class, even for fields inside flattened instances.” The race conditions of a non-atomic variable of type C.val are about the same as (are isomorphic to) the race conditions for the states reachable from a non-varying non-null variable of type MC.ref, where MC is a hypothetical identity class containing the same instance fields as C, but whose fields are not declared final. (Remember that C, being a value class, must have declared its fields final.) Omitting ACC_FINAL above means about the same as using the non-final fields of MC to store C.val states. Omitting ACC_FINAL is less safe for programmers, but much easier to implement in the JVM, since it can just peek and poke the fields retail, instead of updating the whole instance value in a wholesale transaction.

That is, if you see what I mean… ACC_VOLATILE would be another clever pun along the same lines, since a volatile variable of type long is one which suppresses tearing race conditions. But volatile means additional things as well. Other puns could be attempted with ACC_STATIC, ACC_STRICT, ACC_NATIVE, and more. John likes ACC_FINAL because of the JMM connection to final fields.

(JVM ISSUE #0: Can we kill the ACC_VALUE modifier bit? Do we really care that jlr.Modifiers kind-of wants to own the reflection of the contextual modifier value? Who are the customers of this modifier bit, as a bit? John doesn’t care about it personally, and thinks that if we are going to have an attribute we can get rid of the flag bit. One implementation issue with killing ACC_VALUE is that class attributes are processed very late during class loading, while class access-flags are processed very early. It may be easier to do some kinds of structural checks on the fly during class loading even before class attributes are processed. Yet this also seems like a poor reason to use a modifier bit.)

Perhaps some kind of “poetic justice” would be attained by replacing the outgoing and redundant ACC_SUPER bit with an incoming and largely-redundant ACC_IDENTITY bit in the same position in the access_flags item. That would allow everything else to go into a class attribute at the bottom of the class file, as suggested, and would be neutral in pressure on access flag bit positions.

(JVM ISSUE #1: What if the attribute is missing; do we reject the class file or do we infer value_flags=ACC_PRIVATE|ACC_FINAL? Let’s just reject the file.)

(JVM ISSUE #2: Is this ValueClass attribute really a good place to store the “atomic” bit as well? This attribute is a green-field for VM design, as opposed to the brown-field of modifier bits. The above language assumes the atomic bit belongs in there as well.)

A use of a value companion C.val, in any source file, is generally translated to a use of a Q-descriptor QC;:

  • a field declaration of C.val translates to a field-info with a Q-descriptor
  • a method or constructor declaration that mentions C.val mentions a corresponding Q-descriptor in its method descriptor
  • a use of a field resolves a CONSTANT_Fieldref with a Q-descriptor component
  • a use of a method or constructor uses a CONSTANT_Methodref (or CONSTANT_InterfaceMethodref) with a Q-descriptor component
  • a CONSTANT_Class entry main contain a Q-descriptor or an array type whose element type is a Q-descriptor
  • a verifier type record may refer to CONSTANT_Class which contains a Q-descriptor

Privatization is enforced for these uses only as much as is needed to ensure that classes cannot create unintiialized values, fields, and arrays.

If an access from bytecode to a privatized Q-descriptor fails, an exception is thrown; its type is IllegalAccessError, a subtype of IncompatibleClassChangeError. Generally speaking such an exception diagnoses an attempt by bytecode to make an access that would have been prevented by the static compiler, if the Java source program had been compiled together as a whole.

When a field of Q-descriptor type is declared in a class file, the descriptor is resolved early, before the class is linked, and that resolution includes an access check which will fail unless the class being loaded has access to C.val, as determined by loading C and inspecting its ValueClass attribute. These checks prevent untrusted clients of C from created non-constructed zero values, in any of their fields.

The timing of these checks, on fields, is aligned with the internal logic of the JVM which consults the class file of C to answer other related questions about field types: (a) whether C is in fact a value class, and (b) what is the layout of C.val, in case the JVM wishes to flatten the value in a containing field. The third check (c) is C.val companion accessible happens at the same time. This is early during class loading for non-static fields, and during class preparation for static fields.

Privatization is not enforced for non-field Q-descriptors, that occur in method and constructor signatures, and in state descriptions for the verifier. This is because mere use of Q-descriptors to describe pre-existing values cannot (by itself) expose non-constructed values, when those values are on stack or in locals.

This can happen invisible at the source-code level as well. An API might be designed to return values of a privatized type from its methods or fields, and/or accept values of a privatized type into its methods, constructors, or fields. In general, the bytecode for a client of such an API will work with a mix of Q-descriptor and L-descriptor values.

The verifier’s type system uses field descriptor types, and thus can “see” both Q-descriptors and L-descriptors. Clients of a class with a privatized companion are likely to work mostly with L-descriptor values but may also have Q-descriptor values in locals and on stack.

When feeding an L-descriptor value to an API point that accepts a Q-descriptor, the verifier needs help to keep the types straight. In such cases, the bytecode compiler issues checkcast instructions to adjust types to keep the verifier happy, and in this case the operand of the checkcast would be of the form CONSTANT_Class["QC;"].

(JVM ISSUE #3: The Q/L distinction in the verifier helps the interpreter avoid extra dynamic null checks around putfield, putstatic, and the invoke instructions. This distinction requires an explicit bytecode to fix up Q/L mismatches; the checkcast bytecode serves this purpose. That means checkcast requires the ability to work with privatized types. It requires us to make the dynamic permission check when other bytecodes try to use the privatized type. All this seems acceptable, but we could try to make a different design which CONSTANT_Class resolution fails immediately if it contains an inaccessible Q-descriptor. That design might require a new bytecode which does what checkcast does today on a Q-descriptor.)

Meanwhile, arrays are rich sources of non-constructed zero values. They appear in bytecode as follows:

  • A C.val[] array construction uses anewarray with a CONSTANT_Class type for the Q-descriptor; this is new to Valhalla.
  • Such an array construction may also use multianewarray with an appropriate array type.
  • An array element is read from heap to stack by aaload; the verifier type of the stacked value is copied from the verifier type of the array itself.
  • An array element is written from stack to heap by aastore; the verifier type of the stored value is merely constrained to the type Object.

Note that there are no static type annotations on array access instruction. The practical impact of this is that, if an array of a privatized type C.val is passed outside of C, then any values in that array become accessible outside of C. Moreover, if C.val is non-atomic, clients may be able to inflict data races on the array.

Thus, the best point of control over misuse of arrays is their creation, not their access. Array creation is controlled by CONSTANT_Class constant pool entries and their access checking. When an anewarray or multianewarray tries to create an array, the CONSTANT_Class constant pool entry it uses must be consulted to see if the element type is privatized and inaccessible to the current class, and IllegalAccessError thrown if that is the case.

All this leads to special rules for resolving an entry of the form CONSTANT_Class["QC;"]. When resolving such a constant, the class file for C is loaded, and C is access checked against the current class. (This is just what happens when CONSTANT_Class["C"] gets resolved.) Next, the ValueClass attribute for C is examined; it must exist, and if it indicates privatization of C.val, then access is checked for C.val against the current class.

If that access to a privatized companion would fail, no exception is thrown, but the constant pool entry is resolved into a special restricted state. Thus, a resolved constant pool entry of the form CONSTANT_Class["QC;"] can have the following states:

  • Error, because C is inaccessible or doesn’t exist or is not a value class.
  • Full resolution, so C.val is ready for general use in the current class.
  • Restricted resolution, so C.val is ready for restricted use in the current class.

That last state happens when C is accessible but C.val is not.

Likewise, a constant pool entry of the form CONSTANT_Class["[QC;"] (or a similar form with more leading array brackets) can have three states, error, full resolution, and restricted resolution.

Pre-Valhalla CONSTANT_Class entries which do not mention Q-descriptors have only two resolved states, error and full resolution.

As required above, the checkcast bytecode treats full resolution and restricted resolution states the same.

But when the anewarray or multianewarray instruction is executed, it must throw an access error if its CONSTANT_Class is not fully resolved (either it is an error or is restricted). This is how the JVM prevents creation of arrays whose component type is an inaccessible value companion type, even if the class file does not correspond to correct Java source code.

Here are all the classfile constructs that could refer to a CONSTANT_Class constant in the restricted state, and whether they respect it (throwing IllegalAccessError):

  • checkcast ignores the restriction and proceeds
  • instanceof ignores the restriction (consistent with checkcast)
  • anewarray and multianewarray respect the restriction and throw
  • ldc throws (consistent with C.val.class in source code)
  • bootstrap arguments throw (consistent with ldc)
  • verifier types ignore the restriction and continue checking
  • (FIXME: There must be more than this.)

Q-descriptors not in CONSTANT_Class constants are naturally immune to privatization restrictions. In particular, CONSTANT_Methodtype constants can successfully refer to mirrors to privatized companions.

Uses of CONSTANT_Class constants which forbid Q-descriptors and their arrays are also naturally immune, since they will never encounter a constant resolved in the restricted state. These include new, aconst_init, the class sub-operands of CONSTANT_Methodref and its friends, exception catch-types, and various attributes like NestHost and InnerClasses: All of the above are allowed to refer only to proper classes, and not to their value companions or arrays.

Nevertheless, a aconst_init bytecode must throw an access error when applied to a class with an inaccessible privatized value companion. This is worth noting because the constant pool entry for aconst_init does not mention a Q-descriptor, unlike the array construction bytecodes.

Perhaps regular class constants of the form CONSTANT["C"] would also benefit slightly from a restricted state, which would be significant only to the aconst_init bytecode, and ignored by all the above “naturally immune” usages. If a JVM implementation takes this option, the same access check would be performed and recorded for both CONSTANT["C"] and CONSTANT["QC;"], but would be respected only by withvalue (for the former) and anewarray and the other cases noted above (for the latter but not the former). On the other hand, the particular issue would become moot if aconst_init, like withfield, were restricted to the nest of its class, because then privatization would not matter.

The net effect of these rules, so far, is that neither source code nor class files can directly make uninitialized variables of type C.val, if the code or class file was not granted access to C.val via C. Specifically, fields of type C.val cannot be declared nor can arrays of type C.val[] be constructed.

This includes class files as correctly derived from valid source code or as “spun” by dodgy compilers or even as derived validly from old source code that has changed (and revoked some access).

Remember that new nestmates can be injected at runtime via the Lookup API, which checks access and then loads new code that enjoys the same access. The level of access depends in detail on the selection of ClassOption.NESTMATE (for nestmate injection) or not (for package-mate injection). The JVM uses common rules for these injected nestmates or package-mates and for normally compiled ones.

There are no restrictions on the use of C.ref, beyond the basic access restrictions imposed by the language and JVM on the name C. Access checks for regular references to classes and interfaces are unchanged throughout all of the above.

There are more holes to be plugged, however. It will turn out that arrays are once again a problem. But first let’s examine how reflection interacts with companion types and access control.

Privatization and APIs

Beyond the language there are libraries that must take account of the privatization of value companions. We start on the shared boundary between language and libraries, with reflection.

Reflecting privatization

Every companion type is reflected by a Java class mirror of type java.lang.Class. A Java class mirror also represents the class underlying the type. The distinction between the concept of class and companion type is relatively uninteresting, except for a value class C, which has two companion types and thus two mirrors.

In Java source code the expression C.class obtains the mirror for both C and its companion C.ref. The expression C.val.class obtains the mirror for the value companion, if C is a value class. Both expressions check access to C as a whole, and C.val.class also checks access to the value companion (if it was privatized).

But it is a generally recognized fact that Java class mirrors are less secure than the Java class types that the mirrors represent. It is easy to write code that obtains a mirror on a class C without directly mentioning the name C in source code. One can use reflective lookup to get such mirrors, and without even trying one may also “stumble upon” mirrors to inaccessible classes and companion types. Here are some simple examples:

Class<?> lookup() {
  var name = "java.util.Arrays$ArrayList";
  //or name = "java.lang.AbstractStringBuilder";
  //> java.lang.invoke.MethodHandles.lookup().findClass(name);  //ERROR
  return Class.forName(name);  //OK!
}
Class<?> stumble1() {
  //> return java.util.Arrays.ArrayList.class;  //ERROR
  return java.util.Arrays.asList().getClass();  //OK!
}
Class<?> stumble2() {
  //> return java.lang.AbstractStringBuilder.class;  //ERROR
  return StringBuilder.class.getSuperclass();  //OK!
}
Class<?> stumble3() {
  //> return C.val.class;  //ERROR if C.val is privatized
  return C.ref.class.asValueType();  //OK!
}

Therefore, access checking class names is not and cannot be the whole story for protecting classes and their companion types from reflective misuse. If a mirror is obtained that refers to an inaccessible non-public class or privatized companion, the mirror will “defend itself” against illegal access by checking whether the caller has appropriate permissions. The same goes for method, constructor, and field mirrors derived from the class mirror: You can reflect a method but when you try to call it all of the access checks (including the check against the class) are enforced against you, the caller of the reflective API.

The checking of the caller has two possible shapes. Either a caller sensitive method looks directly at its caller, or the call is delegated through an API that requires negotiation with a MethodHandles.Lookup object that was previously checked against a caller.

Now, if a class C is accessible but its value companion C.val is privatized, all of C’s public methods and other API points are accessible (via both companion types), but access is limited to those very specific operations that could create non-constructed instances (via a variable of companion type C.val). And this boils down to a limitation on array creation. If you cannot use either source code or reflection to create an array of type C.val[], then you cannot create the conditions necessary to build non-constructed instances.

Reflective APIs should be available to report the declared properties of reference companions. It is enough to add the following two methods:

  • Class::isNonAtomic is true only of mirrors of value companions which have been declared non-atomic. On some JVM implementations it may additionally be true of long.class and/or double.class.

  • Class::getModifiers, when applied to a mirror of a value companion, will return a modifier bit-mask that reflects the declared access. (This is compatible with the current behavior of HotSpot for primitive mirrors, which appear as if they were somehow declared public, with abstract and final thrown in to boot.)

(Note that most reflective access checking should take care to work with the reference mirror, not the value mirror, as the modifier bits of the two mirrors might differ.)

Privatization and arrays

There are a number of standard API points for creating Java array objects. When they create arrays containing uninitialized elements, then a non-constructed default value can appear. Even when they create properly initialized arrays, if the type is declared non-atomic, then non-constructed states can be created by races.

  • java.lang.reflect.Array::newInstance takes an element mirror and length and builds an array. The elements of the returned array are initialized to the default value of the selected element type.
  • java.util.Arrays::copyOf and copyOfRange can extend the length of an existing array to include new uninitialized elements.
  • A special overloading of java.util.Arrays::copyOf can request a different type of the new array copy.
  • java.util.Collection::toArray (an interface method) may extend the length of an existing array, but does not add uninitialized elements.
  • java.lang.invoke.MethodHandles.arrayConstructor creates a method handle that creates uninitialized arrays of a given type, as if by the anewarray bytecode.
  • The serialization API contains an operator for materializing arrays of arbitrary type from the wire format.

The basic policy for all these API points is to conservatively limit the creation of arrays of type C.val[] if C.val is not public.

  • java.lang.reflect.Array::newInstance will throw IllegalArgumentException if the element type is privatized. (See below for a possible caller-sensitive enhancement.)

  • java.util.Arrays::copyOf and copyOfRange will throw instead of creating uninitialized elements, if the element type is privatized. If only previously existing array elements are copied, there is no check, and this is a use common case (e.g., in ArrayList::toArray).

  • The special overloading of java.util.Arrays::copyOf will refuse to create an array of any non-atomic privatized type. (This refusal protects against non-constructed states arising from data races.) It also incorporates the restrictions of its sibling methods, against creating uninitialized elements (even of an atomic type).

  • java.lang.invoke.MethodHandles.arrayConstructor will refuse to create a factory method handle if the element type is privatized.

  • java.util.Collection::toArray needs implementation review; as it is built on top of the previous API points, it may possibly fail if asked to lengthen an array of privatized type. Note that many methods of toArray use Arrays.copyOf in a safe manner, which does not create uninitialized elements.

  • java.util.stream.Stream::toArray, the various List::toArray, and other clients of Arrays::copyOf or Array::newInstance need implementation review. Where a generic API is involved, the assumption is often that non-flat reference arrays are being created, and in that case no outage is possible, since reference companion arrays can always be freely created. For specialized generics with flat types, additional implementation work is required, in general, to ensure that flat arrays can be created by parties with the right to do so.

  • The serialization API should restrict its array creation operator. Serialization methods should not attempt to serialize flat arrays either. It is enough to serialize arrays of the reference type.

API ISSUE #1: Should we relax construction rules for zero-length arrays? This would add complexity but might be a friendly move for some use cases. A zero-length array can never expose non-constructed states. It may, however, serve as a misleading “witness” that some code has gained permission to work with flat arrays. It’s safer to disallow even zero-length arrays.

API ISSUE #2: What about public value companions of non-public inaccessible classes? In source code, we do not allow arrays of private classes to be made, or of their their public value companions. Should we be more permissive in this case? We could specify that where a value companion has to be checked against a client, its original class gets checked as well; this would exclude some use cases allowed by the above language, which only takes effect if the companion is privatized. An extra check for a public companion seems like busy-work and a source of unnecessary surprises, though. Let’s not.

There are probably legitimate use cases for arrays of privatized types, with which the new restrictions on the above API points would interfere. So as a backup, we will make API adjustments to work with privatized array types, with an extra handshake to perform the access check (via either caller sensitivity or negotiation with an instance of MethodHandles.Lookup).

  • java.lang.reflect.Array::newInstance should probably be made caller sensitive, so it can refrain from throwing if a privatized element type is accessible to the caller. (Alternatively, a new caller-sensitive API point could made, such as Array::newFlatInstance. But a new API point seems unnecessary in this case, and caller-sensitivity is common practice in this method’s package.) Note that, as is typical of core reflection API points, many uses of newInstance will not benefit from the caller sensitivity.

  • java.util.Arrays::copyOf and copyOfRange may be joined by additional “companion friendly” methods of a similar character which fill new array elements with some other specified fill value, and/or which cyclically replicate the contents of the original array, and/or which call a functional interface to provide missing elements. The details of this are a matter for library designers to decide. Adding caller sensitivity to these API points is probably the wrong move.

  • java.lang.invoke.MethodHandles::arrayConstructor will be joined by a method of the same name on MethodHandles.Lookup which performs a companion check before allowing the array constructor method handle to be returned. It will not check the class, just the companion. Note that the use of caller sensitivity in the Lookup API is concentrated on the factory method Lookup::lookup, which is the starting point for Lookup-based negotiation.

Miscellaneous privatization checks

Besides newly-created or extended arrays, there are a few API points in java.lang.invoke which expose default values of reflectively determined types. Like the array creation methods, they must simply refuse to expose default values of privatized value companions.

  • MethodHandles::zero and MethodHandles::empty will simply refuse to produce a result of a privatized C.val type. Clients with a legitimate need to produce such default values can use MethodHandles::filterReturnValue and/or MethodHandles::constant to create equivalent handles, assuming they already possess the default value.

  • MethodHandles::explicitCastArguments will refuse to convert from a nullable reference to a privatized C.val type. Clients with a legitimate need to convert nulls to privatized values can use conditional combinators to do this “the hard way”.

  • MethodHandle::asType will refuse to convert from a void return to a privatized C.val type, similarly to explicitCastArguments.

  • The method Lookup::accessCompanion will be defined analogously to Lookup::accessClass. If Lookup::accessClass is applied to a companion, it will check both the class and the companion, whereas Lookup::accessCompanion will look only at the possible privatization of the companion. (Thus it can simply refer to Reflection::verifyCompanionType.)

To support reflective checks against array elements which may be privatized companion types, an internal method of the form jdk.internal.reflect.Reflection::verifyCompanionType may be defined. It will pass any reference type (regardless of class accessibility) and for a value companion it will check access of the companion (but not the class itself).

Building companion-safe APIs

The method Lookup::arrayConstructor gives enough of a “hook” to create all kinds of safe but friendly APIs in privileged JDK code. The methods in java.util could make use of this privileged API to quickly adapt their internal code to create arrays in cases they are refused by the existing methods Array.newInstance and Arrays.copyOf.

For example, a checked method MethodHandles.Lookup::defaultValue(C) may be added to provide the default value C.default if its companion C.val is accessible. It will operate as if it first creates a one-element array of the desired type, and then loads the element.

Or, a caller-sensitive method Class::defaultValue or Class::newArray could be added which check the caller and return the requested result. All such methods can be built on top of MethodHandles.Lookup.

In general, a library API may be designed to preserve some aspect of companion safety, as it allows untrusted code to work with arrays of privatized value type, while preventing non-constructed states of that type from being materialized. Each such safe and friendly API has to make a choice about how to prevent clients from creating non-constructed states, or perhaps how to allow clients to gain privilege to do so. Some points are worth remembering:

  • An unprivileged client must not obtain C.default if C.val is privatized.
  • An unprivileged client must not obtain a non-empty C.val[] array if C.val is privatized and non-atomic.
  • It’s safe to build new (non-empty, mutable) arrays from (non-empty, mutable) old arrays, as long as new elements containing the C.default do not appear.
  • If a new array is somehow frozen or wrapped so as be effectively immutable, it is safe as long as it does not expose C.default values.
  • If a value companion is public, there is no need for any restriction.
  • Also, unrestricted use can be gated by a Lookup object or caller sensitivity.

In the presence of a reconstruction capability, either in the language or in a library API or as provided by a single class, avoiding non-constructed instances includes allowing legitimate reconstruction requests; each legitimate reconstruction request must somehow preserve the intentions of the class’s designer. Reconstruction should act as if field values had been legitimately (from C’s API) extracted, transformed, and then again legitimately (to C’s API) rebuilt into an instance of C.

Serialization is an example of reconstruction, since field values can be edited in the wire format. Proposed with expressions for records are another example of reconstruction. The withfield bytecode is the primitive reconstruction operator, and must be restricted to nestmates of C since it can perform all physically possible field updates. Reconstruction operations defined outside of C must be designed with great care if they use elevated privileges beyond what C provides directly. Given the historically tricky nature of deserialization, more work is needed to consider what serialization of a C.val actually means and how it interacts with default reconstitution behaviours. One likely possibility is that wire formats should only work with C.ref types with proper construction paths (enforced by serialization), and leave conversion to C.val types to deserialization code inside the encapsulation of C.

JNI, like serialization, allows creation of arrays which is hard to constrain with access checks. We have a choice of at least two positions on this. We could allow JNI full permission to create any kind of arrays, thus effectively allowing it “inside the nest” of any value class, as far as array construction goes. Or, we could say that JNI (like Arrays::copyOf) is absolutely forbidden to create uninitialized arrays of privatized value type. The latter is probably acceptable. As with other API points, programmers with a legitimate need to create flat privatized arrays can work around the limitations of the “nice” API points by using more complex ones that incorporate the necessary access checks.

Summary of user model

A value class C has a value companion C.val which denotes the null-hostile (zero-initialized) fully flattenable value type for C.

Like other type members of C, C.val can be declared with an access modifier (public or private or neither). It is therefore quite possible that clients of C might be prevented from using the companion type.

The operations on C.val are almost the same as the operations on plain C (C.ref), so a private C.val is usually not a burden.

Operations which are unique to C.val, and which therefore may be restricted to you, are:

  • declaring a field of type C.val
  • making an array with element type C.val
  • getting the default flat value C.default
  • asking for the mirror C.val.class

Library routines which create empty flattenable arrays of C.val might not work as expected, when C.val is not public. You’ll have to find a workaround, such as:

  • use a plain C reference array to hold your data
  • use a different API point which is friendly to privatie C.val types
  • ask C politely to build such an array for you
  • crack into C with a reflective API and build your own

If you looked closely at the code for C above, you might have noticed that it uses its private type C.val in its public API. This is allowed. Just be aware that null values will not flow through such API points. When you get a C.val value into your own code, you can work on it perfectly freely with the type C (which is C.ref).

If a value companion C.val is declared public, the class has declared that it is willing to encounter its own default value C.default coming from untrusted code. If it is declared private, only the class’s own nest can work with C.default. If the value companion is neither public nor private, the class has declared that it is willing to encounter its own default within its own package.

If a class has declared its companion non-atomic, it is willing to encounter states arising from data races (across multiple fields) in the same places it is willing to encounter its default value.

Summary of restrictions

From the implementation point of view, the salient task is restricting clients from illegitimately obtaining non-constructed values of C, if the author of C has asked for such restrictions. (Recall that a non-constructed value of C is one obtained without using C’s constructor or other public API.) Here are the generally enforced restrictions regarding a privatized type C.val:

  • You cannot mention the name C.val or C.default in code.
  • You cannot create and load bytecodes which would implement such a mention.
  • You cannot obtain C.default from a mirror of C or C.val.
  • You cannot create a new C.val[] array from a mirror of C or C.val.
  • You cannot lengthen an existing C.val[] array to contain uninitialized elements.
  • You cannot copy an existing array as a new C.val[] array, if C.val is declared non-atomic.

Even so, let us suppose you are an accident-prone client of C. Ignoring the above restrictions, you might go about obtaining a non-constructed value of C in several ways, and there is an answer from the system in each case that stops you:

  • You can mention the C.val or C.default directly in code, in various ways.
  • After obtaining the mirror C.val.class (by one of several means), you can call Class::defaultValue, MethodHandles::zero, or a similar API point.
  • If you can declare a field of type C.val directly you can extract an initial value (or a data-race result, if C.val is non-atomic).
  • If you can indirectly create an array of type C.val, you can extract an initial value (or a data-race result, if C.val is non-atomic).

And there are a number of ways you might attempt to indirectly create an array of type C.val[]:

  • Indirectly create it from a mirror using Array::newInstance or Arrays::copyOf or MethodHandles::arrayConstructor or another similar API point.
  • Create it from a pre-existing array of the same type using Object::clone or Arrays::copyOf or another similar API point.
  • Specify such an array on a serialization wire format and deserialize it.

Using C.val or C.default directly is blocked if C privatizes its value companion, unless you are coding a nestmate or package-mate of C. These checks are applied both at compile time and when the JVM resolves names, so they apply equally to source code and bytecodes created by any means whatsoever.

There are no realistic restrictions on obtaining a mirror to a companion type C.val. (Accidental and casual direct use of C.val.class is prevented by access restrictions on the type name C.val. But there are many ways to get around this limitation.) Therefore any method or API which could violate the above generally enforced restrictions must perform an appropriate dynamic access check on behalf of its mirror argument.

Such a dynamic access check can be made negotiable by an appeal to caller sensitivity or a Lookup check, so a correctly configured call can avoid the restriction. For some simple methods (perhaps Arrays::copyOf or MethodHandles::zero) there is no negotiation. Depending on the use case, access failure can be worked around via a “negotiable” API point like Lookup::arrayConstructor.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK