0

Storing data in pointers

 6 months ago
source link: https://lobste.rs/s/5417dx/storing_data_pointers
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Storing data in pointers

andyc

edited 15 days ago

| link

Any reference on the 5 bit encoding? This is in APIs that get called from Objective C or Swift?

The low bit tagging seems like an obvious win and portable win these days, although I know Lua avoided it because it’s not strictly ANSI C. Rust also got rid of small string optimization early in its life, apparently due to code size

https://old.reddit.com/r/rust/comments/2slcs8/small_string_optimization_remove_as_mut_vec/

Though honestly I would have expected fewer heap allocations and cache misses to be a win for most string workloads

snej

15 days ago

| link

You got it — the reference I remember is from Mike Ash’s blog, which has been dormant for a few years, but the archives are a treasure trove of low-level info about Apple’s runtime.

The CoreFoundation types are exposed in Obj-C as NSString, NSNumber, NSDate, NSValue. They also show up in Swift for bridging purposes, but the native Swift string and number classes are implemented differently (in Swift.)

The various tagging schemes that objc (and by proxy swifts interop) uses are internal implementation details that can (and have) changed so it’s not API. Instead objc_msgSend and family handle it directly - similar to the myriad refcount stores and what not.

I was actually looking at Mike Ash’s blog this week for info on tagged pointers:

How about 5 bits? This isn’t totally ludicrous. There are probably a lot of strings which are just lowercase, for example. 5 bits gives 32 possible values. If you include the whole lowercase alphabet, there are 6 extra values, which you could allot to the more common uppercase letters, or some symbols, or digits, or some mix. If you find that some of these other possibilities are more common, you could even remove some of the less common lowercase letters, like q. 5 bits per character gives eleven characters if we save room for length, or twelve if we borrow a symbol and use a terminator.

https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html


I was actually looking at the blog because I was wondering if ref counting in the unused 16 bits of a pointer might be feasible. It would give you up to 65k references, which is more than enough for many (most?) use cases. That would slim down the size of ref counted values and might make them as cache friendly as a GCed value. Might not be thread safe though.

  1. andyc

    15 days ago

    | link

    Wow, skimming the rest of the post, this is a lot more subtle than I would have expected, and also relatively recent – OS X 10.10 as of 2014.

    1. If the length is between 0 and 7, store the string as raw eight-bit characters.
    2. If the length is 8 or 9, store the string in a six-bit encoding, using the alphabet “eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX”.
    3. If the length is 10 or 11, store the string in a five-bit encoding, using the alphabet “eilotrm.apdnsIc ufkMShjTRxgC4013”

    The five-bit alphabet is extremely limited, and doesn’t include the letter b! That letter must not be common enough to warrant a place in the 32 hallowed characters of the five-bit alphabet

    Pretty crazy!

    I think if you control the entire OS and the same NSString is used everywhere, this makes more sense.

    For what I’m doing, we have to call into libc more, and pass it C strings, so we don’t control that part. The allocation to decode and make it look like a C string is problematic. Not just slow, but creates an ownership problem.

    1. I think if you control the entire OS and the same NSString is used everywhere, this makes more sense.

      It makes me wonder if that is a headwind for adoption to Swift on other platforms. Is the language so tuned to performance on a single platform that it makes the code difficult to port?

      It seems like they are also doing some pretty intricate things with C++ interop. As a outside observer (and a relatively ignorant one at that), it seems like it would be very difficult to generalize some of this work.

      1. snej

        14 days ago

        | link

        This stuff is part of CoreFoundation, not Swift. Swift on Apple platforms has some pretty sophisticated interop / FFI with it, for compatibility with Objective-C code, but that isn’t present in Swift on other platforms.

      2. It shouldn’t be hard to port. In GNUstep, we adopted a compressed strings in pointers encoding some years before Apple (not the 5-bit one. Doing that well requires some analysis of data that I didn’t have access to from run-time profiling of a large set of real-world applications). The interface for iterating over strings allows the caller to provide a buffer. These strings are, by definition, of small bounded length and so converting to a C string for interoperability is trivial with the caller providing the buffer on its stack.

        It does have some very nice performance properties. A lot of dictionaries use small strings as keys. If you check the length on the way in and try to convert mutable strings used for lookup to small strings then you know that the result either is or isn’t a small string. This lets you skip a bunch of comparisons after you’ve found the hash bucket.

      3. Not really. In fact, Swift on Linux doesn’t have certain complexities that are present on Apple platforms due to the necessity of ObjC interop.

      4. andyc

        15 days ago

        | link

        I haven’t followed Swift closely, but I do have the feeling that the focus is entirely on Apple’s platform, and there are some fairly hard tradeoffs with respect to portability / open source.

        Just like I think of Google’s entire stack as a vertically integrated embedded system (hardware up to cloud and apps), Apple seems to be architected in a similar way. They get some pretty big benefits from the vertical integration and control

        Andreas Kling mentioned that Apple is a huge inspiration for SerenityOS – basically doing everything yourself and not taking dependencies, which is kind of the opposite of most open source, which is about reuse and portability

        It seems like if Swift were going to be popular on Linux or Windows, that would have already happened by now. Looks like it’s about 9 years since the first release now

  2. snej

    edited 14 days ago

    | link

    You can’t put the ref-count in a pointer, because a pointer is a reference. If you increment a count in a pointer, that doesn’t affect any other pointers to the object, so no one else knows you added a reference.

    CoreFoundation does (IIRC) store refcounts outside objects. My memory is hazy, but it might reserve a few bits for an internal refcount, and when that pins at its max value, the real refcount moves to a global hash table. The answer probably lies in Mike Ash’s blog. (Swift-native class objects don’t do this, though.)

    [Update: just saw your other post below. What CF does uses spare bits in a pointer field inside the object; I thought you meant putting the refcount in pointers to the object.]

    1. No, you were right in the first place. I was totally thinking about things wrong. I got wrapped around the tree a bit. Thank you for the correction.

    2. asb

      14 days ago

      | link

      It was this one - I added a link to it this morning after this thread!

  3. The refcount for a normal objc object is kept in a few points of the isa pointer, and then moved to a side table once the refcount exceeds the available bits. The result is that there’s no major memory overhead to the refcount in normal objects.

    1. I did see that in this article from Mike Ash’s site:

      https://www.mikeash.com/pyblog/friday-qa-2013-09-27-arm64-and-you.html

      It made me happy to see that it wasn’t a totally stupid idea on my part! :]


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK