iliana :trans_heart: is a user on cybre.space. You can follow them or interact with them if you have an account anywhere in the fediverse.
iliana :trans_heart: @iliana

uhhhh hmm

let’s say I want to store the ASCII integer representations of values 0..15 inclusive in memory

if I allow the memory to overlap I can represent it as

151413121109876

thus 15 and 5 overlap, 11 and 10 overlap

this *might* be the shortest possible representation

is there already a known algorithm for this? is this problem called something?

@iliana just a guess but it's probably come up in compilers/languages that do string interning?

@iliana and/or include strings in the text section of their executables

@chr @iliana yeeeeah, also maybe in compression ( esp shared dictionary stuff a la sdch or zstandard? although what i've actually seen people doing there has been pretty brute force, so maybe not. 🤷 )

@gdkar @iliana @chr shared dictionary definitely sounds like it's in the right area

@iliana your example looks ambiguous, you won't be able to reliably decode that

@iliana Huh. This is really clever, and has all sorts of implications on things like gzip dictionaries

@iliana Lemme clarify since there are already doubters: gzip lets you say "a symbol with length x comes from location y in the dictionary buffer", so the ambiguous locations aren't a problem. The more bytes your symbols are, and the more times they're repeated, the more worthwhile this packing is (just like compression in general).

Never heard it specifically named, but if I was backed into a corner I'd say order-independant entropy coding.

@iliana this feels very close to group theory, but I'm not a maths

@iliana if you do it bit by bit, you could overlap leading and trailing zeroes, though processors are not good at arbitrary bit offsets.

@iliana
are you requiring things to be 8bit aligned? If not you might be able to shorten it. (though a 15-short lookup table is still not very big and it's probably faster to use)

@iliana reminds me of a prefix tree (or suffix tree in this case) with a flat representation (I forget what they call that kind of representation)

@iliana Yes! This is known as the shortest superstring problem. Lots more info here geeksforgeeks.org/shortest-sup

tl;dr it's NP-hard

@bilbono oh interesting! I wonder if it's still NP-hard if you assume the input is a range of integers (i.e. 0..15 per the example).

@iliana @bilbono yeah, the self-delimiting constraint feels... funky, but iiiii can't articulate exactly why.

@iliana Hmm, the more I think about it the more I feel like it shouldn't be NP-hard in the case of a range of integers. There's too much structure/regularity to exploit. Nothing more than a hunch though.

@iliana There's a similar thing called De Bruijn sequences, though that's for listing all n-length strings of a size-k alphabet. This example is too irregular to reliably reproduce.

en.wikipedia.org/wiki/De_Bruij

I’ve decided I’m going to brute force it since I specifically need 0..255 inclusive lol

@iliana seems to me for that case it's pretty straightforward:

3 digit numbers: no overlap, all appear as is
2 digit numbers: all appear as subsets of the 100s
1 digit numbers: all appear as subsets of any group of 10s, so just reuse some random one (24x for example)

@iliana might not be optimal but i can't think of anything terribly clever to make it better (125200 for example could get you 252 for free...but you can only do that for 120-125 so it only saves you like 20 characters overall)

@chr @iliana i feel like it'd be pretty easy to optimize for the 210s just from 1X21X3 right??

@chr @iliana Even with a range of integers (0..255) this problem is a lot more complicated than I thought.

For instance: 1222121112. Every three-digit substring is a unique number in the range! That's as dense as you can get.

@iliana Sounds like an interesting problem to try to solve in Forth, especially since I haven't done any coding lately. I'm not sure if this has a name for it, but I'm reminded of a trick the programmers who worked on a pinball game called Banzai Run did: they ran out of sound board memory, and needed an "and" so they made it the "an" out of "welcome race fans" and called it a day.

@iliana De Bruijn sequences have been suggested, but I'm wondering if you can do it by constructing a De Bruijn graph[1] and deleting some vertices. You can safely ignore all numbers below 10^n if your count goes up to 2*10^n as you indicated (for 255), so this should be simple enough.

1: en.wikipedia.org/wiki/De_Bruij