2

Tell HN: People forget that you can stick any data at the end of a bash script

 1 year ago
source link: https://news.ycombinator.com/item?id=36605869
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Tell HN: People forget that you can stick any data at the end of a bash script

Tell HN: People forget that you can stick any data at the end of a bash script
166 points by BasedAnon 6 hours ago | hide | past | favorite | 78 comments
This is a neat trick I've used to write self-extracting software or scripts that extract files from archives by just using
    tail -c <number of bytes for the binary> $0
All you have to do is make sure you append an explicit 'exit' to the end of your program before your new 'data section', so that bash won't parse any of the 'data section'.

One thing to bear in mind is that if you append binary data, it will be corrupted if you save it in most text editors so when I want to make changes I just delete all the binary and reappend it.

If you care less about space efficiency and more about maintainability of the script, you can also encode the binary as base64 and put an
  echo '...base64 data...' | base64 -d > somefile
in your script.

Or add compression to reclaim at least some of the wasted space:

  echo '...base64 gzipped data...' | base64 -d | gunzip > somefile
Also note that bash accepts line breaks in quoted strings and the base64 utility has an "ignore garbage" option that lets it skip over e.g. whitespace in its input. You can use those to break up the base64 over multiple lines:
  echo '
    ...base64 gzipped data...
    ...more data...
    ...even more data...
  ' | base64 -di | gunzip > somefile
s.gif
If you care about maintainability, you keep the binary data out of the source file and have a build process.
s.gif
For something small. I would take a data.bin file rather than a build process. But yes.
s.gif
You can also use here-documents to avoid hitting any argv length limits:
    { base64 -d | gunzip > output; } <<EOF12345
    ...data...
    EOF12345
s.gif
An even simpler way would be to include a marker to denote the end of the shell script, and the start of the data. For example, if you put this in extract.sh
    #!/bin/sh
    sed -E '1,/^START-OF-TAR-DATA$/d' "$0" | tar xvzf -
    exit
    START-OF-TAR-DATA
and then run:
    cat extract.sh ../foobar.tar.gz > foobar.tar.gz.sh
You can then run foobar.tar.gz.sh to self-extract. And you still get the benefit of being able to modify the shell script without needing to count lines or characters without sacrificing any compression.
s.gif
Just to be sure I’m following you correctly, what is the advantage of zipping the base64 data vs having the original binary, zipped if you like?
s.gif
As I understood, you base64 the zipped data on input and the other way around on output.

The reasoning being that the base64'd binary data is safe from being corrupted when the file is edited in text editors, as a response to the warning stated on the last paragraph of the original post.

s.gif
Is there an encoding that is less wasteful that base64 but not vulnerable to text editor corruption issues? I think avoiding 0x0 to 0x20 should be enough to not get corrupted by text editors, though base64 avoids a lot more than that.
s.gif
If you can count on every printable ascii character being not-mangled, you can use ascii85/base85/Z85 (5 "ascii characters" to 4 bytes) instead of base64.
s.gif
There's probably a base(bigger number) with Unicode chars today
s.gif
While a couple of people suggested Base65536, that encoding isn't particularly compact, and it can't be as elegant as 65536 would suggest because it has to dodge special cases in unicode.

It's almost always the case that either Base32768 is denser, or encodings with 2^17 or 2^20 characters are denser.

s.gif
But you need to make sure to use utf-16 or utf-32 instead of utf-8, or you may be worse off.
This trick is used in the demoscene. Instead of using -c, I use -n,
  tail -n +2 $0
The -n +2 option means “starting at line 2”, which is what you want if you cram your script into one line. You can make an executable packed with lzma this way,
  a=`mktemp`;tail -n+2 $0|unxz>$a;chmod +x $a;$a;rm $a;exit
This is the polite way to do it, using mktemp. You can save some bytes if you don’t care about that stuff.
s.gif
There must be a way to run something without needing a temp file...
s.gif
yup. after that you can use the global var DATA to access the data injected after the __END__
s.gif
Are you sure that Perl took it from ruby and not the other way around?

(edit: a subsequent correction has obsoleted this comment)

A very large Electronic Medical Records company shipped an extremely large shell script to us for an install.

Upon examination it contained binary data and a command to extract it to a file and then installed the application.

This was the “efficient” way to ship and install the binary.

Shell archive it was called? There used to be a lot of installers like that.
s.gif
Minor nit: every "shar" I've seen (from distant memory) used a "here document" rather than appending (possibly binary) data to the end of a shell script.

https://en.wikipedia.org/wiki/Here_document

This reminds me of a job I had 15+ years ago where we did code reviews by emailing files to one another with our changes. It worked like this with the first part of the file being a script and the end of the file being a base64 encoded zip of the changed files. We had tooling that would pack them, but unpacking was done by execution.

What could possibly go wrong with emailing executable scripts?

This is a great trick, but no one should ever run someone else's script that does this unless they have verified the script line by line beforehand.
s.gif
I don't think I've ever read through the Nvidia binary drivers that way. (They're named *.run but are basically shar files)
s.gif
Sure, but that's turtles all the way down... any time you run untrusted code, you are making a risk based decision, usually based on the provenance of the code.
s.gif
Maybe? People run all manner of binaries/installers without checking them; I'm not sure why these sorts of things require any EXTRA scrutiny.
In Perl, __DATA__ indicates the beginning of the data section of the file. A portable way to provide test data or sample data.

https://perldoc.perl.org/functions/__DATA__

Since zip files use a directory at the end, you can make a kind of mullet file - script at the front, archive at the back. I generated single-file runnable Java binaries like that at once point.
Java JAR files are similar, but reversed. You can add anything you want to the beginning of the JAR file (or is it any ZIP file?) so long as it doesn't include the Zip file header "PK". So, I use this to prepend a bash script that ultimately calls
    java -jar $0
It makes it very easy to setup and use Java based command line programs on a server.
s.gif
this sounds incredibly useful but I couldn't get it to work. I just get
    java.util.zip.ZipException: invalid CEN header (bad signature)
        at java.base/java.util.zip.ZipFile$Source.zerror(ZipFile.java:1623)
if I try to do anything with a JAR file that has leading text. I'm creating it just using
   echo 'java -jar $0' | cat - test.jar > test.run.jar
Is there more to it?
s.gif
Technically, you should update the offset to the central directory in the Zip footer, along with the offsets to each file header in each central directory entry. If you don’t, the zip file reader has to apply some heuristics to locate the central directory; not all readers implement these heuristics, and those that do won’t always be robust.

The “unzip” utility can be useful as a sanity check; run “unzip -t” to test the integrity of the file.

I did a similar thing for a lowish volume embedded product. The update files are just bash scripts with a tar file cat'd on them. The unit just looks for a particular file on an external flash drive to run and the bash script runs, copies off a tar and checks that it has the right hash. Super simple and flexible when customers need me to do something special. Like extract some specific log onto a flash drive.
I can vaguely remember that many programs used to install themselves this way under Linux.
s.gif
It was used on Unix systems even before that.
s.gif
definitely used something similar on VAX/VMS called VMS_SHARE (https://www.glaver.org/ftp/multinet-contributed-software/vms...) circa '90-91

in fact I found an old archive of mine floating around on usenet and wrote a python script to unpack it. Looking at the original, it was using a scripting. language bootstrap to make a COM script unpack embedded the original code.

s.gif
Lots of commercial Linux software use this still for installing their stuff. It’s a neat trick
s.gif
I've seen it recently with the Conda and Mamba package managers.
This is my default approach to writing installers for the Unices. The program is compressed and added to the end of the script, and the script does the unpacking and any needed setup/configuration for the specific platform it's getting installed on.

I don't append it in binary form, though. I uuencode it. That way, there is no danger in using text editors.

s.gif
Why uuencode? Base64 is the defacto standard these days.
s.gif
Sorry, I did mean base64. I have a bad habit of calling all "binary as text" encodings "uuencode". I usually catch myself before I put it in writing, though.
s.gif
I've used both, but only briefly. I think I used uuencode when using uucp. And Base64 in one of my Python programs.

What are their pros and cons, in your opinion?

s.gif
Base 64 is slightly more space efficient. Other than that it's just more popular and better supported.
s.gif
Got it, thanks.

Yes, uuencode / uudecode are probably older too.

They are from the uucp dialup comms era of networking.

That's what uuencode / uudecode were once used for.
This for any sh type script, not just bash :) Will work with sh, ksh and even [t]csh
That's how I made a bash backdoor once. It was just a script somewhere on the FS, until it unpacked itself and executed the rest of the rootkit.

Long story but trust me that I had good intentions.

BASIC and Perl had or have something like that too.

IIRC, Perl copied it from BASIC, because BASIC came much before Perl.

And, again, IIRC, I've read about the shar (shell archive) method that someone else commented about in this thread (and which even has a Wikipedia entry), in either the classic Kernighan and Pike book, The Unix Programming Environment (which I've recommended here multiple times before), or in some Unix man pages, long ago.

So it's quite an old method.

This reminds me of ZX Spectrum Basic where all the graphics, sound, and level layouts were defined using DATA lines at the end of the program.
s.gif
Or any machine code routines you wanted to POKE into memory.

A suppressed obscure part of my lizard brain secretly wishes I could just code for 8bit computers from the 80s, just with all the modern niceties like text editors, assemblers and emulators etc.

s.gif
You could also put the binary data in the first line of the Basic program after the ‘rem’ command, change the line number to 0 using the poke command, so that it’s not possible to edit this line. The second line would run the code using ‘randomize usr’. There were also fun tricks with control sequences, that would hide the ‘rem’ command and the line number, and put something like “Cracked by Bill Gilbert (c) 1982” instead. Gosh, why I still remember all this nonsense after all these years…
I use a fun little hack, a la awk:

``` #!/usr/local/bin/bash

echo "HELLO"

TAIL_REMOTE_MARKER=`awk '/^__THE_REMOTE_PART__/{flag=1;next}/^__END_THE_REMOTE_PART__/{flag=0;exit}flag' ${0}`

eval "$TAIL_REMOTE_MARKER"

exit 0

__THE_REMOTE_PART__

echo "WORLD"

__END_THE_REMOTE_PART__ ```

I used to do something similar for Windows executable files. Append a large file to the end as necessary.
I seem to recall that you can do the opposite as well: stash some extra data at the end of a binary file. The 'tclkit' system used this to package up an executable with the scripts you wanted to ship.
I vaguely remember this is what Ocaml does for one format of its executable.
This is a malware technique.

I am not saying don't do it. But that is mostly where I see this type of trick.

s.gif
Malware is about intent and consent, not executable format.
s.gif
>portswigger does that for the burpsuite installers.

Wow, that triggered my wordplay radar, which I'm working on as a fun side line these days, thanks :)

port, suite (sweet)

swig, burp

I think this is how GOG ships the Linux version of Battletech.
s.gif
I believe this is how GOG ships all of its Linux titles, all of the installs I've used from them are downloaded as a single *.sh file. I just checked an example game, and it looks to be using this method.
Makeself archives are a classic self-extracting tarball who do exactly that...
I dont understand this website it is too hard and i dont understand anything. Anyone help me with this?
s.gif
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search:

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK