7

Why DARPA Hopes To 'Distill' Old Binaries Into Readable Code - Slashdot

 1 year ago
source link: https://developers.slashdot.org/story/23/08/19/0150232/why-darpa-hopes-to-distill-old-binaries-into-readable-code
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Why DARPA Hopes To 'Distill' Old Binaries Into Readable Code

Sign up for the Slashdot newsletter! OR check out the new Slashdot job board to browse remote jobs or jobs in your areaDo you develop on GitHub? You can keep using GitHub but automatically sync your GitHub releases to SourceForge quickly and easily with this tool so your projects have a backup location, and get your project in front of SourceForge's nearly 30 million monthly users. It takes less than a minute. Get new users downloading your project releases today!
×
Researchers at Georgia Tech have developed a prototype pipeline for the Defense Advanced Research Projects Agency (DARPA) that can "distill" binary executables into human-intelligible code so that it can be updated and deployed in "weeks, days, or hours, in some cases." The work is part of a five-year, $10 million project with the agency. The Register reports: After running an executable through the university's "distillation" process, software engineers should be able to examine the generated HAR, figure out what the code does, and make changes to add new features, patch bugs, or improve security, and turn the HAR back into executable code, says GT associate professor and project participant Brendan Saltaformaggio. This would be useful for, say, updating complex software that was written by a contractor or internal team, the source code is no longer or never was to hand and neither are its creators, and stuff needs to be fixed up. Reverse engineering the binary and patching in an update by hand can be a little hairy, hence DARPA's desire for something a bit more solid and automatic. The idea is to use this pipeline to freshen up legacy or outdated software that may have taken years and millions of dollars to develop some time ago. Saltaformaggio told El Reg his team has the entire process working from start to finish, and with some level of stability, too. "DARPA sets challenges they like to use to test the capabilities of a project," he told us over the phone. "So far we've handled every challenge problem DARPA's thrown at us, so I'd say it's working pretty well." Saltaformaggio said his team's pipeline disassembles binaries into a graph structure with pseudo-code, and presented in a way that developers can navigate, and replace or add parts in C and C++. Sorry, Java devs and Pythonistas: Saltaformaggio tells us that there's no reason the system couldn't work with other programming languages, "but we're focused on C and C++. Other folks would need to build out support for that." Along with being able to deconstruct, edit, and reconstruct binaries, the team said its processing pipeline is also able to comb through HARs and remove extraneous routines. The team has also, we're told, baked in verification steps to ensure changes made to code within hardware ranging from jets and drones to plain-old desktop computers work exactly as expected with no side effects.

Since compilers strip out "unnecessary" data, like comments, or the names of variables (because they mean nothing to a computer), recovery of all that missing, and often essential metadata has long been a niggly, prickly, and pernicious obstacle with disassembling binary or object code back into assembler, (and in some cases, back into "something that resembles C")

Short of having some kind of AI that "knows" about commonly used interfaces/libraries, and which can identify the compiled code's disassembly and pair it up, there is no easy way to revert it back to something genuinely human-readable.

Even then, there's situations where the code never really was "human readable", such as hand-assembled performance-focused code, where attempting this kind of operation on it will seriously degrade its value-- or software that real-time modifies itself in memory (like SecuRom)

I wish DARPA all the luck in the world, but this is something that people have been wanting to do for aaaaages.

  • (Additionally, this toolkit would make a lot of closed source software vendors shit solid gold bricks, and reach impulsively for their lawyers and cease-and-desist orders, like Catholics reaching for a crucifix when they catch even the faintest hint of something 'satanic')

      • The idea with the addendum, is that if a tool that could easily produce human-readable code (and not just raw disassembly, with obtuse and difficult to tease out structure of "what it's doing"), it would make many hardware vendors "Very Upset."

        See for instance, nVidia, and their binary blob drivers, or Broadcom with their binary blob radio firmware.

        Being able to generate human-readable code using an AI assistive tool (assuming it's worth a shit-- which is a whole other ball of wax), means also being able to easily produce human-readable documentation about a binary blob, and what it's doing.

        That means trade secrets and other things that are obfuscated inside such a blob could be revealed and disseminated quickly.

        Hence the note about reaching for C&Ds.

        • Re:

          And yet Ghidra freely exists.

          Yea, you aren't going to sue the NSA. Your ass will disappear.

          • Re:

            Disassembly and reverse-compiling are seeming different than what this seems to be doing. Compilers and Assemblers have advanced many fold over since the Apple II days, and run optimization routines against your source code to cut the crap out. They reduce libraries into only what is needed, shorten the variable and function names, implement system calls from behind your functions directly in code. This optimization reverses into what is typically a ball of mess. Debugging is not easy with it. This project
        • Re:

          I beg to differ - even producing documentation from someone else's well-written original code is often a nightmare. I don't see it getting any easier using AI de-compiled code, which will almost certainly be less readable.

          It'll be a lot easier than doing so from disassembled assembly code, but that's not saying much.

          And my bet is we'll have intentionally-obfuscating compilers coming out any day now in order to reduce the risk - just like we had back in the day when CPUs were simple linear processors, and c

          • Re:

            No doubt. I even have problems going back to old code I wrote, and remind myself to better document it in the future (which I don't do.) AI seems to be the answer to everything, or at least attracting loads of cash. Given some of the challenges and problems AI has had with stuff that should be relatively straightforward, such as looking up case law or even writing a simple article, trying to understand code is likely to be a source of humor for quite some time.

          • I think the nsa can work out some money in various amount or a honey pot can' buy 'source' code. faiking that there is always 'national security'.
        • But a good developer/hacker doesn't need this "human-readable' code as what the decompilers produce is already pretty readable to them. Anyone really wanting to know these "trade secrets" can already easily get that using current decompilers. It's not like decompiled code is unreabldable to humans who know their shit, yeah, a scriptkiddie might not know it.
        • Re:

          More like it would make them reach for locked down processors that only execute crap with their digital signatures on it and force them upon the public. Everywhere in every machine, every CPU.

          After all, the whole point of proprietary software is to sell you something that tells the machine how to do something that you cannot describe yourself. So why would they have a problem forbidding you from ever being allowed to describe anything? Hell, it's a monopoly at that point. You want the machine to do someth
    • Re:

      For lost source code, most copy right laws have exceptions which explicitly allow decompiling and fixing and porting and recompiling.

  • Re:

    You'll never get the comments back, nor will you have meaningful variable names... but an advanced disassembler would be able to map out the execution paths and provide order and consistency and even break the code into objects.

    Honestly, I still wouldn't call that 'human readable' as I have a bit of trouble reviewing my own code after a few years and it's no longer fresh in my mind. And I comment my code.

    Code that is organized purely based on how it executes seems like a great thing, but realistically you

  • by Gravis Zero ( 934156 ) on Saturday August 19, 2023 @09:13AM (#63779814)

    From TFA:

    We know what you're thinking: Uncle Sam is reinventing decompilation. It certainly sounds like it. There are lots of decompilation and reverse-engineering tools out there for turning executable machine-level code into corresponding source code in human-readable high-level language like C or C++. That decompiled source, however, tends to be messy and hard to follow, and is typically used for figuring out how a program works and whether any bugs are exploitable.

    From what we can tell, this DARPA program seeks a highly robust, automated method of converting executable files into a high-level format developers can not only read – a highly abstract representation, or HAR, in this case – but also edit to remove flaws and add functionality, and reassemble it all back into a program that will work as expected. That's a bit of a manual, error-prone chore even for highly skilled types using today's reverse-engineering tools, which isn't what you want near code going into things like aircraft.

    DARPA instead seems to want a decompilation-and-recompilation system that is reliable, easy enough to use, and incorporates stuff you'd expect from a military research nerve center, such as formal verification of a program's modifications.

  • No, they're trying to vastly improve on de-compiling tools. Tools that can disassemble object/executable code (which is pretty trivial, e.g., objdump does it) but then analyze the assembly to back out C or C++ code that might have originally created the code.

    Think "ftoc" but replace FORTRAN with "exectuable/object code".

    I've done some decompiling of (small) executables, using non-AI tools that often did little more than treat C as an assembler language. It did usually get things like "this is a subroutine" right and created the parts like "int a113s_r4r5r6(int r4, int r5, intr6) {.... return r6; }".

    So we got C code that we could recompile, and while not exact byte-for-byte in the output, the resulting recompiled code was "correct". We could theoretically edit the resulting C code, but because obviously all the labels, variable, names, etc, are stripped out, the decompiler had to generate *something* so we got stuff like I mentioned above.

    As a subject-matter expert, most of my job was trying to recognize what the code was *really* doing, replacing the decompiler-generated names with my best educated-guess as to what the function/variable is really doing or might be called. The decompiler didn't always (usually) see things the "this is an array access", and had instead emitted code like

    int *v123;
    int v3245234;
    v123 =
    v123 += v3544;
    v3245234 = * v123;

    Which is C-as-assembly, essentially. But recognizing the pattern and making substitutions like

    v345235 == "a"
    v3245234 == "b"
    v3544 == "i"

    we might recast that as

    int b = a[i];

    I'm sure the AI parts here are geared towards doing that sort of thing better and more accurately. Not to mention being able to compare object code against known object code in the wild and find the corresponding source code, e.g., when FOSS software got included, or when libc code was statically linked.

    • I see myself potentially replacing refactoring by using this path. Compile existing code, run AI assisted decompile, then proceed to understand the code. At least this way I would have a consistent starting point when faced with some hastily written chicken scratch which was squeeze by program management.

      Might even find a bunch of trivial bugs and convenient optimizations this way too.

    • back in the days (late 80s, early 90s when there was still software manually written in assembly), that was one of the strengths of "Sourcer" disassembler.
      instead of merely dumping machine code into human readable mnemonic, it actually tried to understand what the code does and give meaningful names and comments.
      it did so by having a lot of knowledge in its database.

      so instead of merely:

      out dx, al

      you got:

      out dx, al ; switch the PC speaker timer output to single trigger

      (that how I learned how to play digital

  • Re:

    There are plenty of programs that are just compiled from high level languages down to lego like assembly macros and end finally in assembly.

    If the source code is lost, decompiling them into something C like and then compiling them again might already be enough to port them to a different processor / system.

    • Re:

      Clearly you have never tried that. It doesn't work for C or anything that compiles down to assembler. It will work for Java, sort of, but it is unreliable and for a large enough body of code the chance that a bug gets introduced approaches 1 quite quickly. This is the sort of technology that works for some simple cases but doesn't scale well to large applications. Doesn't mean you couldn't make one that works but that is going to be very expensive and definitely won't be done at a university (this is th

  • Re:

    Well, given they already have ghidra courtesy of the NSA [nsa.gov], that part's down pat.

    One of the biggest problems with ghidra is the inability to easily define external data type libraries. (These contain type references, function signatures, and data structure names / layouts for a given library.) You can build some custom ones for the currently disassembled binary. But it cannot be exported and imported into another project. If you could, you'd be able to help the disassembler quite a bit in that regard. No AI


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK