Rectangle 27 0

Learning assembly?


I'll go against the grain of most answer and recommend Knuth's MMIX variant of the MIPS RISC architecture. It won't be as practically useful as x86 or ARM assembly languages (not that they're all that crucial themselves in most real-life jobs these days...;-), but it WILL unlock for you the magic of Knuth's latest version of the greatest-ever masterpiece on deep low-level understanding of algorithms and data structures -- TAOCP, "The Art of Computer Programming". The links from the two URLs I've quoted are a great way to start exploring this possibility!

Note
Rectangle 27 0

Learning assembly?


Also, what does MSVC++ use for the inline assembly code? MASM?

At the end of the day, this means that an expert assembly programmer and an expert disassembler are two different specialties. Commonly they're found in the same person, but they're really separate, and learning how to be an excellent assembly coder won't help you that much to learn reverse engineering.

I know that when you compile anything with a high level language, you will get a lot of "garbage" code that wouldn't be needed it it was coded directly in assembly. I also do understand that there's a difference between an expert assembly programmer and expert disassembler. But the same could be said about almost everything else.

My concern is that while in theory I could read the papers and get a grasp of what they mean, until I start writting things myself I don't believe I'll truly understand it. You say I can start by changing small parts of code, but to do that I first must know what kind of assembly "flavour" IDA pro, for example, uses.

The assembly you would write by hand and the assembly generated by a compiler are often very different when viewed from a high level. Of course, the innards of the program will be very similar (there are only so many different ways to encode a = b + c, after all), but they're not the trouble when you're trying to reverse engineer something. The compiler will add a ton of boilerplate code to even simple executables: last time I compared, "Hello World" compiled by GCC was about 4kB, while if written by hand in assembly it's around 100 bytes. It's worse on Windows: last time I compared (admittedly, this was last century) the smallest "Hello World" I could get my Windows compiler of then-choice to generate was 52kB! Usually this boilerplate is only executed once, if at all, so it doesn't much affect program speed -- like I said above, the core of the program, the part where most execution time is spent, is usually pretty similar whether compiled or written by hand.

What you want to do is grab the IA-32 and AMD64 (both are covered together) architecture manuals from Intel and AMD, and look through the early sections on instructions and opcodes. Maybe read a tutorial or two on assembly language, just to get the basics of assembly language down. Then grab a small sample program that you're interested in and disassemble it: step through its control flow and try to understand what it's doing. See if you can patch it to do something else. Then try again with another program, and repeat until you're comfortable enough to try to achieve a more useful goal. You might be interested in things like "crackmes", produced by the reverse engineering community, which are challenges for people interested in reverse engineering to try their hand at, and hopefully learn something along the way. They range in difficulty from basic (start here!) to impossible.

Note
Rectangle 27 0

Learning assembly?


I've heard from MASM. If I'm not mistaken, it has a lot of "high level" features, that I don't see when I look at dissambled code. I'd like to have to program in something that is exactly like most disassemblers output code, if this is making sense.

MASM, with its quirks, bugs, and so-called high-level features has done more to confuse assembly programmersbeginners and experts alikemore than anything I can think of.

Start with MASM32 and from there look at FASM. But you'll have fun with MASM.

That would basically be like writing op codes, which doesn't really make sense. Learning MASM32 will help you understand how code looks in a debugger. You may also like to check out OllyDbg: ollydbg.de

You don't have to use all of the MASM features just because they're there; you can make things as hard to read as you want, if you think you'll learn more that way.

You don't understand assembly. You need to understand it. An opcode is a number. Debuggers will attempt to resolve opcodes to their instructions (sometimes its hard). You need to understand the basic instructions. Learning MASM will help you do this. No more needs to be said.

Note
Rectangle 27 0

Learning assembly?


"Write your own disassembler" - I agree, it's how I learned it best. (What's up with "But I don't even know how to write an disassembler"?) LOL.

1) learn the instruction set for the specific processor

2) learn the nuances of how to write code in assemble for said processor such that you can wiggle every opcode bit in every instruction

3) you learn the instruction set better that most engineers that use that instruction set to make their living

I have done this many times and continue to do this. In this case where your primary goal is reading and not writing assembler I feel this applies.

I have some msp430 examples github.com/dwelch67 plus a few instruction set simulators for experimenting with including learning asm, etc.

I recommend the gcc tools, mingw32 is an easy way to use gcc tools on Windows if x86 is your target. If not mingw32 plus msys is an excellent platform for generating a cross compiler from binutils and gcc sources (generally pretty easy). mingw32 has some advantages over cygwin, like significantly faster programs and you avoid the cygwin dll hell. gcc and binutils will allow you to write in C or assembler and disassemble your code and there are more web pages than you can read showing you how to do any one or all of the three. If you are going to be doing this with a variable length instruction set I highly recommend you use a tool set that includes a disassembler. A third party disassembler for x86 for example is going to be a challenge to use as you never really know if it has disassembled correctly. Some of this is operating system dependent too, the goal is to compile the modules to a binary format that contains information marking instructions from data so the disassembler can do a more accurate job. Your other choice for this primary goal is to have a tool that can compile directly to assembler for your inspection then hope that when it compiles to a binary format it creates the same instructions.

I'm going with you! Just bought a MSP430 and a book on it... :)

In your case there are a couple of problems, I normally recommend the ARM instruction set to start with, there are more ARM based products shipped today than any other (x86 computers included). But the likelihood that you are using ARM now and dont know enough assembler for it to write startup code or other routines knowing ARM may or may not help what you are trying to do. The second and more important reason for ARM first is because the instruction lengths are fixed size and aligned. Disassembling variable length instructions like x86 can be a nightmare as your first project, and the goal here is to learn the instruction set not to create a research project. Third ARM is a well done instruction set, registers are created equal and dont have individual special nuances.

Oh for disassembling variable length instruction sets, instead of simply starting at the beginning and disassembling every four byte word linearly through memory as you would with the ARM or every two bytes like the msp430 (The msp430 has variable length instructions but you can still get by going linearly through memory if you start at the entry points from the interrupt vector table). For variable length you want to find an entry point based on a vector table or knowledge about how the processor boots and follow the code in execution order. You have to decode each instruction completely to know how many bytes are used then if the instruction is not an unconditional branch assume the next byte after that instruction is another instruction. You have to store all possible branch addresses as well and assume those are the starting byte addresses for more instructions. The one time I was successful I made several passes through the binary. Starting at the entry point I marked that byte as the start of an instruction then decoded linearly through memory until hitting an unconditional branch. All branch targets were tagged as starting addresses of an instruction. I made multiple passes through the binary until I had found no new branch targets. If at any time you find say a 3 byte instruction but for some reason you have tagged the second byte as the beginning of an instruction you have a problem. If the code was generated by a high level compiler this shouldnt happen unless the compiler is doing something evil, if the code has hand written assembler (like say an old arcade game) it is quite possible that there will be conditional branches that can never happen like r0=0 followed by a jump if not zero. You may have to hand edit those out of the binary to continue. For your immediate goals which I assume will be on x86 I dont think you will have a problem.

So you will have to figure out what processor you want to start with. I suggest the msp430 or ARM first, then ARM first or second then the chaos of x86. No matter what platform, any platform worth using has data sheets or programmers reference manuals free from the vendor that include the instruction set as well as the encoding of the opcodes (the bits and bytes of the machine language). For the purpose of learning what the compiler does and how to write code that compiler doesnt have to struggle with it is good to know a few instruction sets and see how the same high level code is implemented on each instruction set with each compiler with each optimization setting. You dont want to get into optimizing your code only to find that you have made it better for one compiler/platform but much worse for every other.

Thanks for the reply. But I don't even know how to write an disassembler.

The short (okay slightly shortER ) answer to your question. Write a disassembler to learn an instruction set. I would start with something RISCy and easy to learn like ARM. Once you know one instruction set others become much easier to pick up, often in a few hours, by the third instruction set you can start writing code almost immediately using the datasheet/reference manual for the syntax. All processors worth using have a datasheet or reference manual that describes the instructions down to the bits and bytes of the opcodes. Learn a RISC processor like ARM and a CISC like x86 enough to get a feel for the differences, things like having to go through registers for everything or being able to perform operations directly on memory with fewer or no registers. Three operand instructions versus two, etc. As you tune your high level code, compile for more than one processor and compare the output. The most important thing you will learn is that no matter how good the high level code is written the quality of the compiler and the optimization choices made make a huge difference in the actual instructions. I recommend llvm and gcc (with binutils), neither produce great code, but they are multi platform and multi target and both have optimizers. And both are free and you can easily build cross compilers from sources for various target processors.

Write your own disassembler. Not for the purpose of making the next greatest disassembler, this one is strictly for you. The goal is to learn the instruction set. Whether I am learning assembler on a new platform, remembering assembler for a platform I once knew. Start with only a few lines of code, adding registers for example, and ping pong-ing between disassembling the binary output and adding more and more complicated instructions on the input side you:

Note
Rectangle 27 0

Learning assembly?


MIPS is nice. 68000 is, too, and if you learn 68000 you can write binaries that run in MAME. :-)

One of the standard pedagogic assembly languages out there is MIPS. You can get MIPS simulators(spim) and various teaching materials for it.

Personally, I'm not a fan. I rather like IA32.

Note
Rectangle 27 0

Learning assembly?


Are you doing other dev work on windows? On which IDE? If it's VS, then there's no need for an additional IDE just to read disassembled code: debug your app (or attach to an external app), then open the disassembly window (in the default settings, that's Alt+8). Step and watch memory/registers as you would through normal code. You might also want to keep a registers window open (Alt+5 by default).

Finally, you can benefit quite a bit from reading some low-level blogs. These byte-size info bits work best for me, personally.

If you care to invest in a book, here is a nice introductory text. Search amazon for 'x86' and you'd get many others. You can get several other directions from another question here.

Intel gives free manuals, that give both a survey of basic architecture (registers, processor units etc.) and a full instruction reference. As the architecture matures and is getting more complex, the 'basic architecture' manuals grow less and less readable. If you can get your hands on an older version, you'd probably have a better place to start (even P3 manuals - they explain better the same basic execution environment).

Note
Rectangle 27 0

Learning assembly?


Also, what does MSVC++ use for the inline assembly code? MASM?

At the end of the day, this means that an expert assembly programmer and an expert disassembler are two different specialties. Commonly they're found in the same person, but they're really separate, and learning how to be an excellent assembly coder won't help you that much to learn reverse engineering.

I know that when you compile anything with a high level language, you will get a lot of "garbage" code that wouldn't be needed it it was coded directly in assembly. I also do understand that there's a difference between an expert assembly programmer and expert disassembler. But the same could be said about almost everything else.

My concern is that while in theory I could read the papers and get a grasp of what they mean, until I start writting things myself I don't believe I'll truly understand it. You say I can start by changing small parts of code, but to do that I first must know what kind of assembly "flavour" IDA pro, for example, uses.

The assembly you would write by hand and the assembly generated by a compiler are often very different when viewed from a high level. Of course, the innards of the program will be very similar (there are only so many different ways to encode a = b + c, after all), but they're not the trouble when you're trying to reverse engineer something. The compiler will add a ton of boilerplate code to even simple executables: last time I compared, "Hello World" compiled by GCC was about 4kB, while if written by hand in assembly it's around 100 bytes. It's worse on Windows: last time I compared (admittedly, this was last century) the smallest "Hello World" I could get my Windows compiler of then-choice to generate was 52kB! Usually this boilerplate is only executed once, if at all, so it doesn't much affect program speed -- like I said above, the core of the program, the part where most execution time is spent, is usually pretty similar whether compiled or written by hand.

What you want to do is grab the IA-32 and AMD64 (both are covered together) architecture manuals from Intel and AMD, and look through the early sections on instructions and opcodes. Maybe read a tutorial or two on assembly language, just to get the basics of assembly language down. Then grab a small sample program that you're interested in and disassemble it: step through its control flow and try to understand what it's doing. See if you can patch it to do something else. Then try again with another program, and repeat until you're comfortable enough to try to achieve a more useful goal. You might be interested in things like "crackmes", produced by the reverse engineering community, which are challenges for people interested in reverse engineering to try their hand at, and hopefully learn something along the way. They range in difficulty from basic (start here!) to impossible.

Note
Rectangle 27 0

Learning assembly?


  • FPU: learn the basics of the floating-point stack and how you convert to/from fp.
  • calling conventions: how are parameters passed to a function?
  • operand order: add eax, ebx means "Add ebx to eax and store the result in eax".
  • registers: how many are there, what are their names, and what are their sizes?

A lot of the time it will be surprising what the compiler emits. Make it a puzzle of figuring out why the heck the compiler thought this would be a good idea. It will teach you a lot.

I would consider the basics to be:

I would learn just enough about your intended architecture that you understand the basics, then just jump right in and try to understand your compiler's output. Arm yourself with the Intel manuals and just dive right into your compiler's output. Isolate the code of interest into a small function, so you can be sure to understand the entire thing.

I wouldn't focus on trying to write programs in assembly, at least not at first. If you're on x86 (which I assume you are, since you're using Windows), there are tons of weird special cases that it's kind of pointless to learn. For example, many instructions assume you're operating on a register that you don't explicitly name, and other instructions work on some registers but not others.

It will probably also help to arm yourself with Agner Fog's manuals, especially the instruction listing one. It will tell you roughly how expensive each instruction is, though this is harder to directly quantify on modern processors. But it will help explain why, for example, the compiler goes so far out of its way to avoid issuing an idiv instruction.

My only other piece of advice is to always use Intel syntax instead of AT&T when you have a choice. I used to be pretty neutral on this point, until the day I realized that some instructions are totally different between the two (for example, movslq in AT&T syntax is movsxd in Intel syntax). Since the manuals are all written using Intel syntax, just stick with that.

Note
Rectangle 27 0

Learning assembly?


Don't be put off by the title. Most of the first part of the book is "Hacking" in the Eric Raymond sense of the word: creative, surprising, almost sneaky ways to solve tough problems. I (and maybe you) was a lot less interested in the security aspects.

I found Hacking: The Art of Exploitation to be an interesting and useful way into this topic... can't say that I have ever used the knowledge directly, but that's really not why I read it. It gives you a much richer appreciation of the instructions that your code compiles to, which has occasionally been useful in understanding subtler bugs.

Note
Rectangle 27 0

Learning assembly?


I started out learning MIPS which is a very compact 32-bit architecture. It is a reduced instruction set, but that's what makes easy to grasp for beginners. You will still be able to understand how assembly works without getting overwhelmed with complexity. You can even download a nice little IDE, which will allow you to compile your MIPS code: clicky Once you get the hang of it, i think it would be much easier to move on to more complex architectures. At least that's what i thought :) At this point you will have the essential knowledge of memory allocation and management, logic flow, debugging, testing and etc.

Note