Rectangle 27 6

Look at the generated assembly code and see if it uses a call or jmp instruction for the recursive call on x86 (for other architectures, look up the corresponding instructions). You can use nm and objdump to get just the assembly corresponding to your function. Consider the following function:

int fact(int n)
{
  return n <= 1 ? 1 : n * fact(n-1);
}

Compile as

gcc fact.c -c -o fact.o -O2
# get starting address and size of function fact from nm
ADDR=$(nm --print-size --radix=d fact.o | grep ' fact$' | cut -d ' ' -f 1,2)
# strip leading 0's to avoid being interpreted by objdump as octal addresses
STARTADDR=$(echo $ADDR | cut -d ' ' -f 1 | sed 's/^0*\(.\)/\1/')
SIZE=$(echo $ADDR | cut -d ' ' -f 2 | sed 's/^0*//')
STOPADDR=$(( $STARTADDR + $SIZE ))

# now disassemble the function and look for an instruction of the form
# call addr <fact+offset>
if objdump --disassemble fact.o --start-address=$STARTADDR --stop-address=$STOPADDR | \
    grep -qE 'call +[0-9a-f]+ <fact\+'
then
    echo "fact is NOT tail recursive"
else
    echo "fact is tail recursive"
fi

When ran on the above function, this script prints "fact is tail recursive". When instead compiled with -O3 instead of -O2, this curiously prints "fact is NOT tail recursive".

Note that this might yield false negatives, as ehemient pointed out in his comment. This script will only yield the right answer if the function contains no recursive calls to itself at all, and it also doesn't detect sibling recursion (e.g. where A() calls B() which calls A()). I can't think of a more robust method at the moment that doesn't involve having a human look at the generated assembly, but at least you can use this script to easily grab the assembly corresponding to a particular function within an object file.

Nave factorial and fibonacci functions are not good examples. Compilers are often subjected to dumb benchmarks with them, so compiler writers make sure to add special optimizations that apply specifically to functions that look like fact/fib.

@Adam Rosenfield: Your solution worked for me with a few minor changes. For example, I'm using C++, and my function is a member function, so the function name becomes something mangled like "_ZN5cycle4stepEib".

@Adam Rosenfield: I find it interesting that it makes that call tail-recursive at all with -O2. Meanwhile, with -O3 it "unrolls the loop" some 10 times, and then actually calls itself.

Just found this question and couldn't help but point out that the code isn't even tail recursive in nature, since the last operation in the function is actually multiplication, not the recursive call to fact.

g++ - How do I check if gcc is performing tail-recursion optimization?...

gcc g++ tail-recursion
Rectangle 27 6

At any given time, you can not use more registers than the CPU offers; however, you can re-use the same register for multiple values one after the other. That's called register allocation and register spilling where values move between the CPU registers and a program's stack using the rSP stack pointer register.

I assume what you call "unnamed registers" are such spilled values. In addition to the registers listed in your question, more recent x86-64 architectures also offer MMX, SSE, AVX registers for storage and some operations, thus increasing your number of registers. Be careful not to trash non-volatile registers though, i.e. check the calling convention of your machine and operating system.

I think what he's calling "unnamed registers" are actually the large number of internal, non-programmer-accessible registers used by the CPU to implement register renaming.

he's mentioning register renaming technique, not spilling

@LuVnhPhc Mmm, I don't know. "Unnamed registers" and register renaming seem unrelated, register renaming should not be associated with hidden/unnamed registers. Specially considering that some architecture effectively has a register window, which maybe more close to what the OP may meant. If it meant internal registers, then register renaming is just a small part of them and so, again, I think it is irrelevant.

Because the op talks about x86 I assume x86 architectural registers. Micro-architectural register banks are not accessible and I think in the context of the question irrelevant.

@MargaretBloom register renaming means you have a lot of registers, but only a few are exposed to the user. From a user's perspective the remaining registers are unknown, unnamed and they can't access them directly

assembly - There aren't enough registers in x86-64 processor - Stack O...

assembly x86-64
Rectangle 27 9

I think this question has a false assumption. It's mainly just RISC-obsessed academics who call x86 ugly. In reality, the x86 ISA can do in a single instruction operations which would take 5-6 instructions on RISC ISAs. RISC fans may counter that modern x86 CPUs break these "complex" instructions down into microops; however:

  • In many cases that's only partially true or not true at all. The most useful "complex" instructions in x86 are things like mov %eax, 0x1c(%esp,%edi,4) i.e. addressing modes, and these are not broken down.
  • What's often more important on modern machines is not the number of cycles spent (because most tasks are not cpu-bound) but the instruction cache impact of code. 5-6 fixed-size (usually 32bit) instructions will impact the cache a lot more than one complex instruction that's rarely more than 5 bytes.

x86 really absorbed all the good aspects of RISC about 10-15 years ago, and the remaining qualities of RISC (actually the defining one - the minimal instruction set) are harmful and undesirable.

Aside from the cost and complexity of manufacturing CPUs and their energy requirements, x86 is the best ISA. Anyone who tells you otherwise is letting ideology or agenda get in the way of their reasoning.

On the other hand, if you are targetting embedded devices where the cost of the CPU counts, or embedded/mobile devices where energy consumption is a top concern, ARM or MIPS probably makes more sense. Keep in mind though you'll still have to deal with the extra ram and binary size needed to handle code that's easily 3-4 times larger, and you won't be able to get near the performance. Whether this matters depends a lot on what you'll be running on it.

That's why I qualified "the best" with "aside from the cost...and their energy requirements".

I think Intel's throttling down the CPU speed, and smaller die sizes have largely eliminated the power differential. The new Celeron dual 64-bit CPU with 64k L1 and 1MB L2 caches is a 7.5 watt chip. It's my "Starbucks" hangout machine, and the battery life is ridiculously long and will run rings around a P6 machine. As a guy doing mostly floating point computations I gave up on RISC a long time ago. It just crawls. SPARC in particular was atrociously glacial. The perfect example of why RISC sucks was the Intel i860 CPU. Intel never went THERE again.

@RocketRoy: 7.5 watt isn't really acceptable for a device that's powered 24/7 (and not performing useful computations the whole time) or running off a 3.7v/2000mAh battery.

@RocketRoy "Intel i860 CPU. Intel never went THERE again." After a little research, the i860 sounds a lot like Itanium: VLIW, compiler-ordered instruction parallelism....

assembly - Why is x86 ugly? Why is it considered inferior when compare...

assembly x86 mips x86-64 cpu-architecture
Rectangle 27 9

I think this question has a false assumption. It's mainly just RISC-obsessed academics who call x86 ugly. In reality, the x86 ISA can do in a single instruction operations which would take 5-6 instructions on RISC ISAs. RISC fans may counter that modern x86 CPUs break these "complex" instructions down into microops; however:

  • In many cases that's only partially true or not true at all. The most useful "complex" instructions in x86 are things like mov %eax, 0x1c(%esp,%edi,4) i.e. addressing modes, and these are not broken down.
  • What's often more important on modern machines is not the number of cycles spent (because most tasks are not cpu-bound) but the instruction cache impact of code. 5-6 fixed-size (usually 32bit) instructions will impact the cache a lot more than one complex instruction that's rarely more than 5 bytes.

x86 really absorbed all the good aspects of RISC about 10-15 years ago, and the remaining qualities of RISC (actually the defining one - the minimal instruction set) are harmful and undesirable.

Aside from the cost and complexity of manufacturing CPUs and their energy requirements, x86 is the best ISA. Anyone who tells you otherwise is letting ideology or agenda get in the way of their reasoning.

On the other hand, if you are targetting embedded devices where the cost of the CPU counts, or embedded/mobile devices where energy consumption is a top concern, ARM or MIPS probably makes more sense. Keep in mind though you'll still have to deal with the extra ram and binary size needed to handle code that's easily 3-4 times larger, and you won't be able to get near the performance. Whether this matters depends a lot on what you'll be running on it.

That's why I qualified "the best" with "aside from the cost...and their energy requirements".

I think Intel's throttling down the CPU speed, and smaller die sizes have largely eliminated the power differential. The new Celeron dual 64-bit CPU with 64k L1 and 1MB L2 caches is a 7.5 watt chip. It's my "Starbucks" hangout machine, and the battery life is ridiculously long and will run rings around a P6 machine. As a guy doing mostly floating point computations I gave up on RISC a long time ago. It just crawls. SPARC in particular was atrociously glacial. The perfect example of why RISC sucks was the Intel i860 CPU. Intel never went THERE again.

@RocketRoy: 7.5 watt isn't really acceptable for a device that's powered 24/7 (and not performing useful computations the whole time) or running off a 3.7v/2000mAh battery.

@RocketRoy "Intel i860 CPU. Intel never went THERE again." After a little research, the i860 sounds a lot like Itanium: VLIW, compiler-ordered instruction parallelism....

assembly - Why is x86 ugly? Why is it considered inferior when compare...

assembly x86 mips x86-64 cpu-architecture
Rectangle 27 5

Pico Lisp recently (in the past few years) switched from C to x86-64 assembler. That's the only example I can think of that was undertaken in the "modern" era. There are some older Lisps bootstrapped from assembler still in use. Actually wait, someone recently wrote a Scheme in ARM assembler (http://armpit.sourceforge.net/index.html). I don't know why they would have done such a crazy-sounding thing and I haven't looked at it closely. Of course it's very common to write in C and add some asm functions to implement call/cc or the like.

The BDS C compiler from the 1980's was written in 8080 assembler and the source code was released a few years ago, but it's mostly of historical interest.

PicoLisp also has a reference implementation written in Java. (Which is also used to compile picolisp when there is no picolisp on the system already.)

Programming languages implemented in assembly language - Stack Overflo...

programming-languages assembly forth
Rectangle 27 0

Use VirtualAllocEx() to allocate a block of executable memory inside of the target process, then use WriteProcessMemory() to write x86 or x64 machine instructions into that memory block as needed. Have those instructions call LoadLibrary(), GetProcAddress(), an the exported DLL function as needed. Then use CreateRemoteThread() to execute the memory block. Your injector cannot call the exported DLL function directly if it is running in a separate process. The exported function has to be loaded and called within the context of the target process. And do not subtract the return value of LoadLibrary() from the return value of GetProcAddress(). GetProcAddress() returns a direct memory pointer to the function so it can be called directly.

Update: a variation of this is to put all of your injected code inside of the DLL's entry point (or have the entry point spawn a thread to run the code) when it is called with the DLL_ATTACH_PROCESS reason. Thus no need to export any functions from the DLL. Then you can use VirtualAllocEx() and WriteProcessMemory() to store the DLL's path into the target process, and then use CreateRemoteThread() to invoke LoadLibrary() directly. Kernel functions always have the same memory address across processes, so your injecting process can call GetProcAddress() within its own address space to get the address of LoadLibrary() and then pass that pointer to the lpStartAddress parameter of CreateRemoteThread(). This way, you don't have to worry about writing any x86/x64 assembly code.

This technique is described in more detail in Section 3 of this article:

The DllMain / DLL_ATTACH_PROCESS idea is pretty clever - I hadn't considered the possibility of handling allocation or function resolution on the target side. Wouldn't calling GetProcAddress within the context of the injecting (64-bit process) give you the address of the 64-bit kernel32's LoadLibrary though? (Win7SP1, for instance, has 64-bit LoadLibraryW at 0x776B6F80, and 32-bit LoadLibraryW at 0x76C14913)

c - x86 Code Injection into an x86 Process from a x64 Process - Stack ...

c winapi assembly x86 x86-64
Rectangle 27 0

Look at the generated assembly code and see if it uses a call or jmp instruction for the recursive call on x86 (for other architectures, look up the corresponding instructions). You can use nm and objdump to get just the assembly corresponding to your function. Consider the following function:

int fact(int n)
{
  return n <= 1 ? 1 : n * fact(n-1);
}

Compile as

gcc fact.c -c -o fact.o -O2
# get starting address and size of function fact from nm
ADDR=$(nm --print-size --radix=d fact.o | grep ' fact$' | cut -d ' ' -f 1,2)
# strip leading 0's to avoid being interpreted by objdump as octal addresses
STARTADDR=$(echo $ADDR | cut -d ' ' -f 1 | sed 's/^0*\(.\)/\1/')
SIZE=$(echo $ADDR | cut -d ' ' -f 2 | sed 's/^0*//')
STOPADDR=$(( $STARTADDR + $SIZE ))

# now disassemble the function and look for an instruction of the form
# call addr <fact+offset>
if objdump --disassemble fact.o --start-address=$STARTADDR --stop-address=$STOPADDR | \
    grep -qE 'call +[0-9a-f]+ <fact\+'
then
    echo "fact is NOT tail recursive"
else
    echo "fact is tail recursive"
fi

When ran on the above function, this script prints "fact is tail recursive". When instead compiled with -O3 instead of -O2, this curiously prints "fact is NOT tail recursive".

Note that this might yield false negatives, as ehemient pointed out in his comment. This script will only yield the right answer if the function contains no recursive calls to itself at all, and it also doesn't detect sibling recursion (e.g. where A() calls B() which calls A()). I can't think of a more robust method at the moment that doesn't involve having a human look at the generated assembly, but at least you can use this script to easily grab the assembly corresponding to a particular function within an object file.

Nave factorial and fibonacci functions are not good examples. Compilers are often subjected to dumb benchmarks with them, so compiler writers make sure to add special optimizations that apply specifically to functions that look like fact/fib.

@Adam Rosenfield: Your solution worked for me with a few minor changes. For example, I'm using C++, and my function is a member function, so the function name becomes something mangled like "_ZN5cycle4stepEib".

@Adam Rosenfield: I find it interesting that it makes that call tail-recursive at all with -O2. Meanwhile, with -O3 it "unrolls the loop" some 10 times, and then actually calls itself.

Just found this question and couldn't help but point out that the code isn't even tail recursive in nature, since the last operation in the function is actually multiplication, not the recursive call to fact.

g++ - How do I check if gcc is performing tail-recursion optimization?...

gcc g++ tail-recursion
Rectangle 27 0

I'm not sure why it hasn't been mentioned yet that this approach is completely misguided and easily broken by compiler upgrades, etc. It would make a lot more sense to determine the time value you want to wait until and spin polling the current time until the desired value is exceeded. On x86 you could use rdtsc for this purpose, but the more portable way would be to call clock_gettime (or the variant for your non-POSIX OS) to get the time. Current x86_64 Linux will even avoid the syscall for clock_gettime and use rdtsc internally. Or, if you can handle the cost of a syscall, just use clock_nanosleep to begin with...

c - How to prevent compiler optimization on a small piece of code? - S...

c optimization gcc avr-gcc
Rectangle 27 0

All processors operate on bits, we call that machine code, and it can take on very different flavors for different reasons, building a better mouse trap to patents protecting ideas. Every processor uses some flavor of machine code from a users perspective and some internally convert that to microcode, another machine code, and others dont. When you hear x86 vs arm vs mips vs power pc, that is not just company names but they also have their own instruction sets, machine code, for their respective processors. x86 instruction sets although evolving still resemble their history and you can easily pick out x86 code from others. And that is true for all of the companies, you can see the mips legacy in mips and arm in arm, and so on.

So to run a program on a processor at some point it has to be converted into the machine code for that processor and then the processor can handle it. Various languages and tools do it various ways. It is not required for a compiler to compile from the high level language to assembly language, but it is convenient. First off you basically will need an assembler for that processor anyway so the tool is there. Second can be much easier to debug the compiler by looking at human readable assembly language rather than the bits and bytes of machine code. Some compilers like JAVA, python, the old pascal compilers have a universal machine code (each language has its own different one), universal in the sense that java on an x86 and java on an arm do the same thing to that point, then there is a target specific (x86, arm, mips) interpreter that decodes the universal bytecode and executes it on the native processor. But ultimately it has to be the machine code for the processor it is running on.

There is also some history with this method of these compiling layers, I would argue that it is the somewhat unix building block approach, make one block do the front end and another block the backend and output asm and then the asm to object is its own tool and object linked with others is its own tool. Each block can be contained and developed with controlled inputs and outputs, and at times substituted with another block that fits in the same place. Compiler classes teach this model so you will see that replicated with new compilers and new languages. parse the front end, the text of the high level language. Into an intermediate, compiler specific binary code, then on the backend take that internal code and turn it into assembly for the target processor, allowing for example with gcc and many others to change that backend so the front and middle can be reused for different targets. Then separately have an assembler, and also a separate linker, separate tools in their own right.

People keep trying to re-invent the keyboard and mouse, but folks are comfortable enough with the old way that they stick with it even if the new invention is much better. Same is true with compilers and operating systems, and so many other things, we go with what we know and with compilers they often compile to assembly languge.

Are all programs eventually converted to assembly instructions? - Stac...

assembly
Rectangle 27 0

I'm not sure why it hasn't been mentioned yet that this approach is completely misguided and easily broken by compiler upgrades, etc. It would make a lot more sense to determine the time value you want to wait until and spin polling the current time until the desired value is exceeded. On x86 you could use rdtsc for this purpose, but the more portable way would be to call clock_gettime (or the variant for your non-POSIX OS) to get the time. Current x86_64 Linux will even avoid the syscall for clock_gettime and use rdtsc internally. Or, if you can handle the cost of a syscall, just use clock_nanosleep to begin with...

c - How to prevent compiler optimization on a small piece of code? - S...

c optimization gcc avr-gcc
Rectangle 27 0

If you can link against the C library, you could call the printf function. On most x86 systems, variadic functions use the cdecl calling convention - arguments are pushed on the stack from right to left (so, first the register value, then the string containing %d), and you have to clean up the stack (add to %esp) after the call.

For more details please specify what system you're on (and if you can't link against the C library, you'll need to generate a string by hand to convert it into whatever base you want to print in.

How do I print the contents of a register in x86 assembly to the conso...

assembly x86
Rectangle 27 0

I'm not familiar with the particulars of x64 instructions, so I can't help with rewriting the assembly code to support 64-bit, but I can tell you that Embarcadero's 64-bit compiler does not currently allow you to mix Pascal and Assembly in the same function. You can only write all-Pascal or all-Assembly functions, no mixing at all (a Pascal function can call an Assembly function and vice versa, but they cannot coexist together like in x86). So you will have to rewrite your methods.

It's worth noting that the main reason why you can't mix ASM and Pascal in the same routine is to avoid causing stack frame problems; they're subject to much stricter requirements on Win64 than on Win32. So code that uses asm tricks to set up fake stack frames is something you probably don't want to replicate anyway.

@MarcusAdams x86 opcodes won't work on x64 without modifications, if you use some pointers within. But some very basic 32 bit code may work iin x64 (like function Increment(A: integer): integer; asm inc eax end;. But Chau Chee Yang's kind of code won't work, since it rely on calling convention and pointers.

What is also interesting is that AFAIK the x64 back-end of FPC is able to nest asm blocks inside a begin..end. Not bad for an Open Source compiler supporting x64 since 2006! ;) I just found out that the great TotalCommander tool is using FPC for its upcoming 64 bit version.

Delphi XE2: Convert a ASM method for Win64 platform - Stack Overflow

delphi delphi-xe2 win64
Rectangle 27 0

QueryPerformanceCounter leads to a call inside the OAL (part of the BSP) where the highest-resolution progressive counter available in the system should be used to return a 64 bits value. Looks like an issue with that specific BSP. On x86 the amount of HW timers is limited and it may be that some drivers are using the counter that is used by QueryPerformanceCounter to have timers with less that 1ms resolution. It seems that the value is re-set and then continues to increase, this may be due to a driver setting a timer for itself (maybe useing only the lower 32bit part of the timers registers). If you have the source code of your BSP you can search OEMQueryPerformanceCounter implementation, check wich registers is using and check if other components of the BSP are accessing them (or other registers that can impact their operativity).

c++ - Time running backwards with QueryPerformanceCounter() - Stack Ov...

c++ timer embedded windows-ce performancecounter
Rectangle 27 0

Would it help if you write a simple call to sprintf in C and use gcc -s foo.c ?

x86 64 - How can I call sprintf from x86_64 assembly? - Stack Overflow

assembly x86-64 varargs
Rectangle 27 0

I think this question has a false assumption. It's mainly just RISC-obsessed academics who call x86 ugly. In reality, the x86 ISA can do in a single instruction operations which would take 5-6 instructions on RISC ISAs. RISC fans may counter that modern x86 CPUs break these "complex" instructions down into microops; however:

  • In many cases that's only partially true or not true at all. The most useful "complex" instructions in x86 are things like mov %eax, 0x1c(%esp,%edi,4) i.e. addressing modes, and these are not broken down.
  • What's often more important on modern machines is not the number of cycles spent (because most tasks are not cpu-bound) but the instruction cache impact of code. 5-6 fixed-size (usually 32bit) instructions will impact the cache a lot more than one complex instruction that's rarely more than 5 bytes.

x86 really absorbed all the good aspects of RISC about 10-15 years ago, and the remaining qualities of RISC (actually the defining one - the minimal instruction set) are harmful and undesirable.

Aside from the cost and complexity of manufacturing CPUs and their energy requirements, x86 is the best ISA. Anyone who tells you otherwise is letting ideology or agenda get in the way of their reasoning.

On the other hand, if you are targetting embedded devices where the cost of the CPU counts, or embedded/mobile devices where energy consumption is a top concern, ARM or MIPS probably makes more sense. Keep in mind though you'll still have to deal with the extra ram and binary size needed to handle code that's easily 3-4 times larger, and you won't be able to get near the performance. Whether this matters depends a lot on what you'll be running on it.

That's why I qualified "the best" with "aside from the cost...and their energy requirements".

I think Intel's throttling down the CPU speed, and smaller die sizes have largely eliminated the power differential. The new Celeron dual 64-bit CPU with 64k L1 and 1MB L2 caches is a 7.5 watt chip. It's my "Starbucks" hangout machine, and the battery life is ridiculously long and will run rings around a P6 machine. As a guy doing mostly floating point computations I gave up on RISC a long time ago. It just crawls. SPARC in particular was atrociously glacial. The perfect example of why RISC sucks was the Intel i860 CPU. Intel never went THERE again.

@RocketRoy: 7.5 watt isn't really acceptable for a device that's powered 24/7 (and not performing useful computations the whole time) or running off a 3.7v/2000mAh battery.

You size the CPU to the task at hand and resources available. That doesn't rule out a CISC or x86 architecture, it just means it may be overkill for some apps.

mips - Why is x86 ugly? aka Why is x86 considered inferior when compar...

x86 mips x86-64 computer-architecture assembly
Rectangle 27 0

Well even though it isn't what some people call the "best" assembly out there, I would recommend learning X86 / X86-64 as it is the most widely used. To run the program, you can simply use GCC to translate it into binary and then run it through your console.

How do I start learning Assembly - Stack Overflow

assembly
Rectangle 27 0

A good place to start would be Jeff Duntemann's book, Assembly Language Step-by-Step. The book is about x86 programming under Linux. As I recall, a previous version of the book covered programming under Windows. It's a beginner's book in that it starts at the beginning: bits, bytes, binary arithmetic, etc. You can skip that part if you like, but it might be a good idea to at least skim it.

I think the best way to learn ASM coding is by 1) learning the basics of the hardware and then 2) studying others' code. The book I mentioned above is worthwhile. You might also be interested in The Art of Assembly Language Programming.

I've done quite a bit of assembly language programming in my time, although not much in the last 15 years or so. As one commenter pointed out, the slight size and performance gains are hard to justify when I take into account the increased development and maintenance time compared to a high level language.

That said, I wouldn't discourage you from your quest to become more efficient with ASM. Becoming more familiar with how a processor works at that level can only improve your HLL programming skills, as well.

Getting assembly language programming skills - Stack Overflow

assembly