Optimization of Microchip PIC XC8 compiler in Free and Pro mode

I don’t like to program PICs in C language. In fact, I even used to hate it due to the poor quality of the C compilers.

When I started to program PICs microcontrollers in 1998 there was not too many options to program PICs in C. As far as I remember, only Hi-Tech, IAR and CCS had compilers – not even Microchip has his own one – and they were quite horrible compiling. But the fault was not in the compilers manufacturers, but in the PIC core architecture.

Those days Microchip had only what we know nowadays as the ‘base-line’ (12C50X) and ‘mid-range’ (16C54,16F84,16F87X…) architectures. Those cores were so simple that it was not easy no make a C compiler for them. Few memory, scarce resources, small instructions set, few addressing modes…
Anyway, who needs a C compiler with such simple architectures?

Years later Microchip released the more C oriented PIC17/PIC18 architecture and a new range of C compilers for the new PICs were created. Finally we had “reasonable efficient” tools to program Microchip microcontrollers in C!

Two years ago Microchip bought the Hi-Tech company and renamed their Picc compiler as XC8. With this movement, Microchip provide to their clients a cheap and decent C compiler as their old and deprecated C18 compiler was – in my opinion – plenty of bugs and not worthy to work with.

I still use ASM to program the PIC12 and PIC16 family. However, I program the PIC18 devices in C but I often had to dive into the asm of the generated binary  to optimize it.
In those optimizations I have seen weird things made by compilers and I have been long time wanting to write about it.
Today I am only going to write shortly about how the free mode of the XC8 compiler bloats the binary to make the Pro version look more efficient.

I have been working for the past week in a new design and as my most important requirement is the size of the PCB, I decided to use the new PIC 12F1840.
This microcontroller use the new ‘enhanced mid-range’ architecture, a revision of the classic 16F architecture with new C-oriented features that should make more efficient the C programming. I decided to break my rule of not programming a PIC16 using C and to use the XC8 compiler to test that “C-oriented” core.

The firmware has a time-critical section were an algorithm is executed to measure the frequency of a signal. Running at 16MHz, I calculated I had 120 instructions cycle to execute the algorithm: getting the CCP1 comparator value, checking if it has only a small deviation from the past captured values, calculating the median and storing it in a buffer.
Could the C compiler be efficient enough to make the algorithm to run in just those 120 cycles I had? I estimated between 30 and 40 cycles if I code it in ASM, so I was asking for a C compiler just 25% or 33% as efficient as programming in assembly.

Well… the answer is NO. The XC8 was far from being efficient enough to fit the algorithm. The generated code was running in 230 instruction cycles. At the end, I had to check the disassembly listing and replace those poor optimized C sentences with #ASM blocks.

I bet nobody is surprised with this result and as you can imagine I am not here writing this post to tell you “C is bad. Real programmers use ASM”. I am writing this to show you a questionable  technique Microchip uses to “motivate” you to use their paid version of the XC8 instead of the free one.

Time before Microchip bought Hi-Tech, Hi-Tech wanted to release a free version of their PICC compiler (the current XC8).
They decided to license a version of the compiler without any optimization activated, so their Lite (free) version was fully functional but not efficient in terms of size and speed. Even without the optimizations, the Lite version was very attractive specially to hobbyist and small companies who could prefer to use a more powerful PIC instead of paying the license to have a more efficient compiler.
To encourage the purchase of a Pro license, Hi-Tech decided to make the gap between the Lite and Pro version even bigger by inserting garbage in the generated binary file when using the Lite mode.

When Microchip bought Hi-Tech, they kept the license schema and still nowadays the Free version of XC8 has this “free compulsory de-optimizer” feature.

Let’s see an example of how this ‘deoptimizer’ works!

A simple example

The next examples are run on MPLABX 1.95 and XC8 1.21, both the most recent versions at 28 of December of 2013.

Let’s analyze how XC8 compile this C code:

This snippet stores in a 8 bits (UINT8) variable called maxDev the value of the lower byte of the CCPR1 register divided by 16.

Using the XC8 1.21 in free mode, the assembled result is:

(Please, note that this is a copy-paste of a disassembly list. The first column is the address location, the second one is the op-code and the third is the mnemonic code)

This code is 10 instructions long and takes 22 instruction cycles to execute. OH GOD!

This is how I would do it:

Just 5 instructions and 5 cycles. Half size and less than a quarter of the time needed to execute it.

What is wrong with the ASM generated by the XC8 compiler? Three things:

  1. It uses a loop to rotate the register as many times as indicated (4 in this example).
    This is the correct way to do write it if you are not worried about the speed but about the code size and you are going to rotate the register at least 5 times. If you rotate it less than 5 times, it is smaller to just put the rotate instruction as many times as you need (like in my proposed ASM code).
    The drawback of this loop is the speed. 4 times slower than just packing the rotate instructions.
  2. The value of CCPR1L is stored in a temporal variable at 0x72 and operated from this temp var. To save the byte of the temporal variable and the 2 instruction cycles of moving the values, I would  move the CCPR1L value directly to the maxDev variable and manipulate it.
    The reason he compiler doesn’t make this savings is that it doesn’t want the variable maxDev to have incorrect values so it calculates the final value in a temporal variable before copying it to maxDev.
    In my proposed ASM code, the maxDev variable is written with 3 different incorrect values before getting the final correct one. If in the middle of the calculation of the correct value an interrupt occurs and the ISR access to the maxDev value, it will read an incorrect value! As I am programming the ISR, I know this is not going to happen so I manipulate directly the maxDev variable instead of using a temporal variable. XC8 is not that smart so it considers to access all the variables in an atomic and safe way.
  3. What the hell is going on in lines 37 and 38!? Why are we moving the W register to the temporal variable 0x78 just to read it back to W in the next instruction? It has no sense to read it back if you already has the value in W!
    Just in case you ask it, the value stored in 0x78 is not used anywhere. In fact, the variable is rewritten in another non-sense operation like this one in the next 10 instructions.

Points 1 and 2 can be understood. Compilers use generic algorithms to translate from C to ASM that are good enough in most of the cases but because they are “too general, wide and safe” they are not optimal. Better compilation algorithms or a second-pass optimization should improve this.
However, I don’t understand the point 3. Why those useless instructions? Is this just an isolate and non-repetitive incident?

Trying to improve the compilation

I am going to try to modify  the C source so no #ASM blocks are needed to optimize code.
I refactor the last example to this:

Note that this C snippet is NOT EXACTLY the same than the one before. In this case I am explicitly assigning intermediate values to the maxDev variable before getting the final value while in the previous example I was atomically writing the correct value in maxDev (remember the point 2?). I expect this code to be assembled to something much closer to the ASM I proposed.

This is the result:

WHAAAT!? The structure is very close to what I was expecting, but… why are the lines 31 and 32 moving from W to a temp var and from the temp var to W? Again, the value stored in the temp var is not used anywhere. And why it is using LSRF to perform the first rotation and it is using BCF CARRY + RRF in the next rotations?

Anyway, this solution is much better in speed. 10 instructions, 10 cycles.

Note: LSRF is a instruction introduced in the enhanced mid-range architecture that performs a logical shifting, that is shifting without inserting the carry bit. It is virtually the same that clearing the Carry flag and doing a regular RRF/RLF operation.

OK! Let’s refactor again!

The logic says that the assembled code should be exactly the same!
Well… remember! XC8 doesn’t respond to logic or reason:

16 instructions, 16 cycles. It is getting worse!

HEY! LSRF instruction is back! No more BCF Carry + RRF!
And again we have those instructions to move W to a useless temporal variable. 4 times! I start to see a pattern here… it looks like each time we have an assignation operator (=), XC8 insert two useless instructions that moves the assigned value to a temporally variable and then it read it.

USING THE PRO VERSION

The previous examples were compiled with the free version of XC8 (Linux ver. 1.21). When you use the compiler it warns you with this message:

Running this compiler in PRO mode, with Omniscient Code Generation enabled,
produces code which is typically 40% smaller than in Free mode.

Where this 40% of improvement comes? From real optimizations or from removing those useless artifacts that seems to be inserted on purpose?

Let’s compile the previous examples using the XC8 Pro version (Linux ver. 1.21 activated for a  60 days trial mode).
This is the first example:

WOW! IT REALLY DID A GOOD OPTIMIZATION!
From the 10 instructions / 22 cycles it compiled the code when used the Free mode, it has been reduced to only 3 instructions / 3 cycles. It is even better than my proposed ASM code by 2 instructions!!!

What about the second example?

This is exactly the ASM I would code! Well played XC8! Well played…

And the third?

It produces exactly the same code than the previous example. This is what we should expect as both C codes are the same, not like the Free version that produces different code.

So far the XC8 Pro version works really well. It optimized these simple examples as much as I could do manually. In fact, I am so surprised about how it treated the maxDev = CCPR1L >> 4 sentence in the example #1 that I had to make some experiments with rotations. Let’s play with the number of rotations:

And let’s see the disassembly listing:

Uhmmm… now I am not surprised with the compiling. I would code it like this:

A little bit more efficient. You save two instructions/cycles

And if I don’t even care about using the variable maxDev to store intermediate values (read point 2 of disassembly listing #1), this is even more efficient:

Let’s do it again but this time rotating 6 times:

The disassembly listing using Pro version:

So… the compiler uses again the generic algorithm it used when compiling the example #1 in Free mode. It detects that if rotating 6 times, is more compact to use a loop than inserting as many LSRF as needed.
I am a little bit disappointed here as I was hoping to see a “trick” like the compiler did when compiling the example #1 in Pro mode. I was expecting something like

It looks that the Pro version of XC8 is clever enough to apply different compiling patterns as needed in order to optimize the resulting binary. The free version, in the other hand, use always the same generic algorithm. The use of the Pro and pay version of the XC8 is therefore justified for those who needs to take most from their microcontrollers.

THE FREE “DE-OPTIMIZER”

Do you remember my hypothesis about the free XC8 version bloating the binary code with useless instructions? Could it be made in purpose of making the Pro version more attractive when compared with the free version?

As I stated before, it looks that when using the Free mode of the XC8 compiler, any assignation (like the = operator) is bloated with two useless instructions that store the assigned value in a temporal variable.
Let’s compare more disassembly listings using Free and Pro mode to verify this theory:

In Free mode (XC8 1.21):

In Pro mode:

Yes! The two useless instructions are present when using the free version and disappear with the Pro mode.
One more example:

As this sentence can be rewritten like irRecord=irRecord+1, will the Free mode insert those padding instructions? The answer is YES

Compiling with the Free version (XC8 1.21):

In Pro mode:

The last example:

Compiling with the Free version (XC8 1.21):

In Pro mode:

It is clear that XC8 is inserting padding instructions when working in Free mode to make it look slower, but it could be worse… much worse in fact! Only few months ago the free version of XC8 was using another dirty trick to make the compiled binary bigger and run slower.

Now we are going to study this C code:

And we are going to compile it using the XC8 in Free mode, but this time using the one year old 1.12 version instead the current 1.21.

WOW! This simple bit test need 8 instructions and takes  6 or 8 cycles to run (depending the value of flagIR). The instruction at 0x001D is not even ever executed!

Let’s compare this code with the generated with XC8 1.12 Pro version:

Only 3 instructions and 3 cycles. No padding GOTOs that consume space and execution cycles.

Another example:

In Free mode:

Pro version:

It is obvious that when running in Free mode, XC8 is bloating the generated code for if-else structures with useless GOTO’s. Remember that in the PIC microarchitecture each jump takes 2 instruction cycles, so these padding GOTOs have a big impact in the execution speed.
Strangely, when using a switch structure, no bloating is performed.

Fortunately, this ‘GOTO padding feature’ disappeared in the version 1.20 released in July of 2013. In fact, look what the “Change Log” says about version 1.20:

I compile these example with the old Hi-Tech PICC compiler and I got the same results and worse: the Lite (free) version is bloated with even more useless instructions like random NOPs.

Conclusions

XC8 is a decent compiler. It is not as mature as compilers for others architectures like MIPS (PIC32) or ARM, but you can expect efficiency and stability (something you couldn’t expect from the deprecated Microchip C18 compiler). Of course, I still prefer to use ASM for PIC12 and PIC16 as the compiled binary is like half the size you can get if you program in C. However, for non critical task where space or speed is not a restriction, XC8 is your friend.

Does the Pro version worth the 995$? It depends…

Is true that the Pro version generates more efficient binaries, but I think that 40%-80% of gain that Microchip claims against the Free version is a little bit exaggerated. Probably is closer to the 30%-40%, and that even considering that XC8 is bloating the binary when working in Free mode.

If you are an electronics company  is easy to monetize the investment in the Pro compiler. If using the Pro version you can make your code to fit in a smaller device and save 5 cents per microcontroller, you will return that money very soon. This kind of companies would probably pay 1000$ even for just a 10% of gain in efficiency.

However it is harder to justify for small companies or freelancer engineers where smaller production batches are common. In this cases is cheaper to use – if possible – a bigger and faster microcontroller and use the Free version of the compiler. Or what is worse… these small companies will prefer to use another microcontroller provider with better tools!

Most of the Microchip competitors like Atmel or ARM (Cortex M0) offer free C compilers (usually based on the GNU toolchain) without any optimization restriction. Others like TI and its MPS430 offers their tools with only a binary size restriction so they promote the use of their products in small projects but force professionals to license the tools for bigger projects.

Microchip bought XC8 from Hi-Tech, a company whose only business was making compilers. For Hi-Tech, deactivating in their lite (free version) of their PICC compiler some optimizations and adding some “deoptimization” features was absolutely understandable. They had to make a clear distinction between their free and their pay compiler so people had reasons to buy the Pro version. However, Microchip business is to sell microchips not compilers, so they should free all their tools to promote the use of their products.

The recent Microchip decision of optimizing the GOTOs makes me think that Microchip plans to provide compilers without restrictions in a near future. They are probably waiting some time before doing that because they don’t want to kick out of business to all those third-party compilers who had been supporting the company for so many years. Microchip could be giving them some courtesy years to change their business before releasing a free full featured compiler.