Microprocessor Design – We're Running Out Of Ideas

MONDAY, 3 JULY 2023

It used to be the case that you could go out to the store every year and buy a phone or computer twice as fast as your current one. And then it was every two years. Then three. Next thing you know, you're stuck using the same computer from 10 years ago because there isn't anything out there worth upgrading to. And now you find it getting increasingly sluggish when running the latest software. So, what gives?

BUILDING A BETTER SWITCH | Believe it or not, but every single digital component relies on the fast and accurate switching of a sufficient number of switches to achieve the desired output. For a long time we were limited by the lack of a miniaturisable switch. You had, of course, electromechanical relays, and later, vacuum tubes, but they could only get so small and switch so fast. Later, it was found that semiconductors such as silicon or germanium behaved as switches when exposed to impurities in specific patterns and sequences, and thus were born diodes, transistors, thyristors, and so forth. It was only a matter of time before someone realised that it was possible to print these switches on the same piece of semiconductor using techniques borrowed from the world of lithography, and then to join them up together with wires to make any arbitrary circuit.

And after some stumbling around, the transistor was discovered to be increasingly power efficient the smaller it was built, due to a happy coincidence of the scaling laws driving its operation. So every 18 months or so, in accordance with Moore's Law, your friendly local fab would figure out a way to print smaller transistors on a slice of silicon. As these were necessarily smaller and more efficient, every year microprocessor design teams would scramble to use the extra transistors to make their design go faster. Initially, there was heady progress as even a clunker of a design could be reasonably fast, as long as the transistors could go fast enough to compensate for all the design flaws. But this did not last.

The first sign of trouble was when the transistors got too small to be printed. This is because lithography remained the only efficient way to print transistors en masse — you had light shining on a mask that cast a patterned shadow onto a photosensitive coating on the silicon wafer, rapidly creating any arbitrary pattern on the surface that you could use to guide subsequent processing steps. This is fine, until the individual lines on the mask became smaller than that of the wavelength of light, essentially turning into an expensive diffraction grating that blurred everything. Initially, the fix was simple enough — just find a laser that could generate a shorter wavelength of light with enough power. But this is easier said than done. Each new wavelength required the development of new photoresists sensitive to the new wavelength, new materials to block or reshape the light, and new equipment and new process flows to account for these differences.

Eventually, it got to the point where it was just cheaper to work around the diffraction limit, where one might add tiny fillets to the mask pattern to negate the effects of diffraction on the resulting light distribution, or to use the initial pattern as a template to create smaller patterns. This got to a head in recent years, when shrinking the wavelength into the extreme ultraviolet became the only sane way to continue building smaller structures reliably at an acceptable speed. However, it is inherently impractical by virtue of the involved light being so energetic that everything, including air is opaque to it, so the entire process must take place in a vacuum chamber using mirrors to control the light. The light itself ended up being generated by vapourising droplets of molten tin with a multi-kilowatt laser as they drip past a special collecting mirror. Moreover, to enhance the efficiency of the mirrors, the light could not hit the mirrors at more than a grazing angle, resulting in a narrow field of view that limited how much area could be exposed at any one time.

The next bit of trouble arose when the transistors became so small they started behaving like wires, and the wires became so small they started behaving like resistors. Replacing the aluminium in the wires with copper helped, although now additional care had to be taken to lay down barriers to keep the copper from diffusing into the silicon and destroying all the transistors built on it. But then the copper wires had to become so small that electrons started being unable to tell between the wires and empty space, causing their apparent resistance to shoot up disproportionately. Cobalt, indium, and molybdenum were tried, as their smaller grain boundaries compelled electrons to follow the wire boundaries more scrupulously, but their low heat conductivity, high coefficient of thermal expansion, and fragility proved no end of trouble for foundry companies. Meanwhile, the issue of transistors not working below a certain size was neatly sidestepped by placing them on their side and by using special coatings to enhance the electric fields there, among others. However, these serve only to delay the inevitable.

To be sure, improvements from the manufacturing side are still possible, but they become increasingly expensive and impractical. To drive further improvements in electrical performance, there has been a drive to utilise vertical space more effectively. Hence, the transistors in use were first changed from simple 2D structures that could be printed onto a surface into sideways finFETs that had to be etched. And, now manufacturers are proceeding to the logical conclusion of arranging these transistors as 3D stacks of wires or sheets. This, too, can only improve the switching ability of the transistor so far, and now manufacturers are already looking into alternative ways of shrinking the wiring between these miniature transistors without increasing the resistance too much. Thus far, the approach has been to rewire them so that power can be delivered from directly above, or by distributing power upwards from the other side of the chip where there are fewer constraints on how big the wires can get. But now we are faced with a highly complex process that has nearly impossible tolerances, is almost impossible to evaluate due to the sheer number of structures that have to be inspected, and requires more hardware investment for the same incremental increase in throughput. And so, we are seeing a trend where increases in performance continue apace, albeit at a slower rate than before, but which does not translate into a cost reduction for existing hardware.

To make matters worse, these processes now take so much time, money and expertise to set up that only a few companies in the world remain capable of keeping up with the bleeding edge, and even then it takes so long for these companies to respond to any change in demand so that supply and demand are essentially uncorrelated. This results in a situation where there are regular boom and bust cycles, which is simply unsustainable in such a demanding industry. To some extent, we have seen manufacturers pushing back by requiring customers to prepay for capacity years in advance, but this only kicks the problem down the road.

BUILDING A BETTER PROCESSOR | Meanwhile, in the world of the chip designer, things started going wrong at about the same time. It used to be the case that they relied solely on Moore's law for massive speed improvements, given that the first microprocessors made a lot of design compromises to compensate for the low number of transistors per chip available to them. Moreover, the low expectations of consumers at the time relative to what was actually possible meant that there was no real need to optimise these chips. Even then, there were obvious low hanging fruit, so when it came to expanding the capabilities of these early chips, these were rapidly adopted once it became possible to implement them. These included useful things such as adding internal support for numbers larger than 255, or support for larger memory sizes, or the ability to execute instructions in fewer clock cycles, among others. There was also a drive to integrate as many chips as possible into the central processing unit, so instead of a memory controller, a math coprocessor, and so forth, your central processing unit could now do all that and more.

But it wasn't immediately clear about where to go from there. The aforementioned toxic combination of high resistance and high switching power soon meant that chip designers were now faced with the uncomfortable fact that they could no longer count on raw switching speed to drive performance, and that each increase of complexity had to be balanced against the increased power consumption. With general purpose processors hitting scaling limits, it now made sense to create specialised chips targeting specific workloads to avoid the overhead of general purpose processors. And thus the concept of accelerator chips would emerge, the most prominent being the graphics processor unit.

Throughout all this, the preferred processor design was also in flux, as opinions differed on how to push chip design further. One could revamp the instruction set to make it easier for the processor to decode what had to be done from the instruction code supplied to it, or one could give the programmer the tools to tell the processor how to run more efficiently. Others would prefer to spend their transistor budget to add support for complicated operations such as division or square roots, so as not to rely on inefficient approximations of these operations using obscure arithmetic tricks. In what is now termed a complex instruction set architecture, all sorts of new instructions were being added on an ad hoc basis to natively implement various simple programming tasks. It soon became apparent that the decoding of these complex instruction set architectures were extremely energy intensive, and there was then a push in the other direction, towards the fewest possible instructions that could be decoded in the simplest possible way, with the net result finding wide use to this very day in mobile computers.

Then there was the concept of tasking the programmer, or at least the programmer writing the compiler to convert programming code to machine readable code, with thinking about how to shuffle data around optimally. The very long instruction word (VLIW) paradigm makes this explicit by requiring the programmer to group instructions into blocks that are then simultaneously executed. But this belies the difficulty of finding instructions that can be simultaneously executed, and keeping track of how long each instruction would take to complete. In a related approach, the single instruction, multiple data (SIMD) paradigm allows a single instruction to perform operations on multiple streams of data at the same time. Instead of adding single pairs of numbers at a time, now you could add entire arrays of numbers to each other in one go. While these instructions would see limited uptake in general purpose processors, instead they found widespread adoption in specialised processors such as in graphics processor units and digital signal processors, which target highly parallelisable workloads such as image processing that involve iterative computations on large amounts of data. The opposite was also considered, and computer architectures that could directly interpret high level code were built. But, as they locked you into a single programming language and were difficult to debug, they mainly exist today as a paradigm in which one can safely run untrusted code by executing it in a simulated computer that can only run code of a specific type.

And yet, painstaking design overhauls would continue to be made to general purpose processors of each design paradigm, and later to other specialised processors to help make them faster. One early step was to break up each computation into smaller, simpler stages that could execute faster. However, this created a whole host of potential bottlenecks that now had to be considered when designing a chip in order to avoid leaving performance on the table. Instead, for reasons of ease of use and compatibility, most general purpose processors would deal with the instruction decoding issue with the more conservative route of adding an additional stage to translate the instruction set to something more scalable. Processors could also now run multiple instructions simultaneously, allowing them to execute different parts of the same linear strand of code at the same time, and they also gained the ability to execute multiple programs simultaneously to be able to take full advantage of available resources. Speculative execution was also introduced at this time, in which the CPU would guess how a decision would pan out and calculate the resulting implications even before the decision had been reached. Of course, if it guessed wrongly, there would be speed and security penalties. However, this could only be scaled so far, as this approach required numerous energy intensive connections to be made across different parts of the chip.

At the same time, processors also began to outperform the storage they ran from, as it turns out that reliably storing data is inherently slower than performing an operation, especially since the non-negligible speed of light at these scales limits how quickly information can be passed to the processor. Thus, it became necessary to add tiers of faster and nearer memory to the system, as improvements in manufacturing processes allowed the extravagant waste of millions of transistors on the processing die on something as mundane as storage. But this would again run into a wall, as large caches require more energy, and take longer to access, while occupying expensive die area, all while the other parts of the chip up had to be scaled up in order to utilise the additional bandwidth efficiently.

Thus dawned the multiprocessor era, as chip designers realised that instead of adding more complexity for marginal benefit, it sufficed to provide more processors per chip for a mostly linear increase in benefit. This wasn't always useful, as it turns out that software had to be rewritten to take advantage of the extra threads, and had software developers done so, Amdahl's law limits the maximum amount of speedup observed to the inverse of the proportion of the task that cannot be parallelised. While this is not an issue for massively parallel tasks acting on arrays such as rendering or video encoding. For most desktop software or games, only a 2–4x speedup can be seen. Ultimately, this would be limited by the amount of power needed to run all cores at a reasonable speed, and the fact that eventually a large enough chip would be impossible to manufacture due to the larger number of things that could go wrong in the manufacturing process.

The problem now is that chip designers and semiconductor manufacturers alike have painted themselves into a corner where there are no longer obvious ways to provide massive improvements over existing technology, under existing constraints. As the old adage goes, one must pick between power, performance, and area (and hence cost) when designing a chip. Power can no longer be ignored, since currently, the high resistance of the wires and plateauing improvements in transistor efficiency mean that only a fraction of the transistors on a microprocessor can be used at any given time, lest the entire chip melt. Thus, while achievable transistor density continues to increase, transistor utilisation is facing a hard wall, as it turns out that it is impossible to efficiently cool something putting out more heat per unit area than a nuclear reactor. There are ways around this, of course, but when we start talking about making chips so thin that flexibility becomes an issue, or more power hungry than a space heater, or drilling cooling channels into them, one can't help but raise an eyebrow. Nor can area be ignored, due to the limited availability of leading edge processes as the machines we need to make them happen can only be produced and installed so quickly, as well as the high cost involved in manufacturing each wafer.

RETHINKING THE PROCESSOR | Can we discard our constraints and start afresh? One natural solution is to adopt a heterogeneous computing approach, in which we split up a processor into a grab bag of specialised coprocessors to get the best of all worlds while keeping total chip cost within bounds. Thus, you would have a reasonably fast CPU for general computation but which offloads massively parallel tasks such as graphics processing to a GPU that is slower but is capable of performing many concurrent computations, or to a DSP to perform basic image processing functions. Later, the need may emerge to incorporate still more chips to speed up different types of calculations, such as deep learning or cryptography operations, essentially heralding a return to the era of the coprocessor chip. Another, as shown by the Mill architecture, offloads the task of rearranging the input instructions to the programmer so that the processor can focus on sequential execution.

Alternatively, the necessary high bandwidth links can be scaled up further in an advanced packaging paradigm. In the case of HBM memory, this was used to give processors faster access to memory, so that less time is spent waiting for new data, while 2.5D and 3D packaging has also allowed companies to pick the optimal process with which to print different parts of a chip. We can scale bandwidth further with bolting on cache dies onto a processor, and in fact it makes sense to try and disaggregate processors into smaller chiplets that remain tightly interconnected. Among others, defect rates would be reduced due to their reduced complexity and the possibility of manufacturing them under more specialised conditions, and the possibility of vertical stacking allows long, energy-hungry interconnections to be avoided.

We can also rethink the strict von Neumann hierarchy in which every data transfer must pass through the CPU, as this results in unnecessary data transfers to and from the CPU. Instead, approaches such as direct memory access allow us to bypass the CPU in any given system when passing data between devices. Meanwhile, in the in memory processing paradigm, one tries to bring the data to be processed as close to the processor as possible, although the utility for more complicated workloads is limited by the expense of sufficiently fast memory to make it worthwhile. In the processing in memory approach, this is taken to the logical conclusion by fully integrating the processor and the memory, but to date these approaches have yet to take off as the process of manufacturing fast processors and fast memory tend to be mutually exclusive.

RETHINKING THE SOFTWARE | In the end, all of these efforts would come to naught without software and developers to make use of all these new capabilities. There is a need for a complete rethink of software paradigms to take advantage of this brave new world emphasising parallelism and an economy of data flow. To some extent, there are new programming paradigms that provide an alternative to the classic linear control flow, such as graphics card programming interfaces that expose the tools needed for a programmer to run multiple copies of a simple program in parallel across the available compute resources. We also need to consider ways of improving the translation process from programming language to machine code, as the compilers that do so can always be improved, and all programming language should implement well defined ways to allow the programmer to peel back and bypass the abstraction inherent in them in order to achieve speed and safety improvements.

The increased shift towards increased abstraction in human computer interactions has, to some extent, caused a lack of curiosity towards low level computer design. This affects us at all levels, as it not only leads to a dearth of expertise on how to improve on existing microprocessor designs, but also creates a situation where inefficient code is written due to a lack of appreciation for the nuances of the underlying hardware. Programmers need to be made aware of potential inefficiencies in their code and the tools with which they can identify and fix these in a safe manner. And indeed, our failure to recognise this is leading to unsafe, slow code that threatens to undo the progress that we have made thus far in building faster computers.

And this is the situation we find ourselves in today. We have faster hardware, but these still fall far short of our ever increasing expectations of what a computer should do. This in turn, is fed by the expectation that new hardware should be able to do more, even as this results in complex, poorly optimised software that runs slowly on less capable hardware. And, with the manufacturing process becoming ever more complex and expensive, at some point there will be a point of reckoning when we need to rethink our expectations about what our hardware is capable of, and that hopefully this will result in an increased appreciation into the underlying design and how it can be leveraged to its full potential. And then, just maybe, your devices might just stop getting slower with each update.

Clifford Sia studied medicine at St. Catharine's College but happens to have a passing interest in computers and also helps run the BlueSci website. Artwork by Barbara Neto-Bradley.