Feature: The Blue Screen of Death

SATURDAY, 29 JANUARY 2011

It is 1947 and computer scientists studying the Harvard Mark II computer reach a dead end. The theory is correct, the logic is sound, but the machine still isn’t working. On 9 September, operators investigating the new high-speed electromagnetic relays discovered a moth stuck inside, and notoriously recorded it as a 'bug'. Since then, computer science has adopted the phrase into its vernacular.

The blue screen of death, known for its garish blue colour and scary messages, is an error screen caused by a fatal system crash. Encountered by professional and amateur users alike, the screen has become synonymous with both failure and despair. This makes it a useful measure of progress, as it can be caused by hardware failure, faulty software, a virus or a mixture of all three. Exploring these three key avenues provides insight into just how complex the problem of ‘program perfection’ really is.

Keeping hardware working consistently has required borrowing solutions from electrical engineers and physicists. Whilst computer scientists wait to receive hardware that never crashes in any situation, they have been implementing systems that are ready in the event of failure. Early systems had multiple processors doing exactly the same thing, so that if one crashed, there was another one ready to carry on.

Engineers had been using the term ‘bug’ or glitch to refer specifically to electronic or mechanical problems and, in the early stages of computer science, it was assumed that ‘bugs’ would also be confined to hardware. Software was seen to be virtual or intangible and considered immune.

Yet time and time again, when scientists hit unforeseen barriers, rather than being limited by them, they discover a new field of study. On 6 May 1949, the EDSAC (Electronic Delay Storage Automatic Calculator) computer began operation and correctly generated tables of squares. It was a triumph in computing as it was the first practical, working ‘stored-program’ computer—one which uses electronic memory to store program instructions. However, three days later, it hit a glitch. The program that had been written to enumerate prime numbers was not giving the right results. On further inspection, it was found that the code was incorrect. This mistake led to the exploration of what we now call software development, and led Sir Maurice Wilkes, founder of the Computer Laboratory in Cambridge and creator of EDSAC, to infamously realise that “a large part of my life from then on was going to be spent in finding mistakes in my own programs.”

There have been many notable examples where mistakes in software have led to catastrophic results. Our inability to write down all the digits of an irrational number means that computers are also unable to store numbers with infinite precision and instead use ‘floating-point’ numbers (conceptually similar to standard integers, but typically represented in this form: significant digits × base exponent). Each of these numbers is allocated storage in memory, and when one of the numbers exceeds its storage space, this can lead to catastrophe. This type of problem caused the Ariane 5 Space Rocket to self-destruct just 37 seconds after launch, costing the European Space Agency close to 1billion USD.

Bugs in software also lead to loopholes or vulnerabilities that allow hackers and viruses to infiltrate. In April 2010, the QAKBOT virus was discovered on over one thousand NHS computers. The virus had been secretly uploading sensitive information onto servers. Viruses exploit vulnerabilities in code at various levels, often at the interaction between programs or at weaknesses in protocols. These flaws cost businesses billions of pounds each year.

Though it may be costly to make mistakes, it is also very difficult to avoid them. In the traditional software design paradigm, development follows the ‘V model’ —the project moves from general objectives to detailed functional requirements and then a functional design is produced. From design, the project can begin its coding or implementation phase, after which it is tested and then released. Most of the difficulties are experienced in translation, that is, moving between the levels. It is very challenging to make sure that a design completely fulfils the stated requirements, and even harder to implement the design in a congruent way.

As computers increased in memory and speed, complexity grew exponentially and layer upon layer of abstraction was added, until it became incredibly difficult to be confident that a solution would be glitch-free. Even if assured of perfection at a given level of abstraction, it would be extremely difficult to confirm that functionality at lower levels was similarly effective. Businesses now depend on increasing levels of IT infrastructure, so whilst from a scientific perspective it seems rather unsatisfying to knowingly release a flawed product, it is currently the only commercially viable solution.

If your favourite game console crashes, it might be a little inconvenient, but if an aircraft’s computer system fails 40,000 feet in the air, it becomes much more serious. The realisation that some errors are more acceptable than others is part of risk assessment and management. The current situation is to deal with bugs pragmatically, so instead of aiming for perfection, software houses catalogue bugs that cannot be fixed in time and instead release patches gradually to fix those deemed critical.

In mathematics, the study of sets—collections of mathematical or physical objects—is known as set theory. Because computer programming relies on natural expression, such as “and…if…or…not,” it in a sense uses what mathematicians call ‘intuitive’ or ‘naïve’ set theory. This is a non—formalised theory that uses natural language (as opposed to precise mathematical language) to describe and study sets. But the language of naïve set theory often lacks rigorous definition. The programming input and grammar may then be relatively ambiguous, requiring more interpretation by the computer and making it easier to write incorrect code.

Yet perhaps there is hope after all. There are other potential systems that examine objects, groupings and collections. One such system is type theory, which could become crucial for computer scientists in proving the correctness of programs before coding even begins. This enables detection and elimination of potential bugs before any problems can occur.

Web pages have data validations for email addresses that only accept input if it has an ‘@’ symbol. Doing this reduces the chance of sending an email to a non-existent email address. Similarly, ‘type systems’ assign types to variables and functions with the aim of stopping impossible behaviour such as trying to numerically add the word ‘hello’ and the number 5. A more subtle result is that by having very strict rules on types, it is actually hard to write incorrect code that will run.

Such is the promise of type theory that in 1989, the EU began heavily funding projects that looked at developing type systems and ways to prove a program’s correctness. One of the projects that emerged from this investment was Mobius, which sought to incorporate so-called ‘proof-carrying code’, allowing code to be certified as bug-free.

Although poetry and humour benefit much from the ambiguity of languages, beauty in mathematics and science is most commonly seen as simplicity and clarity. The two most popular languages, JAVA and C (based on October 2010 Figures by TIOBE Ranking Index) have a lot of ambiguity which makes programming quicker but also leads to bugs.

Both of these languages have what is known as an ambiguous grammar, which means that there are statements that are valid but could have more than one meaning. It is up to the compiler software to decide what it really means. Understandably, this lack of precision means that it is very hard to prove a code is correct.

Languages do exist that are free of ambiguous grammar. One is called SPARK, which sets out to be free of ambiguity, meaning that any compiler of the language will give the same result each time. SPARK is grounded on mathematical logic and formalism, and while traditional coders may not welcome the intensity of mathematical rigour, it is precisely this painstaking rigour that may once and forever banish the blue screen of death into obscurity.

Wing Yung Chan is a 1st year undergraduate in the Computer Laboratory