On the practicality of regex for email address processing

A coleague recently pointed me to an blog post: On the Futility of Email Regex Validation. For the sake of brevity I will refer to it as Futility in this article.

I admit that while the challenge of writing a regex that can successfully identify whether a string conforms to the RFC 5322 definition of an Internet Message header is an entertaining challenge, Futility is not a useful guide for the practical programmer.

This is because it conflates RFC 5322 message headers with RFC 5321 address literals; which in simple language means that what constitutes a valid SMTP email address is different from what constitutes a valid message header in general. It is also because it incites the reader to become preoccupied with edge cases that are theoretically possible from a standards point of view, but which I will demonstrate have an infinitesimal probability of occurring “in the wild”.

This article will expand upon both of these assertions, will discuss a few possible use cases for email regex, and will concluded with annotated “cookbook” examples of practical email regex.

RFC 5321 supersedes 5322

The universality of SMTP for the transmission of email means that as a practical matter, no examination of email address formatting is complete without a close reading of the relevant IETF RFC, which is 5321.

5322 considers email addresses as simply a generic message header with no special case rules applying to it. This means that comments enclosed in parenthesis are valid, even in a domain name.

The test suite referenced in Futility includes 10 tests which contain comments, or diacritical or unicode characters and indicates that 8 of them represent valid email addresses. This is incorrect because RFC 5321 is explicit in stating that the domain name portions of email addresses “are restricted for SMTP purposes to consist of a sequence of letters, digits, and hyphens drawn from the ASCII character set.”

In the context of constructing a regular expression, it is hard to overstate the degree to which this constraint simplifies matters, especially as regards determining excessive string length. The annotation of the examples will highlight this below. It also implies some other practical considerations in the context of validation that we will explore further on.

Mailbox names in the wild

As per both RFCs, the technical name for the portion of the email address to the left of the “@“ symbol is “mailbox”. Both RFCs allow considerable latitude in what characters are allowable in the mailbox portion. The only significant practical constraint is that quotes or parenthesis must be balanced, something that is a real challenge to verify in vanilla regex.

However real world mailbox implementations are again the measure which the practical programmer should employ. As a rule, the people who pay us frown upon 90% of our billable hours being directed to resolving the 10% of theoretical edge cases that might possibly not even exist at all in real life.

Let’s look at the dominant email mailbox providers, consumer and business, and consider what types of email addresses they permit.

For consumer email, I did some primary research, using a list of 5,280,739 email addresses that were leaked from Twitter accounts. Based on 115 million twitter accounts, this gives us a 99% confidence level with a 0.055% margin of error for the entire population of Twitter, which would be very representative of the general population of all Internet email addresses. Here is what I learned:

  • 82% of addresses contained only ascii alphanumeric characters,
  • 15% contained only ascii alphanumeric and dots (ascii periods), for 97% of all addresses,
  • 3% contained only ascii alphanumeric, dots, and dashes, for a nominal 100% of email addresses.

However, this is a a rounded 100%. For the trivia lovers out there, I also found:

  • 38 addresses with underscores for 0.00072% of the total
  • 27 with plus signs for 0.00051%, and
  • 1 address with unicode characters representing 0.00002% of the total.

The net effect is that assuming email address mailboxes contain only ascii alphanumeric, dots, and dashes will give you better than 5 9’s accuracy for consumer emails.

For business emails, Datanyze reports that 6,771,269 companies use 91 different email hosting solutions. However the Pareto distribution holds, and 95.19% of those mailboxes are hosted by just 10 service providers.

Gmail for business (34.35% market share)

Google allows only ascii letters, numbers, and dots when creating a mailbox. It will however accept the plus sign when receiving email.

Microsoft Exchange Online (33.60%)

Allows only ascii letters, numbers, and dots.

GoDaddy Email Hosting (14.71%)

Uses Microsoft 365, allows only ascii letters, numbers, and dots.

7 additional providers (12.53%)

Not documented.

Unfortunately we can only be certain of 82% of businesses and we do to know how many mailboxes that represents. However we do know that of the Twitter email addresses, only 400 out of 173,467 domains had more than 100 individual email mailboxes represented. I believe that most of the 99% of remaining domains were business email addresses . In terms of mailbox naming policies at the server or domain level, I propose that it is reasonable to take these 237,592 email addresses as representing a population of 1 billion business email addresses with 99% confidence level and 0.25% margin of error, giving us close to 3 9’s when assuming that an email address mailbox contains only ascii alphanumeric, dots, and dashes.

Use cases

Again, with practicality foremost in our minds, let us consider under what circumstances we might need to programmatically identify a valid email address.

New account creation/user signups

In this use case, a prospective new customer is trying to create an account. There are two high-level strategies we might consider. In the first case we attempt to verify that the email address that the new user provides is valid, and proceed with account creation synchronously. There are two reasons why you might not want to take this approach. The first is that although you might be able to validate that the email address has a valid form, it might nonetheless not exist. The other reason is that at any kind of scale, synchronous is a red flag word, which should cause the pragmatic programmer to consider instead a fire and forget model where a stateless web front end passes form information to a microservice or API which will asynchronously validate the email by sending a unique link which will trigger the completion of the account creation process.

Contact forms

In the case of a simple contact form, of the sort often used to download white papers, the potential downside to accepting strings that look like a valid email but are not, is that you are lowering the quality of your marketing database by failing to validate if the email address really exists. So once again, the fire and forget model is a better option than programatic validation of the string entered in a form.

Parsing of referrer logs and other large volumes of data.

Which leads us to the real use case for programatic email address identification in general, and regex in particular: anonymizing or mining large chunks of unstructured text.

I first came across this use case assisting a security researcher who needed to upload referrer logs to a fraud detection database. The referrer logs contained email addresses that needed to be anonymized before leaving the company’s walled garden.

These were files with hundreds of millions of lines, and there were hundreds of files a day. “Lines” could be close to a thousand characters long. Iterating through the characters in a line, applying complex tests (e.g. is this the first occurrence of @ in the line and is it part of a file name such as imagefile@2x.png?) using loops and standard string functions would have created a time complexity that was impossibly large. In fact the in-house development team of this (very large) company had declared it an impossible task.

I wrote the following compiled regex:

search_pattern = re.compile("[a-zA-Z0-9\!\#\$\%\'\*\+\-\^\_\`\{\|\}\~\.]+@|\%40(?!(\w+\.)**(jpg|png))(([\w\-]+\.)+([\w\-]+)))")

and dropped it into the following Python list comprehension:

results = [(re.sub(search_pattern, "redacted@example.com", line)) for line in file]

I cannot remember just how fast it was, but it was fast. My friend could run it on a laptop and be done in minutes. It was accurate. We clocked it at 5 9’s looking at both false negatives and false positives.

My job was made somewhat easy by the fact as referrer logs, they could only contain URL “legal” characters, so I was able to map out any collisions which I documented in the repo readme. Also I could have made it even simpler (and faster) if I had performed the email address analysis and learned with assurance that all that was needed to get to the 5 9’s target was ascii alphanumeric, dots, and dashes. Nonetheless this is a good example of practicality and scoping the solution to fit the actual problem to be solved. One of the greatest quotes in all of programming lore and history is the great Ward Cunningham’s admonition to take a second to remember exactly what you are trying to accomplish, and then ask yourself “What is the simplest thing that could possibly work?”

In the use case of parsing out (and optionally transforming) an email address from a large amount of unstructured text this solution was definitely the simplest thing I could think of.

Annotated cookbook

Like I said at the beginning, I found the idea of building an RFC 5322 compliant regex amusing, so I will I show you composable chunks of regex to deal with various aspects of the standard and explain how the regex polices that. At the end I will show you what it looks like all assembled.

The structure of an email address is:

  1. The mailbox
    1. Legal characters
    2. Single dots (double dots are not legal)
    3. Folded White Space (RFC 5322 craziness)
    4. (A complete regex solution would also include balanced parenthesis and/or quotes, but I do not have that yet. And very possibly never will.)
  2. The delimiter (@)
  3. The domain name
    1. Standard dns parsable domains
    2. IPv4 address literals
    3. IPv6 address literals
      1. IPv6-full
      2. IPv6-comp (for compressed)
        1. 1st form (2+ 16-bit groups of zero in the middle)
        2. 2nd form (2+ 16-bit groups of zero in the beginning)
        3. 3rd form (2 16-bit groups of zero at the end)
        4. 4th form (8 16-bit groups of zero)
      3. IPv6v4-full
      4. IPv6v4-comp (compressed)
        1. 1st form
        2. 2nd form
        3. 3rd form
        4. 4th form

Now for the regex.

Mailbox

^(?<mailbox>(\[a-zA-Z0-9\\+\\!\\#\\$\\%\\&\\'\\\*\\-\\/\\=\\?\\+\\\_\\\{\\}\\|\\\~]|(?<singleDot>(?<!\\.)(?<!^)\\.(?!\\.))|(?<foldedWhiteSpace>\\s?\\&\\#13\\;\\&\\#10\\;.))\{1,64})

First we have ^ which “anchors” the first character at the beginning of the string. This is to be used if validating a string that is supposed to contain nothing but a valid email It makes sure that the first character is legal. If the use case is instead to find an email in a longer string, omit the anchor.

Next we have (?<mailbox>. This names the capture group for convenience. Inside the captured group are the three regex chunks separated by the alternate match symbol | which means that a character can match any one of the three expressions. Part of writing good (performant and predictable) regex is to make sure that the three expressions are mutually exclusive. That is to say that a substring that matches one, will definitely not match either of the other two. To do this we use specific character classes instead of the dreaded .*.

Unconditionally legal characters

[a-zA-Z0-9\+\!\#\$\%\&\'\*\-\/\=\?\+\_\{\}\|\~]

The first alternate match is a character class is enclosed in square brackets, which captures all ascii characters that are legal in an email mailbox except the dot, “folded white space”, the double quote, and the parenthesis. The reason why we excluded them is that they are only conditionally legal, which is to say that there are rules about how you can use them that have to be validated. We handle them in the next 2 alternate matches.

singleDot

(?<singleDot>(?<!\.)(?<!^)\.(?!\.))

The first such rule concerns the dot (period). In a mailbox, the dot is only allowed as a separator between two strings of legal characters, so two consecutive dots is not legal. To prevent a match if there are two consecutive dots, we use the regex negative lookbehind (?<!\.) which specifies that the next character (a dot) will not match if there is a dot preceding it. Regex look arounds can be chained. There is another negative lookbehind before we get to the dot (?!^) which enforces the rule that the dot cannot be the first character of the mailbox. After the dot, there is a negative lookahead (?!\.), this prevents a dot being matched if it is immediately followed by a dot.

foldedWhiteSpace

(?<foldedWhiteSpace>\s?\&\#13\;\&\#10\;.)

This some RFC 5322 nonsense about allowing multi-line headers in messages. I am ready to bet that in the history of email addresses there has never been anyone who seriously created an address with a multiline mailbox (they might have done it as a joke). But I am playing the 5322 game so here it is, the string of unicode characters that creates the Folded White Space as an alternate match.

Balanced double quotes and parenthesis

Both RFC allow for the use of double quotes as a way of enclosing (or escaping) characters that would normally be illegal. They also allow for enclosing comments in parenthesis so that they will be human readable, but not considered by the mail transfer agent (MTA) when interpreting the address. In both cases the characters are only legal if balanced. This means that there has to be a pair of characters, one the opens and one that closes.

I am tempted to write that I have discovered a demonstrationem mirabilem, however this probably only works posthumously. The truth is that this is non-trivial in vanilla regex. I have an intuition that the recursive nature of “greedy” regex might be exploited to advantage, however I am unlikely to devote the time necessary to attack this problem for the next few years, and so in the very best tradition, I leave it as an exercise for the reader.

Mailbox length

{1,64}

Something that actually does matter is the maximum length of a mailbox: 64 characters. So after we close the mailbox capture group with a final closing parenthesis, we use a quantifer between curly braces to specify that we must match any of our alternates at least one time and no more that 64 times.

atSign

\s?(?<atSign>(?<!\-)(?<!\.)\@(?!\@))

The delimiter chunk starts off with the special case \s? because according to Futility a space is legal just before the delimiter and I am just taking their word for it. The rest of the capture group follows a similar pattern as singleDot, it will not match if preceded by a dot or a dash or if followed immediately by another @.

Domain name

Here, as in the mailbox, we have 3 alternate matches. And the last of these has nested in it another 4 alternate matches.

Standard dns parsable

(?<dns>[[:alnum:]]([[:alnum:]\-]{0,63}\.){1,24}[[:alnum:]\-]{1,63}[[:alnum:]])

This will not pass several of the tests in Futility, but as mentioned earlier, it complies strictly with RFC 5321 which has the final word.

IPv4

(?<IPv4>\[((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\])

There is not too much to say about this. This is a well-known and easily available regex for IPv4 addresses.

IPv6

(?<IPv6>(?<IPv6Full>(\[IPv6(\:[0-9a-fA-F]{1,4}){8}\]))|(?<IPv6Comp1>\[IPv6\:((([0-9a-fA-F]{1,4})\:){1,3}(\:([0-9a-fA-F]{1,4})){1,5}?\])|\[IPv6\:((([0-9a-fA-F]{1,4})\:){1,5}(\:([0-9a-fA-F]{1,4})){1,3}?\]))|(?<IPv6Comp2>(\[IPv6\:\:(\:[0-9a-fA-F]{1,4}){1,6}\]))|(?<IPv6Comp3>(\[IPv6\:([0-9a-fA-F]{1,4}\:){1,6}\:\]))|(?<IPv6Comp4>(\[IPv6\:\:\:)\])|(?<IPv6v4Full>(\[IPv6(\:[0-9a-fA-F]{1,4}){6}\:((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3})(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\])|(?<IPv6v4Comp1>\[IPv6\:((([0-9a-fA-F]{1,4})\:){1,3}(\:([0-9a-fA-F]{1,4})){1,5}?(\:((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\])|\[IPv6\:((([0-9a-fA-F]{1,4})\:){1,5}(\:([0-9a-fA-F]{1,4})){1,3}?(\:((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\]))|(?<IPv6v4Comp2>(\[IPv6\:\:(\:[0-9a-fA-F]{1,4}){1,5}(\:((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\]))|(?<IPv6v4Comp3>(\[IPv6\:([0-9a-fA-F]{1,4}\:){1,5}\:(((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))\]))|(?<IPv6v4Comp4>(\[IPv6\:\:\:((?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3})(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\]))

I was unable to find a good regular expression for IPv6 (and IPv6v4) addresses, so I wrote my own, carefully following the Backus/Naur notated rules from RFC 5321. I will not annotate every sub group of the IPv6 regex, but I have named every subgroup to make it easy to pick apart and see what is going on. Nothing too interesting really except maybe the way I combined greedy matching on the “left” side and non-greedy on the “right” in the IUPv6Comp1 capture group.

The full monty

I have saved the final regex, along with the test data from Futility, and enhanced by some IPv6 test cases of my own, to Regex101. I hope you have enjoyed this article and that it proves to be useful and a time saver for many of you.

AZW

AI and the Consciousness Gap

AI means a lot of things to a lot of people. Usually what it means is not very well thought out. It is felt, it is intuited. It is either adored, worshipped or deemed blasphemous, profane, to be feared.

In this article, I explore what society at large really means by artificial intelligence as opposed to what researchers or computer scientists mean. I want to clarify for the non-technical audience what can realistically be expected from AI, and more importantly, what is just unrealistic pie-in-the-sky speculation.

I am worried that blind fear — or in some cases worship — of AI is being used to manipulate society.Politicians, business people, and media personalities craft narratives around AI that stir up deep emotions that they use to their advantage. Meanwhile the truth is only to be found in dense technical literature that is out of reach for the ordinary person.

What do we mean by intelligence?

Is intelligence a purely human characteristic? Most people will consider some dogs to be more intelligent than others. Or dogs to be more intelligent than Guinea pigs, so clearly intelligence is something that an animal can have.

If a dog can have intelligence, can a bird? How about an earthworm, or a plant? Where do we draw the line? There are many definitions of intelligence, but one that I like (from Wikipedia) is:.

“the ability to perceive or infer information, and to retain it as knowledge to be applied towards adaptive behaviors within an environment or context“.”

Seems reasonable? Some people argue that by this definition even plants can be intelligent So why not computers? 

This is the intelligence to which AI researchers generally refer. Yet the reality is that when average people speak of or think about AI, they are not thinking about plants or animals. Most people would not get too excited one way or another about the idea that a computer might be able to operate at the level of a plant, a guinea pig, or even a dog.

Equally, people do not really care of a machine can do what its creator intends it to do; what it is programmed to do. Handguns are created to kill people, and they do, yet no one worries that handguns will develop consciousness and kill all the humans – as they are “programmed” to do.

If we are going to be honest, we must admit that what the average person means by “artificial intelligence” is really “artificially like a human”, and what they worry about (or celebrate) is the possibility that AIs could spontaneously come up with motivations of their own.

The human factor

Motivation is key; human motivation has a different quality than that of all other animals, and this difference is arguably what makes humans unique.

The great mathematician and computer pioneer Alan Turing had suggested that instead of asking what it is to be human — which he considered a practical impossibility — we could simply say: “if it walks like a duck and quacks like a duck.” .

Turing’s proposal was that if a computer program could pass a written interview without the interviewer realizing that they were interacting with a computer, then the computer could be said to be for all practical purposes an artificial intelligence that was human in character. However this does not really answer the question of whether a computer program could make the same kind of complex decisions in real time that a human does, whether it can have motivations that were not preprogrammed into it.

Above all, a written interview encompasses only a very small part of human experience. We live in a world of action, and many actions have very real and immediate consequences. The decision of whether or not to walk down a dark alley is a complex one, and many of the factors that influence our decision are unknown.Every day, we make decisions without knowing all the data, and the fact that the human race survives and thrives is indisputable proof that on the whole we are amazingly successful at doing this — if one defines success as survival.

Value making machines

How we pull this trick off is a matter of debate. But most serious scholars would agree that it is almost unavoidably the case that the human propensity for assigning value is at the heart of things.

Humans are “value making machines” and we do it at a speed that no computer network could hope to match. If you hear an unexpected sound, you will have a reflex action in 170 milliseconds. To put this in perspective: it is 30 milliseconds faster than Google’s recommended time for the first byte of data to hit your browser after you click on a link, and Google recommends a 500 millisecond time for the page to finish loading after the first byte.

Loading a web page is a trivial action for a computer, and still it takes more than twice as long for the computer to do this as it does for your brain to figure out whether an unexpected noise is threatening (a snapping twig or a metallic sound) or delightful (ice cream truck bells or a child’s laughter) — which by the way; it does instantly and with virtually no data at all.

Identifying something as threatening or delightful is what we mean by assigning value, and it is something that we do instinctively, automatically, “without thinking”. Yet without doing it, we would be unable to “think” as we know it.

Say you are thirsty; you must solve the problem of what to drink, then you must solve the problem of getting it.

Suppose your choice of what to drink is between pond water and fresh sparkling well-water. You are thinking right now that obviously we want the well-water. But what if you are an escaped prisoner, and the pond water is safely out of site in the forest, and the well is in a town square where you might be seen? Now you are thinking pond water. You are motived by thirst, but you are more highly motivated to remain free.This is because the value you place on staying free is much higher than the value you place on the freshness of your water.

Human motivation is completely dependent upon values. A thirsty dog will simply drink from the first available water it finds, because its motivation is survival, unaffected by values.

Let’s consider an example unrelated to survival: would you push a button that would definitely kill one person, or would you refuse to push it even if it meant that 10 people might die. Your thinking about this question is not in any related to your own physical survival, yet it has a moral urgency that few people could deny. Your thinking will be entirely directed by the values that you assign while evaluating your options.

Where do these values come from? 

The truth is that we don’t know.

For some it is a question faith. Sincerely religious people belive that our values are a reflection of God’s will.The few people who truly believe in evolutionary theory and all of its implications would say that our values are that/those which allow(s) us to survive and which consequently perpetuate itself/themselves.

Almost everyone else holds the view that values are a self-evident truth. That we have these values because we all have them, that it is just obviously so. As Sam Harris puts it:

“When we really believe that something is factually true or morally good, we also believe that another person, similarly placed, should share our conviction”.

As far as a claim to understanding goes, this is weak tea. It provides no useful grounding for explaining why some values are culturally dependant, while others seem to be universal or nearly so. It does not explain the origin of values, and so for the most part, the religious and the evolutionists notwithstanding, we have no useful explanation of how values work.

And it is in values that we find the consciousness gap. In the above example you are motivated by thirst, a biological factor. An AI, equipped with the appropriate sensors, could also have that sort of motivation, for example the need to charge a battery that is running low.

But what motivates you to take a picture of something beautiful to share with your loved ones, or to argue politics with your friends? What motivates you to watch a scary movie or to learn a sport? What about the button pushing example? On what basis could a computer make a decision like that without a human first providing values such as one life is worth less than ten, or that killing is wrong no matter what the circumstances. Our morality, our ability to assign value in the blink of an eye, has evolved over millions of years. How would a computer — in a single generation mind you, because they have no natural reproductive mechanism — arrive at a moral grounding on its own, without a blueprint first being provided by a human?

Would we even want it to? Human history suggests that tens of thousands of generations were required before we arrived at what we now consider civilized behaviour. Would we really want a race of AIs to make the same slow and painful march towards civilization?

Sure we could give them a “jump-start” but then we are back to the consciousness gap: if we program in the jump start, basically programming in the values of some human, can the AI be said to truly have human-like consciousness, capable of spontaneous motivations? As you may have guessed; I believe the answer is probably not.

Wherein the danger

This is not to say that AI is totally and completely uninvolved with anything dangerous or any area for concern. The use of AI is no more and no less subject to the law of unintended consequences than any other field of human endeavour. We can be absolutely certain that the use of AI for things such as curating the content of your social media feed will lead directly to unforeseen results that many people do not like. 

This is not due to to any inherent quality of AI. It is the nature of the world in which we live. It is inherent to human decision making, and whether that decision is to import Cane Toads into Australia (to control pests), or to use AI to make automated stock purchases, the overwhelming probability is that catastrophe will ensue.

Party tricks

We cannot imagine trying to present the “button problem” to a dog. We have no reason to believe that a dog would have any moral framework within which to make the problem relevant, no reason to believe that dog would care one way or another.

We have no more reason to believe that we could present this problem to an AI than to a dog. We have no evidence at all that any AI anywhere has a moral framework guided or informed by the sort of spontaneous value-making that marks humans. It is irrelevant that computers can calculate faster than a human or can predict certain classes of problems faster or better. Winning a chess game against a human is a landmark in programming and computer science, but it is of little consequence in the real world. The ability to calculate a very large number of permutations within a very strict, and very limited, set of rules  is in no way indicative of general intelligence or the ability for a computer to develop consciousness.

Computers being able to triage and diagnose medical conditions as well as — or possibly better than — human doctors is also not quite as impressive as it sounds. To be sure it is very helpful to automate checklists and decision trees; in a medical emergency a computer is faster than cracking a book. And it is true that without these checklists, humans are prone to all sorts of perceptual and cognitive biases, but you would be very wrong if you assumed that the AI that can do triage can also decide if the shadow in front of your car is a cardboard box or a child on a tricycle.

Sensationalism sells. The sky is always falling, and the failure of last week’s prophecies of Armageddon (the “Y2K bug” or New York City’s West Side Highway under water by 2019) never seem to slake people’s thirst for this week’s prediction of impending doom. The same can be said for sensationalist idealism. The repeated failure of utopian ideologies to actually produce the predicted earthly paradise seems in no way to hinder the convictions of the true believers that this time.

AI is nowhere near as exciting, or mysterious, or dangerous, or magnificent, as it promoters and detractors would have you think. It has become for the most part a branch of the mathematics of probability, glorified actuarial work. It has all the sex appeal of an otaku or anorak lecturing on their favorite subject. It is a one trick pony.

In short: it is nothing to worry about or get excited about. We return you to our regularly scheduled program. 

Framework or language?

An addendum to my very personal history of programming

Programmers today…

…do not really know where the language stops and the framework begins.

What do I mean by this?

Up until about 1988 most programs that a person (like you) would use had been programmed from the ground up by a handful of programmers (often just one) using a 3GL (3rd generation language). The key words being: from the ground up.

As I explain in the first article of this series, 3GLs abstract assembly or machine language into reserved words.¹ A programming language is a collection of reserved words and some rules about grammar that restrict how one can use those words in a way that will not confuse the compiler (which expands the words into a series of machine language instructions). Together, this is referred to as the language’s syntax.

SmallTalk has only 6!

So back to pre-1988 software, if you look at the code of a program written before that time, the only words you will see besides the handful of reserved words are the names of variables and functions that the programmers created. This is what most people think of as programming.

The image comparison code (written in C) below uses two reserved words: for and double. The two underlined words ( fabs, and printf) are functions from included libraries (stdio.h and math.h). When a library is includedlike this it is called a dependency. All of the other words in this program are either variables or comments written by the programmer.

By simple word count, 90% of this code was written by the programmer. Less than 10% of the code is someone else’s, 4% from the language and 6% from the two libraries.

for(x=0; x < im1->width; x++)
{
for(y=0; y < im1->width; y++)
{
totalDiff += fabs( GET_PIXEL(im1, x, y)[RED_C] - GET_PIXEL(im2, x, y)[RED_C]) / 255.0;
totalDiff += fabs( GET_PIXEL(im1, x, y)[GREEN_C] - GET_PIXEL(im2, x, y)[GREEN_C]) / 255.0;
totalDiff += fabs( GET_PIXEL(im1, x, y)[BLUE_C] - GET_PIXEL(im2, x, y)[BLUE_C]) / 255.0;
}
}
printf("%lf\n", 100.0 * totalDiff / (double)(im1->width * im1->height * 3));

Now lets look at the Java version of the same function:

public enum ImgDiffPercent {
    ;
 
    public static void main(String[] args) throws IOException {
        // https://rosettacode.org/mw/images/3/3c/Lenna50.jpg
        // https://rosettacode.org/mw/images/b/b6/Lenna100.jpg
        BufferedImage img1 = ImageIO.read(new File("Lenna50.jpg"));
        BufferedImage img2 = ImageIO.read(new File("Lenna100.jpg"));
 
        double p = getDifferencePercent(img1, img2);
        System.out.println("diff percent: " + p);
    }
 
    private static double getDifferencePercent(BufferedImage img1, BufferedImage img2) {
        int width = img1.getWidth();
        int height = img1.getHeight();
        int width2 = img2.getWidth();
        int height2 = img2.getHeight();
        if (width != width2 || height != height2) {
            throw new IllegalArgumentException(String.format("Images must have the same dimensions: (%d,%d) vs. (%d,%d)", width, height, width2, height2));
        }
 
        long diff = 0;
        for (int y = 0; y < height; y++) {
            for (int x = 0; x < width; x++) {
                diff += pixelDiff(img1.getRGB(x, y), img2.getRGB(x, y));
            }
        }
        long maxDiff = 3L * 255 * width * height;
 
        return 100.0 * diff / maxDiff;
    }
 
    private static int pixelDiff(int rgb1, int rgb2) {
        int r1 = (rgb1 >> 16) & 0xff;
        int g1 = (rgb1 >>  8) & 0xff;
        int b1 =  rgb1        & 0xff;
        int r2 = (rgb2 >> 16) & 0xff;
        int g2 = (rgb2 >>  8) & 0xff;
        int b2 =  rgb2        & 0xff;
        return Math.abs(r1 - r2) + Math.abs(g1 - g2) + Math.abs(b1 - b2);
    }
}

Everything in bold is a key word, everything underlined is a function from an imported library. This listing is about 65% code written by the programmer, and 20% code from the language syntax and 15% depedencies upon external libraries.

I am not saying that this is a bad thing. I am merely making a statement of fact. As it happens, including code from libraries is an important productivity enhancer. There is no justifiable reason for the average programmer to re-invent the wheels of language; reserved words, library classes and functions. So in most modern code, this trend towards less code written by the programmer, and relying more and more upon code written by someone else in the form of a library has increased exponentially. And although not inherently a bad thing, many agree that things have gone too far.

And magic happens…

In early 2016, what felt like half of the Internet broke because a programmer removed an 11 line program called left-pad from a public repository called npm. It turned out that some of the biggest, most used JavaScript frameworks in the world included a dependency on left-pad rather than type out the ten lines of code below:

function leftpad (str, len, ch) {
    str = String(str);
    var i = -1;    
    if (!ch && ch !== 0) ch = ' ';
    len = len - str.length;
    while (++i < len) {
        str = ch + str;
    }
    return str;
}

Another npm package called isArray had 18 million downloads in February of 2016, and is a dependency for 72 other NPM packages. 18 Million everyday programmers, and 72 package authors used an include rather than type this 1 line of code:

return toString.call(arr) == '[object Array]';

Now I’m just a country boy, but to me this pretty clearly indicates that the programmers that created these 72 npm packages either had the most twisted sense of humor I have ever seen, or really had no idea of what was in isArray and how JavaScript actually works. I take it to be an example of cargo cult programming at its most extreme.

To further drive home the point that most modern programmers blindly use class libraries without understanding what is in them I refer you to Jordan Scales sobering (and depressing) account of his personal reaction to the left-pad fiasco.


Get off my lawn


So where am I going with all of this?

My point is that “programming” as the average person imagines it hardly exists today. The only programmers “writing code” in the form of new algorithms are either working at very big Internet companies, or are writing specialized image, video, or sound processing sofwtare for a startup.

The armies of “kids today” working in the salt mines of corporate and government IT are doing something else entirely. The coding they do is the software equivalent to meme creation and social media posts, complete with post-modern pop culture references. Only instead of recycling pictures of Clint Eastwood, Good Guy Greg, or Scumbag Steve, they are cutting and pasting code and indiscriminately using libraries such as left-pad or isArray. They do not really know where the language ends and the framework begins. It is all just one big soup to them.

And although I am not a “kid”, I am scarcely better myself. I describe myself as a cargo cult programmer (reluctantly but honestly). Some of you may be familiar with the epic Story of Mel. To read The Story of Me just buy my book.


[1] In ALGOL, FORTRAN, and PL/1 there are no reserved words, only keywords. The difference is not really that important in the context of this article. In this article I will use reserved words to refer to both.


Pyramids, salt mines, or sausage factories?

Working in a modern day IT shop

“Most software today is very much like an Egyptian pyramid with millions of bricks piled on top of each other, with no structural integrity, but just done by brute force and thousands of slaves.”


Alan Kay spoke these words in an interview that he gave to the ACM’s Queue magazine in 2004. Things have only become more so since then.

For this article, let’s agree that when Kay says “most software today” he is not atalking about Google, Facebook, or Netflix (especially not in 2004), and in this article I am excluding them as well. As huge as Google’s codebase is, it is a drop in the bucket. By some estimates, an equivalent amount of new code is brought into the world every week, and most of that code is below the radar.

A few hundred thousand lines added to your banking app so that you can deposit a checque by taking a photo — multiplied by all of the banks in the world.

A few thousand lines to allow a customer to process a merchandise return — multiplied by all of the suppliers in the world.

These billions of lines of code are not being written by Steve Yegge, or Chris Coyier, or Arun Gupta, or Ethan Marcotte, or Fredrik Lundh, or anyone else whose blog posts you are reading. They are being written by these developers, in this workplace…

Convergys office. Photo from Glassdoor

… and many similar workplaces around the world.

Salt mines

Kay uses the image of slaves and pyramids. The image most often in my mind is that of the salt mines. For most of human history salt was an almost priceless luxury. Salt mining — often done by slave or prison labor — was one of the most dangerous occupations in a world where life for most people was already nasty, brutish, and short.

Nobody is physically dying in these modern day salt mines, but there is a lot of misery.

In these workplaces, success is elusive, and most humans require at least occasional success to make them feel as though they are making a positive difference.

This is not trivial.

Large government and enterprise IT development projects are very rarely unmitigated successes. Serious research indicates that on the average, only 30% of projects are considered total successes by the executives who are paying for the work, and that there is another 20% that are not considered total failures (but still failures to some degree).

Now add to this the fact that successes and failures are not evenly distributed. Some (very very few) IT shops are generally successful, and account for most of the successes. For the rest; failure (whether partial or total) is the norm.

How do you think the internal discussions around these failed projects go? Do you imagine that they are pleasant for anyone? In the failing IT shops, most team meetings are about impossibles targets, missed deadlines, budget overruns, and failed deployments. I have spent most of my career in these salt mines and I don’t remember many project meetings that started with the words “high-fives all around”.

The misery is real.

In fact, true success is so elusive that on the few occasions when one of my projects was wildly, amazingly, successful, senior executives would refuse to approve a celebration because they were waiting for the other shoe to drop. Recognition would finally come many months later, with some bloodless ceremony in a quarterly divisional meeting where the presenting executive would inevitably get some important fact about the project wrong – usually leaving out one or more team members – creating a demoralizing effect far exceeding any feelings of goodwill that the tepid ceremony might have otherwise produced.

Sausage factories

I am, of course, referring to the timeless aphorism:

“Laws are like sausages. Its better not to see them being made.”

Believe me when I tell you that the same is true of enterprise software applications.

The vast majority of IT workers are not permanent employees of the companies they are coding for. Typically they work for one of several contractors who are in heated competition with each other within the account. They have the Sisyphean taskof working on code that has dependancies on, and conflicts with, code that other developers — working for one of the competing contractors — are working on.

Everything breaks constantly, and everyone is pointing the finger at everyone else. It is a miracle that anything works at all, and the only way it does is by driving it all down to the lowest common denominator, and liberally applying chewing gum and baling wire (or duct tape if you prefer).

If engineers built bridges the way these companies build software, 90% of the bridges in the world would look like this:

or maybe this:

Why?

Because these bridges are built by hand, not industrially. This is the way we build software today.

Is there any hope?

Take a look at what industrial bridge production looks like:

http://gph.is/1YlwmSW

One day enterprise application development will look like this too; dropping in prefabricated segments that were carefully engineered to fit together perfectly. If we can do it for a 200 ton concrete slab, 90 feet in the air, we can figure out how to do it in software.

And which bridge construction crew would you rather be working on? Unless you are an atavistic lunatic with a death wish: the industrialized one.

I cannot exaggerate the levels of true despair I have witnessed in the salt mines of enterprise IT. Dozens of millions of workers suffer through every day because IT is a good living, a good way to make decent money, but they are not motivated, not happy, not proud of what the team has built.

Industrializing IT will improve the lives of millions of people in a meaningful way. And it will happen. The demand for applications far exceeds the supply of experienced and competent developers to create them. Throughout history, the solution to this problem has been industrialization, which allows unskilled or semi-skilled labor to produce high quality results. Software application will not be an historical exception.

This is going to happen.

Interpreted, compiled. what. ever.

CPUs speak only one language; they don’t care what language you write in


Programmers tend to make a big deal over the supposed difference between compiled languages and interpreted ones. Or dynamic languages vs. statically typed languages.

The conventional wisdom goes like this: A compiled language is stored in machine code and is executed by the CPU with no delay, an interpreted language is converted to machine language one instruction at a time, which makes it run slowly. Dynamic languages are slower because of the overhead of figuring out type at runtime.

The reality is nothing like this

A long long time ago, when programmers wrote to the bare metal and compilers were unsophisticated, this simplistic view may have been somewhat true. But there are two things that make this this patently false today.

The first thing is that all popular programming languages these days — Java, JavaScript, Python, Scala, Kotlin, .NET, Ruby — all run on virtual machines. The virtual machines themselves are written in a variety of languages*.

The second thing is that VMs make it easier to easier to observe the program’s execution, which in turn makes JIT (Just in Time) compilation possible.

So how about a real example that illustrates why this matters: Java.

Java is compiled. Or is it? Well… sort of not really. Yes there is a compiler that takes your source and creates Java bytecode, but do you know what the JVM does with that bytecode as soon as it gets it?

First it build a tree. It disassembles the Java bytecode and builds a semantic tree so that it can figure out what the source code was trying to do. In order to accomplish this it must undo all of the snazzy optimizations that the compiler has so carefully figured out. It throws away the compiled code.

That sounds crazy! So why does it do this? The answer to that question is the key to understanding the title of this article.

The best way to understand code is to watch it running

This applies to humans, but it applies just as well to compilers. The reason why the modern JVM undoes the compilation and the optimizations is that “conventionally” compiled Java bytecode runs too slowly on the JVM. To attain the speed of execution for which Java is known these days, the JIT has to throw away the code that was statically compiled (and statically optimized) and “watch” the code running in JVM, and make optimizations based on the code’s actual behaviour at run time**.

Don’t go around saying things like “[insert language name here]is too slow because it is interpreted” because that is simply not true. Languages are just syntax and can have many implementations (for example there are several implementations of Java). You could say “[insert language name here] is slow because the interpreter/VM doesn’t have JIT” and that might be true.

Languages are not fast or slow

C++ is not a faster language. C++ runs fast simply because more hours have been spent on the compiler optimizations than for any other language, so of course they are better. It is worth noting that programs compiled using PGC++ and Clang are regularly 2 or 3 times faster than the same source code compiled using the AOCC compiler. This is proof that it is the compiler and its optimizations — not the language itself — that dramatically affect execution performance.

Java is generally considered next fastest, and that is because it has had more hours invested in its JIT compiler than anything except C/C++.

Framework or Language redux

But it is not all down to the compiler. I have already writtenabout the dangers of unexamined libraries and frameworks. The article could also have been titled Syntax or Library: Do you know which one you are using? I asked a trusted friendto review this article, and brought up the very good point that memory access patterns are the big performance culprit overall. He made the point that C programs benefit from the fact that…

C only has primitive arrays and structs (but it’s very liberal with pointers so you can pretty much wrangle anything into these primitive structures with enough patience). Working with hashmaps or trees can be very painful, so you avoid those. This has the advantage of nudging you towards data layouts that make effective use of memory and caches. There’s also no closures and dynamically created functions, so your code and data mostly stay in predictable places in memory.

Then look at something like Ruby, where the language design encourages maximum dynamism. Everything tends to be behind a hashmap lookup, even calling a fixed unchanging method on an object (because someone might have replaced that method with another since you last called it). Anything can get moved or wrapped in yet another anonymous closure. This creates memory access patterns with a scattered “mosaic” of little objects scattered all over the place, and the code spends its time hunting each piece of the puzzle from memory which then points to the next one.

In short, C encourages very predictable memory layouts, while Ruby encourages very unpredictable memory layouts. An optimizing compiler can’t do much to fix this.

I had to agree. Which led me to articulate my point thusly: Programmers who do not understand where syntax stops and libraries begin are doomed to write programs whose execution they do not really understand.

My belief is that it is more difficult to write a truly awful C program because (if it runs at all) it would be too much work to manually reproduce the memory chaos that Ruby so casually produces.

We then had an interesting chat about how a certain large tech company created a “cleaned-up PHP++”. He has some interesting things to say, maybe he will write an article about that.

Thank you for your help Pauli.

So an implicit part of my contention is that modern programming languages lower the bar so that many programmers do not think about about basic computer science (memory structures and computational complexity), and therefore have no basis upon which to understand how their programs will execute.

The other part of my contention is that any Turing complete language could run about as quickly as any other when considered from a pure syntax perspective. For example I believe that it would absolutely possible to create a high performance implementation of Ruby on let’s say the JVM. I readily acknowledge that most current Ruby code would break on that system, but that is as a result of the programming choices made in the standard libraries, not as a fundamental constraint of the language (syntax) itself.

I have say, (as a self-admitted cargo cult programmer) that it is definitely possible that I just don’t understand Ruby syntax and/or the Church–Turing thesis.

CPUs (or VMs) speak only one language

Ruby and its shotgun approach to memory management notwithstanding; any programming language that would have as many hours invested in optimizing its compilation as Java or C, would run just as fast as Java or C and this is because CPUs (and VMs) speak only one language: machine language. No matter what language you write in, sooner or later it gets compiled to machine language and the things that affect performance are how fundamental computer science principles are implemented in the standard libraries, and how effectivethe compilation is, notwhen it happens.

The moral of the story is that programmers should spend less time crushing out on languages and more time understanding how they work under the hood.

I will finish with a quote from the great Sussmann:

“…computers are never large enough or fast enough. Each breakthrough in hardware technology leads to more massive programming enterprises, new organizational principles, and an enrichment of abstract models. Every reader should ask himself periodically ‘‘Toward what end, toward what end?’’ — but do not ask it too often lest you pass up the fun of programming for the constipation of bittersweet philosophy.”


* The original Sun JVM was written in C. The original IBM JVM was written in SmallTalk. Other JVMs have been written in C++. 
The Java API (class libraries without which most Java programmers would unable to make even a simple application) are written in Java itself for the most part.

** It is worth noting that this happens again when the machine language of the VM “hits” the hardware CPU, which immediately takes a group of instructions, breaks them apart, looks for patterns it can optimize, and then rebuilds it all so it can pipeline microcode.