I'm Matthew Setter. I'm a security researcher, privacy advocate, and a software engineer. I’ve been developing software since 2000. This blog is focused on helping you write more secure software and protect your online privacy.
Why We Have To Restart the Mantra of High Performance ApplicationsRegular Expressions / Development / High Performance September 3rd, 2015
There was a time when it was normal to expect that your application would be as highly performant as possible. But over the last decade or so, with the advent of cheap hardware, applications have become larger and more capable, yet demand ever more resources than they once did. Sure, they give us more, but at what cost? Today, filtered through the lens of one of the tougher programming concepts, regular expressions, I consider the partly lost art of efficient programming, the potential reasons for and against, as well as excuses people make for avoiding doing what they should.
Why We Have To Restart the Mantra of High Performance Applications
Recently, as I've been building the application which supports my podcast's website, I found that I needed to do an increasing amount of work with strings and text.
The reason for this is that, the data for the site is based on text files, which are a combination of Yaml front-matter and a Markdown body. You can see a sample one in the GitHub repository.
The Yaml front matter stores meta information such as the published date, slug, title, guest list, audio file size and so on. The Markdown content contains the show synopsis, description, and related links.
The more specific reason why I have had to do increasing amount of work with strings, is that originally, I was just rendering the Markdown body as HTML as is. But, over time, I found that the site could be so much better, so much more sophisticated by extracting the individual sections in the Markdown body.
I wanted to render the information in a more sophisticated manner, essentially rendering the synopsis, related links, and other information in separate locations. The catch is, however, I didn't want to change the file format, as it was so simple and concise.
So the question was, how to write as little code as possible, to do as much as I possibly could. The first option that came to mind was using regular expressions. Some people seem to see them as voodoo, as Brian Behlendorf, the original creator of Apache, says they can be.
I won't deny that they're difficult and challenging. But they're also a thing of beauty. Some people argue that regular expressions are a software language unto themselves. I don't quite agree, but I'm increasingly of that persuasion. Take this one:
This expression will retrieve a section of the content, starting from the header and includes the content up to the next header. The header is composed of one or more hashes, which in Markdown parlance, identifies headers of the same level as those in HTML, and one or more words, space-separated.
The words however, can only contain upper or lowercase letters from
a-z and a space. No special characters are allowed. I don't want to consider, for a moment, just how involved the code might need to be if I wasn't using a regular expression. I'd likely have to so all manner of string searching, matching and extracting. All things which this expression does for me - in one line.
Regular Expressions Aren't Easy - And People Make Excuses
Now to be fair, these aren't the simplest of things to write, I won't try and pretend that they are. And I know that a lot of people who I've worked with, and naturally those I've not, are often put off by them.
However I find it sad, especially as we're working with computers, and computers require lots of personal investment and continuous learning to do properly, that they take the longer route, because they can seem so daunting.
Alright, I'll be candid here for a moment and say that I also think some people are plain lazy and don't want to stretch themselves. Again, this is sad. But I'll discount those types of people.
Then there's another pseudo-argument I hear so often:
What if you leave and no one else knows anything about regular expressions? What do we do then?
This excuse is brought up so often, masquerading as a legitimate reason, when in truth it's not one at all. Think about it, if you bring a technology, service, product, or technique in to a company, or any organisation, something which hadn't formally existed, it's incumbent upon you to do two things;
- Firstly, you need to thoroughly document it, in an easily accessible locatable
- Secondly, you need to teach it to others
When both of these are done properly this excuse really is just that - an excuse. For documentation, there are so many choices. You can maintain a Wiki, add to the existing one, or use some other form of documentation.
For teaching, the options once again abound. Whether you do it in a weekly/fortnightly/monthly developer meeting, create a screencast, or have an informal chat amongst the development team, you have to teach the others on your team. You can't operate in a silo-style environment, or make yourself irreplaceable (or un-sackable).
Reasons Why Regular Expressions Lead to Better Code
The key reason why regular expressions lead to better code all stems from primarily from the simplification and efficiency which they lead to. The reason that they create simpler and more efficient code, is that they require you to write less code. It's at this moment that I wish I had a working example of an alternative to the above expression.
But moving on; when you have less code you naturally have less potential points of failure. When you have less potential points of failure, you have to write less tests to validate your code for issues which may arise. This then continues in a positive spiral.
As you have less code to maintain, both application and test code, you have a lower overall time investment. Whether you're maintaining the existing code, writing more tests for it, or extending it, adding newer or better features, there's overall less to do.
As you have a lower investment, you need less people, both developmental and non-developmental staff, and the people you do have don't have as large a cognitive load to bear. This cognitive load is offset at two points.
Firstly, the initial period where they have to become familiar with your code, and secondly over the course of the time which they'll spend maintaining the code. As a knock-on effect of the reduced codebase, it should therefore flow, that the code will, or could, be both more efficient, and easier to optimise.
This has two excellent benefits:
- A reduced environmental impact
- A reduced financial impact
Whether you're an environmentalist (or have even a basic concern for our impact on the environment in which we live), or you're more the hard-headed financial type, there's a win to be had with this approach.
There was a time quite some decades ago, when Assembler was still a much used programming approach. I don't recall the person, nor the application which they were talking about. But the kernel of the conversation was there was a sense of competitiveness to make programs run as efficiently as they could.
Bloat was bad - plain and simple. If your application could run more efficiently, you were honour bound to make it happen. But this was also in a time when computer hardware was nowhere near as readily available as it is today, in large part due to cost. So there was a financial imperative to take this approach.
Then hardware became cheap, and the narrative changed.
As years went by, the narrative I heard was that if there was a problem, we just had to throw more hardware at the problem, because it was cheap - and developer time was expensive. Sure, this is a logical and pragmatic decision to make. And for a short period, I agree that it's fine. But not as a long-term solution.
As a long-term solution, it's wrong. Increasingly, just like web page bloat, applications become larger and slower. But the attitude remained. Throw more hardware at it, it's cheap. This attitude misses the point. You shouldn't need to take this approach, most of the time - not if your software is well designed.
Now this is a simplistic statement, as an entire argument on this point would take quite some time. But the principle I'm making is that it's important to refocus on application performance, whether from a financial or ecological perspective.
And one way to do this is to look to techniques, such as regular expressions, techniques which admittedly involve a lot of personal investment, which as a whole, can lead to smaller, more well written applications.
Join the Email List
If you enjoyed this post, why not join the email list and get all future posts straight to your inbox? In addition, you'll get background information, extra research, and other content that's only available on the list. I promise I'll NEVER spam you. And you can unsubscribe at any time.