You came up with the idea, you put together the team. You’ve got a working application and the customers love it! You breath a sigh of relief . “The Hard Part™ is over”, you think.
That’s when the stream of support requests start coming in.
I keep waiting and nothing happens
Mmmm.
The loading thing just keeps spinning
Oohmmm
I don’t know if I’m doing something wrong but it’s just taking forever
You’re sure it’ll be fine, it’s just going a little slow is all… but the emails keep coming.
First it was going slow and now it won’t load at all.
Oh that’s not good.
What’s going on?! The site’s down!
Congratulations! You’ve just been hugged to death.
Now what?!
A little understanding goes a long way
It goes without saying that we want the application to remain online and performant, but to get where we’re going, we need to know where we’re starting.
Here is a very high level abstraction of what comprises a web or mobile application.
Frontend code
- Displays information from the backend for the user to interact with.
- For a web app, this would be the website (Who’d have thought!)
- For a mobile app, this would be the application the end user uses.
- Generally runs on the users device.
Backend code
- Stores, modifies, and otherwise computes data displayed on the frontend.
- There can be (and often are) performance issues here.
- Runs on your infrastructure.
Infrastructure
- The computers (and other software) that run your application.
- It includes the hardware (Like how much CPU or RAM are available), networking, and more.
- Will generally include other software required by your application like database or web serving software.
- This is where your backend code runs.
External Services
- Any external applications (Like APIs) used by your application.
- An example of this would be using Google Maps in your application
Putting It All Together
A simple way to think about this is the frontend and backend code is a car. You can tune it’s engine or the aerodynamics of its body to increase it’s speed and efficiency.
The infrastructure can be thought of as the fuel and oil for the car.
While you can tune your fuel and oil a bit to give your car more power or efficiency, it’s primary purpose is to provide a steady and reliable supply of energy for the car to get you over to FoodCo for dinner. Even the best car isn’t good without fuel!
Performance and reliability issues can involve any of these portions. In our experience we’ve generally seen most issues on the backend (The engine isn’t tuned right), followed by issues with resource constraints on the infrastructure (Insufficient fuel).
Of the above we have control over the frontend code, the backend code, and the infrastructure.
In most cases we will not have significant control over the end users device or any external services used by the application.
An important thing to keep in mind is that for each of the broad categories mentioned, there are numerous smaller components within.
Show Me The Data
This should help us understand that our problems (The website is going down or isn’t performant) could be caused by issues in multiple areas.
So how do we go about finding what effect each of the above has on the application?
Like many other complex systems, the best place to start is identifying and collecting critical information, then providing the information in an easily interpreted manner.
We’re pull out of our drive way to head to FoodCo and the car stops moving. Is something more serious wrong with the car or is it simply out of fuel? Luckily we can glance down at the handy dandy fuel gauge and see we may have overestimated how fuel efficient our Hummer H4 was (we expected to get down the street before a refill at least).
On a car the fuel gauge helps you understand if you’re about to run out of gas so you can get a refill. If your fuel gauge is fancy it may even calculate how much fuel you have, your MPG, and how far you can travel before a refill. Neat!
Infrastructure monitoring is kind of like the fuel gauge on your car. It collects information like how much RAM, CPU, storage, or bandwidth you’re using. It can help understand if you’re about to run out of something and how many resources are being used.
Infrastructure Monitoring Software Examples: AWS Cloudwatch, Zabbix, and many others
Back to our problem, we could certainly get a bigger fuel tank or keep putting more gas in the car but that’s going to be very expensive over time.
We may decide to refill the car this time (we really want dinner) but we need to figure out why it’s fuel economy is so bad.
The fuel gauge (or Infrastructure Monitoring) might tell us how much we have left or how much we’ve used but it doesn’t tell us why the car (Your applications code) is using so much fuel.
Since our fuel economy was in the range of about 5 feet per tank, we probably want to figure that out. We’d need more information so we’d test the car further. We could examine it’s engine, put it in a wind chamber, and other more vigorous tests.
For your application this critical information is collected with Application Performance Monitoring tools.
Application Performance monitoring tools collect far more information about the performance of your application. It will measure things like your error rates or failures (as well as information about each error), how long each request takes, and more. In many cases it will even automatically profile (collect detailed information) about particular requests (like slow running ones) so they can be analyzed and remediated.
APM Software Examples: Stackify, New Relic APM, others
While it’s certainly possible to solve some problems without appropriate monitoring tooling, they help provide significant (and ongoing) time savings during problem remediation and also allow a deeper understanding of the ongoing performance/availability of your application.
A perhaps more important benefit is that this tooling can reveal problems before they rear their ugly head (Ideally preventing a stream of angry support emails).
We’ve found that in many cases our clients:
- May have some Infrastructure Monitoring but it is not being properly reviewed and acted upon (for example by ensuring sufficient resources are available).
- Have no Application Performance Monitoring tooling in place which leads to an insufficient understanding of the applications performance and reliability over time.
- Do not allocate sufficient time evaluating and improving the performance of their code.
The key takeaway is you aren’t going to know what’s wrong if there are no eyes and hands on the problem.
It is critical to implement appropriate Infrastructure and Application Performance Monitoring and to allocate time for problem discovery and remediation.
A little bit of this, a little bit of that
So we’ve started gathering all the information we need to figure out where things are going wrong.
This is where the technical team will dive in and interpret the gathered data.
They might ask questions like:
- Are there any particular resource constraints or bottlenecks with the infrastructure?
- What portions of the code are particularly inefficient?
- What portions of the code are used frequently?
- Are there any particular patterns we can discern from the data?
- What portion do we decide to fix first?
- What will provide the best cost/benefit?
If you’ve been following along with our car analogy then you could imagine there are multiple ways to solve our imaginary Hummer’s problems.
These choices largely boil down to three categories:
- Keep providing the vehicle (your code) with more fuel (infrastructure) in some way.
- Make the vehicle (your code) more efficient so that it uses less fuel (infrastructure).
- A bit of column A, a bit of column B.
It’s relatively easy to refill the Hummer with gas and maybe even install a bigger gas tank but the additional fuel (infrastructure) costs will add up over time.
Improving the Hummers efficiency/performance is likely to be more time consuming (and difficult) but it will continue to provide long term benefits with reduced fuel costs and a better driving experience for the user.
In the vast majority of cases we advocate tackling the issues from both sides.
We always want the application working, so we ensure it receives sufficient resources (infrastructure) to operate reliably then work towards making improvements to the applications code to ensure that it is cost efficient, performant, and reliable for the end user.
Now we’ve diagnosed our problems and decided to move forward, what might problem remediation look like?
Pass Me The Wrench
The technical team has interpreted the data and they’ve determined the various problems that exist. We want to make sure the code has as many resources as it needs and we want the code to use those resources efficiently.
Some of the things a team might do to resolve some of these issues:
For the Infrastructure
- Manually modify the infrastructure when more resources are required (Blech!).
- Automate the infrastructure so more resources can be added more easily and with less errors.
- Further automate the infrastructure so resources can be requested and provided as needed (generally known as auto-scaling)
- Where relevant, tune any software like the web server or database server
- Implement caching to reduce overall load.
For the Code
- Improve or refactor frequently used code.
- Improve or refactor particularly slow but less frequently utilized code.
- Implement caching at the application level.
- Remove features that are performance intensive and not valuable to end users.
- Ensure the application degrades “gracefully” during periods of very high load (for example, by disabling certain performance intensive features)
The exact way the problem will be resolved, of course, will depend on the exact problem at hand, but this can provide a wide overview of what to expect.
Need a hand?
While we think everyone should understand the basics, we know you might not want to pop open the hood every time something goes wrong.
Need a hand?
Let’s have a short conversation about the problems you’re facing and how we could help.
Summary
If you’ve made it this far then congratulations! You’ve survived an onslaught of terrible car based analogies and will leave a bit more informed about how many web/mobile apps work, the importance of collecting critical data, and how a technical team may tackle the problem.
Now time for some dinner at FoodCo.