Apr 2, 2024
News
Apr 2, 2024
News
Operating software can be difficult at the best of times, but when your software provides little to no information on what it is doing and how fast it is doing it, it can make the whole experience extremely stressful.
At RHE, we operate a suite of applications providing a variety of services to public authorities. These authorities trust us to provide responsive, accurate and well maintained products, a responsibility we take very seriously.
Over the past few years the technical department at RHE has been driven to improve the observability of the products it provides to help better pre-empt, prevent or respond to disruptions to service.
Why observability is important
We all know when a website is down. Our browser endlessly loads, we are presented with an indecipherable error message or even perhaps an amusing graphic telling us something isn’t quite right (Think dinosaur Google game when you’ve forgotten to turn the wifi back on).
In these moments, we trust that operators of the service we are trying to access know there is a problem and are frantically running around to bring the website back online as soon as possible. However, a lot of software operators may never actually know, without manually checking, that their service is no longer operating as expected. This is the case because they are not actively monitoring the status of their service, their service isn’t observable enough and so it remains offline until discovered by a member of staff, or, worse still, a client calls to tell them. I’m sure most technical readers of this article have been in this situation before, the sinking feeling you get as the client informs you that they couldn’t complete their transaction because the website is down, followed by the frantic efforts to login and check the logs for errors.
Observability reduces this stress. By observing our applications more closely, we can get early warning signs of service disruptions or degradations. We can know that a service has gone offline before the unexpected client phone call and, if we are efficient enough, have corrected this issue before anyone notices.
At RHE we have worked tirelessly over the past few years to get as many of our applications as possible into this position, ensuring that we know when something has gone wrong or isn’t working as well as it should be, so we can take remedial action and ensure our clients are getting the best possible service.
How RHE has achieved better observability
Over the past few years RHE has made use of popular technologies, such as AWS CloudWatch Alarms, Metrics and LogInsights to better analyse the behaviour of our applications. We feed information from these services into alerts, reports and dashboards to give us an holistic view of our software estate and up-to-the-minute reports on their status.
We have been making use of technologies such as Sentry for many years to help keep us informed about unexpected errors — Sentry provides us with a convenient, one-size-fits-all approach to exception monitoring by providing SDKs in multiple programming languages with convenient, low-effort, instrumentation tools that we can use to efficiently and quickly start monitoring application exceptions. With the addition of this new monitoring approach, we can not only look out for those unexpected errors, but also for signals that may indicate a pending disruption or an ongoing degradation of service.
In addition to this, we have been rolling out more fine-tuned observability tools throughout our estate to help identify the root cause of service degradations or outages as expeditiously as possible. We have made use of open source technology such as OpenTelemetry to instrument our applications with granular traces and metrics designed to give us information on performance trends, error rates and resource utilisation.
"The technical department at RHE is on a mission to ensure that if there is a reduction in the quality of any service we provide, we know about it first, identify the source quickly and mitigate it as fast as possible."
Recently, we have been moving clients from version 1 of The Noise App over to version 2 — During this migration we leaned heavily on our use of telemetry to ensure we could bring excited clients over to the new service as fast and reliably as possible. Using telemetry we introduced, we were able to identify performance problems with our migration routine, find troublesome areas and remedy them incredibly quickly. Whereas, without this level of information, we would have needed to spend enormous amounts of time running, analysing and iterating on our routines to ensure they performed as expected. This is just one of the many ways RHE uses emerging and modern technology to improve the quality of the services it offers.
Identifying traits of value
Observability in software can sometimes create a wealth of information on the behaviour and performance of an application, for people just starting out in their observability journey, knowing what to look for in an enormous database of traces and metrics can be overwhelming, creating alarms on each individual performance variable simply isn’t practical or necessary in most cases. So at RHE we focus on the metrics we know will be of the highest value to our clients. We do this using a structured logging mechanism and through the use of established tools, such as AWS X-Ray. Using tools and structured logging gives us the ability to filter our logs and traces in search for those that are of most interest to us. At RHE our top three priorities are always the same:
1. Availability of service
2. Service error rate
3. Response times
We endeavour to have time-series and detailed logging on all of these metrics to give us a clear picture of our applications performance and ensure high levels of availability.
Pre-empting performance problems
Improving an application’s observability delivers more than simply identifying problems in production. At RHE, we are able to use our improved observability to monitor an applications performance throughout the software development lifecycle, all the way from the developers local machine, up to our test environments and finally into production. Throughout this process, members of our development team are able to view, analyse and act on metrics being generated by an application to ensure the change they are making is performing to the high standards of quality we have.
This means that before a developer’s change even makes it off of their machine, they have a wealth of information on the performance of their change and potentially indicators to possible bottlenecks or problems. Not only does this allow us to deliver change faster by finding problems earlier in our development lifecycle, but it also means there is less chance of a performance issue making it to one of our production environments.
Should a performance issue make it off of a developers machine, other developers are then empowered to view and act on poor performance indicators by reviewing the same information generated by the application in a pre-production environment, providing yet another safety net and even further protection for our production environments.
I’m incredibly proud of the progress we have made at RHE in improving our observability and am sure as time goes on, we will be able to fine-tune our processes and routines to ensure our clients are getting maximum value from our products.
With thanks to Full Stack Developer, Terence Jefferies for this article.
Enjoyed this article? Feel free to comment, share or take a further look at our latest articles over on our news page.