Today we are going to talk about logs, quantitative metrics, and how to observe them in order to increase the team reaction rate and reduce the system waiting time in case of an incident.
Erlang/OTP as a framework and ideology of developing distributed systems provides us with regulated approaches to development, proper tools, and implementation of standard components. Let’s assume that we have exploited OTP’s potential and have gone all the way from prototype to production. Our Erlang project feels perfectly fine on production servers. The code base is constantly developing, new requirements are emerging alongside with functionality, new people are joining the team. Everything is, like…fine? However, from time to time something might go wrong and any technical issues combined with human factor lead to crash.
As you just can’t have all the bases covered and foresee all the potential dangers of failures and problems, then you have to reduce the system waiting time via proper management and software solutions.