Logs and metrics in Java

Vasile Mihali
Senior Software Engineer @ Arobs

PROGRAMMING

As we all know we are more or less logs-dependent. The scope of this article is not to present a logging-framework within Java, but superior level of abstraction for collecting, storing and analyzing, also the way good logs can help us to have a better overview over what our software application does and how good it does that.

But what are those logs and why we need them? I personally consider a log as a small point, but keep in mind that multiple points can draw a line and multiple lines a form or a direction that should be understood.

What that means:

A single log cannot evolve (are immutable)
There are so many of them (thousands, millions per day - depends how big is our software and how complex it is, and also if there are used some additional resources for audit - if needed )
Time to live (TTL) is generally short (30 days, 6 months, maybe more for financial systems and after that moved to some glacier system - depends from application and business requirements)

In monolithic application the logging process is grouped usually by days i.e. (JBoss filename+".yyyy-MM-dd) and analyzing those can be very annoying and needs pretty good knowledge in regular expressions and Linux commands, or some scripting language (i.e. Scala, Python) etc. Also usually those apps are deployed using external hosting services (cloud) and connecting from exterior for getting those log files (and those can be quite large even archived) becomes rapidly a time consuming process, and the analyze phase is getting gawky and so many times those logs are not even checked unless something is not working anymore or someone complains about the speed of the system etc.

Even in monolithic applications there are additional systems that are present (proxies, load-balance) and those are having their particular logs too, and the information that is hold in it is precious since can describe how app is used, security attacks, response times etc.

But what this can be approached when is desired to have a distributed architecture i.e. micro-services? How to collect those logs from each machine? How to connect those logs and make them meaningful? How easy is for us to do it?

Img. Load-balancer/proxy (tutorialspoint.com)

And if those logs are analyzed only when something goes wrong, then maybe is some particular situations to be already too late. So my opinion is that as a continuous logs analyses is required in first months after deploy or at least things become stable (even in qa). What stable means is different from each project, functionality/product contract with client, etc.

Img. Micro-service architecture (tigerteam.dk)

To not forget that what our logs contain is same important or even more. Because the content from them can provide a big impact in how application can be understood.

Here developers have are full power on the content and can be custom made but still to have a good patter over all the project/modules. Since many Java is doing backend jobs, like processing data, data migration, async tasks, reports, etc. which are not providing direct feedback (as a web request/response) then the information becomes even more sensitive in order to confirm if the apps works correctly.

Our solution - and not only ours:

ELK (Elastic, Logstash (+ loggregator), Kibana)

Img.Logs - pipeline (blog.xebia.fr - Vincent S.)

Process:

Applications use console appender (log4j + slf4j) - stdout
Then is used loggregator (CloudFoundry) - agent that takes the stream from stdout
Logstash receives logs (ETL, transform date in generic formats, etc.)
Logs are then indexed in Elastic
Use Kibana to view them

So we have a centralized system where logs from all aps are grouped together, with common formats and many other enhancements to become more easily to be understood, correlate and interpret.

Metrics. Core Metrics.

Metrics library from Coda Hale is one of the most used and popular being part in many successful projects (Cassandra).

Developers can use the "registry":

Logger log = LoggerFactory.getLogger(MetricsUtils.class);
MetricRegistry metrics = new MetricRegistry();
Slf4jReporter reporter = Slf4jReporter.forRegistry(metrics)
       .outputTo(log)
       .convertRatesTo(TimeUnit.SECONDS)
       .convertDurationsTo(TimeUnit.MILLISECONDS)
       .build();

And being used by:

counters

Counter failedMigrationSessions = metrics.counter(
   MetricRegistry.name(Entity.class, 
   “migration-sessions-failed”));

histograms

Histogram migrationSizes = metrics.
  histogram(MetricRegistry.name(MigrationUtilsImpl. 
    class, “migration-sizes”));

timer

Timer migrationTimer = metrics.timer(MetricRegistry.name(MigrationUtilsImpl.class, "migration-timer"));

Calls of methods:

public void incIdsToMigrate(long d) {

        migrationSizes.update(d);
}

public void incFailedMigrationSessions() {
        failedMigrationSessions.inc();
}

//migration util class
@Override
public Response migrate(String entityCanonicalName, List idSToMigrate) {

final Timer.Context context = MetricsUtils.getInstance().getMigrationTimer().time();
final MigrationSession migrationSession = 
     new MigrationSession(entityCanonicalName, 
      idSToMigrate.size());
try {

… //validation, others
MetricsUtils.getInstance().incIdsToMigrate(idSToMigrate.size());
….
} finally {
       context.stop();
       if (migrationSession.isGeneralError()) {
                MetricsUtils.getInstance().
        incFailedMigrationSessions();
       }
}

Logs:

[MetricsUtils.info:110] type=TIMER, name=com....MigrationUtilsImpl.migration-timer, count=463, min=14185.432459, max=120866.415675, mean=48638.78311404348, stddev=33618.49316878979, median=33732.338869, p75=68521.92468899999, p95=120788.85010139999, p98=120866.415675, p99=120866.415675, p999=120866.415675, mean_rate=0.0023544359078945367, m1=2.964393875E-314, m5=2.2626576149598838E-82, m15=4.532485314583121E-30, rate_unit=events/second, duration_unit=milliseconds

[MetricsUtils.info:108] type=HISTOGRAM, name=com…..MigrationUtilsImpl.migration-sizes, count=463, min=4, max=35, mean=19.541666666666668, stddev=12.021644127831763, median=19.5, p75=33.75, p95=35.0, p98=35.0, p99=35.0, p999=35.0

[MetricsUtils.info:104] type=COUNTER, name=com….MigrationUtilsImpl.migration-sessions-failed, count=0

Also continuous monitoring day by day, weekly, monthly in a web app and not only can provide a good sense and meaning of how app is behaving

[LoggingFilter.filter:100] 13bd85d2-1464-406a-70ee-99ef10545989 - method=GET code=200 url=workbases - Round trip response duration=25 millis
[LoggingFilter.filter:100] 084f6732-8824-4311-7d3d-d8596f8bfc43 - method=GET code=200 url=workbases - Round trip response duration=29 millis
[LoggingFilter.filter:100] 71ddc0fa-281f-4f80-6664-1374aa14b0c2 - method=GET code=200 url=registrationruns - Round trip response duration=232 millis
[LoggingFilter.filter:100] ebbf5225-7800-46b9-5e1c-bfec7d25dec1 - method=GET code=200 url=registrationruns - Round trip response duration=318 millis
[LoggingFilter.filter:100] cb721bc3-adea-4bfd-708d-8f8a7c39c9aa - method=POST code=201 url= registrationruns - Round trip response duration=5 millis

This kind of logs let us to keep statistics like:

Conclusion

Going from monolithic apps to distribute architectures brings a lot of good things as flexibility scalability but in the same time new challenges. One of them is how to manage logs, the way are saved and accessed. Also to not forget about what logs contain. Is important to have logs that get us sense over app in time and help us to understand how is used, and never the less the current performance and future needs.

And continuous monitoring of logs, correlate them will provide a good vision over what can be enhanced or how stable the system is. A really important aspect in a distributed orchestration is the visibility over the state of it and possibility to react at variations (because variation is the first sign if a system is not stable or scalable)