Search This Blog

Tuesday, June 4, 2019

Exposing JVM metrics using Prometheus!!


In this fast changing environment , its really tough to debug or get the insight of any applications which are running in production. There are several ways but exposing metrics is the best option using Prometheus , we can visualize it also if we use Grafana along with Prometheus.

Sometimes we need to create our own metrics and that too we can achieve using this. So lets deep dive in implementation.



Pre-release artifacts are being published frequently, but are NOT intended for production use.
In Gradle:
compile 'org.springframework.metrics:spring-metrics:latest.release'
compile 'io.prometheus:simpleclient_common:latest.release'
Or in Maven:
<dependency>
  <groupId>org.springframework.metrics</groupId>
  <artifactId>spring-metrics</artifactId>
  <version>${metrics.version}</version>
</dependency>
<dependency>
  <group>io.prometheus</group>
  <artifact>simpleclient_common</artifact>
  <version>${prom.version}</version>
</dependency>
Enable metrics in your Spring Boot application with @EnablePrometheusMetrics:
@SpringBootApplication
@EnablePrometheusMetrics
public class MyApp {
}

@RestController
@Timed
class PersonController {
    Map<Integer, Person> people = new Map<Integer, Person>();

    public PersonController(MeterRegistry registry) {
        // constructs a gauge to monitor the size of the population
        registry.mapSize("population", people);
    }

    @GetMapping("/api/people")
    public List<Person> listPeople() {
        return people;
    }

    @GetMapping("/api/person/")
    public Person findPerson(@PathVariable Integer id) {
        return people.get(id);
    }
}
@EnablePrometheusMetrics also applies @EnablePrometheusScraping to your Spring Boot application which enables a Spring Boot Actuator endpoint at /prometheus that presents a Prometheus scrape with the appropriate format.
Here is an example scrape_config to add to prometheus.yml:
scrape_configs:
  - job_name: 'spring'
    metrics_path: '/prometheus'
    static_configs:
      - targets: ['HOST:PORT']
In this sample code, multiple dimensional time series are created with a variety of metrics:
  1. Adding @Timed to the controller creates a Timer time series named http_server_requests which by default contains dimensions for the HTTP status of the response, HTTP method, exception type if the request fails, and the pre-variable substitution parameterized endpoint URI.
  2. Calling collectionSize on our meter registry adds a Gauge time series named population that changes when observed by a metrics backend or exporter.
  3. Metrics will be published for JVM GC metrics.
  4. If you are using logback, counts will be collected for log events at each level.
Let's break down the key pieces of the instrumentation API in detail.


Meters and Registries

A meter is the interface for collecting a set of measurements (which we individually call metrics) about your application. spring-metricspacks with a supported set of Meter primitives including: TimerCounterGaugeDistributionSummary, and LongTaskTimer. Note that different meter types result in a different number of metrics. For example, while there is a single metric that represents a Gauge, a Timer measures both number of timed events and the total time of all events timed.
A registry creates and manages your application's set of meters. Exporters use the meter registry to iterate over the set of meters instrumenting your application, and then further iterate over each meter's metrics, generally resulting in a time series in the metrics backend for each combination of metrics and dimensions.
Three meter registry implementations are provided out-of-the-box: SpectatorMeterRegistryPrometheusMeterRegistry, and SimpleMeterRegistry. The registries are generally created incidentally by selecting a backend via @EnableAtlasMetrics@EnablePrometheusMetrics, etc.
Then, inject the MeterRegistry elsewhere to create meters that you can use to instrument your app:
@RestController
public class MyController {
  List<Person> people = new ArrayList<Person>();
  Counter steveCounter;
  Timer findPersonTimer;

  public MyController(MeterRegistry registry) {
      // registers a gauge to observe the size of the population
      registry.collectionSize("population", people);

      // register a counter of questionable usefulness
      steveCounter = registry.counter("find_steve", /* optional tags here */);

      // register a timer -- though for request timing it is easier to use @Timed
      findPersonTimer = registry.timer("http_requests", "method", "GET");
  }

  @GetMapping("/api/person")
  public Person findPerson(@RequestParam String q) {
      return findPersonTimer.record(() -> { // use the timer!
          if(q.toLowerCase().contains("steve")) {
              steveCounter.increment(); // use the counter
          }

          return people.stream().filter(p -> /* etc */).findAny().orElse(null);
      });
  }
}


Dimensions/Tags

A meter is uniquely identified by its name and dimensions. We use the term dimensions and tags interchangeably, and the spring-metricsinterface is Tag simply because it is shorter.
As a general rule it should be possible to use the name as a pivot. Dimensions allow a particular named metric to be sliced to drill down and reason about the data. This means that if just the name is selected, then the user can drill down using other dimensions and be able to reason about the value being shown.
Suppose we are trying to measure the number of threads in a thread pool and the number of rows in a database table.

Recommended approach

registry.counter("threadpool_size", "id", "server_requests")
registry.counter("db_size", "table", "users")
This variant provides enough context so that if just the name is selected the value can be reasoned about and is at least potentially meaningful. For example if we select threadpool_size we can see the total number of threads in all pools. Then we can group by or select an id to drill down further or perform comparative analysis on the contribution of each functional area to the number of threads consumed by the instrumented app.

Bad approach

registry.counter("size",
    "class", "ThreadPool",
    "id", "server_requests");

registry.counter("size",
    "class", "Database",
    "table", "users");
In this approach, if we select size we will get a value that is an aggregate of the number of threads and the number of items in a database. This time series is not useful without further dimensional drill-down.

Metric and tag naming

spring-metrics employs a naming convention that separates words with '_'. Camel casing is also perfectly acceptable. We recommend staying away from '-' and '.' as word separators to maintain the greatest degree of independence from a particular monitoring backend. '-' is interpreted as metric subtraction in some monitoring systems (e.g. Prometheus), and '.' is used to flatten tags into hierarchical names when shipping metrics to hierarchical backends like Graphite.


Counters

Counters report a single metric, a count. The Counter interface allows you to increment by a fixed amount, and isn't opinionated about whether that fixed amount may be negative.
When building graphs and alerts off of counters, generally you should be most interested in measuring the rate at which some event is occurring over a given time interval. Consider a simple queue, counters could be used to measure things like the rate at which items are being inserted and removed.
It's tempting at first to conceive of visualizing absolute counts rather than a rate, but carefully consider that the absolute count is usually both a function of the rapidity with which something is used and the longevity of the application instance under instrumentation. Building dashboards and alerts of the rate of a counter per some interval of time disregards the longevity of the app.
Be sure to read through the timer section before jumping in to using a counters, as timers record a count of the timed events as a separate metric. For those pieces of code you intend to time, you do NOT need to add a counter separately!
The following code simulates a real counter whose rate exhibits some perturbation over a short time window.
RandomEngine r = new MersenneTwister64(0);
Normal dist = new Normal(0, 1, r);

MeterRegistry registry = ...
Counter counter = registry.counter("counter");

Flux.interval(Duration.ofMillis(10))
        .doOnEach(d -> {
            if (dist.nextDouble() + 0.1 > 0) {
                counter.increment();
            }
        })
        .blockLast();


Graphana-rendered counter
Counter over a positive-biased random walk.
Prometheus Query

rate(counter[10s])
Representing a counter without rate normalizing over some time window is rarely useful, as the representation is a function of both the rapidity with which the counter is incremented and the longevity of the service. Below you can see how the counter drops back to zero on service restart. The rate normalized graph above would return back to a value around 55 as soon as the new instance (say on a production deployment) was in service.


Graphana-rendered counter (no rate)
Counter over the same random walk, no rate normalization.
Prometheus Query

counter


Timers

Timers are useful for measuring short-duration latencies and the frequency of such events. All implementations of Timer report at least the total time and count of events as separate time series.
As an example, consider a graph showing request latency to a typical web server. The server can be expected to respond to many requests quickly, so the timer will be getting updated many times per second.
The appropriate base unit for timers does vary by metrics backend for good reason. Prometheus recommends recording timings in seconds (as this is technically a base unit), but records this value as a doublespring-metrics is decidedly un-opinionated about this, but because of the potential for confusion, requires a TimeUnit when interacting with Timersspring-metrics is aware of the preferences of each implementation and stores your timing in the appropriate base unit based on the implementation.
public interface Timer extends Meter {
    ...
    void record(long amount, TimeUnit unit);
    double totalTime(TimeUnit unit);
}
The Prometheus Timer produces two counter time series with different names:
  1. ${name}_count - Total number of all calls.
  2. ${name}_sum - Total time of all calls.
For the same reasons cited in the Counters section, it is generally most useful to rate normalize these time series to reason about them. Since Prometheus keeps track of discrete events across all time, it has the advantage of allowing for the selection of an arbitrary time window across which to normalize at query time (e.g. rate(timer_count[10s]) provides a notion of requests per second over 10 second windows).


Prometheus-rendered timer
Timer over a simulated service.
Prometheus Queries

  1. Average latency: rate(timer_sum[10s])/rate(timer_count[10s])
  2. Throughput (requests per second): rate(timer_count[10s])


Long Task Timers

The long task timer is a special type of timer that allows you to measure time while an event being measured is still running. A timer does not record the duration and until the task is complete.
Now consider a background process to refresh metadata from a data store. For example, Edda caches AWS resources such as instances, volumes, auto-scaling groups etc. Normally all data can be refreshed in a few minutes. If the AWS services are having problems it can take much longer. A long duration timer can be used to track the overall time for refreshing the metadata.
In a Spring application, it is common for such long running processes to be implemented with @Scheduledspring-metrics provides a special @Timedannotation for instrumenting these processes with a long task timer:
@Timed(value = "aws_scrape", longTask = true)
@Scheduled(fixedDelay = 360000)
void scrapeResources() {
    // find instances, volumes, auto-scaling groups, etc...
}
If we wanted to alert when this process exceeds threshold, with a long task timer we will receive that alert at the first reporting interval after we have exceeded the threshold. With a regular timer, we wouldn't receive the alert until the first reporting interval after the process completed, over an hour later!


Prometheus-rendered long task timer
Simulated back-to-back long tasks.
Prometheus Query

longTaskTimer{statistic="duration"}


Gauges

A gauge is a handle to get the current value. Typical examples for gauges would be the size of a collection or map or number of threads in a running state.
spring-metrics takes the stance that gauges should be sampled and not set, so there is no information about what might have occurred between samples. After all, any intermediate values set on a gauge are lost by the time the gauge value is reported to a metrics backend anyway, so there seems to be little value in setting those intermediate values in the first place.
If it helps, think of a Gauge as a heisengauge - a meter that only changes when it is observed.
The MeterRegistry interface contains a number of convenience methods for instrumenting collections, maps, executors, and caches with gauges.
Lastly, Gauges are useful for monitoring things with natural upper bounds. We don't recommend using a gauge to monitor things like request count, as they can grow without bound for the duration of an application instance's life.
In Prometheus, a gauge is a generalization of a counter that also happens to allow for decrementing. If you view a gauge as something that is actively set by the application application code rather than sampled, it is clear that your code would have to increment and decrement the gauge as the size of the thing being measured changes. Diligent incrementing and decrementing throughout the application code yields the same result as the heisengauge, ultimately.


Distribution Summary

A distribution summary is used to track the distribution of events. It is wholly similar to a timer, but more general in that the size does not have to be a period of time. For example, a distribution summary could be used to measure the payload sizes of requests hitting a server.


Quantile Statistics

Timers and distribution summaries can be enriched with quantiles computed in your app prior to shipping to a monitoring backend.
Timer timer = meterRegistry.timerBuilder("my_timer")
                .quantiles(WindowSketchQuantiles.quantiles(0.5, 0.95).create())
                .create();
For distribution summaries, you can use summaryBuilder(name) which mirrors this construction.
This would result in additional gauges with tags quantile=0.5 and quantile=0.95. The 0.95 quantile is the the value below which 95% of observations in a group of observations fall. 0.5 represents the media of our observations thus far.
It is also possible to indicate that you want to compute quantiles in an @Timed annotation:
@RestController
public class MyController {
    @Timed(value = "list_people", quantiles = {0.5, 0.95})
    @GetMapping("/api/people")
    public List<Person> listPeople() { ... }
Four quantile algorithms are provided out of the box with different tradeoffs:
  • WindowSketchQuantiles - The importance of an observation is decayed as it ages. This is the most computationally costly algorithm.
WindowSketchQuantiles.quantiles(0.5, 0.95)
    .error(0.01) // OPTIONAL, defaults to 0.05
    .create()
  • Frugal2UQuantiles - Successive approximation algorithm that converges towards the true quantile with enough observations. This is by least costly algorithm, but exhibits a higher error ratio in early observations.
Frugal2UQuantiles
    // the closer the initial estimate (100) is to the true quantile, the faster it converges
    .quantile(0.95, 100)
    .quantile(0.5, 150)
    .create()
  • CKMSQuantiles - Allows you to tradeoff computational complexity for error ratio on a per quantile basis. Often, it is desirable for higher quantiles to have a lower error ratio (e.g. 0.99 at 1% error vs. 0.5 at 5% error). Still more computationally expensive than Frugal.
CKMSQuantiles
    .quantile(0.95, 0.01)
    .quantile(0.5, 0.05)
    .create()
  • GKQuantiles - Allows you to tradeoff computational complexity for error ratio across all quantiles. This is used inside of WindowSketchQuantiles.
GKQuantiles.quantiles(0.5, 0.95)
    .error(0.01) // OPTIONAL, defaults to 0.05
    .create()
Here is a demonstration of all four algorithms operating simultaneously on the same distribution:Quantile Algorithms


Histogram Statistics

Timers and distribution summaries can be enriched with histogram statistics that yield a counter time series for each of a set of buckets.
Histograms can be used to compute quantiles or other summary statistics in some monitoring backends (e.g. Prometheus). Because histograms buckets are exposed as individual counters to the monitoring backend, it is possible to aggregate observations across a distributed system and compute summary statistics like quantiles for an entire cluster.
Naturally, the error rate of the computed summary statistic will be higher because of the lossy nature of bucketing data.
spring-metrics supports both cumulative and non-cumulative (normal) histograms and provides a set of generators for each.
DistributionSummary hist = meterRegistry.summaryBuilder("hist")
        .histogram(CumulativeHistogram.buckets(linear(0, 10, 20)))
        .create();
For timers, you can use timerBuilder(name) which mirrors this construction.
This sample constructs a cumulative histogram consisting of 20 buckets, one every 10 units beginning at 0.
To construct a normal histogram, use the generators on NormalHistogram.
For timers, be sure to specify the TimeUnit that your buckets represent. The bucket tag value on the time series will be normalized to the expected time base unit of the monitoring backend (e.g. seconds on Prometheus, nanoseconds on Atlas). In this way, you can keep your histograms backend agnostic.
CumulativeHistogram.buckets(linear(0, 10, 20), TimeUnit.MILLISECONDS);


Cache Monitoring

Guava caches can be instrumented with the registry, but it is important that you call recordStats() on the CacheBuilder, as it is not possible to turn this on after the Cache is constructed.
@Repository
class PersonRepository {
    LoadingCache<String, Person> personBySsn;

    public PersonRepository(MeterRegistry registry) {
        personBySsn = Meters.monitor(registry, CacheBuilder.newBuilder().recordStats().build(),
            "people_cache", // base metric name
            "lookup_key", "ssn" // <- any number of tag key/value pairs
        );
    }
}
Cache instrumentation results in several gauges whose names are prefixed by the provided name ("people_cache" in this example), corresponding to the stats recorded in CacheStats.
The original cache instance is unchanged by instrumentation.


Data Source Monitoring

Data sources can be instrumented with the registry. This requires the DataSourcePoolMetadataProvider automatically configured by Spring Boot, so only works in a Spring Boot context where these providers are configured.
@Configuration
class MyConfiguration {
    @Autowired
    private DataSource dataSource;

    @Autowired
    private Collection<DataSourcePoolMetadataProvider> metadataProviders;

    @Autowired
    private Environment env;

    @PostConstruct
    private void instrumentDataSource() {
        Meters.monitor(
            registry,
            dataSource,
            metadataProviders,
            "data_source", // base metric name
            "stack", env.acceptsProfiles("prod") ? "prod" : "test", // <- any number of tags
        );
    }
}
Data source instrumentation results in gauges representing the currently active, maximum allowed, and minimum allowed connections in the pool. Each of these gauges has a name which is prefixed by the provided name ("data_source" in this example).
The original data source instance is unchanged by instrumentation.


Executor and ExecutorService Monitoring

Executor and ExecutorService instances can be instrumented with the registry. This includes any specializations of these types created by java.util.concurrent.Executors. Additionally, you can directly monitor ThreadPoolTaskExecutor and ThreadPoolTaskScheduler in a wholly similar way, but they must be initialized prior to attempting to instrument them.
@Configuration
class MyConfiguration {
    @Bean("worker_pool")
    ExecutorService workerPool(MeterRegistry registry) {
        return Meters.monitor(registry,
            Executors.newFixedThreadPool(8),
            "worker_pool",
            "threads", "8" // any number of tag key value pairs
        );
    }
}
ExecutorService instrumentation results in a composite counter that tracks the number of submitted, active, and completed tasks. Additionally, a timer records the execution time of tasks (plus a count of such tasks, since Timers always track both count and totalTime statistics).
Executor instrumentation just records the execution time.


Web Instrumentation

spring-metrics contains built-in instrumentation for timings of requests made to Spring MVC and Spring WebFlux server endpoints.

Web MVC and Annotation-Based WebFlux

Adding @EnableMetrics to your @SpringBootApplication class autoconfigures these interceptors.
The interceptors need to be enabled for every request handler or controller that you want to time. Add @Timed to:
  1. A controller class to enable timings on every request handler in the controller.
  2. A method to enable for an individual endpoint. This is not necessary if you have it on the class.
  3. A method with longTask = true to enable a long task timer for the method. Long task timers require a separate metric name, and can be stacked with a short task timer.
@RestController
@Timed // (1)
public class MyController {
    @GetMapping("/api/people")
    @Timed // (2)
    @Timed(value = "all_people", longTask = true) // (3)
    public List<Person> listPeople() { ... }
The Timer is registered with a name of http_server_requests by default. This can be changed by settingspring.metrics.web.server_requests.name.
The Timer contains a set of dimensions for every request, governed by the primary bean WebfluxTagConfigurer or WebmvcTagConfigurer(depending on which programming model you are using) registered in your application context. If you don't provide such a bean, a default implementation is selected which adds the following dimensions:
  1. method, the HTTP method (e.g. GET, PUT)
  2. status, the numeric HTTP status code (e.g. 200, 201, 500)
  3. uri, the URI template prior to variable substitution (e.g. /api/person/)
  4. exception, the simple name of the exception class thrown (only if an exception is thrown)
In addition to the default tags provided, you can add fixed tags to individual controllers or request methods via the extraTags attribute on @Timed:
@Timed(extraTags = {"authenticated", "false"})

Webflux Functional

spring-metrics contains a filter that you can add to a RouterFunction to instrument timings to its routes.
RouterFunctionMetrics metrics = new RouterFunctionMetrics(registry);

// OPTIONAL: the default is to record tags on method and status
metrics.defaultTags((req, resp) -> { /* custom tags here */ });

RouterFunction<ServerResponse> routes = RouterFunctions
    .route(GET("/person/").and(accept(APPLICATION_JSON)),
        request -> ServerResponse.ok().build())
    .filter(metrics.timer(
        "http_server_requests", // metric name
        "instance", MY_INSTANCE_ID // optional tags
    ));
The filter applies to all routes defined by this router function.
Separately, a router function generator is provided to add a scraping endpoint to a Webflux functional application:
PrometheusMeterRegistry meterRegistry = new PrometheusMeterRegistry();
RouterFunction<ServerResponse> route = route(GET("/prometheus"),
    PrometheusFunctions.scrape(meterRegistry));
You can compose this router function with another router function(s) that are instrumented with metrics.

Client-side HTTP Instrumentation

Enabling metrics in your Spring Boot application configures a BeanPostProcessor for RestTemplate, so every instance you create via the application context will be instrumented.
A timer is recorded for each invocation that includes tags for URI (before parameter substitution), host, and status. The name of this timer is http_client_requests, and can be changed via the spring.metrics.web.client_requests.name property.


Scheduling Instrumentation

Enabling metrics in your Spring Boot application plus enabling AOP configures AOP advice that times @Scheduled methods. For a method to be timed, it must be marked as @Timed("my_metric_name") with a name.
Depending on the duration of the scheduled task, you may want to choose to time the method with a LongTaskTimer, a Timer, or both. Below is an example of measuring both long task and regular timings to a scheduled task:
@Timed("beep")
@Timed(value = "long_beep", longTask = true)
@Scheduled(fixedRate = 1000)
void longBeep() {
    // calculate the meaning of life, then beep...
    System.out.println("beep");
}

No comments :

Post a Comment