Blog Post

Avro Schema Versioning and Performance – Part 2

    
July 21, 2016 Author: Matt Dailey

Version dependency is an important consideration, especially in large distributed systems where it's usually impossible to upgrade all components in unison. This is the second in a series of posts about how we use Avro schema versioning in Rocana Ops to help minimize this dependency with minimum performance impact.

In our first post, we discussed how Rocana Ops leverages Apache Avro for data interchange and third-party integrations. After elaborating on evolving our Avro schema, we touched on how we wanted to run microbenchmarks to improve our Avro decode performance. Check out part 1 of this series: High Performance Avro Schema Versioning in Rocana Ops.

To get started with microbenchmarking, we used the Java Microbenchmark Harness (JMH), a part of the OpenJDK project. JMH is a great tool that helps you avoid many pitfalls commonly seen when attempting to write microbenchmarks from scatch. These include working around JIT compiler optimizations and de-optimizations, and avoiding measuring JIT compilation time. Goetz's JVM microbenchmarking article (2005) goes into more detail about the pitfalls that JMH combats.

Creating benchmarks in JMH is straightforward. They can be set up similar to JUnit tests, and executed with JMH's Runner class. Here is a snippet to show the setup and benchmark:

package com.rocana.microbenchmarks;

import com.rocana.event.Event;
import com.rocana.kafka.EventDecoder;
import com.rocana.kafka.EventEncoder;

import com.google.common.collect.Maps;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Level;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;

import java.nio.ByteBuffer;

// JMH annotation to say each Thread running the benchmarks will have its own

// copy of the state of this class
@State(Scope.Thread)
public class EventBenchmark {

 private EventDecoder decoder;

 private byte[] eventBytes;

 // Run this setup once before each Iteration.  An Iteration is a collection
 // of Invocations of the benchmark method over a period of time.
 @Setup(Level.Iteration)
 public void setup() throws Exception {
   // EventEncoder and EventDecoder are our custom classes for encoding and decoding
   // the Event class
   EventEncoder encoder = new EventEncoder();
   decoder = new EventDecoder();

   // Event is a class generated from an Avro schema.  Here, we use the builder
   // to create an Event with any proper default values
   Event event = Event.newBuilder()
     .setId("id_101")
     .setTs(System.currentTimeMillis())
     .setEventTypeId(101)
     .setService("service")
     .setLocation("location")
     .setHost("host")
     .setBody(ByteBuffer.wrap(new byte[0]))
     .setAttributes(Maps.<String, String>newHashMap())
     .build();

   // serialize the Event to bytes, normally for transport or storage.  Here

   // we use the bytes for benchmarking the decoder
   eventBytes = encoder.toBytes(event);
 }

 @Benchmark
 public Event testDecode() {
   // decode the bytes as an Event, benchmarking the process
   return decoder.fromBytes(eventBytes);
 }

 @Benchmark
 public void noop() {}
}

There are a few lessons in this code that are not obvious:

  • Always return something from the benchmark. The JVM's JIT compiler is very good at noticing if a method has no effect, so void methods are prone to being optimized to a no-op.
  • The benchmark method does not contain a loop. The JIT can also optimize a loop to a single iteration if it notices there is no side effect to the loop. JMH handles iterating your benchmark in a JIT-safe way.
  • Including a noop benchmark is useful to show the maximum throughput of any operation. This also helps clue you in to any issues in other benchmarks, such as the JIT optimizing your benchmark to a no-op.

The code to execute these benchmarks is also straightforward:

package com.rocana.microbenchmarks;

import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.concurrent.TimeUnit;

/** Overrides org.openjdk.jmh.main.Main with our preferred defaults  */
public class Main {

 public static void main(String[] args) throws RunnerException {
   Options opt = new OptionsBuilder()
     .include("com.rocana.microbenchmarks.*") // include our benchmarks
     .forks(1)                                // number of times to run iterations in a separate process
     .mode(Mode.All)                          // run all modes (Throughput, AverageTime, etc.)
     .timeUnit(TimeUnit.MICROSECONDS)         // report results in microseconds (usec)
     .shouldFailOnError(true)
     .build();

   new Runner(opt).run();
 }
}

After building and launching an executable jar, we can run the benchmarks. With the default settings, each benchmark runs about 40 seconds for each of 3 run modes (Throughput, AverageTime, and SampleTime), plus a little more time for the SingleShot mode. This comes to about 120 seconds per benchmarked method.

The benchmark output is itself human-consumable, giving statistics more useful than just mean. Here is some output for the Throughput mode on one benchmark method.

# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
... Iteration output ...

Result "testDecode":
  1.266 ±(99.9%) 0.063 ops/us [Average]
  (min, avg, max) = (1.142, 1.266, 1.336), stdev = 0.073
  CI (99.9%): [1.203, 1.329] (assumes normal distribution)

First, our benchmark tells us this implementation of fromBytes can perform, on average, 1.266 decodes per microsecond. More importantly, 99.9% of the time, the throughput will be between 1.203 and 1.329 operations per microsecond, giving us reasonable upper and lower bounds on performance.

One thing to note: these microbenchmarks on their own may not tell the whole story for performance in real-world scenarios, so extrapolating these statistics to scale is probably not advised. What they do allow us to do is comparative analysis between implementations of our benchmarked methods.

In the final post of this series, Avro Schema Versioning and Performance – Part 3, we discuss how we used these tools to improve Avro Event decoding performance in Rocana Ops.


JMH Documentation References:


Learn More...

Learn About Rocana Ops: The Central Nervous System for IT Operations