Pivoting Elasticsearch data

As I've possibly mentioned here before, ClueTrust is using Elasticsearch to perform analysis of log information. Recently, I finally decided to take some our telemetry inforamtion and pull it in to Elasticsearch as a data exploration and statistical tool.

Importing structured XML data into Elasticsearch

Although there are some filters and logstash methods that have this capability, the XML that we use is extremely regular (strict schemas, etc.), and I felt that it would be better to directly and intentionally import based on the DOM that I'd created in 2009 when preparing for Cartographica to ship.

For purposes of illustration, the basic form of the Cartographica telemetry files is:

Preamble
Crash logs (yep, they're embedded)
Event stream
- Errors
- Events (launch, quit, and other)
- Exceptions
- Statistics (at quit and other times)

Due to the way that Elasticsearch works, it turns out this is a really workable input, generating a (possibly too verbose) set of items from each telemetry report, including:

Report
Launch
Event
Crash
Error
Exception
Statistic

For most of these items, the format is regular and arguments are inserted directly into the record (so a Crash has a crash log along with some interpretive data as well as the preamble from the report). This holds true for Events and Errors as basically individual data points inside of a preamble+launch context. The only oddity is the Statistic report which contains many "columns" of data for each statistic event. It's not lost on me that this idempotent data set is very SNMP-like.

Searching for meaning among the data

Due to the choice to separate these out as separate objects in Elasticsearch, most statistical information is straightforward to ascertain. Want to know what formats are most popular? Look for importVector or importRaster events and tabulate the number of times each format and/or driver are used. Interested in how frequently a particular analysis tool is used? Look for its corresponding event.

When (and how) to pivot your data

The one piece that had me stumped for a few days was: how do I determine how many active users are on which version of macOS? I've got launch data and a unique (but pseudonymous) host identifier. For example, you can create buckets based on the OS and count unique host IDs... unfortunately, that creates a data problem with users who have upgraded during the time period being examined. Using this technique, a user who was running macOS 11 and upgrading on each release day would account for 10 separate macOS counters.

What I really needed was to look at the data for just the most recent report for each host ID. Basically, I needed to pivot around the host ID. After much too much time trying to find a complex way through this, I finally searched on "pivot elasticsearch" and found Pivot Transformations, which turns out to be just what I needed. By creating a transformed index with the latest method, I was able to get an index that only pointed to the most recent documents for each host ID. Once I had this, I could aggregate using terms to find the operating system, resulting in a bucket of OS versions used most recently by each host ID.

Pivot Transformations create an alternate index to documents in another index. In my case, I used a latest transformation, which maintains only the most recent item for each unique key, based on the specified timestamp field and possibly limited by a filter. In my case:

{
  "source": {
    "index": [
      "ct-app-logs-*"
    ]
  },
  "latest": {
    "unique_key": [
      "report.host.id"
    ],
    "sort": "report.timestamp"
  },
  "description": "Application Hosts",
  "frequency": "1m",
  "dest": {
    "index": "ct-app-hosts"
  }
}

creates the new index based on the existing ct-app-logs-* pattern and pulling out items by the unique report.host.id key, using report.timestamp to determine which item is the most recent. This boils down 12 indexes containing 18.5M documents into a single index containing 16K documents.

The destination index, ct-app-hosts was set up ahead of time using a basic clone of the original index.

If desired, I could add a query key which would have limited the scope of the documents.

Upping the ante for aggregation

Once I got this going, I was having some issues pulling information out of the data due to some variances in how versions were managed. In particular, I was interested in seeing major OS versions (macOS 10.15, macOS 11, iOS 14, etc.) and maybe the same for the application version.

To facilitate this, I used runtime fields, setting up 2 additional mappings in the destination index (ct-app-hosts) pointed at above. To do this, I PUT a new index definition for the index, citing the following:

{
	"mappings": {
	  "runtime": {
	    "major_app": {
	      "type":"keyword",
	      "script": {
	        "source": """
def myField = doc['report.application.version'];
if (myField.empty)
	emit("");
else {
	def dom	= myField.value;
	for( String suffix : ['a','d','b']) {
		if (dom.indexOf(suffix)>0) {
		  dom = dom.substring(0,dom.indexOf(suffix));
		}
	}
	int last = dom.lastIndexOf('.');
	if (last == dom.indexOf('.'))
	  emit(dom);
	else
	  emit(dom.substring(0,last));
}""",
            "lang" : "painless"
          }
        },
        "major_os" : {
          "type" : "keyword",
          "script" : {
            "source" : """
def myField = doc['report.host.os.version'];
if (myField.empty)
	emit("");
else {
	def dom	= myField.value;
	int last = dom.lastIndexOf('.');
	def major = dom.substring(0,last);
	if (major.startsWith('11') || major=='10.16') {
	  emit('11');
	} else {
	  emit(major);
	}
}""",
            "lang" : "painless"
          }
        }
      },

This creates a new column, major_app which:

Checks that report.application.version exists
Removes any suffix starting with a, b or d
Removes the last . (unless it is also the first ., such as in 1.4)

Similarly, it creates a column major_os which:

Checks that report.host.os.version exists
Removes the last .
Makes sure to emit 11 for 10.16

With these two powerful tools, I was able to create a clear, concise, and constantly-up-to-date resource for OS and application usage.