Gaige's Pages

Of fast food and health: revisited

2026-03-01T08:00:00-05:00

Twenty-two years ago, I sat down and compared the nutritional data of chicken sandwiches from the big four drive-thru chains: McDonald's, Wendy's, Chick-Fil-A, and Burger King. Back then, Chick-Fil-A's grilled chicken reigned supreme, their fried sandwich was the best of a guilty bunch, and Burger King's Caesar salad was the surprise winner for the health-conscious. The comparison chart that accompanied the original article has been lost to time, but the conclusions held up well enough.

A lot has changed since 2004. The chicken sandwich wars of 2019–2021 reshaped the entire fast food landscape. Popeyes launched a sandwich that caused actual traffic jams. KFC finally got serious about their chicken sandwich game. McDonald's discontinued their grilled chicken entirely and bet the farm on the McCrispy. Even Chick-Fil-A, the perennial chicken champion, had to up their game to stay competitive.

So let's do this again. Same methodology as before: I pulled the nutritional data from each chain's website or official nutrition guides, and I'm comparing standard menu items as-served. I still don't have a degree in nutrition (though my wife still does), so take this for what it is—a numbers comparison, not medical advice.

Here are the contenders and their nutritional resources:

Restaurant	Nutritional data
McDonald's	McDonald's Nutrition Calculator
Wendy's	Wendy's Nutrition PDF
Chick-Fil-A	Chick-Fil-A Nutrition
Burger King	Burger King Nutrition Explorer
Popeyes	Popeyes Nutrition
KFC	KFC Nutrition Guide

Grilled Chicken Sandwich

In 2004, this was a four-way race. In 2026, the field has thinned dramatically. McDonald's killed their grilled chicken during the 2020 menu simplification and hasn't brought it back (though there are rumors of regional tests). Burger King and KFC don't offer one either. That leaves us with Chick-Fil-A, Wendy's, and—interestingly—Popeyes, which offers a Blackened Chicken Sandwich as a non-fried alternative. It's not grilled in the traditional sense (the chicken is seasoned and blackened on a flat-top), but it's the closest thing to a lighter option on their menu:

	Chick-Fil-A Grilled	Wendy's Grilled	Popeyes Blackened
Calories	390	350	550
Total Fat (g)	11	8	29
Sat. Fat (g)	2.5	1.5	5
Cholesterol (mg)	75	70	90
Sodium (mg)	765	850	1900
Carbs (g)	45	35	41
Protein (g)	28	32	32

The Popeyes Blackened Chicken is an interesting case. It matches the grilled options on protein (32g) and isn't deep-fried, but it's clearly in a different weight class: 550 calories and 29g of fat put it closer to fried sandwich territory than grilled. And that 1900mg of sodium—nearly 80% of your daily limit in a single sandwich—is eye-popping. If you're at Popeyes and trying to be sensible, it's still better than their Classic fried sandwich (700 cal, 42g fat), but calling it a healthy option would be a stretch.

Between the two genuinely grilled options, this is a reversal from 2004. Wendy's now wins on calories, fat, saturated fat, and carbs, while packing more protein than the Chick-Fil-A. Chick-Fil-A's advantage is lower sodium (765mg vs 850mg), but that's the only category where it comes out ahead. Twenty-two years ago I gave the nod to Chick-Fil-A; today, Wendy's takes the grilled crown. If you're watching calories and fat, it's the clear choice—and at 32g of protein, it's the better post-workout option too.

The bigger story here is that most chains have simply abandoned grilled chicken sandwiches. The chicken wars drove everyone toward bigger, crispier, fried offerings. If you want truly grilled, your options are limited to two chains.

Fried Chicken Sandwich

This is where it gets interesting. In 2004, the fried chicken sandwich was almost an afterthought—a guilty pleasure people didn't talk about much. Now it's the main event. Here's how they all stack up:

	McDonald's McCrispy	Wendy's Classic	Chick-Fil-A Original	BK Royal Crispy	Popeyes Classic	KFC Classic
Calories	470	490	440	670	700	650
Total Fat (g)	20	21	19	41	42	35
Sat. Fat (g)	5	3.5	4	7	14	4.5
Cholesterol (mg)	70	75	60	60	95	90
Sodium (mg)	1140	1450	1400	1080	1443	1260
Carbs (g)	46	49	41	54	50	49
Protein (g)	26	28	29	23	28	33

The field has split into two tiers. The first tier—McDonald's McCrispy, Wendy's Classic, and Chick-Fil-A Original—all come in between 440–490 calories with 19–21g of fat. The second tier—Burger King, Popeyes, and KFC—are all 650+ calories with 35–42g of fat. That's a massive gap.

Within the first tier, Chick-Fil-A still wins on calories (440), total fat (19g), cholesterol (60mg), and carbs (41g), just as it did in 2004. McDonald's McCrispy is the sodium winner at 1140mg (still high, but notably lower than the others). Chick-Fil-A leads on protein-per-calorie, packing 29g into just 440 calories.

The newcomers are a different story. Popeyes' sandwich may have started the chicken wars, but at 700 calories and 42g of fat (14g saturated!), it's the least healthy option in the group. That 14g of saturated fat is nearly 70% of your daily recommended limit in a single sandwich. KFC's offering is a bit more reasonable at 650 calories and actually leads the entire field in protein at 33g, but the fat content is still steep.

My nod in the fried category still goes to Chick-Fil-A, which has held its position remarkably well over 22 years. If you're looking for a budget-friendly alternative, the McCrispy is a solid second choice at a lower price point.

Chicken Salads and Wraps

In 2004, every chain in my comparison had a chicken salad. That's no longer the case. McDonald's discontinued their salads in 2020 (during the same menu simplification that killed grilled chicken), and Burger King dropped theirs in 2022. Popeyes has never really had a salad menu, and KFC's focus is elsewhere (though they've brought back a limited-time Twister wrap).

The good news is that two chains are still holding the line. Chick-Fil-A offers a full salad lineup—Cobb, Spicy Southwest, and Market salads, all with grilled chicken—plus the Cool Wrap, which packs grilled chicken, lettuce, and cheese into a flatbread. Wendy's likewise still has grilled chicken salads (Apple Pecan, Caesar, and Cobb) and a Grilled Chicken Ranch Wrap. If you're looking for something lighter at a drive-thru, these two chains remain your best options by a wide margin.

Still, it's notable that half the chains in this comparison have abandoned salads entirely. In 2004, the salad was a given on any fast food menu. In 2026, it's a differentiator.

Fries

Since no fast food comparison is complete without fries (and I skipped burgers last time too, so this is tradition), here's how the medium/regular fries compare across the six chains. As before, I'm using the combo/meal size since that's what most people end up ordering:

	McDonald's (M)	Wendy's (M)	Chick-Fil-A (M)	BK (M)	Popeyes (Reg)	KFC (Individual)
Calories	320	350	360	380	260	320
Total Fat (g)	15	16	17	17	14	16
Sodium (mg)	260	370	280	570	460	740
Carbs (g)	43	44	43	50	30	37

McDonald's still gives you the lowest calorie fry, just as it did in 2004. Some things never change. KFC's seasoned fries have disturbingly high sodium at 740mg—nearly triple McDonald's. And my advice from 2004 still holds: if you are watching carbs, just say no to fries.

The Bottom Line

Twenty-two years later, the fast food chicken landscape is almost unrecognizable. The grilled chicken sandwich has become an endangered species. The fried chicken sandwich has gone from afterthought to arms race. Salads have all but disappeared. And portions have gotten bigger across the board.

If you're trying to eat relatively well at a drive-thru, Chick-Fil-A and Wendy's remain your best bets—Wendy's for grilled, Chick-Fil-A for fried. The newcomers (Popeyes and KFC) brought flavor to the chicken wars but not health. And McDonald's McCrispy is a perfectly respectable middle-ground option that didn't exist when I wrote the original comparison.

The biggest change isn't in any single sandwich, though. It's in how the industry has split. Half the chains in this comparison have abandoned grilled chicken, salads, and lighter fare entirely in favor of bigger, crispier fried offerings. The other half—Chick-Fil-A and Wendy's—have held the line with grilled options, wraps, and full salad menus. In 2004, you could find something reasonably healthy at any of the four chains I tested. In 2026, where you pull in matters a lot more.

Note: This time around I had some help with the research and writing from Anthropic's Claude.

Docker multiplatform bridge

2025-07-29T04:43:00-04:00

As described previously, we run predominantly SmartOS systems at our data center and that includes using them to run docker in some cases, to quite good effect.

A few years ago, multiplatform images started to appear in the docker space, providing the ability to have (mostly) linux arm64 and x86-64 images with the same tag. With the advent of more powerful ARM systems, especially the M-series processors from Apple, these had started to take hold, but weren't yet pervasive.

Back in SmartOS-land, the docker ability isn't exactly neglected, but it's an uncommonly-used feature on a niche operating system. The general functioning of docker on the platform is solid (surprisingly so, given all of what it's doing), but the tools haven't needed to evolve much.

Now, I may be assigning a causality to a coincidence here, but at WWDC25, Apple announced that the ARM transition was nearly complete and so the support for Intel CPUs would be removed from macOS 27. Moreover, Apple announced that Rosetta 2 would be "significantly scaled back" in macOS 28+. Suddenly, there were releases of more multi-platform images into the ecosystem, and in particular, the fine folks at GitLab released 18.2.x versions with multi-platform images... and there's the root of our story here: SmartOS did not play well with multiplatform images.

To be clear, multiplatform images are a registry construct, the actual images are still a manifest and a list of image overlay layers. However, with the advent of multiplatform images, there's a new index type that returns an uber-manifest that lists the manifests for multiple platforms.

Unfortunately, the imgadm client for SmartOS didn't understand these multiplatform images and thus wouldn't pull them correctly. Now, I think that I've established my bona fides as not including substantial prowess with Javascript, and unfortunately for my open-source contributing self, the imgadm command is written in Node. I rolled up my sleves and made a PR for the node-based client in SmartOS. Although the PR isn't yet accepted, the problem is fixed in 20250724T001011Z of SmartOS and that's great.

In the mean time, we're still qualifying this version and I need to get some images that are currently multiplatform, so I decided to set up docker-registry on one of the k8s clusters (limiting access) and then mirror just the x86-64 versions in the local registry until we can do the upgrade.

On the Mac:

# inspect the imsage
docker manifest inspect gitlab/gitlab-ee:18.2.1-ee.0
# get just the linux/amd64 image
docker pull --platform linux/amd64 gitlab/gitlab-ee:18.2.1-ee.0
# retag for our repo
docker image tag gitlab/gitlab-ee:18.2.1-ee.0 XXX/gitlab/gitlab-ee:18.2.1-ee.0
# and push to our repo
docker push XXX/gitlab/gitlab-ee:18.2.1-ee.0

This worked and I'm now in business temporarily until I'm able to get the upgraded version of SmartOS in place.

Bridging the Gap

2025-05-08T06:00:00-04:00

This past weekend, I was experimenting with MultiBoard, a 3D printed wall organizer system. Starting there, it's an interesting and useful product. I downloaded the starter kit and printed it. It's well-designed and flexible.

As I was playing with the output, I was wondering how to appropraitely (and not necessarily permanently) attach the pieces to the wall. I looked around the system and found some pieces that would use 3D Command Strips to attach the pieces to the wall. That seemed suitable, so I set out to print a small set of examples.

Most of what I'd printed so far worked fine without any changes to the print profile settings. However, printing the Command Strips attachments was a different story. The pieces has a large, nearly 20mm gap between the two sides of the strip that needed to be bridged. My first print was pretty ugly, coming out like a washboard, and with enough detachment between the gap-spanning layer and the next layer that you could cut the filament with a fingernail.

Adding to this, I originally printed 4 pieces at once and they were progressively worse as they got further from the cooling fan. Clearly, I needed to makes some adjustments to the print profile settings. I started by slowing down the speed from 50mm/s to 40mm/s, then 30mm/s. It got somewhat better, but was still pretty bad and I decided to move to adjusting the Bridge Flow Rate setting from 100% down to 95, then 90, without a ton of success. So, I went back to the speed settings and dropped to 20mm/s, and then 10mm/s. That was looking quite a bit better, so I went back to the Bridge Flow Rate and dropped it down to .85 and that looked quite good.

View larger image

I was happy with the results, but I was still curious which setting made the most difference. After dropping the Bridge Flow Rate to .85, I went back to 20mm/s. That clearly showed that although both were necessary, the speed was the most significant.

Happy with these results, I printed 4 pieces up to see how the cooling fan would affect the pieces. Intentionally, I printed them parallel to the cooling fan, with the intent of having clear delination between the airflow in the pieces. Not surprisingly, the pieces closest to the fan came out almost perfect. The ones furthest away were starting to show some signs of washboarding, but not as bad as the first print.

View larger image

In the case of a hidden gap, such as where the Command Strips attach, these are probably sufficient. I was careful to use the same filament for all of the tests (Bambulab Basic PLA) and the tests were on my P1S.

Gitlab CI: Docker vs Kubernetes

2025-04-13T16:43:00-04:00

Starting with a problem

When looking through some recent updates to the Cartographica dependencies, I noticed that some of the updates were not being automatically merged, despite having passed all the tests. All told, not really a big deal, since all I would need to do is maually merge the changes. However, it got a little more interesting when I noticed that the problem was a minute difference in code coverage between builds, despite no change in the code which was being checked for coverage. It's possible that there is some small change somewhere, but I wanted to see what I could do to iron this out.

My approach on this problem was twofold:

Reduce the precision of the code coverage numbers
Combine the code coverae reports using ReportGenerator

To date, ReportGenerator is the only tool I've found that can combine multiple Cobertura reports and maintain a high level of precision. Other tools will "combine" the reports by assuming they each test a different area (such as subroutines or modules). In the case of Cartographica, I collect test information from 3 sources:

Cartographica unit tests on x86
Cartographica unit tests on ARM
Cartographica GUI tests

The first two are necessary because there are actual code differences between them. Some libraries never made the leap to ARM, and thus aren't supproted.

Dockerizing ReportGenerator

ReportGenerator is a .NET application, and I'm not particularly intersted in installing and maintaining a .NET infrastructure. As such, I decided to put the ReportGenerator application into a Docker container.

First, I checked to see if there was already a reasonable source. Dockerhub had images, but they were literally years old, and they were full OS images, so they contained all manner of cruft.

I decided that I'd make my own image, and make use of the base images maintained by ChainGuard. They have a number of wide selection of base images (and application images) that are built with security in mind. In most cases, they not only have reduced-attack-surface images, but they use distroless where possible.

The Dockerfile in question was a simple one:

# Build docker image containing .net runtime from chainguard and ReportGenerator application
# syntax=docker/dockerfile:1
FROM cgr.dev/chainguard/dotnet-sdk:latest-dev@sha256:aed050b3cb6e0d3c06e56e5ad15f1ce193bcefa4536550cf4daf81bad6da3b97 AS builder
ARG RG_VERSION=5.4.5

WORKDIR /app
RUN dotnet tool install dotnet-reportgenerator-globaltool --tool-path /app --version $RG_VERSION

# runtime image

FROM cgr.dev/chainguard/dotnet-runtime:latest-dev@sha256:7959c9804da649b8ffcbaafeb93ee264921bc34203b8b076def985efcc754011

WORKDIR /app
ENV DOTNET_ROOT=/usr/share/dotnet
ENV PATH="${PATH}:${DOTNET_ROOT}/tools:${DOTNET_ROOT}/tools/dotnet-tools"
COPY --from=builder /app /app

CMD ["/app/reportgenerator"]
# This is fine for k8s, but not for docker run for CI without manually overriding the exec command
ENTRYPOINT ["/app/reportgenerator"]

Note that I pulled in the dotnet-sdk image to build the ReportGenerator application, and then I copied the application into the runtime image. This was necessary to make sure I had the appropriate runtime on the runtime imags. You may also be wondering why I was using the latest-dev tag instead of the latest tag. The reason is I wanted to run the automation inside of the container, which was going to require a shell. If I'd used latest, I would have needed to either launch via docker-in-docker or use some kind of sidecar.

Running ReportGenerator in Gitlab CI Docker vs Kubernetes

Generally, I've considered these two to be historically equivalent. As such, I normally require docker in CI files and then tag the kubernetes runners with both docker and kubernetes. In the past, this has worked fine, although what I hadn't realized was that the handling of the containers was different on GitLab between docker and kubernetes. The main difference is in how they handle the ENTRYPOINT command. In the case of docker, the ENTRYPOINT command is overridden by the command in the CI file only in the event that you explicitly override it. In the case of Kubernetes, the ENTRYPOINT command is always overridden by the command in the CI file. Effectively, this means that for a GitLab CI script, you need to either use Kubernetes or you need to pay attention to the ENTRYPOINT command in the Dockerfile.

My original image was intended to be a distroless image for running the reports, which would generally have meant an ENTRYPOINT command. However, since I wanted to execute the CI steps in the container (which is a common pattern), that wouldn't work without some modifications to the CI file:

docker_test:
  image:
    name: "${CI_REGISTRY_IMAGE}/${CI_COMMIT_REF_SLUG}:${CI_COMMIT_SHA}"
    entrypoint: ["/usr/bin/sh", "-c"]

I'd imagine most people won't run into this problem, as the requirement that the images used for GitLab CI includes a shell. As such most folks will just use an image that has a shell, and thus likely uses the shell as the entrypoint.

In this case, I know the shell is there, and we need to tell Docker where to go get it. This seems to be a good balance between needing to modify the existing entrypoint and having to restrict the processes to just kubernetes.

Conclusion

All told, a small bit of annoyance, thankfully mostly during the testing phase. I did have to adjust the image due to a long-running job failing bacause it happened to hit the docker runner instead of the kubernetes runner, which was the impetus for this whole exploration.

Building a Complete Swift CI/CD Pipeline with GitLab

2025-01-05T13:30:00-05:00

The Goal

I recently built out a comprehensive CI/CD pipeline for xcode-sync, a Swift command-line tool that synchronizes Xcode project source groups with filesystem directories. The pipeline handles everything from testing to code signing, notarization, and automatic Homebrew formula updates.

The complete flow looks like this:

Run Swift tests
Build the release binary
Bump version using Conventional Commits
Sign the binary with a Developer ID certificate
Submit to Apple for notarization
Package and upload to GitLab Package Registry
Create a GitLab release
Trigger downstream pipeline to update the Homebrew tap

Pipeline Stages

The pipeline is organized into five stages:

stages:
  - test
  - build
  - package
  - release
  - publish

Testing and Building

The test and build stages run on a macOS runner (tagged with macos). Swift's built-in test runner generates JUnit XML output for GitLab's test reporting:

test:swift:
  stage: test
  tags:
    - macos
  script:
    - swift test --verbose --xunit-output .build/test-results.xml --parallel
  artifacts:
    reports:
      junit: .build/test-results.xml

Version Bumping with Commitizen

The package stage uses Commitizen to automatically determine the next version based on Conventional Commits. This eliminates manual version management:

- cz bump --annotated-tag --changelog || exit_code=$?

If there are no conventional commits since the last tag, Commitizen exits with code 21 or 3, which we handle gracefully to skip the release process.

Code Signing and Notarization

For macOS binaries to run without Gatekeeper warnings, they need to be signed with a Developer ID certificate and notarized by Apple. The pipeline handles this automatically:

# Unlock the keychain containing the signing certificate
security unlock-keychain -p "${KEYCHAIN_PASSWORD}" "$HOME/CIKeychain.keychain"

# Sign the binary with hardened runtime
codesign --force --options runtime --timestamp \
  --sign "${DEVELOPER_ID_NAME}" \
  "$BINARY_PATH"

# Submit to Apple for notarization
xcrun notarytool submit "$NOTARIZE_ZIP" \
  --key "${ASC_KEY_FILE}" \
  --key-id "${ASC_KEY_ID}" \
  --issuer "${ASC_ISSUER_ID}" \
  --team-id "${APPLE_TEAM_ID}" \
  --wait \
  --timeout 30m

The --wait flag blocks until Apple's notarization service completes, typically taking 1-5 minutes.

Package Registry for Permanent Artifacts

Job artifacts in GitLab expire (typically after 30 days). For release assets that need to persist, I upload the tarball to GitLab's Generic Package Registry:

PACKAGE_URL="${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/${CI_PROJECT_NAME}/${VERSION}/${TARBALL}"
curl --header "JOB-TOKEN: ${CI_JOB_TOKEN}" \
  --upload-file "${TARBALL}" \
  "${PACKAGE_URL}"

The release then links to this permanent URL instead of ephemeral job artifacts.

Creating the Release

GitLab's native release keyword creates the release with proper asset links:

release:
  stage: release
  image: registry.gitlab.com/gitlab-org/cli:latest
  release:
    tag_name: "${NEW_TAG}"
    name: "${NEW_TAG}"
    description: "./RELEASE_NOTES.md"
    assets:
      links:
        - name: "macOS executable"
          url: "${PACKAGE_URL}"
          link_type: "package"

The release notes are extracted from the CHANGELOG using a simple awk script that pulls out the most recent version's changes.

Triggering the Homebrew Tap Update

The final stage triggers a downstream pipeline in a separate homebrew-tap repository:

homebrew:
  stage: publish
  trigger:
    project: experiment/homebrew-tap
    branch: main
    strategy: depend
  variables:
    FORMULA_NAME: "${EXECUTABLE_NAME}"
    FORMULA_VERSION: "${VERSION}"
    FORMULA_SHA256: "${TARBALL_SHA256}"
    FORMULA_ARTIFACT_URL: "${PACKAGE_URL}"

The downstream pipeline downloads the tarball, uploads it to a distribution server, generates the Homebrew formula, and commits it to the tap repository.

Preventing Infinite Loops

One gotcha with this setup: when the package stage pushes the version bump commit, it could trigger the pipeline again. To prevent this, we skip the package/release/publish stages for bump commits:

rules:
  - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH && $CI_COMMIT_TITLE =~ /^bump/
    when: never
  - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH

The Downstream Homebrew Pipeline

The homebrew-tap repository has a simple pipeline that:

Downloads the artifact from the package registry
Uploads it to a distribution server (for faster downloads)
Generates the Ruby formula file
Commits and pushes the updated formula

# Download from package registry
curl -fSL -o "${FORMULA_TARBALL}" --header "JOB-TOKEN: ${CI_JOB_TOKEN}" \
  "${FORMULA_ARTIFACT_URL}"

# Generate the formula
cat > Formula/${FORMULA_NAME}.rb << EOF
class ${CLASS_NAME} < Formula
  desc "${FORMULA_DESCRIPTION}"
  homepage "${FORMULA_HOMEPAGE:-${FORMULA_PROJECT_URL}}"
  url "${TARBALL_URL}"
  sha256 "${FORMULA_SHA256}"
  version "${FORMULA_VERSION}"

  def install
    bin.install "bin/${FORMULA_BINARY:-${FORMULA_NAME}}"
  end
end
EOF

Conclusion

This pipeline automates the entire release process. A developer pushes a commit with a conventional commit message like feat: add new feature, and within minutes:

Tests run
Version is bumped appropriately (minor for feat, patch for fix)
Binary is signed and notarized
Release is created with permanent download links
Homebrew formula is updated

Users can then simply run brew upgrade xcode-sync to get the latest version.

The key lessons learned:

Use the Package Registry for release artifacts, not job artifacts
Handle Commitizen's non-zero exit codes for "no changes" gracefully
Use --wait with notarytool to block until notarization completes
Prevent infinite loops by skipping bump commits in rules
Pass variables to downstream pipelines via the variables key, not artifacts

Parallels Automation

2024-10-16T05:24:00-04:00

As part of my recent work for automation VyOS, I've been working with Parallels in order to run versions of the router locally for testing. My initial attempts were entirely manual:

Look up the lastest rolling release on GitHub
Download the ISO image from the VyOS website
Create a new VM in Parallels
Configure the VM to ahve appropriate network interfaces (4, all bridged)
Boot the VM
Attach the ISO image to the VM
Reboot (to get the live VyOS to run)
Use install image to install the image to the VM
Reboot
Disconnect the ISO image
Configure an IP address on the primary interface
ssh to the system and add the test configuration
run tests

All told, not a bad process, and highly repeatable. But, once something is repeatable, I want to automate it. So, I started looking at the Parallels CLI so that I could start with some bash scripts.

I know there is a mechanism for stuffing keystrokes, but I'm not confident in that approach, so I'll be doing steps 8-13 manually for now.

The code presented here takes an version argument to determine which build to use. If the version is 1.5, it will look up the latest rolling release on GitHub and use that. If the version is not specified, it will look in the specified directory for the latest image. If the version is specified, it will look for the latest image with that version prefix.

#!/bin/bash
set -x

# Function to retrieve the latest VyOS version from GitHub
get_latest_vyos_version() {
  curl -L -sS \
    -H "Accept: application/vnd.github+json" \
    -H "X-GitHub-Api-Version: 2022-11-28" \
    https://api.github.com/repos/vyos/vyos-nightly-build/releases | jq -r ".[0].name"
}

# Function to find the latest VyOS image in a directory
find_latest_vyos_image() {
  local version_prefix="$1"
  local directory="$2"
  find "$directory" -type f -name "vyos-${version_prefix}*-*amd64.iso" -print0 | xargs -0 ls -t | head -n 1
}

# Set defaults
VYOS_VERSION=""
DOWNLOAD_DIR="${HOME}/VyOS"
ROLLING_VERSION="1.5"

# Parse arguments
while [[ $# -gt 0 ]]; do
  case "$1" in
    -v|--version)
      VYOS_VERSION="$2"
      shift 2
      ;;
    -d|--directory)
      DOWNLOAD_DIR="$2"
      shift 2
      ;;
    *)
      echo "Unknown option: $1" >&2
      exit 1
      ;;
  esac
done

# Determine VYOS_VERSION and VYOS_IMAGE_FILE
if [[ "$VYOS_VERSION" == "${ROLLING_VERSION}" ]]; then
  VYOS_VERSION=$(get_latest_vyos_version)
  export VYOS_IMAGE_FILE="${DOWNLOAD_DIR}/vyos-${VYOS_VERSION}-generic-amd64.iso"
elif [[ -z "$VYOS_VERSION" ]]; then
  VYOS_IMAGE_FILE=$(find_latest_vyos_image "*" "$DOWNLOAD_DIR")
  VYOS_VERSION=$(basename "$VYOS_IMAGE_FILE" .iso | awk -F'-' '{print $2}')
else
  VYOS_IMAGE_FILE=$(find_latest_vyos_image "$VYOS_VERSION" "$DOWNLOAD_DIR")
fi

#export VYOS_VERSION=1.5-rolling-202411070006
export VYOS_VM="Vyos-${VYOS_VERSION}"
export VYOS_IMAGE_URL=https://github.com/vyos/vyos-nightly-build/releases/download/${VYOS_VERSION}/vyos-${VYOS_VERSION}-generic-amd64.iso
#export VYOS_IMAGE_FILE=~/Downloads/vyos-${VYOS_VERSION}-*amd64.iso
export VYOS_SIG_URL=${VYOS_IMAGE_URL}.minisig
export VYOS_SIG_FILE=${VYOS_IMAGE_FILE}.minisig
export VYOS_PUBKEY="RWSIhkR/dkM2DSaBRniv/bbbAf8hmDqdbOEmgXkf1RxRoxzodgKcDyGq"


RUNNING=$(prlctl list $VYOS_VM)
if [ $? -eq 0 ]; then
  echo "$VYOS_VM already running, abort it?"
  prlctl stop $VYOS_VM --kill
  prlctl delete $VYOS_VM
fi

if [ -f $VYOS_SIG_FILE ]; then
  echo "checking existing file ${VYOS_IMAGE_FILE}"
  minisign -Vm ${VYOS_IMAGE_FILE} -P ${VYOS_PUBKEY}
  if [ $? -ne 0 ]; then
    echo "Signature failed for $VYOS_VERSION"
    exit 1
  fi
fi

if [ ! -f $VYOS_SIG_FILE ]; then
  if [[ $VYOS_VERSION == $ROLLING_VERSION* ]] ; then
    # retrieving file
    curl -L -o $VYOS_IMAGE_FILE $VYOS_IMAGE_URL
    curl -L -o $VYOS_SIG_FILE $VYOS_SIG_URL
    minisign -Vm ${VYOS_IMAGE_FILE} -P ${VYOS_PUBKEY}
    if [ $? -ne 0 ]; then
      echo "Signature failed for $VYOS_VERSION"
      exit 1
    fi
  else
    echo "Warning: Skipping signature check due to no signature file"
  fi
fi

prlctl create $VYOS_VM -d debian
prlctl set $VYOS_VM --cpus 2 --memsize 512
prlctl set $VYOS_VM --device-set net0 --type bridged --iface en0 --mac auto
prlctl set $VYOS_VM --device-add net --type bridged --iface en0 --mac auto
prlctl set $VYOS_VM --device-add net --type bridged --iface en0 --mac auto
prlctl set $VYOS_VM --device-add net --type bridged --iface en0 --mac auto
prlctl set $VYOS_VM --device-set cdrom0 --image $VYOS_IMAGE_FILE
prlctl start $VYOS_VM
wait 5
prlctl set $VYOS_VM --device-connect cdrom0
prlctl reset $VYOS_VM

Upgrading to PostgreSQL 16 from 13

2024-10-16T05:24:00-04:00

This past week, I upgraded my primary PostgreSQL servers from version 13 to version 16. The process was relatively straightforward, but there were a few glitches that I wanted to document here.

My intention is to do this upgrade annually and usually stay a year behind the current release (mostly because I don't want to carefully track every piece of software that depends on PostgreSQL).

I was behind a couple of versions, mostly due to the sunsetting of Cartographica last year and my lack of interest in making changes to the backend while that was in flight.

Preparation

My upgrade process is a well-worn path at this point, automated using Ansible with scripts that I've been using for years. I have a staging server that I upgrade first and then test each application before I move on to production.

My upgrade process follows the old, reliable mechanism of dumping the database, upgrading the software, and then restoring the database. I've scripted this process using Ansible to ensure that I follow the indentical process each time.

For the official release notes for 14 and 15, see:

Upgrading

The actual upgrade process went off without a hitch. My ansible scripts did their jobs well, and I mostly remembered to get the backups out of the way so they didn't stop the processs.

Post-Upgrade

The upgrade process went smoothly, but I did run into a couple of items that caused problems later. In particular:

There was a change in PostgreSQL 15 that modified how permissions were handled which tripped up one of my older applications that was not yet using my vault-based secrets management system. For the vault-based roles, I'd been explicitly granting permissions on the public schema to my owner role. However, that was now required in more circumstances. An easy enough fix, I just manually grant ALL on schema public to foo;
As of PostgreSQL 16, roles must be assigned to the GRANT user unless they are superuser. See Postgres roles and privileges for detailed information on the change to my roles and privileges system. Amusingly, my article previously called out the security implications of not having a role-specific grant privilege, so I can't complain about the change, but it did affect my scripts. I've updated the previous post to reflect the changes.

Ruckus ICX 7150-C12P Switch (Brocade)

2024-08-19T08:00:00-04:00

After the recent death of one of my long-running Juniper EX-2200Cs, I needed to find a replacement. I decided to go with a Ruckus ICX 7150-C12P-2X10GR, which is a 12-port PoE switch with 2 10G SPF+ uplinks. It's basically a rebranded Brocade switch.

I'd been happy with the Juniper, right up until it lost its brain due to its non-replaceable battery going dead. With my entire fleet of four at home being about the same vintage, I decided to replace them all with the Ruckus.

On the Ebay market, there readily available, although take care if you want the 10G uplinks because they are not necessarily enabled with all licences. I found that the 7150-C12P-2X10GR model comes with those enabled and likely has a routing license as well.

Switches Compared

Both the Juniper and the Ruckus are high-quality, fanless switches with POE+ support and 12 downlink ports. The Juniper has 2 1G uplinks, while the Ruckus has 2 1G uplinks and 2 10G uplinks.

Both switches have sophisticated CLI-based management and both have available (and thankfully disable-able) web interfaces. Their CLIs are different, with the Juniper being the more traditional JunOS and the Ruckus being more like Cisco's IOS.

They have different ideas of the default VLAN, bringing home the need to avoid both "VLAN 0" and "VLAN 1" wherever possible, since the Juniper uses "VLAN 0" as the default and the Ruckus uses "VLAN 1".

Since I had a mix of switches while I'm replacing them, I had to be careful to ensure that the VLANs were consistent across the switches, and took the opportunity to standardize on specific non-default VLANs.

Software Upgrades

The Ruckus switches have a reasonably-available firmware download site and a clear upgrade path based on good documentation. The ones that I received were running the aging 8.0.65 firmware, which needed to be upgraded to 8.0.80 first in order to update to the new boot loader and UFI.

In all cases, I used switches that had been wiped to factory defaults, and was connected via a serial cable to my Mac running the Serial 2 app.

I did all my upgrades using USB, because my network components are sufficiently locked down as to not be compatible with the older firmware (new SSH key exchange algorithms, more modern SSL/TLS, etc.).

The only trick with the USB is to have the drive formatted to FAT32.

Upgrading to 8.0.80f

Upgrading to 8.0.80f requires upgrading both the boot loader and the image. If you don't upgrade the boot loader, you'll end up in a boot loop and will need to break out of it and then upgrade the boot loader before continuing (or boot back to the secondary image).

The basic steps to upgrade to 8.0.80f are:

copy disk0 flash 08080f/ICX7150/Images/SPR08080f.bin primary
copy disk0 flash 08080f/ICX7150/Boot/mnz10114.bin bootrom
boot system flash primary yes
Wait for PoE Firmware upgrade

The upgrade takes a few minutes, and you'll see the switch reboot a couple of times.

Once you're happy with this move, copy the primary to secondary image slot, so that you can boot from it if needed:

copy flash flash secondary

Upgrading to 8.0.85n

The upgrade to 8.0.85n was straightforward and followed the same pattern, but didn't require the firmware upgrade.

The basic steps to upgrade to 8.0.85n are:

copy disk0 flash 08095n/ICX7150/Images/SPR08095nufi.bin primary
boot system flash primary yes
Wait for PoE Firmware upgrade

Upgrading to 9.0.10j

Eventually, I decided to move to the "Technology Release" 9.0.10j, which adds support for Eliptic Curve SSH and updates a number of other protocols. So far, so good with this image.

Troubleshooting image upgrades

If you run into trouble where you have a seeming boot loop, you can break out of it by hitting b when the bootloader prompts each cycle.

In order to roll back to the previous version at that point, use boot_secondary to boot off older image.

Configuration

The configuration is relatively straightforward, but is more cisco-like in that configuration changes take effect immediately instead of needing to commit them, as with the Juniper. I prefer the commit-based approach, but spent enough time in Cisco land that I am comfortable with it, especially for switches. It does stand as a reminder of the value of OOB management.

Configuration is mostly in the conf t space, but for some reason, key-based authentication to the switch requires copying a file into the flash using the copy command. It's straightforward, but won't work over HTTP for some reason, so you'll need to use TFTP, SCP or USB. Also, these keys are in RFC4716 format, so you may need to covert them using:

ssh-keygen -e -f ~/.ssh/id_rsa.pub -m RFC4716 > id_rsa.pub.rfc4716

Finding Login Items

2024-07-11T08:55:00-04:00

Given a bit of downtime waiting for my AC repair, I decided to take a look at the login and startup items on my Mac. I've not been having issues, but it seems like good hygeine to know what these are.

Unfortunately, although the Open at Login section of the General > Login Options panel makes it easy to locate, add, and remove items, the lower pane: Allow in Background is a bit more opaque.

When trying to locate a few items I didn't recognize (which mostly turned out to be developers that I didn't recognize the names up, but whose apps I did), I did some searching and found:

sfltool dumpbtm

which dumps the database for background items, including the items that show up in the aforementioned panel. The format isn't great, but it does provide a lot of information:

                 UUID: C0F0A0C0-35B9-463D-92B9-792D81BCFA8C
                 Name: Zoom
       Developer Name: Zoom
                 Type: curated developer (0x80020)
          Disposition: [disabled, allowed, visible, notified] (10)
           Identifier: Zoom
                  URL: (null)
           Generation: 1
  Embedded Item Identifiers:
    #1: us.zoom.ZoomDaemon

                 UUID: FEE750CF-984A-463D-BCDD-9CAD067EC6BF
                 Name: WireGuardLoginItemHelper
       Developer Name: (null)
      Team Identifier: L82V4Y2P3C
                 Type: login item (0x4)
          Disposition: [disabled, allowed, visible, notified] (10)
           Identifier: (anchor apple generic and certificate leaf[field.1.2.840.113635.100.6.1.9] /* exists */ or anchor apple generic and certificate 1[field.1.2.840.113635.100.6.2.6] /* exists */ and certificate leaf[field.1.2.840.113635.100.6.1.13] /* exists */ and certificate leaf[subject.OU] = L82V4Y2P3C) and identifier "com.wireguard.macos.login-item-helper"
                  URL: Contents/Library/LoginItems/WireGuardLoginItemHelper.app
           Generation: 2
    Bundle Identifier: com.wireguard.macos.login-item-helper
    Parent Identifier: (anchor apple generic and certificate leaf[field.1.2.840.113635.100.6.1.9] /* exists */ or anchor apple generic and certificate 1[field.1.2.840.113635.100.6.2.6] /* exists */ and certificate leaf[field.1.2.840.113635.100.6.1.13] /* exists */ and certificate leaf[subject.OU] = L82V4Y2P3C) and identifier "com.wireguard.macos"

The absence of the URL makes it a bit harder to find, and I've had limited luck looking for items using the Bundle Identifier, however, it's a lot more information than I had to start with.

Sequoia Testing

2024-06-16T08:44:00-04:00

With Apple having now announced and made available for devlopers the next release of macOS (Sequoia), I wanted to put together a quick post (to be updated) on what I've found so far with macOS 15.0 beta and Xcode 16.0 beta.

Automated tests were failing when I referenced TARGET_TEMP_DIR, because it looks like Xcode 16.0 doesn't always create it for running tests. I was able to resolve by ensuring the directory in my test code.
There's a new version of the SPM Package.resolved file (version 3) that appears to only add one additional field (some kind of hash) and update the version. I'm not sure if it's incompatible with older versions of Xcode, so I've been reverting those two changes when I check in.
When running UI Tests with the new Xcode (esp. under macOS 15 and via automation) it appears that the XCUIApplication isn't always running and frontmost at the start. This causes problems with my code that checks window status.

Updating Ansible Collections

2024-05-26T11:59:00-04:00

As part of our move to more havily using VyOS, we've needed to update the vyos collections from ansible, due to some changes in some of the syntax for 1.4+ of VyOS.

There are some pitfalls to setting up and editing ansible collections, especially ones that are for network devices, so I wanted to put in a reminder for when I do this next.

Setup

The ansible-collections test code is very particular about the structure on disk. Generally, you'll need to be in a directory that is under a directory structure that contains ansible_collections/. In my case, I was working on the vyos collection, so I needed to check out the vyos collection git repo into ~/Development/ansible_collections/vyos (the ~/Development part is peculiar to me, but it's good to know this doesn't need to be right at the top of ~).

As an additional recommendation, I'd further isolate this inside of another folder, especially if you are using an IDE like VSCode.

I'm using VSCode these days for most of my python work, so I wanted to set up an appropriate workspace. The vyos collection appropriately ignores all the vscode configuration files, so you can just create them in the root of the collection. I put the .vscode directory in the vyos collection root and added paths so that the python interpreter would appropraitely see the modules (and the adjacent collection modules):

json { "python.envFile": "${workspaceFolder}/.env", "python.autoComplete.extraPaths": [ "${workspaceFolder}/../../.." ], "python.analysis.extraPaths": [ "${workspaceFolder}/../../.." ] }

For tests to work, you need to have key dependencies in the same ansible_collection directory. It's not clear to me why, but the sanity tests failed without it.

bash ansible-galaxy collection install ansible.netcommon -p ~/Development --force

seemed to do the trick.

The vyos collection uses network resource modules which are a bit different than the normal modules. They have a different structure and need to be in a different directory.

In particular, you need the build playbook installed somewhere that you can get to it. And, in the case of the "standard" modules (like yvos, ios, eos, etc.), you also need to fork the repository for the resource modules from resource module models repository.

Make changes to the module parameters and documention using the resource modules directly.
Run the resource module build playbook, pointing the output at your module (in my case vyos). This will generate the module documentation, parsers, and validators for you.

```bash ansible-playbook -e rm_dest=~/Development/ansible_collections/vyos/vyos \ -e structure=collection \ -e collection_org=vyos \ -e collection_name=vyos \ -e model=~/Development/resource_module_models/models/vyos/firewall_rules/vyos_firewall_rules.yaml \ site.yml

```

All of the parameters appear to be required. For isolation, I created a venv and run the playbook from there.

A couple of notes about generation here (specifically the sanity checks):

Run yaml-lint on the source *.txt files to make sure they are valid yamllint -d "{extends: default, rules: {line-length: {max: 120}, document-start: disable}}" <path>
you may need to run black on the generated files to get them to pass the sanity tests
aliases does not work in the existing sanity check
type values for the returns were missing from the build playbook, so I had to add them manually

Testing

Testing is its own fun. There's some information in the contribution documentation on the testing:

ansible-test sanity to run the sanity checks. These should mostly succeed
Create or use existing (or modify) unit tests: bash ansible-test units tests/unit/modules/network/vyos/test_vyos_firewall_rules.py --docker -v --python 3.12 Run using docker with verbose output.
ansible-test integration --docker -v to run the integration tests. These are more involved and require a running instance of VyOS to test against.
ansible-test units --docker -v to run all the unit tests. These are the most basic tests and should be run first.

My Google index is shrinking

2024-03-31T07:30:00-04:00

Google Search is interesting. Over the past 2 months, over 2000 of my posts on this blog have been removed from the index. Not shockingly, it’s hard to tell why.

Some of the stuff that got removed were original articles with analysis or opinion, and some of the ones that were kept were basically links posts.

Most of the ones that aren’t indexed anymore are link posts, which I did early on in a similar fashion to how Gruber does them to this day.

However, a few of them puzzle me:

a recent post on using Hashicorp Vault with AWS Credentials
a much older post on the Intel iMac
one of my favorite posts about limiting SQL queries So much data, why?

If this continues, I’m going to have to set up a search mechanism of some kind to find my old articles, which makes me sad. I’ve always enjoyed going to google and serendipitously finding something I wrote when looking for an answer to a question.

Vaulting AWS credentials

2024-03-20T03:52:00-04:00

I've been describing our Hashicorp Vault journey here at ClueTrust in a number of posts. Chief among the reasons to use Vault is its ability to generate and rotate credentials with specific systems and services.

I've written before about PostgreSQL credential management using Vault, which has been quite successful. This has allowed for short-term, tightly-scoped credentials when interacting with our PostgreSQL servers, meaning that not only are we definitely not storing credentials in code, but the credentials we are using are only minimally potent if taken out of the environment.

This weekend, I began using the AWS Secrets Engine with the intention of creating a similar parttern for accessing our AWS resources.

Although ClueTrust mostly runs its own physical infrastructure, we do have some systems that we run in other environments, including geographically-diverse nameservers we run in AWS. As such, we use AWS credentials to deploy and maintain those (through Ansible, of course).

This weekend's effort was to move from using ansible-vault-stored secrets (which I would hand rotate every 3-6 months) to using Hashicorp-Vault-stored secrets which would be created as needed and would expire quickly.

STS Credentials

I have a fondness for minimized blast radius both in scope and time, which leads to a preference for using AWS STS tokens (see Temporary Security Tokens in AWS).

Using STS in Vault seems a bit of a trade-off. When you create your AWS Role in Vault, that role must have the ability to create and maintain STS tokens, but it must also contain all of the entitlements that you will be granting in that environment. This can be scoped per AWS Secrets Engine (so you can do multiple accounts each at their own endpoint), but the intention here is to trust Vault with as much privilege as all of the needs you have. Initiallly this may feel risky, as you're concentrating risk in that one set of credentials. However, if you were to use multiple static AWS roles and store them in the same Vault, you'd still have the same blast radius scope, but each of those credentials would likely have a larger temporal blast radius.

By using the STS credentials, you can scope minimally in time and specify (by existing AWS IAM groups, policy ARNs, or a specific policy document). This gives you flexibility from Vault to give tightly-scoped credentials when these are issued.

In my case, I used federation_token which provides maximium flexibility to the Vault management. Also available are assumed_role (which uses STS and assumes a role which the Vault-assigned AWS User is able to assume) and iam_user for which Vault creates new users for each request and then deletes those users after a TTL. Vault also supports Static Roles, which are similar to static credentials in databases.

AWS Policies and User

To enable the use of federation_token, your AWS user needs to have permission to use sts:GetFederationToken.

I created a Policy called Federator as such:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "sts:GetFederationToken",
            "Resource": "*"
        }
    ]
}

A second policy (Change-self-access-keys) to allow for self-rotation of access keys:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "iam:ListUsers",
                "iam:GetAccountPasswordPolicy"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:*AccessKey*",
                "iam:GetUser"
            ],
            "Resource": [
                "arn:aws:iam::*:user/${aws:username}"
            ]
        }
    ]
}

and then a user (vault-federation) which was assigned these two policies and the additional permissions required to do actual work.

You can assign these directly, although I did so through groups.

Establishing AWS secrets in Vault

The process of configuring Vault for AWS involves:

Enable your new secrets store in Vault:

vault secrets enable aws

Create the IAM Role(s), Policies, and Users that you need in AWS (discussed below)
Register your User (for federation_token, you'll need an actual user; for assumed_role and iam_user you can authenticate to a Role) with Vault:

```bash vault write aws/config/root access_key=YYY \ secret_key=XXX region=us-east-1 \ sts_endpoint=https://sts.us-east-1.amazonaws.com sts_region=us-east-1

```

(Note: for federation_token users, you'll want to assign an sts_endpoint and sts_region to enable use across non-default regions. It's usually best to choose a region that you frequently operate in)

Once you've registered the "root" user (poorly named, but it is what it is), you should rotate the credentials so that only vault has them. To do this:

bash vault write -force aws/config/rotate-root

Creating AWS roles in Vault

Now the fun begins. Typical of use in Vault, you'll establish individual "roles" which can be used to retrieve credentials from AWS an through Vault policies you'll determine which users can access those roles.

I tend to create these roles through scripts and check them in to a git repository, so I will use a command like this:

vault write aws/roles/$role - < aws/roles/$role.json

in order to load my roles. An example of the JSON for a role I'm using is:

{
  "credential_type": "federation_token",
  "default_sts_ttl": 300,
  "iam_groups": null,
  "max_sts_ttl": 3600,
  "policy_arns": [
    "arn:aws:iam::aws:policy/AmazonEC2FullAccess"
  ],
  "policy_document": ""
}

I could use the iam_groups list to use one or more groups as the policy, policy_document to pass a document containing the precise policy to assign (assuming it's a subset of the policies the role has), or (as I'm doing here) pass complete ARNs in a list.

All that remains is to make sure this role is accessible from your users or roles in Vault (left as an exercise to the reader).

Using aws credentials from Vault

Use of the credentials is straightforward:

Request the credential through CLI or API bash % vault write aws/sts/ec2-admin ttl=15m Key Value --- ----- lease_id aws/sts/ec2-admin/NNN lease_duration 14m59s lease_renewable false access_key XXX secret_key YYY security_token ZZZ ttl 14m59s
Use these credentials for the next 15 minutes as necessary

Poetry in Production

2023-11-05T08:58:00-05:00

I regularly use poetry in order to isolate development environments as I'm putting applications together. I've been happy with it, and there are a number of methods that I've developed for using poetry in various environments.

For production, there are a number of different mechanisms used by people in the poetry community:

use poetry directly (poetry run application)
install directly in the root environment after exporting requirements (poetry export --without dev -o requirements.txt)
use an in-tree virtual environment

Decision process

I've tried all of them, and you can make them all work. However, after some investigation, I've decided to land right now on the in-tree virtual environment for ease of use.

I'd recommend against installing directly in the environment if possible because of the issues that arise if you need to install more than one virtual environment on a server. Generally, you'd think this would be unnecessary, as you should be isolating your servers anyway, and in a minimalist container environment it is likely true.

In our case, since we use slightly heavier-weight containers (Solaris Zones), I occasionally have other tools (like background processes that may be working on the same data) in the same zone. As such, you can still run into conflicts for dependencies and the virtualenv isolation brings some benefits.

Once you've determined that you're running in a virtual environment, the question becomes where to put your virtual environment data. For development, I prefer to leave it in the default (cache) directories because it's easier for me to remake those environments en masse when I upgrade the python enterpreter(s).

For production environments, the rebuild problem isn't an issue and the execution environments are generally limited. At this point, it's really a matter of tidiness and standardization.

Implementation

For our Solaris zones, they could go anywhere, but I perfer the ability to nuke and reconsistitute from source quickly and without having to go hunt down the virtualenv directory.

In the docker environments that I use for some applications, the problem becomes a bit more accute. Since I want to create the install environment using the python-dev container and then deploy it using a runtime container, that means I need to copy everything over and a standard location is better for this.

As such, my installation process tends to be:

Set up the in-tree virtual environment:

bash pip install poetry poetry config virtualenvs.in-project true

Capture the execution environment (which should be in .venv):

bash poetry env info --path

Export the main-only requirements to a file for installation (skipping hashes in our environment because we have some home-built packages that we don't gather hashes on yet):

bash poetry export --only main --without-hashes --output /tmp/requirements.txt

Install the requirements in the virtual environment:

bash pip install -r /tmp/requirements.txt

I have used these successfully both in Dockerfile (using multi-stage builds and copying the .venv over) and in Ansible for deployment in our Solaris zones environments.

Ansible

This isn't a complete ansible playbook, but it should give you an idea of how to construct an effective one:

- name: Install poetry
  pip:
    name: poetry
    state: present

- name: Set up the in-tree virtual environment
  command: poetry config virtualenvs.in-project true
  args:
    chdir: '{{ program_base }}'

- name: Capture the execution environment
  command: poetry env info --path
  register: poetry_env
  args:
    chdir: '{{ program_base }}'

- name: Export the main-only requirements to a file for installation
  command: poetry export --only main --without-hashes --output /tmp/requirements.txt
  args:
    chdir: "{{ poetry_env.stdout }}"

- name: Install the requirements in the virtual environment
  pip:
    requirements: /tmp/requirements.txt
    virtualenv: "{{ poetry_env.stdout }}"

In this case program_base is the directory where the pyproject.toml file is located.

Docker version

In the Docker version, you'd use a multi-stage build to create the .venv and then copy it over to the runtime container. Here's a simplified example:

FROM python:3.9 as python-dev
# Install poetry
RUN pip install poetry

# Set up the in-tree virtual environment
WORKDIR /app
RUN poetry config virtualenvs.in-project true

# Capture the execution environment
RUN poetry env info --path > /tmp/poetry_env_path

# Export the main-only requirements to a file for installation
RUN poetry export --only main --without-hashes --output /tmp/requirements.txt

# Install the requirements in the virtual environment
RUN pip install -r /tmp/requirements.txt

# Runtime container
FROM python:3.9

# Copy the virtual environment from the python-dev stage
COPY --from=python-dev /app/.venv /app/.venv

# Set the virtual environment as the default Python environment
ENV PATH="/app/.venv/bin:$PATH"

# Copy your application code to the container
COPY . /app

# Set the working directory
WORKDIR /app

# Run your application
CMD ["/app/.venv/bin/python", "app.py"]

This is a simplified example, but it should give you a good start. If you are installing only modules (for example if your build steps result in wheels), you will need to make some modifications.

Renovating git tags

2023-10-21T18:56:00-04:00

I've been very happy using Renovate (the free version) for use on my personal projects. I've previously discussed running it on one of my k8s clusters.

Today, I was trying to deal with a very specific problem: I needed to track a dependency via git tags, instead of tracking the head of the main branch.

Originally, I expected I'd be able to just set the branch in the .gitmodules file and then it's do "the right thing." Turns out, not so much.

I tried a number of ways to leverage the default configuration, but couldn't get that working. So, I decided to take matters into my own hands and use a custom config.

Since my usual config enables git-submodules, I need to disable it in for my custom manager to work.

  "git-submodules": { "enabled": false },
  "customManagers": [
    {
      "customType": "regex",
      "fileMatch": [ "(^|/)\\.gitmodules$" ],
      "datasourceTemplate": "git-tags",
      "matchStrings": [
        "url = (?<depName>.*?)#(?<currentValue>.*?)\\s"
      ],
      "versioningTemplate": "semver",
      "depTypeTemplate": "dependencies"
    }
  ]

For those who haven't dealt with the customManagers before, they're very powerful. Basically, you can use RegEx to extract data from the file and describe exactly which datasource, versioning and more you want applied.

In this case, I'm pulling the url and looking for a # to indicate the tag. Originally, I used the standard branch mechanism, but for some reason, the result of the first application of the branch version resulted in the URL with the # marker.

Previously, I'd used this in some of my docker files, based on the regexManager, which honestly is basically the same thing. I'm not sure why there are two ways to do this, nor why both are in the documentation, but the other place I've done this is:

  "regexManagers": [
    {
      "fileMatch": ["(^|/)Chart\\.yaml$"],
      "matchStrings": [
        "#\\s?renovate: image=(?<depName>.*?)\\s?appVersion:\\s?\\\"?(?<currentValue>[\\w+\\.\\-]*)"
      ],
      "datasourceTemplate": "docker"
    }
  ]

And I also used it in Renovating Ansible where I used it to update based on a gitlab tag, but in an explicit manner, not as part of a git submodule.

The key difference here appears to be the manager name (and differences in how I ran the tests).

Sonoma Arq warning

2023-10-15T15:09:00-04:00

After upgrading to Sonoma, I started occasionally (and then repeatedly) noticing warning messages and errors related to cloud files in my laptop and desktop machines in the area that is for iCloud. The specific files aren't important, although they seem to be related to applications (mostly on the phone) that have transient caches.

However, to stop the myriad warnings/errors, I found a new setting in the backups for:

When a dataless ('cloud-only') file is encountered:

With three settings:

Materialize: pull a copy to the system before continuing (Arq warns this can be time-consuming and data-intensive)
Report an error: this is the default and puts an error in your logs (and sends an email if you have emails sent on errors)
Ignore: just ignore this file and don't back it up

At home, where time and bandwidth aren't as important as a clean backup, I selected Materialize and on my laptop, I selected Ignore, since I don't want to slow things down on the road. The most important items to back up from the road are the ones I'm creating or editing there.

With those changes in place, things seem to be working fine.

Booting Dell servers over SMB

2023-10-13T04:29:00-04:00

The first time I did this I didn't document it very well, causing the next time to be more time consuming, so her'es the rundown.

It's not a secret that we use some older Dell hardware as servers in our datacenter. We've been pretty happy with it since switching away from HP and one of the reasons is that the iDRAC system seems much more stable, useful, and featureful than the HP counterpart. (Oh, and HP puts their firmware updates behind a paywall which is not desirable).

Setting up a server to boot over SMB using an ISO CD/DVD image is relatively trivial, but does require a bit of preparation. You can also boot over NFS, but I've found that a bit less relaiable in the past, and honestly NFS is more painful to administer if you're not actively using it than samba.

Setting up the Samba SMB Server

Install samba (your OS may vary here)
Create an appropriate user (dell in my case)
Configure the share. I use a very simple share from /tmp because it's simple

ini [tmp] comment = Temporary file source path = /tmp/ read only = yes public = yes

Configure the user password (user dell in this case)

bash smbpasswd -a dell

enter and confirm the password as prompted

(Re-)start the smb daemon (example on smartos)

bash svcadm enable samba:smbd svcadm enable samba:nmdb

Place your files in /tmp for pick up

Booting on the Dell

Note: you need to have an enterprise license to boot like this.

Log in to the iDRAC
Navigte to the Server view
Click on the Attached Media tab
Fill in the following:

title	value
Image File Path	`//`IP or domain`/`mount point`/`file name
Domain name	blank
User Name	user name
Password	user password
Expired or invalid certificate action	`Ignore`

Click Connect

After a few seconds, you should get a confirmation of the connection Connection Status will read Connected

Go to the Virtual Console and make sure your boot sequence checks for CD/DVDs and then Reboot

Exploring distroless images

2023-10-13T04:29:00-04:00

Distroless images are all the rage in the container space these days due to the reduced attack surface. This is great and also results in much thinner images. But, when an image isn't behaving it can cause some additional trouble as you try to figure out what may be missing or broken without the ability to access the image.

pull the image (if not already present)
run a container (this mounts the image to create the filesystem)
Export the image contents bash docker export hungry_mcnulty >contents.tar

This will provide the contents of the image and the container, so its good for debugging

Image-only solutions

If you just want to explore the layers and files in the image, you may find tools like dive (available on the Mac through brew) an appealing solution... well, if you like UI through the terminal.

Flask and vault

2023-10-09T09:27:00-04:00

When using dynamic database credentials with Flask, we need to make sure that the flask instance picks up the right credentials, renews them when necessary, and uses the right roles.

My flask code is pretty embedded with the database changes here, so pardon the dust, but I think it's relatively easy to follow.

Configuration parameters are either from the config file or they are taken from environment variables.

Parameter	Required	Purpose	Default
VAULT_ROLE	✓	dynamic database role to use	None
DB_ROLE		role to assume in connection	None
SQLALCHEMY_DATABASE_URI	✓	URI for datbase	no default

The application below is named telemetry_ingest and uses TELEMETRY_INGEST as the prefix for any environment variables that are used for configuration. This is mostly interesting if you are going to adapt this code elsewhere, since you need to remember to pull those out.

Vault use is triggered by the presence of the VAULT_ROLE parameter, since the vault credentials may or may not be necessary depending on the environment. If they are present in the config, this code will push them to the libraries, otherwise they'll come as None and hvac will use its defaults from the environment or statically.

Authentication data is stored in the auth global in this module and is initialized when the application starts. The logic to get and renew the authentication data is in get_vault_credentials().

Of particular interest is the event handling at the bottom in the with app.app_context() stanza. This adds event handlers for do_connect (called before the connection, so we can load the credentials), checkout (called when a connection is "checked out" to do something, where we verify the connection), and connect (where we set the database role if requested). Finally, the standard configuration is done, registering the blueprint for the actions.

import datetime
import os
from typing import Optional

import hvac
from flask import Flask
from sqlalchemy import event
from sqlalchemy.exc import DisconnectionError

auth = {}


def create_app(test_config=None):
    # create database
    # create and configure the app
    app = Flask(__name__)
    app.config.from_object("telemetry_ingest.default_settings")
    if test_config is None:
        # If we want to read from py file instead of prefixed variables
        # if os.environ['TELEMETRY_INGEST_SETTINGS']:
        #     app.config.from_envvar('TELEMETRY_INGEST_SETTINGS')
        app.config.from_prefixed_env("TELEMETRY_INGEST")
    else:
        # load the test config if passed in
        app.config.from_mapping(test_config)

    from telemetry_ingest.models import db

    db.init_app(app)

    def use_vault() -> bool:
        if requested_credential() is None:
            return False
        return True

    def requested_role() -> Optional[str]:
        if "DB_ROLE" not in app.config:
            return None
        return app.config["DB_ROLE"]

    def requested_credential() -> Optional[str]:
        if "VAULT_ROLE" not in app.config:
            return None
        return app.config["VAULT_ROLE"]

    def get_vault_credentials(existing=None):
        if not use_vault():
            return None
        client = hvac.Client(
            url=os.environ["VAULT_ADDR"], token=os.environ["VAULT_TOKEN"]
        )
        assert client.is_authenticated()
        if existing is not None:
            if (
                existing["response"]["renewable"]
                and datetime.datetime.now() < existing["vault_expire"]
            ):
                try:
                    renew_response = client.sys.renew_lease(existing["vault_lease_id"])
                    new_auth = existing
                    new_auth[
                        "vault_expire"
                    ] = datetime.datetime.now() + datetime.timedelta(
                        seconds=renew_response["lease_duration"]
                    )
                    new_auth[
                        "vault_renew"
                    ] = datetime.datetime.now() + datetime.timedelta(
                        seconds=renew_response["lease_duration"] / 2
                    )
                    return new_auth
                except hvac.v1.exceptions.VaultError:
                    app.logger.debug("lease renewal failed")
                    pass
        read_response = client.secrets.database.generate_credentials(
            requested_credential()
        )
        app.logger.debug("new lease")
        new_auth = {
            "user": read_response["data"]["username"],
            "password": read_response["data"]["password"],
            "vault_lease_id": read_response["lease_id"],
            "vault_expire": datetime.datetime.now()
            + datetime.timedelta(seconds=read_response["lease_duration"]),
            "vault_renew": datetime.datetime.now()
            + datetime.timedelta(seconds=read_response["lease_duration"] / 2),
            "response": read_response,
        }
        return new_auth

    global auth
    auth = get_vault_credentials()

    with app.app_context():
        # https://docs.sqlalchemy.org/en/20/core/engines.html#custom-dbapi-args
        @event.listens_for(db.engine, "do_connect")
        def provide_credentials(dialect, conn_rec, cargs, cparams):
            if use_vault():
                global auth
                cparams["user"] = auth["user"]
                cparams["password"] = auth["password"]

        @event.listens_for(db.engine, "checkout")
        def validate_checkout(dbapi_connection, connection_record, connection_proxy):
            if not use_vault():
                return
            global auth
            if datetime.datetime.now() > auth["vault_renew"]:
                app.logger.debug("credentials expired")
                auth = get_vault_credentials(auth)
                raise DisconnectionError()

        @event.listens_for(db.engine, "connect")
        def set_role_on_connect(dbapi_connection, connection_record):
            if requested_role() is None:
                return
            with dbapi_connection.cursor() as cursor:
                cursor.execute("SET ROLE '" + requested_role() + "'")

        from telemetry_ingest.routes import telemetry, redirection

        app.register_blueprint(telemetry)
        app.register_blueprint(redirection)
        db.create_all()
        return app

Vault local testing setup

2023-10-09T08:28:00-04:00

When I was confirming the configurations for my vault management of database credentials, I used a local postgresql and vault server. This may also be useful for development (especially testing code that may exercise the vault and database interactions).

This can make it relatively easy to watch all of the pieces and throw away any side-effects that you don't want in your production or staging servers.

If you're using vault elsewhere in your system, make sure you clear out your credentials from any cache while doing this work or you may accidentally modify a running system.
Make sure you have a running postgresql server
Start a vault development server: vault server -dev, which will automatically unseal itself and run in memory, so there's no need to clean up after yourself (although you'll have to start the whole process again if you kill the server)
Create a new admin user in your local database. I use a separate user for vault so that I don't run into a problem in break-glass scenarios, but you can also do that with local user override. I use local-vault, and for the sake of this example, set the passwrd to `changeme``
vault secrets enable database to turn on the database secret manager
Create the database config:

bash vault write database/config/localpg \ plugin_name=postgresql-database-plugin \ allowed_roles="*" \ connection_url="postgresql://{{username}}:{{password}}@127.0.0.1:5432/postgres?sslmode=disable" \ username="local-vault" \ password="changeme"

This command will create the record for localpg, setting it up to manage the local-vault credentials for the local vault

Force rotation of the password (this will make it so that you don't know the root-standin password):

bash vault write -force database/rotate-root/localpg

Now, for each of the users that we want the database to be able to use, we need to create roles using the parameters we determined previously.

For the ownership user, this is:

```bash vault write database/roles/owner-test \ db_name=localpg \ creation_statements="CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}' IN ROLE \"owner\" NOINHERIT;" \ revocation_statements="drop user \"{{name}}\";" \ default_ttl=60s \ max_ttl=60m

```

The braced variables are replaced by vault, so this will result in a user with a name, password, and expiration chosen by vault, which has access to the abilities of the owner role, but only if it is explicitly assumed using set ROLE owner.

NOTE: for real use, you want longer TTLs than 60s and 60m, but these were used here because it is a test environment and we want to verify both the renewal and expiration of the credentials.

Now, you should be able to retrieve the credentials from vault manually

bash vault read database/creds/owner-test

Once you've created the credential, you should be able to access it and also see the credentials with \du in the psql interface. You should see expiration times for the credentials you create, and they should automatically disappear after that time.

Postgres roles and privileges

2023-10-09T08:21:00-04:00

This is part of a multi-part series on using postgres databases, vault, and a variety of other tools to effect short-lived database credentials for real use.

As postgres uses user and role interchangably, so will I, although I'll generally try to use user to refer to a role with login permissions.

Postgres roles and privileges

There are some really interesting and powerful capabilities for managing roles in postgres, and as I've become more familiar with them, I understand why so many of us are in the habit of granting overly-broad privileges in the database realm.

A few important things to know about roles and privileges:

The owner of a new object is the user/role that created that object. As such, any default privileges for objects owned by that user must either be assigned by that user, or assigned on behalf of the user. Be careful/consistent about the role that creates tables and sequences so that you prevent problems. Only the owner is allowed to alter tables and sequences.
By default new objects are not provided privileges to existing roles. In order to ensure that new tables, sequences, etc. are usable by the roles that we are creating, they must be either granted explicitly each time a new object is created, or you must have an appropriate default permission for the public role or an explicit role. NOTE: defaults for new objects are owner-dictated, and thus the default role must be granted by the owner of the table, or on behalf of the owner of the table or it won't work as desired.
There are rules about dropping roles that may not be obvious. An owning role cannot be deleted until that owned object is removed or ownership is given to another role. Additionally any privileges owned by the role need to be dropped before dropping the role.
By default, all users have create privileges in the public schema for each database. This is not necessary for the creation of temporary tables, but generally makes it easier for all new users to work with a database. When designing your access controls, it probably makes sense to revoke all unnecessary privileges from public (revoke create on schema public from public).
Roles that are not marked noinherit in their definition will be inherited by any user granted that role unless you explicitly mark the grant noinherit. This may make sense when you are looking at privileges that don't grant ownership, but if you are trying to ensure that you don't have unexpectedly-owned tables, you will want your ownership roles to have noinherit set or grant them noinherit.
Prior to version 16 It's worth noting that non-owning users with the createrole permission do not have the ability to directly set their roles to be any arbitrary role on the system. However, I've found nothing to prevent a user with createrole permission from granting itself any role on the system except superuser. Keep that in mind that these granting roles are potent.
As of PostgreSQL 16, there is indeed a specific mechanism for making sure that roles are not granted by just any user who has the createrole permission. As of version 16, the grantor must have role admin capability on any roles that it will assign. As such, our vault role needs to have GRANT owner,readwrite,readonly to vault-admin with admin option set in order to create the roles with the appropriate role. As such, the vault-admin role is now more limited in what it can do.

Role design

Designing an appropriate role structure is complicated and this is not intended to be one-size-fits-all in the least. However, I hope it serves as a useful starting-off point. This design expects tob e used with dynamic roles through vault, and I'll detail that at the end.

The recommended "group" roles are:

role	purpose	privileges
`postgres`	superuser role (maybe break glass)	`SUPERUSER`, `LOGIN`
`vault-admin`	used by vault to manage dyanmic users	`CREATEROLE`, `LOGIN`, granted each role `with ADMIN OPTION`
`owner`	create/modify tables	`ALL on SCHEMA public` and `ALL` on tables and sequences
`readwrite`	read/write to tables	`SELECT on SCHEMA public` and `ALL` on tables and sequences
`readonly`	read-only on tables	`SELECT on SCHEMA public` and `SELECT` on tables and `SELECT` on sequences

NOTE: The readwrite user is granted ALL on tables and sequences. This may be a bit more than necessary, consider if you want to grant REFERENCES and TRIGGER tables, since these are frequently unnecssary. Although the USAGE privilege is separate on sequences, it's not particularly useful to allow UPDATE without allowing USAGE, as that won't allow you to use the sequence in a nextval call.

When creating the dynamic roles (assuming use with vault), there will be three role templates used:

role	creation statement
`owner`	`CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}' IN ROLE \"owner\" NOINHERIT;`
`readwrite`	`CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}' IN ROLE \"readwrite\" INHERIT;`
`readonly`	`CREATE ROLE \"{{name}}\" WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}' IN ROLE \"readonly\" INHERIT;`

The creation statements are all very similar, except for the specifics of which role is assigned and if they are inherited.

In my case, I'm choosing to use inherit for the non-ownership roles because it removes the need to set role. However, it's not necessary and unless you need the (temporary) user name, setting role is probably preferred.

This structure can be used for any application framework, and with or without a secrets maanger like vault, but since my example here started as a vault example, the creation of these items would be sequenced thusly:

Create the new database
Create the owner role and grant it appropriate privileges
Create the readwrite role and grant it the appropriate privileges and defaults in the schema on behalf of owner
Create the readwrite role and grant it the appropriate privileges and defaults in the schema on behalf of owner

At this point, you have what you need to create the database admin user, the reader, and the writer. If you're using vault, then use the creation statements above. If not, you can use similar statements manually so that if you have multiple users, you can control access centrally.

Django and vault

2023-10-09T08:20:00-04:00

When using dynamic database credentials with Django, we need to make sure that the django instance picks up the right credentials, renews them when necessary, and uses the right roles.

This post includes the background and the necessary code.

Migration and creation

Migration and creation provide special problems because of modification of database objects. For this, we either need to assume the role (as mentioned above) which owns the items, or we need a separate user.

I intend to try using a separate user for migrations in the future, but for now, I have a single role that the temporary user will assume which has access to read and write as well as own and maintain the tables, sequences, etc.

Since the temporary user has the ability to create objects, there are some ownership issues that will create problems if those are owned by the temporary user.

I'm using the database option assume_role to assume a permanent postgresql role after connecting to limit ownership confusion. Support for assuming roles for a session was added in 4.2 and is effected by adding ('OPTIONS': { 'assume_role': 'test-owner'} in the database definition in your configuration.

Renewing credentials

To renew the credentials, we're going to need to wrap the database access so that new credentials are retrieved both when required.

For this, I took inspiration from the AWS Samples for secrets manager rotation which used nearly the same mechanism, but with Secrets Manager instead of Vault.

Effectively, I created a new database manager in my-app/db/backends/postgresql using the django.db.backends.postgresql as a base and then in the DATABASES configuration stanza, I referred to my-app.db.backends.postgresql as the database ENGINE.

In the code below, DatabaseCredentials are used to store the database credentials while they are live. The credentials are stored by the DatabaseWrapper in instance storage, retrieving the credentials at init time and passing the settings_dict from the original DATABASES block along so that we can pick up any salient information.

There are a number of vault-related parameters, all prefixed with VAULT_:

Parameter	Required	Purpose	Default
`VAULT_ROLE`	*	vault dynamic role name	None
`VAULT_STATIC_ROLE`	*	vault static role name	None
`VAULT_MOUNT_POINT`		database secret store mount point	`database`
`VAULT_ADDR`		URL for the vault	None
`VAULT_TOKEN`		Token for accessing vault	None

*: At least one of VAULT_ROLE and VAULT_STATIC_ROLE must by included.

If either the VAULT_ADDR and VAULT_TOKEN are empty, the hvac library will provide its defaults, reading first from the environment and then using static defaults.

The DatabaseWrapper provides an override for get_new_connection, adding a set of credentials, renewing them if necessary, and then forwarding along to the underlying wrapper afte rthe credentials are replaced.

import logging
import hvac
from django.core.exceptions import ImproperlyConfigured
from django.db import DEFAULT_DB_ALIAS
from django.db.backends.postgresql import base

try:
    try:
        # noinspection PyPep8Naming
        import psycopg as Database
    except ImportError:
        # noinspection PyPep8Naming
        import psycopg2 as Database
except ImportError:
    raise ImproperlyConfigured("Error loading psycopg2 or psycopg module")

logger = logging.getLogger(__name__)


class DatabaseCredentials:
    def __init__(self, settings_dict: dict):
        self.creds = None
        logger.info("init vault credentials")
        self.credential_name = settings_dict.get("VAULT_ROLE", None)
        self.static_credential_name = settings_dict.get("VAULT_STATIC_ROLE", None)
        self.mount_point = settings_dict.get("VAULT_MOUNT_POINT", "database")
        self.vault_url = settings_dict.get("VAULT_ADDR", None)
        self.vault_token = settings_dict.get("VAULT_TOKEN", None)
        self.client = hvac.Client(url=self.vault_url, token=self.vault_token)
        self.refresh_now()

    def get_conn_params_from_vault(self, conn_params):
        conn_params["user"] = self.creds["username"]
        conn_params["password"] = self.creds["password"]
        logger.info(f"Getting db creds: user={self.creds['username']}")
        return

    def refresh_now(self):
        logger.info(f"refreshing credentials for {self.credential_name}")
        if self.static_credential_name:
            our_creds = self.client.secrets.database.get_static_credentials(
                self.credential_name, mount_point=self.mount_point
            )
        else:
            our_creds = self.client.secrets.database.generate_credentials(
                self.credential_name, mount_point=self.mount_point
            )

        self.creds = our_creds["data"]


class DatabaseWrapper(base.DatabaseWrapper):
    def __init__(self, settings_dict, alias=DEFAULT_DB_ALIAS):
        self.database_credentials = DatabaseCredentials(settings_dict)
        super().__init__(settings_dict, alias)

    def get_new_connection(self, conn_params):
        try:
            logger.info("get connection")
            self.database_credentials.get_conn_params_from_vault(conn_params)
            conn = super(DatabaseWrapper, self).get_new_connection(conn_params)
            return conn
        except Database.OperationalError as e:
            # there doesn't appear to be a good way to check for a specific error
            # other than to read the string and look for "authentication failed"
            if "authentication failed" not in str(e):
                raise

            logger.info("Authentication error. Going to refresh secret and try again.")
            self.database_credentials.refresh_now()
            self.database_credentials.get_conn_params_from_vault(conn_params)
            conn = super(DatabaseWrapper, self).get_new_connection(conn_params)
            logger.info(
                "Successfully refreshed secret and established new database connection."
            )
            return conn

Kubernetes Load Balancer Reset

2023-10-07T10:50:00-04:00

This morning I had the need to change the IP address configuration for the load balancer in our k8s cluster. The basics of changing the main pool in metallb were straightforward enough, but when I applied my changes, I didn't get what I needed.

So, what happened? Originally, I wasn't thinking like a kubernetes cluster, so I'd not realized that the load balancer itself drives addressing. As such, I was focussed on things like restarting the pods that had been assigned addresses. This did no good, although I did learn how to force fluxcd to detect and correct helm chart drift which was super useful when I accidentally deleted a deployment that didn't come back.

Once I realized that metallb is a kubernetes operator that looks at the Service CRD for type LoadBalancer and plumbs the address and network for those services, I needed to focus on the metallb controller pod, as that was what wasn't effecting the changes.

I looked at the logs and realized it would not pick up the new configuration because there were IP addresses in the old range in use. It turns out that there's no way to automatically update these, but if you restart your controller deployment:

kubectl rollout restart deployment metallb-controller -n metallb-system

Then it will pick up the new configuration and assign new addresses to any service already in use.

Recovering longhorn backups

2023-10-07T10:50:00-04:00

Another chapter in my learning kubernetes the hard way, this time Longhorn.

Probably ill-advisedly, I'm using ephemeral volumes for my storage volumes in Longhorn and have a habit of leaving the nodes in the cluster as they're being rebuilt. Generally, this isn't a problem. This weekend, I was a bit too cavalier about handling the rebuild process and didn't prune the temporary volumes each time I replaced a storage node, resulting in all of my storage getting nuked.

In my case, even the "persistent" volumes are all just cache, so it didn't really matter. However, since I'm also backing this up to an S3-compatible storage system, it gave me an opportunity to try retrieval.

Ephemeral storage in persistent nodes

If I were to remove nodes completely and bring them up afresh, I wouldn't have these problems. However, I've had the practice of draining/cordoning nodes and then rebuilding them and re-establishing them in the cluster without removing them completely from the cluster.

This is marginally faster, but results in the system thinking it was just temporarily disconnected instead of completely gone. Because of this, the longhorn storage expects the volumes to be there and present. The current failure mode is for them to indicate an error, but not allocate new space. This makes sense in terms of restoring from backups; but it's not helpful in my case, since it takes up the volume slot and prevents the other nodes from rebuilding.

This weekend, I rebuilt all 4 of my storage nodes, resulting in a complete loss of data. Configuration was, of course, fine, since that's in the etcd (which I didn't screw up this time).

Restoring the pvc

As an experiment, I wanted to try restoring the backups from my S3-like backup storage to see if it would work. This is the process that worked for me:

Quiesce the dependent pod by scaling down to zero:

kubectl scale deploy --replicas=0 renovate-whitesource-renovate

Use the GUI to restore the backup to a new volume (named appropriately). In my case, I named it mend-restored
Wait for the restore to finish
Delete the old Volume (GUI)
Create PV/PVC on the backup (GUI). Use the existing name for the PVC.
Once the PV and PVC are available, scale the dependent pod back up:

kubectl scale deploy --replicas=1 renovate-whitesource-renovate

Vaulting Database Credentials

2023-09-25T07:06:00-04:00

Over the past year, I've been experimenting with Hashicorp Vault, using the open-source/community version for some internal experiments, including some with high availability.

In a separate article, I'll go over a test configuration of Vault, but all of the notes here are agnostic to the use of HCP (Hashicorp's cloud services) or a private instance.

Setting up database connections

Contrary to what I originally thought, you only need to set up a single database configuration for each database server/cluster. (You can create more if you need to silo your controls further, but doing so can make the role map unnecessarily confusing). Specifically, the connection_url from the database configuration does not limit database access to the roles created through or administered by the vault database connection.

In most cases, set up a single database config to your postgres database and use it as the connection for all of your static and dynamic roles.

I'll walk through setting up a local vault and database connection for a test scenario in Local Testing.

Vault connection user

You'll definitely want to create a separate admin user for use by vault in managing credentials. Once you configure your connection with this user and rotate the password, you'll not have access to that password and as such it won't be available for break-glass or other administrative duties.

Recommendations from the internet are that the vault admin user just be able to provision users (CREATEROLE privilege can grant membership in other roles), and then create a postgresql role which can be used (assumed) by the vault role in the event that it needs to create database objects that have ownership (tables, sequences, etc.). Your specific use case may vary, but least privilege would be the goal with multiple roles.

Static vs dynamic roles

Vault has two types of roles for accessing databases, static and dynamic. The static roles have user-specified usernames, whereas the dynamic roles create a new (usually short-lived) user while being used. The biggest difference is that the dynamic roles are deleted once they're no longer in use (see caveat below on ownership).

A static role can adopt a role already in a database and is automatically rotated after a specified period of time.

A dynamic role is a unique user id that lasts for a shorter period of time, with extensions possible through extending the lease, and are deleted after their expiration time. Generally speaking, this is the target state, as they provide isolationa dn limit usefulness of leaked credentials.

Ownership and roles

One problem with dynamic roles comes into play when creating objects. Quoting from the postgresql documentation on DROP ROLE:

Because roles can own database objects and can hold privileges to access other objects, dropping a role is often not just a matter of a quick DROP ROLE. Any objects owned by the role must first be dropped or reassigned to other owners; and any permissions granted to the role must be revoked.

So, although the dynamic roles give us the safety we desire, they do create some complications when objects may be owned by those roles. To solve for this, we add some complexity, by creating a new role to hold the privileges, then give the dynamic roles role membership in that role, and finally we assume the role for each database session, ensuring that we are acting on behalf of the role we assume, including its privileges and ownership.

Creating the roles may take a little time, but it has a couple of other nice side effects, including placing the determination of specific access to the database instead of hard-coding that into your vault configuration.

Kubernetes etcd near disaster

2023-07-30T07:06:00-04:00

This post is mostly a warning to me for the future, but hopefully it'll prevent somebody else from going through the same problem. I've been running a small Kubernetes cluster for a couple of years now, mostly as an experiment and to keep my skills tuned for new tooling. Part of that has been making sure I use reasonable tooling and automate as much as I can.

So far, I've been pretty happy using:

Keeping kubernetes up to date

I've been in the habit of keeping my k8s cluster up-to-date for some time. Usually just redeploying the RKE cluster whenever there's a notable upgrade and keeping major dependencies up to date using automation based on mend.

This has worked well for the kubernetes infrastructure, but since the "machines" that run my cluster are all VMs, I also want to update them occasionally, updating the underlying OS and proving that I can rebuild the environment.

For worker nodes in the cluster, this has generally worked well. I have a basic, automated by Ansible process of:

Set the node to unschedulable
replace the node
run rke command in order to bring the node back into the cluster

This has worked fine, probably because the worker nodes, once quiesced, are not unique in any way.

Updating my control plane nodes

The problem came when I went to follow the same procedure for my control plane nodes. I took the first node offline, built a new node, and ran rke to bring the node back online. Everything seemed to be functioning well, so I went on to the next node and everything came to a halt.

It took me a few minutes to figure out what I'd done, but the key is that, unlike the worker nodes, the nodes running etcd are a bit special. In particular, they carry a unique ID that is embedded in their local database.

Running docker exec etcd etcdctl --write-out=table member list, you can see the ID on the left:

+------------------+---------+-------------+--------------------------+--------------------------+------------+
|        ID        | STATUS  |  NAME       | PEER ADDRS               | CLIENT ADDRS             | IS LEARNER |
+------------------+---------+-------------+--------------------------+--------------------------+------------+
| 51e442e065ed8da9 | started | etcd-node-1 | https://etcd-node-1:2380 | https://etcd-node-1:2379 |      false |
| 7c17ab818595f4fe | started | etcd-node-0 | https://etcd-node-0:2380 | https://etcd-node-0:2379 |      false |
| d085086f6d909371 | started | etcd-node-2 | https://etcd-node-2:2380 | https://etcd-node-2:2379 |      false |
+------------------+---------+-------------+--------------------------+--------------------------+------------+

Everything was fine when I deleted the first node, since the environment was configured for HA, it had 2/3 of its nodes available, which meant it still had a good voting majority left. And, when the node was re-provisioned, it was still fine, because it could contact 3/4 of the etcd nodes. However, when I went to replace the second node, the entire cluster failed, because etcd reached a state where only 2/4 of the etcd nodes were available and it couldn't reach quorum.

Once I was able to ascertain that the original node hadn't been replaced in the node list, the solution was relatively simple, and I deleted the errant node using:

docker exec etcd etcdctl member remove [id]

In the future, when working with the etcd nodes, I've made modifications to my MOP and the my ansible scripting:

Make sure to check the status of the etcd membership before replacing any node

docker exec etcd etcdctl --write-out=table member list

Make an explicit snapshot of the cluster before replacing the nodes:

rke etcd snapshot-save --name extra_snapshot.db --config cluster.yml

Remove the old etcd node as soon as possible to prevent negative effect on the quorum:

docker exec etcd etcdctl member remove [id]

Recovery from failure

In order to get the cluster back into working shape, I did do a restore and rebuild once I figured out what had gone on. This also involved using the most recent backup from etcd. (I also took a backup of the botched etcd situation before restoring).

rke etcd snapshot-save --name disaster.db --config cluster.yml
rke etcd snapshot-restore --name extra_snapshot.db --config cluster.yml
rke up

Note that there is reasonable disaster recovery documentation in etcd's documentation.

Elastic index correction

2023-07-03T07:04:00-04:00

Recently, I noticed a problem with my Index Lifecycle Management (ILM) not appropriately rotating indexes. The error was not super clear, but I did notice that the existing index had just reached 90 days without closing and that was the first move in the ILM. It was clear that the 30-day rollover wasn't happening.

The primary problem was easy to solve, which was to make sure that the write index was set correctly and the index was attached to the template with the alias:

PUT filebeat-8.0.0-2023-04-01-000001
{
  "aliases": {
    "filebeat-8.0.0": {
      "is_write_index": true
    }
  }
}

That resolved part of the problem, but the roll-over then occurred and it created filebeat-8.0.0-2023-04-01-000002, which definitely wasn't what I wanted (although in truth that date is just for the humans, the ILM uses the write dates).

To fix this, I needed to:

Stop the ILM

POST _ilm/stop

Create a new index using the date fields

PUT %3Cfilebeat-8.0.0-%7Bnow%2Fd%7D-000001%3E

Set the write index correctly for both indexes:

POST /_aliases

json { "actions" : [ { "add" : { "index" : "filebeat-8.0.0-2023-07-03-000001", "alias" : "filebeat-8.0.0", "is_write_index" : true } }, { "add" : { "index" : "filebeat-8.0.0-2023-04-01-000002", "alias" : "filebeat-8.0.0", "is_write_index" : false } } ] }

Turn ILM back on

POST _ilm/start

Relatively straightforward. The only hiccup was that the interim index was now out of sync with the ILM program, showing:

illegal_argument_exception: index [filebeat-8.0.0-2023-04-01-000002] is not the write index for alias [filebeat-8.0.0]

Since the temporary rollover index was small and didn't contain anything essential, I decided to delete it. There were some postings for older versions of ES that suggested ways of fixing this, but with 8+ they didn't seem to work.

Also of note: last year, I described Renaming Elasticsearch indexes when the situation arose to change the name of an index template.

Poetry in GitLab

2023-05-14T12:31:00-04:00

This weekend, I had occasion to build a new python-based utility and leaned in to my existing poetry tooling in order to do so. While starting the new project, I wanted to take advantage of some gitlab automation I'd previously used on other projects, so I figured I'd document it here.

Tooling overview for automation

The purpose of the gitlab automation here is to go from a feature branch to a new release without having to do any of the work myself.

I'm using a bunch of tools to achieve this:

poetry for dependency management and packaging
commitizen for enforcing conventional commits and managing release notes
pytest for test running and reporting
tox for test automation in multiple language versions (currently 3.10 and 3.11)

And, for good measure, I'll mention GitLab (the Pro version) for source repository and CI/CD, and JetBrains PyCharm, which I use as my IDE most of the time.

Automating the poetry delivery pipeline

Once I've got the project building and tests running, then I want to start rolling it out in versions. I first established this pipeline for another command-line tool (certalerter, my alerting tool for certlogger), so adapting for a new project should be straightforward.

I'm going to elide the coding and testing and stick to the automation for this post, and mostly do it by going through my .gitlab-ci.yml file a bit at a time.

Overall workflow

I've broken the workflow into 5 stages and limited the running of the workflow to: merge requests, commits to main branch when not in a merge request, adding a tag (mostly to handle releases).

I'm running all of this in docker (possibly k8s, but I haven't specifically enabled that yet). My python is pretty clean and I havne't had any problems with portability.

workflow:
  rules:
    - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
    - if: '$CI_COMMIT_BRANCH && $CI_OPEN_MERGE_REQUESTS'
      when: never
    - if: '$CI_COMMIT_BRANCH'
    - if: $CI_COMMIT_TAG
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH

default:
  image: python:3.11

There are 5 stages, of which most of them are pretty straightforward

stages:
  - build
  - test
  - bump
  - package
  - release

Building the project

Building the project is pretty straightforward, load in poetry to get our environment and the let it build. I've chosen to capture the distribution binaries (whl and tar.gz files) in the artifacts paths so that they don't need to be rebuilt for the testing phase. I'm not using the PyPi repository from gitlab yet, because I don't want every build to be uniquely kept there, but that's addressed in the package phase later.

build-job:
  stage: build
  script:
    - pip install poetry
    - poetry build
  artifacts:
    paths:
      - dist/ct_nagios_plugins*.whl
      - dist/ct_nagios_plugins*.tar.gz
    expire_in: 1 week
  interruptible: true

Testing and coverage

In order to enable pushing automatically to production, I feel it's necessary to have well maintained test suites. As such, code is tested on every commit (and in all major environments) and coverage is maintained to see when the tests are back-sliding.

Each of the test environments is started in the appropriate python image and then tox and the coverage tools are installed so that we don't need to fully install python. Since the goal here is to create a stand-alone package, I want to take care not to introduce any unintended poetry dependencies.

The funky grep/sed/awk bit is to tease the coverage out of the coverage file for use by gitlab. The || true at the end of it ensures that being unable to get coverage through this method doesn't spoil the stage.

Finally, the test logs (junit-*.xml) and coverage reports (coverage-*.xml) are stored as artifacts.

test:
  needs:
    - build-job
  parallel:
    matrix:
      - PYTHON_VERSION: "3.11"
        TOXENV: py311
      - PYTHON_VERSION: "3.10"
        TOXENV: py310
  image: python:${PYTHON_VERSION}
  stage: test
  script:
    - pip install tox coverage
    - tox --installpkg dist/*.whl
    - coverage xml -o coverage-${PYTHON_VERSION}.xml
    - >
      grep ^\<coverage coverage-${PYTHON_VERSION}.xml
      | sed -n -e 's/.*line-rate=\"\([0-9.]*\)\".*/\1/p'
      | awk '{print "CodeCoverageOverall =" $1*100}'
      || true
  interruptible: true
  artifacts:
    reports:
      junit: junit-*.xml
      coverage_report:
        coverage_format: cobertura
        path: coverage-*.xml
  coverage: '/^CodeCoverageOverall =(\d+\.?\d*)$/'

Bumping versions

In addition to the previously-mentioned rules for running, the version bump is very selective. It will only run on commits to main where bump is not part of the commit message. This (hopefully) prevents it from running twice without need. It also should stop loops.

Note the use of CI_BUMP_TOKEN here, which is a Personal Access Token (PAT) for GitLab that has permissions to read_repository and write_repository so that it can be used to write back to the repo. When I tried this originally, I expected to be able to commit back to my own repo, but ran into trouble, so using the PAT here makes that straightforward. The CI_BUMP_GITLAB_ID is probably not necessary, as __token__ should suffice.

Using poetry and cz here guarantees that the steps that are expected all run, but it also results in the above requirements. If I weren't committing back, but just setting a tag or release, I could easily do that with the API. Specifically, the CI_JOB_TOKEN doesn't have write_repository permission.

In my case, I use a specific PAT to this repository, so that I can limit the blast radius. I'd be happier if there were a way to request a read/write CI_JOB_TOKEN for certain stages, but even if that were available, it's not clear how that would be governed effectively without giving all stages in the pipeline access.

# need to clean in case tagging is screwy, since `git clean` doesn't know to remove tags
bump:
  needs:
    - test
  stage: bump
  variables:
    GIT_STRATEGY: clone
  script:
    - pip install poetry
    - poetry install
    - git config --global user.email "${GITLAB_USER_EMAIL}"
    - git config --global user.name "${GITLAB_USER_NAME}"
    - exit_code=0
    - poetry run cz bump --annotated-tag --changelog || exit_code=$?
    - echo "$exit_code is exit code ; $? was result"
    - |
      if [ $exit_code -eq 0 ]
      then
        git remote set-url origin ${CI_SERVER_PROTOCOL}://${CI_BUMP_GITLAB_ID}:${CI_BUMP_TOKEN}@${CI_SERVER_HOST}/${CI_PROJECT_PATH}
        git push origin --follow-tags HEAD:${CI_COMMIT_BRANCH}
      elif [ $exit_code -eq 21 ]
      then
        echo "Skipping push with no version change"
      elif [ $exit_code -eq 3 ]
      then
        echo "Skipping push with no commits"
      else
        echo "cz error code $exit_code"
        exit $exit_code
      fi
  rules:
    - if: $CI_COMMIT_BRANCH==$CI_DEFAULT_BRANCH && $CI_COMMIT_TITLE =~ /^bump/
      when: never
    - if: $CI_COMMIT_BRANCH==$CI_DEFAULT_BRANCH
# skip on bump, because you'll never bump after bump

Packaging the job

As with the bump stage, the package stage runs only at specific times. In particular, it will only run directly following a bump commit on the main branch.

Theoretically, I could use poetry and then make use of the publish command, but in this case, twine is fine (and dedicated).

package-job:
  stage: package
  script:
    - pip install twine
    - TWINE_PASSWORD=${CI_JOB_TOKEN} TWINE_USERNAME=gitlab-ci-token python -m twine upload --verbose --disable-progress-bar --repository-url ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/pypi dist/*
  rules:
    - if: $CI_COMMIT_BRANCH==$CI_DEFAULT_BRANCH && $CI_COMMIT_TITLE =~ /^bump/

Finishing off the release

The final phase, which only happens on tagged commits, is to tag the release. GitLab makes this easy by directly supporting the release process in the CI file.

release_job:
  stage: release
  image: registry.gitlab.com/gitlab-org/release-cli:latest
  rules:
    - if: $CI_COMMIT_TAG
  script:
    - echo "Running the release job for $CI_COMMIT_TAG."
    - "awk '/^## Unreleased/ { next } ; /^## / { r++ ; if ( r <2) { print ; next } else { exit } }; /^/ { print } ;' < CHANGELOG.md >INCREMENTAL_CHANGELOG.md"
  release:
    tag_name: $CI_COMMIT_TAG
    name: 'v$CI_COMMIT_TAG'
    description: INCREMENTAL_CHANGELOG.md

Monitor fleet aging

2023-05-14T10:40:00-04:00

Background

Generally speaking, I refresh most of my systems pretty regularly, spurred on by security concerns, general hygeine, a desire to make sure the automation doesn't age out, and certificate expiration.

Although I don't need to refersh systems due to certificate expiration, it has historically been the easiest indicator of systems that are getting a little long in the tooth.

Working on some systems this weekend, I noticed some out-of-date copies of postgresql...really out of date..like close to a year old. This is what sent me off on this weekend's adventure.

What do you mean by refresh and why?

Given our penchant or building everything using Ansible, when I indicate I'm refreshing a system, that means the old VM gets taken down and a new one is built to then-current specifications as a replacement.

Rob and I have nurtured this workflow for years (ever since moving to using ansible for automation). In all cases, I build staging environments before production and in most cases there are some reasonable automated tests for that process.

As to why? The answer is mostly one of convenience, although there are security arguments as well, both getting the latest versions of libraries that may contain vulnerabilities and dislodging anything bad that may be sitting on the virtual machines.

Monitoring the fleet age

Based on the recent discovery of some aging systems, I figured that I should find a way to add this process to our monitoring system, the venerable Nagios.

This didn't need to be particularly complex, but I needed the nagios server to reach out to the SmartOS Global Zones in order to get information about the running VMs. Historically, we've done this with captive SSH, using dedicated keys and lines in ~/.ssh/authorized_keys which take advantage of the command= command in order to run a program, potentially with information from the incoming SSH connection. Results are sent in text, but preferably encoded in JSON or similar.

a new python framework for ssh requests

Most of our previous commands piggy-backed on the check_by_ssh checker, which is a standard nagios plugin. However, that command assumes that we put all of the intelligence at the other end of the line (on the recipient) and basically run the checks there. That could be done, but the need to do date math made coming up with an appropirate one-liner a bit ridiculous, so I decided to go with python.

The python code was strightforward, and I used my existing poetry-based environment as a starting point, creating a couple of new commands which I'd install on the nagios servers: one for SmartOS and another for AWS.

By making use of my existing poetry workflows, I got a number of things for free, including updating release notes, packaging releases in gitlab, etc.

Integrating with nagios

The nagios integration should have been simple, but for one small issue: I needed to parameterize the global zone system so that the command could take place there.

After some digging through the documentation for nagios, I found the section on custom macro variables, which is exactly what I needed in this case. I wanted to add a new variable _GZHOST to my existing host definitions which would indicate which host to query about the underlying VM. I already had this infromation in the PARENTS field, which I thought I could use as $HOSTPARENTS$ , but it turns out that for some reason that's not exposed.

In this case, I was able to use $_HOSTGZHOST in my command definition in commands.cfg, resulting in:

define command {
       command_name check_smartos_vm_age
       command_line /opt/local/bin/ct-smartos-vm -H $_HOSTGZHOST$ $ARG2$ -i $USER5$/smartos-age-check-key $HOSTNAME$
}

With:

$_HOSTGZHOST$ having the Global Zone host
$ARG2$ being a placeholder for optional parameters (such as overriding the timelines)
$USER5$ pointing to our directory for storing ssh keys
$HOSTNAME$ the name of the VM to check

Results

In the end, I found a few more systems that were out of date than I was expecting, including one I could have sworn I'd refreshed just earlier this week. So, I'm pretty happy with the system.

Subtasks and Redirection

2023-05-06T07:30:00-04:00

Background

As part of an ongoing effort to keep Cartographica up to date with recent changes in libraries that we compile from source, notably GDAL and Proj, I'm in the midst of a refresh of those subtrees in the frameworks that I build from them. Over the past few years, both of these projects have expanded test coverage and modernized their build architectures (using CMake) and I've improved validation and coverage by integrating these tests into my Xcode build environment.

Up until the Cartographica 1.6 release, where I made available the Command Line Tools for GDAL and PROJ, I didn't have a way to do acceptance testing on the final product, so I integrated these tests into the unit tests for the frameworks.

A Problem with Shell Redirection

For many of the tests for both PROJ especially, the tests involve invoking CLI commands with a set of parameters and validating that the exact results are as expected. In order to support this, I created an Objective-C class that spawns a /bin/sh shell (although the specific flavor doesn't seem to have much effect on the problem) using the executable-bit-marked shell script as arg0 with the necessary arguments and environment variables in place.

This has worked well since I build this structure in 2014. However, the most recent updates elicited failures based on the diffs in the tests not succeeding. First check was to run the test manually, which resulted in... succcess. That was a bit unexpected, since I'm running the same commands in effectively the same environment in both cases...but, of course, it is not.

To run the tests from within the XCTest structure, I am running in code, and that means that I need to spawn the task using a sub-shell, which in my case involves spawning an NSTask, and waiting for it to complete in order to gather the results.

Looking at the results, the key difference is that when run in my NSTask, the redirection of the stdout and stderr to the same location in the script works differently than it does from the command line. When run from the command line, they are separately buffered, causing the results to appear as:

Attempt to use coordinate operation Inverse of WGS 84 to EGM2008 height (1) failed.
49 2 0  *  * inf

When run inside of the NSTask, the results are a less useful:

49 2 0 Attempt to use coordinate operation Inverse of WGS 84 to EGM2008 height (1) failed.
*  * inf

The code for the underlying command echos to stdout the initial coordinates (49 2 0) before the error occurs, then sends the error to stderr and then continue to print the result to stderr, including the \n, signalling EOL, and flushing the buffer.

It's not at all clear why the behavior of the buffering is working differently during the script executed from withing the shell directly rather than from the script executed from within the NSTask. In this case, the actual redirection happens as part of the script and not part of the original shell from which the script is being run. I speculate that there's some kind of default handlign that is getting passed through to the script from the original shell, and when I use NSTask it is coming from there instead.

The code is pretty straigthforward:

- (int)runScriptTest:(NSString*)script withExecutable:(NSString*)executable andArguments:(NSArray<NSString*>* _Nullable)userArguments
{
    NSBundle *testBundle =[NSBundle bundleForClass: [self class]];
    NSString *executablePath = [testBundle pathForAuxiliaryExecutable: executable];
    XCTAssertNotNil( executablePath, @"Need executable %@", executable);

    NSString *scriptPath = [testBundle pathForResource: script ofType:nil];
    XCTAssertNotNil( scriptPath, @"Need script %@", script);

    NSString *runDir = NSProcessInfo.processInfo.environment[@"TMPDIR"];
    XCTAssertNotNil( runDir, @"Need runPath %@", script);
    XCTAssertNotEqualObjects(runDir, @"/");
    NSTask *childTask = [[NSTask alloc] init];

    NSArray *arguments = @[scriptPath, executablePath, self.nadPath];
    if (userArguments)
        arguments = [arguments arrayByAddingObjectsFromArray: userArguments];
    childTask.arguments = arguments;
    childTask.executableURL = [NSURL fileURLWithPath: @"/bin/sh"];
    childTask.currentDirectoryURL = [NSURL fileURLWithPath: runDir];
    childTask.environment = [self environmentWithResources];

    NSError *error;
    XCTAssertTrue([childTask launchAndReturnError: &error], @"Launch failed %@", error);

    [childTask waitUntilExit];
    int status = [childTask terminationStatus];

    return status;
}

A minimal C program doesn't have any problem with this:

#include <unistd.h>
#include <stdio.h>
#include <string.h>

int main(int argc, char **argv)
{
    const char *env[] = {
        "PROJ_DATA=Proj4Tests.xctest/Contents/Resources/for_tests",
        NULL
    };

    int result;
    result = execle( "/bin/sh", "/bin/sh", "Proj4Tests.xctest/Contents/Resources/testvarious", "Proj4Tests.xctest/Contents/MacOS/cs2cs", "Proj4Tests.xctest/Contents/Resources/for_tests", NULL, env);
    printf("result = %d (%s)", result, strerror(result));
}

A solution by replacing NSTask

Doing some further experimentation, I don't end up with interleaved output on the subprocess is I use posix_spawn instead of spawning with NSTask.

Adapting my original code, this seems to work:

- (int)runScriptTest:(NSString*)script withExecutable:(NSString*)executable andArguments:(NSArray<NSString*>* _Nullable)userArguments
{
    NSBundle *testBundle =[NSBundle bundleForClass: [self class]];
    NSString *executablePath = [testBundle pathForAuxiliaryExecutable: executable];
    XCTAssertNotNil( executablePath, @"Need executable %@", executable);

    NSString *scriptPath = [testBundle pathForResource: script ofType:nil];
    XCTAssertNotNil( scriptPath, @"Need script %@", script);

    NSString *runDir = NSProcessInfo.processInfo.environment[@"TMPDIR"];
    XCTAssertNotNil( runDir, @"Need runPath %@", script);
    XCTAssertNotEqualObjects(runDir, @"/");

    NSArray *arguments = @[scriptPath, executablePath, self.nadPath];
    if (userArguments)
        arguments = [arguments arrayByAddingObjectsFromArray: userArguments];

    const char * const env[] = {
        [NSString stringWithFormat: @"PROJ_DATA=%@", [self.nadPath stringByAppendingPathComponent:@"for_tests"]].UTF8String,
        NULL
    };

    const char * const args[] = {
        "/bin/sh",
        scriptPath.UTF8String,
        executablePath.UTF8String,
        self.nadPath.UTF8String,
        NULL
    };

    pthread_chdir_np(runDir.UTF8String);

    pid_t pid;
    int success = posix_spawn( &pid, "/bin/sh", NULL, NULL, (char *const*)args, (char *const*)env);
    if (success==-1) {
        printf("Fork failed");
        return(-1);
    }

    printf("parent sees child %d\n", pid);
    int status;
    pid = waitpid(pid, &status, 0);
    if (pid<0) {
        if (errno == EINTR) {
            pid = waitpid(pid, &status, 0);
        }
        if (pid<0) {
            printf("Error waiting %d %d\n", pid, errno);
            return(-2);
        }
    }
    printf("child completed with %d\n", status);
    return status;
}

Two things of note here:

pthread_chdir_np is not a public method for macOS -- other third party applications, like Chrome use it, but it's not sanctioned and could go away. I'm less concerned about this in a test jig than in code that would go to end users.
The little dance around waitpid being called twice is related to receiving a signal, which I am pretty certain is SIGCHLD being sent. However, I'm not comfortable ignoring it because I may not be the only one spawning a child task.

fp-concat Accuracy

2023-05-06T05:53:00-04:00

My previous post about proj floating point investigation discussed an issue that I'd tracked down to the OS level. However, it's clear that this relates to an underlying change to code compiled by Xcode (and/or the LLVM toolchain that it is built upon).

Based on a post about Xcode 14.3 in Michael Tsai's blog, I started looking at a change in the compiler around the handling of fp-contract, which controls the use of contractions in optimizing floating point operations in the compiler.

There's certainly beeen some debate about the change that was introduced in Clang 14 which changed the handling of the fp-contract flag, and I'm not going to take a stand on which is "better", but I am going to note that the change was unexpected (in a minor release) and had notable, if not significant effects on floating point handling in Cartographica.

The change in behavior results in Xcode 14.3 (Clang 14) by default choosing to contract Multiply and Add instructions in floating point using a Fused Multiply Add instruction that is intended to capture rounding betwen the operations. Although this likely makes the calculations more accurate, it runs the risk of diverging from existing resutls and can create compolexitiies in testing.

In practice, I haven't found a large number of differences, but in some cases, there are variances that are causing some difficulties in test management.

The change itself was in how the fp-contract (floating-point contraction) flag was being handled by Clang to bring it more into alignment with the standards for C/C++. The details on the flag handling are a bit esoteric so I'll leave that for the reader, but there are a variety of options for the setting and the change was in the default handling between Clang 13 and Clang 14, effectively causing it to move from off to on by default.

If you want to see an illustration of what this does from a code generation perspective, there's a good comparison using godbolt between default Clang 13 and Clang 14, as well as Clang 14 with -ffp-contract=off, showing the behavior change. ( longer godbolt link here if the short one ever goes stale).

I'm still on the fence over whether this is actually a code problem or a test problem, but at the moment, it's really feeling more like the latter. The IEEE Floating Point standards define the fused-multiply-add operator and it's clearly intended to remove some error in the combining of floating point operations while also improving speed.

xcodes for xcode switching

2023-04-29T15:50:00-04:00

As part of digging through my various problems with Xcode 14.3 (Feedback FB12154691, FB12154887, and some test case issues involving floating point math), I needed to install Xcode 14.2 to move my buildfarm backwards. Although this didn't enitrely fix the problem, it was an essential element of the debugging and remediation.

For manual work, there's a great list of Xcode releases available with direct links to Apple's downloads and release notes.

Since I was in the need to do this across five different machines, automation was on my mind, so I looked for the latest in tooling to help with this install process.

The latest in this field is xcodes, open source tooling for installing one or more copies of xcode and switching between them.

`xcodes` Commands

Use xcodes installed to find out which versions are installed
Use xcodes install XX.YY to install a specific version
Use xcodes select XX.YY to make the specified version the default
Use xcodes uninstall XX.YY to uninstall a specific version

Minimizing traffic and logins

To retrieve the xcode installer from Apple, you need to be logged in to a developer account, and that means credentials. Coordinating that with automation is painful and also would result in pulling every installer I need (each time I need it) across the internet.

As I'll be automating installation on multiple systems, I decided that I'd cache the items that I need in order to save time and bandwidth.

To download the packages into your cache:

xcodes download --directory CACHE_DIR 13.2.1

Automating installation

For installation (via Ansible), I'm using the following (assuming item.version contains the version on input):

- name: copy xcode installer
  copy:
    src: "{{ xcode_cache }}/{{ item.package }}"
    dest: "{{ root_home }}/Downloads/{{ item.package }}"
    owner: "{{ owner }}"
    group: "{{ group }}"
    mode: '0644'

- name: "install xcode {{ item.version }}"
  command:
    cmd: "xcodes install --experimental-unxip --path {{ root_home }}/Downloads/{{ item.package }} {{ item.version }}"
  become: true

- name: remove installer
  file:
    path: "{{ root_home }}/.Trash/{{ item.package }}"
    state: absent
  ignore_errors: true

This uses xcodes to install without downloading, based on the xip file (along with enabling the experimental fast unxip code).

This code is called using include_tasks from a loop in my main ansible ci-bot file that installs appropriate versions:

- name: determine current xcodes
    shell:
      cmd: "xcodes installed | cut -f1 -d' '"
    register: xcodes_installed

- name: install xcode if missing
    include_tasks: ci-xcode.yml
    loop: "{{ xcode_versions }}"
    when:
      - item.version not in xcodes_installed.stdout_lines
      - ansible_distribution_version >= item.min_os
      - item.max_os is not defined or (ansible_distribution_version < item.max_os )

The first task gets a list of current xcodes (to keep from reinstalling) and then installs only if it's not already installed and the version is appropriate for the version of macOS that we're installing on.

xcode_versions looks like this:

xcode_versions:
  - version: '13.2.1'
    package: 'Xcode-13.2.1+13C100.xip'
    min_os: '11.3.0'
    max_os: '12.0.0'
  - version: '14.3'
    package: 'Xcode-14.3.0+14E222b.xip'
    min_os: '13.0.0'
# intentionally out-of-order because 14.2 is preferred right now
  - version: '14.2'
    package: 'Xcode-14.2.0+14C18.xip'
    min_os: '12.5.0'

Proj Floating Point Error Investigation

2023-04-29T09:27:00-04:00

TL;DR

MacOS 13.3 or 13.3.1 incorporated a change that is affecting calculations in proj for applications running on those versions of the OS. The change appears to be relatviely subtle, only affecting a single test in a single projection and only on x86_64, not arm, but nonetheless resulting in a failure on macOS in the standard gie tests for proj.

Since the tests work successfully on other platforms and on versions of macOS prior to 13.3, I'm going to assume for now that it is a variation from the floating point execution that is peculiar to macOS and only on Intel CPUs. I did file an issue with the project just in case we want to believe this isn't a bug in macOS.

Investigating

Last weekend, I decided to finally upgrade my macOS build farm, which I'd been putting off doing because it would require 5 macOS upgrades and 5 Xcode upgrades. At this point, my automation for the former is inadequate so I usually do that process by hand. The latter, I had similarly done by hand (until moving that work to xcodes this week).

Unfortunately, I hadn't upgraded my desktop MacPro to 13.3.1 prior to doing the upgrades in the build farm, so I didn't know what kind of fun I had in store.

Once I'd finished the upgrades, I ran a test build across the build farm and the Intel-based Ventura (13.3.1) machine was the only one that was failing. I went back to my MacPro (still running 13.2.1) and I had no problems. A few days later, I upgraded the MacPro to 13.3.1, and suddenly, it was causing the same problem. At this point, I've isolated the problem (somewhat accidentally) to the Intel macOS 13.3.1 plafforms.

As a side note, testing in this case was made much easier by the fact that I'd released the proj and gdal CLI tools with the last major release of Cartographica, meaning I could run the functional portion of the tests without having to use xctest.

I was having some other problems with Xcode 14.3, so I decided to back down the Xcode version thinking that may be causing my problem. (some more on that is detailed in Xcodes for Xcode Switching).

After downgrading to Xcode 14.2, the aforementioned inaccuracy was still happening, which seemed to rule out Xcode as the cause of this problem.

Unfortunately, I'd already upgraded all of my macOS 13.2 machines to 13.3, so I decided to create a VM to run the regression tests against macOS 13.2. Thanks to the list of Apple macOS Downloads for Ventura, I was able to download the installer and create a VM under Parallels to do my tests. It took a bit of time, but once operating, I could confirm the problems were only happening on macOS 13.3.1 on Intel-based CPUs.

I'll post an end to this story when it happens, but it's a bit frightening to see floating point changes coming in without significant notification in minor macOS updates.

I did find 2 potential work-arounds, which I detailed in my Issue with the Proj owners above. I'll post a follow-up article when the final issue is dispositioned.

Renovating GitLab registries

2023-04-23T09:35:00-04:00

I've already written a bit about using renovate to keep dependencies current using Renovate On Prem in Renovating GitLab Repos. This has been working well. However, there are a couple of twists that I figured I'd document in the event that people run into them.

For single-repositories with public dependencies, the default configuration works without much tweaking. As I mentioned in my previous article, there are a few nuances for dealing with git submodules and other dependency types that are served by gitlab.

I noticed this first with the git-submodules module, basically that it wasn't authenticating and thus wasn't able to determine updates for self-hosted submodules. Additionally, as I expanded use to other repositories, I noticed that checking gitlab-hosted helm charts (helm module) and gitlab-hosted docker containers (docker module) were also failing. In these cases, it is unclear (even with debugging on) whether the token auth was being used due to the prior hostMatch records or not. However, I was able to confirm that for the docker registries, at least I couldn't log in with a bearer token, and I'm assuming a similar problem was at play with the helm repository.

The fix in my configuration was a hostRules array with a set of hostMatch directives which are used to map the authentication mechanisms to specific hosts.

"hostRules": [
    { "matchHost": "{{ requiredEnv "CI_SERVER_HOST" }}", "token":"{{ requiredEnv "RENOVATE_GITLAB_TOKEN" }}" },
    { "matchHost": "{{ requiredEnv "CI_SERVER_HOST" }}", "hostType": "docker", "username": "token", "password":"{{ requiredEnv "RENOVATE_GITLAB_TOKEN" }}" },
    { "matchHost": "{{ requiredEnv "CI_SERVER_HOST" }}", "hostType": "helm", "username": "token", "password":"{{ requiredEnv "RENOVATE_GITLAB_TOKEN" }}" }
]

Originally, I'd expected that Renovate would create a default hostRule based on the server and gitlab token. However, even if that is the case for some items, it doesn't work for all of them. I've reported this as a shortcoming, as I would expect that to try the current token (basically what I'm forcing to happen here), but it does not.

These three lines effectively match the CI_SERVER_HOST (the gitlab server) for authentication by default to the RENOVATE_GITLAB_TOKEN using a bearer token (hence the use of token) and then override that for both the docker and helm repositories because they require username and password.

Warning this does store the token in a clear text configuration file instead of using Kubernetes Secrets.

Renovating Ansible

2023-04-17T06:20:00-04:00

Most of the system administration work that I do has been automated using Ansible, as I've mentioned in posts here, including Deploying with GitLab.

Now that I've got Renovate in place (Renovating GitLab Repos), I am starting to look at how to expand beyond my existing automations in order to let the computers do a bit more of the work.

This weekend's project involved experimenting with the ad hoc, regex-based integrations with Renovate to enable renovating files that might not otherwise be in a form that most dependency managers would recognize.

Conceptually, it makes a lot of sense. Renovate separates the ability to understand different datasources, which provide data on new dependencies from managers, which are used to determine which dependencies are used. This separation, and the level of control and customization enabled by the configuration files, enables some interesting use cases.

Thanks to documentation for Custom Manager Support using Regex, the implementation for my case was pretty straightforward.

Example ansible build process (SmartOS GitLab Runner)

Because Rob and I are using SmartOS, there are frequently components that need to be built separately for the OS, because they aren't available in the package manager or are not directly supported by vendors. One such example is the GitLab runner for SmartOS.

Soon after starting to use GitLab, I submitted a MR to fix some incompatibilities with SmartOS. That was not accepted, but was expanded upon in the expanded MR fixing not only SmartOS, but a number of other Unix-style OSes.

Thankfully, the changes have remained stable and I've had no problem pulling them forward with each release. However, since it's not a supported OS, there's no build for it. As with many tools, I've been building this using a bespoke Ansible script in order to make sure that I have the latest tools and environment.

In my playbook, I define the build version of GitLab runner using a variable, defined as:

vars:
    gitlab_runner_version: 'v15.10.1'

Using the RegEx Manager, I was able to mark this dependency using a comment:

vars:
    # renovate: datasource=gitlab-tags depName=gitlab-org/gitlab-runner
    gitlab_runner_version: 'v15.10.1'

by using a custom renovate.json file in the project:

{
  "regexManagers": [
    {
      "fileMatch": ["^*\\.yml$"],
      "matchStrings": [
        "renovate: datasource=(?<datasource>.*?) depName=(?<depName>.*?)( versioning=(?<versioning>.*?))?\\s.*_version: '(?<currentValue>.*)'\\s"
      ],
      "versioningTemplate": "{{#if versioning}}{{{versioning}}}{{else}}semver{{/if}}"
    }
  ]
}

This configuration pulls out the comment (starting with the renovate: string) and then pulls the datasource, dependency name (depName), the current value of the version (currentValue), and optionally the version methodology (versioning).

Once checked in, this is now recognized by Renovate and it now generates MRs as necessary:

The MR is most of the battle, as I've got plenty of experience automating ansible through GitLab.

Note: Yes, I realize that I'm not using the SmartOS runner to build the SmartOS runner directly, and I may do that in the future. That's a project for another day, since it requires refactoring further how I'm managing builds.

Resurrecting old posts

2023-04-12T04:28:00-04:00

Seemingly appropriate for the week after Easter, I've gone through some old draft posts and decided to publish them.

A couple that were mostly ready and I decided to push out:

And one that I did a bunch of additional work and brought it up to date with this week:

Moving Selenium tests in-house

Moving Selenium tests in-house

2023-04-11T07:45:00-04:00

Ed. Note: I started this article nearly a year ago, but got stuck on the Kubernetes piece. Now that I've resolved that, I'm publishing it.

I've been a very happy user of SauceLabs for testing for many years. However, I don't make a lot of use of it, and recently I've been trying to figure out how to cut down on my dependence on external SaaS services (and, for that matter, some external paid-for software). As part of this move, I decided to look at how I could run my selenium tests without any third-party services.

I've beeen running the Selenium tests locally as part of the development test for years, which has worked reasonably well (although be careful doing automated tests with Safari, it's not well suited for testing on a machine that's also in active use). However, using Chrome has been fine, and that works great when I need to run a quick re-test before committing code to the repo and letting the CI run.

SauceLabs as a workhorse

However, for CI, I've had to put together some pretty complex cases. Generally, I don't like exposing test systems to the internet when not necessary (at least prior to the stage phase). As such, I the django test jig on one of my systems (frequently in docker) and used SauceLab's Sauce Connect Proxy to proxy back to my docker container and access the server running on localhost. Despite the network delays involved in round-tripping to California for every test command and going through a bespoke VPN, the process worked well and, save occasional errors in communication, was a reliable testing environment.

Now, SauceLabs has a lot to recommend it, and it's got a really nice UI for teams (or for that matter individuals), but the pricing model has changed over the years. When I first signed up, I was paying $50/month for access, and that's remained the same as I'm on a Legacy plan. However, I've been hesitant to move off of it because the Legacy plans aren't available any longer and the remaining plans are a bit pricey for my use case, starting at $149/month. To be fair, their pricing model changed and they're now offering unlimited minutes for each parallel test. However, I don't need unlimited minutes. In fact, I looked around and noticed that there are some other providers that handle per-minute pricing, so originally I looked at that. However, the problem came back to having to expose my system to the internet, unless I wanted to put up a proxy and authentication, which just seemed unnecessarily annoying.

Enter the Selenium Docker container

I hadn't really looked at this in previous times because I wasn't running much docker nor any kubernetes. Generally, my only docker was run locally on my Mac as a useful way to fire up a linux environment when necessary.

This all changed last summer when I moved to GitLab, leading me not only to run docker containers directly on my SmartOS infrastructure, but also opening the door to Debian-based docker environments for testing.

With docker already running and GitLab already spinning up docker containers (and containers in Kubernetes as well these days), I had a chance to look at the Selenium Docker containers as a tool once again.

All told, there was a bit of experimentation to get it working, but I managed to get it functioning in the docker environment with some small tweaks. In particular, I needed to add:

--docker-shm-size 2000000000

to my runner (setting up for 2GB of shared memory), and setting a feature flag during the run in order to ensure networking was happy (below). Once that was running, things worked very well.

I also tried getting it running in my k8s runner environment, where I had less success, until I finally set out to finish this article (nearly a year later). In the intervening time, there were posts about the chrome shared memory exhaustion that was plaguing my execution.

I need to keep tweaking this, but, from this issue on docker-selenium, I got the pointer I needed to stack overflow which recommended:

spec:
  volumes:
  - name: dshm
    emptyDir:
      medium: Memory
  containers:
  - image: gcr.io/project/image
    volumeMounts:
      - mountPath: /dev/shm
        name: dshm

Which effectively translates into the following runner requirement in gitlab:

  [[runners.kubernetes.volumes.empty_dir]]
    name = "empty-dir"
    mount_path = "/dev/shm"
    medium = "Memory"

In the end, once I was past the shared memory and adjusted for networking in the pod required localhost instead of the bridged network with local DNS, I was able to stabilize the tests. And, it turns out they're a bit faster than the pure docker environment tests.

In the end the tests were sufficiently unstable that I won't use them in a production environment.

The CI/CD configuration is similar to the docker configuration with a few changes. This will be detailed in the next section.

Selenium-test configuration

selenium-test:
  tags: [docker,selenium]
  stage: test
  interruptible: true
  image: python:3.11
  variables:
    FF_NETWORK_PER_BUILD: 1
    GRID_URL: "http://selenium__standalone-chrome:4444"
  services:
    - name: selenium/standalone-chrome:4
#      alias: selenium
  script:
    # following is used for internal/sauceconnect use
    - apt-get update
    - apt-get install -y --no-install-recommends curl
    - curl -sSL https://install.python-poetry.org | python3 -
    - export PATH="~/.local/bin:$PATH"
    - poetry install
    - poetry run coverage erase
    - mkdir -p output
    - curl $GRID_URL
    - poetry run coverage run --branch ./manage.py test selenium_tests
    - mv .coverage .coverage-selenium-${CI_JOB_NAME}
  artifacts:
    when: always
    paths:
      - .coverage-selenium-${CI_JOB_NAME}
      - output
    reports:
      junit: reports/junit.xml

Using the python:3.11 container to run the tests, along with a service container using the selenium/standalone-chrome image as a sidecar in the docker environment to run the

The FF_NETWORK_PER_BUILD setting provides a private network, which prevents problems from interaction between multiple builds theoretically running on the same machine (in my case, this would only happen in kubernetes).

The GRID_URL is created automatically by docker and results in the hostname for the sidecar being selenium__standalone-chrome at port 4444.

Beyond the variables, the rest is bootstrapping and running the tests that require selenium. Since I'd been using it with Saucelabs, I don't have much to do to modify the tests except make sure they're only using standard testing code (no need to send telemetry information about builds to SauceLabs nowadays).

In this case, I load up poetry from the standard location, install it, set the path, install my poetry app poetry install followed by poetry run coverage erase to make sure there's no old image data.

Finally, I run the test through coverage, and rename the coverage results, so that they don't overwrite the coverage results from my unit test step and can be combined.

Once the run is complete (regardless of the results), the coverage, output, and testing reports are uploaded.

Kubernetes modifications

Since the kubernetes environment is a bit different, it uses a slightly modified configuration, directly extending the docker configuration above:

kube-test:
  extends: selenium-test
  tags: [docker,kubernetes]
  variables:
    FF_NETWORK_PER_BUILD: 1
    KUBERNETES_SERVICE_CPU_REQUEST: 2
    KUBERNETES_SERVICE_MEMORY_REQUEST: 4Gi
    GRID_URL: "http://localhost:4444"

The key configuration change here is the change in the GRID_URL, due to the difference in network handling.

Bonus round: combined coverage

Since I've now got my unit tests and my UI tests running in the same pipeline, I wanted to combine coverage, since the GitHub method of averaging doesn't really represent full test coverage. If it were the same code, it'd give too much credit, but if it were too little coverage, it'd not reresent it either. So what we really need to do is pull both sets of coverage information and merge them. I do this using the coverage combine command which is intended to combine multiple coverage files into the same report. By doing this, the actual coverage of code is represented.

combine-coverage:
  tags: [docker]
  stage: report
  interruptible: true
  image: python:3.11
  needs:
    - selenium-test
    - django-test
  script:
    - python3 -m venv venv
    - source venv/bin/activate
    - pip3 install coverage
    - coverage combine .coverage-*
    - mkdir -p reports
    - coverage xml -o reports/coverage.xml
    - coverage html -d public
    - >
       grep ^\<coverage reports/coverage.xml
       | sed -n -e 's/.*line-rate=\"\([0-9.]*\)\".*/\1/p'
       | awk '{print "CodeCoverageOverall =" $1*100}'
       || true
  coverage: '/^CodeCoverageOverall =(\d+\.?\d*)$/'
  artifacts:
    when: always
    paths:
      - public
    reports:
      coverage_report:
        coverage_format: cobertura
        path: reports/coverage.xml

Overall results

Reliability of running in pure docker has been about as good as I've seen with SauceLabs over the years, with much better control, and substantially lower costs. If your environment allows, I'd encourage making use of it.

Renovating GitLab Repos

2023-04-10T06:35:00-04:00

Over the past week, I've been working on getting my various dependencies up to date in my GitLab instance repositories. The tool I'm using is Mend Renovate, an open-source solution by the folks at Mend (formerly WhiteSource).

Let me state up front that I don't love the license here, it's AGPL (formerly MIT for versions prior to version 12.0.0), but for my purposes, it's OK since I'm not planning on modifying (other that potentially to submit bug fixes or improvements) and I'm not providing the application as a service to others (which is the key additional restriction).

Generally speaking, you're going to want a container environment for running Renovate. Although you can run it using NPM, you're going to need an environment in which the code and dependencies are available for inspection, so you need all of the dependency-retrieving code available in your environment. For this, and for isolation, I'm using containers.

Documentation for self-hosting Renovate in general is pretty extensive and is available for Self-hosted Mend Renovate online. Fair warning: there are a lot of knobs on this software.

Bootstrapping renovate

My first experiments involved using the well-maintained GitLab Runner, that is freely available and itself is updated by mend to the latest docker. The docs for the running in this configuration are straightforward and provide a sufficient understanding of how to install in order to get your first results quickly.

The recommendation is to use a dedicated private project to host the runner, and I concur with that. I have a dedicated group for experiments like this and it fit well in that location.

You'll need to define a number of CI/CD variables in order for it to work, but that's straightforward and well documented.

Initially, I used the RENOVATE_EXTRA_FLAGS to specify individual projects instead of using automated onboarding. As a rule, I found the explicit support to work well, and wildcards were OK, but regex was very finicky, especially when using negatives via the prefix ! .

Make sure you put in a GitHub token as well as a GitLab token, since you will want to have authenticated requests to GitHub in order to avoid the rate limits.

My final .gitlab-ci.yml used the full image in order to be able to handle a broader array of dependencies and allowed for manual as well as scheduled operation:

include:
    - remote: https://gitlab.com/renovate-bot/renovate-runner/-/raw/v12.13.0/templates/renovate.gitlab-ci.yml

image: ${CI_RENOVATE_IMAGE_FULL}

renovate:
    rules:
      - if: $CI_PIPELINE_SOURCE == "schedule"
      - if: $CI_PIPELINE_SOURCE == "web"
      - if: $CI_PIPELINE_SOURCE == "push"

Note: you may want to renovate to keep the script version up-to-date.

Once you've got this running, individual repositories carry their configurations in renovate.json files (which may be stored in a variety of locations, I generally put mine in .gitlab/renovate.json). When not present, the repository is ignored.

Permissions and users

While it is possible to run Renovate as yourself, especially when you're only running it on your repositories, there is less confusion if you have a separate bot user dedicated to providing these updates.

By using a separate user, you can fully scope access, which is especially important if you have admin access to your repositories or have access to a wider array of repos than you want to give Renovate access to.

Further, if you decide to go with Mend Renovate On-Premises, you'll find that the webhooks are basically ineffective if it can't distinguish between your actions and its own by the user making the change.

In the end, I thought it was definitely worth it, because I was also able to enable autodiscovery since each repo (or group) was intentionally onboarded to Renovate.

Mend Renovate On-Premises

There's also a version of Renovate that's designed to run with its own scheduler and responds to webhooks from GitLab. This version of Mend Renovate On-Premises is what I've moved to over the weekend. This one definitely wants a stable container environment to run in, and one in which you'll need to be able to communicate between your source control system (in my case GitLab) and the renovate server (otherwise, you won't get the benefits of the webhooks, and might as well stick with the easier-to-manage cron-only version above).

This is licensed software from Mend, and requires a key in order to run. The keys are available for Free once you request them. It took a little time, and you should expect to be prospected for sales, but that's certainly fair.

Once the key is received, you'll need to prepare to configure your environment. In my case, this is a small kubernetes environment. There's a Helm chart that is mostly well documented in the aforementioned repository, that goes through the basics, installation with Helm and configuration for both GitHub and GitLab—in both cases either self-hosted or SaaS.

Setting up the environment via Helm was straightforward and I created an application in my gitlab-managed environment, using .gotmpl in order to fill in my secret information using the GitLab CI/CD variables and runtime information.

My helmfile.yaml looks like this:

repositories:
- name: renovate-on-prem
  url: https://mend.github.io/renovate-on-prem

releases:
- name: renovate
  namespace: YOUR-NAMESPACE-HERE
  chart: renovate-on-prem/whitesource-renovate
  version: 3.1.3
  installed: true
  values:
    - values.yaml.gotmpl

which basically adds the chart repo and then runs a specific version based on the chart available and using the included values file.

My values.yaml.gotmpl is a bit more complex:

credentials:
  gitlab_access_token: {{ requiredEnv "RENOVATE_GITLAB_TOKEN" }}
  github_access_token: {{ requiredEnv "RENOVATE_GITHUB_TOKEN" }}

renovate:
  acceptWhiteSourceTos: 'y'
  licenseKey: {{ requiredEnv "RENOVATE_LICENSE_KEY" }}
  renovatePlatform: gitlab
  renovateEndpoint: {{ requiredEnv "GITLAB_API_URL" }}
  renovateToken: {{ requiredEnv "RENOVATE_GITLAB_TOKEN" }}
  githubComToken: {{ requiredEnv "RENOVATE_GITHUB_TOKEN" }}
  webhookSecret: {{ requiredEnv "RENOVATE_WEBHOOK_TOKEN" }}

  config: |
    module.exports =  {
      "autodiscoverFilter": ["!/{data,experiment,imported}/.*/"],
      "packageRules": [
        {
          "matchUpdateTypes": ["major", "minor", "patch", "digest", "bump"],
          "addLabels": ["dependencies"]
        },
        {"matchLanguages": ["ruby"], "addLabels": ["ruby"]},
        {"matchLanguages": ["java"], "addLabels": ["java"]},
        {"matchLanguages": ["python"], "addLabels": ["python"]},
        {"matchLanguages": ["php"], "addLabels": ["php"]},
        {"matchLanguages": ["js"], "addLabels": ["js"]},
        {"matchLanguages": ["docker"], "addLabels": ["docker"]},
        {"matchLanguages": ["git-submodules"], "addLabels": ["submodule"]}
      ],
      "git-submodules": {"enabled": true},
      "hostRules": [
        { "matchHost": "{{ requiredEnv "CI_SERVER_HOST" }}", "token":"{{ requiredEnv "RENOVATE_GITLAB_TOKEN" }}" },
        { "matchHost": "{{ requiredEnv "CI_SERVER_HOST" }}", "hostType": "docker", "username": "token", "password":"{{ requiredEnv, "RENOVATE_GITLAB_TOKEN" }}" },
        { "matchHost": "{{ requiredEnv "CI_SERVER_HOST" }}", "hostType": "helm", "username": "token", "password":"{{ requiredEnv "RENOVATE_GITLAB_TOKEN" }}" }
      ],
      "dependencyDashboard":"true",
      "dependencyDashboardLabels": ["dashboard"]
    }

podSecurityContext:
  fsGroup: 1000

cachePersistence:
  enabled: true
  storageClass: longhorn

ingress:
  enabled: true
  ingressClassName: nginx
  annotations:
    cert-manager.io/cluster-issuer: "YOUR_CERT_ISSUER"
    nginx.ingress.kubernetes.io/whitelist-source-range: "YOUR_CI_SERVER"

  hosts:
    - YOUR_RENOVATE_HOSTNAME
  tls:
    - secretName: renovate-fe-cert
      hosts:
        - YOUR_RENOVATE_HOSTNAME

Before using this, take a good look at it. There are some common items here, and some lessons learned.

If you don't set the fsGroup and you use the cache, you may find that the cache is not writable. The application runs as UID/GID 1000, so this setting just makes sure the application can write to it.
I'm using Longhorn for my local storage, so storageClass is set to that. If you're using something different, take appropriate changes
For getting the webhooks, I am using an nginx ingress controller, metallb, and cert-manager with support for ACME. I constrained the access to the ingress server using the nginx.ingress.kubernetes.io/whitelist-source-range annotation. TLS is configured and my gitlab server is checking for appropriate TLS certificates.
The weird hostRules lines are because I need to authenticate back to the gitlab server for a couple of repository types. I've pulled this discussion into Renvating GitLab Registries after finding some further nuance.
The documentation on Renovate has a lot of settings. There are multiple ways to add default changes. In this case, I chose to force settings by applying them globally. You can also do these locally using renovate.json files (which will be sought out as the indicator that you want Renovate to run in those repos), and by specifying global defaults. Read the documentation and look for best practices.
The defaults mostly worked, but I like to have my PRs labeled, so I added the various packageRules in order to set labels.
I find the Dependency Dashboard to be useful, so I configured it (and also set a label for it, so I could do a an easy global search for a sort-of dashboard of dashboards).

Although the helm chart seems up-to-date for the most part, I did eventually start playing with the image tags in order to get a more up-to-date version of the underlying image. Do this at your own peril. There's nothing theoretically wrong with it, as they are released at least in some fashion, but you may want to stick with the standards.

If you do decide to go with the modified image:

image:
  repository: whitesource/renovate
  tag: VERSION
  pullPolicy: IfNotPresent

is what I added to retrieve the latest renovate image, and you'll need to visit Docker Hub in order to get the latest tag.

Results

All told, I'm happy with the implementation. It took a little while to bootstrap, but it is doing a good job at finding my updates and providing a pretty uncluttered and automated mechanism for keeping up to date.

Bacula pruning old storage

2022-10-05T07:22:00-04:00

I note with some amusement the fact that I wrote on this exact day last year about this same subject (in much more detail).

The reason for the new message on this subject is that I'm still cleaning up some of the decisions I made when first using Bacula.

The Problem

Before I realized that I really needed to have three pools to make things work correctly, I was storing the full, differential, and incremental backups in the same pool. This turned out to be a bad decision, and I rectified it prior to my last note on the topic— however, not until I'd accumulated a lot of "volumes" of data that were still in my main pool (the one I continue to use for my full backups).

If you look at volumes the way that Bacula does, they're basically tapes. As such, they're considered not to take up any more space when they're full as when they're empty. This makes sense for tapes, since the tendency in a tape-based system is to rotate media physically through offsite and onsite storage and then to tape libraries or robots for reuse. For on-disk volumes, though, this poses a different problem—empty on-disk volumes and full ones do not take up the same amount of space.

In my case, I was running low on disk space on my storage volume and when I looked I noticed many volumes that had not been written to in quite some time and were marked as "expired". These volumes would eventually be reused, but since that pool had run for quite some time with full, differential, and incremental backups in it, I had a large number of "tapes" that contained expired data to write over. As the full backups take a relatively constant number of volumes, they would take a few more years to overwrite the volumes used for differential and incremental backups.

In short, I had a lot of used volumes taking up space (both in the catalog and in storage) that should have been purged.

The Solution

In order to stop bacula from continuing to allow the "old" storage to lie around, I needed to delete the volumes. This makes sense if you think like tape—once you've bought the tape, you might as well leave the data on it (ignoring the security concerns) until you need to reuse it. But, this isn't tape, and there's a significant benefit to keeping the number of volumes right-sized for our environment.

In my case, I wanted to remove all volumes from my Cloud-CT media pool that were more than 2 years old. In this case, the bacula sql command came in handy as I was able to directly query the database:

select 'prune yes volume=' || volumename from media where lastwritten < '2020-10-03' and mediatype='Cloud-CT' order by volumename;

The above query resulted in a number of lines that could be pasted directly into the bconsole in order to purge the volumes.

Based on some examples, and out of an abundance of caution, I decided to purge the volumes before deleting them. This was likely an unnecessary step, but it ensured that my catalog database was cleaned as well.

Once done, I was able to rerun the sql command, replacing prune with delete to delete the unnecessary volumes.

This cleared up all the near-side volumes, removing the storage that they consumed as well as their markers so that they would not be reused in the future.

For the far-side (cloud) copies, I opted to directly purge those using a find command:

find bacula-west/ -mtime +720 -exec rm -r \{\} \;

Where bacula-west is the name of the storage location.

Summary

All told, if I'd known originally what a mess I was creating by using a single pool, I would have resolved that earlier, but this is how we learn.

Test without building and SPM

2022-05-17T07:45:00-04:00

Another day, another set of testing issues. As mentioned in my previous post, Slathering Xcode Variants, I've been making some use of Xcode's capability to build a test package and separately run that test package on a different machine, possibly with a different version of macOS or even a different CPU architecture.

In order to meet my testing needs, I've generally been building with the latest Xcode on the latest version of macOS and then running tests on both that version and the previous version of macOS. I've been running in this mode for months, without any difficulty, until this week.

Earlier this week, I added a new SPM dependency to one of the Xcode projects in my workspace. Since this project is large, the organization is a single workspace with multiple projects in it (today that number is 16) and a handful of SPM dependencies (3 across the whole set today).

After adding the dependency and running full tests on my Mac Pro (x86_64, macOS Monterey), things were looking good, so I pushed the new files to my CI server. A few minutes later, my Big Sur tests both failed. Looking at the logs, I found:

Package.resolved file is corrupted or malformed; fix or delete the file to continue: unsupported schema version 2

This only happend on the Big Sur machines, and they are running Xcode 13.2.x (last version before 13.3 came along and stopped running on Big Sur).

Not surprisingly, this is because this file was created on my Monterey Mac.

Trying without Fastlane

I'll caution that these are all still running within Fastlane, so it's possible that I'm shooting myself in the foot by not pushing all the parameters to the command line manually. I may, at some point, give that a try and see if that solves the problem.

Doing some manual testing confirmed that when not using Fastlane, I could easily skip the problematic code when running in CI by targeting the xctestrun file directly instead of using the Scheme and Workspace (thus avoiding the workspace evaluation).

For example:

xcodebuild test-without-building ⏎
    -xctestrun test_build/Build/Products/Cartographica_Cartographica-Exhaustive_macosx12.3-arm64-x86_64.xctestrun ⏎
    -destination 'platform=macOS' ⏎
    -resultBundlePath 'output/Cartographica.xcresult'

(once again using ⏎ to denote, counter-intuitively, that the line breaks here are only for readability and should be left out).

This command explicitly uses the -xctestrun option, pointing it at the specific xctestrun file instead of using

Back to Fastlane

Eventually, I decided I still wanted to use Fastlane (although I feel like I'm using few enough features that there may be a future blog post about kicking it to the curb as well).

I'm having to be more careful about how I build and run my tests (notably: needing to make sure I don't bump the test libraries too far). However, things are working and the process survived at least the initial jump to Ventura.

My build and test matrix now looks like this:

build-mac:
  stage: build
  variables:
    GIT_CLONE_PATH: ${CI_BUILDS_DIR}/${CI_PROJECT_PATH}
  before_script: *mac_build_prep
  tags: [xcode14,codesigning,build]
  needs: [check-servers]
  interruptible: true
  script:
    - bundle exec fastlane --verbose cibuild
  artifacts:
    paths:
      - test_build/Build/Products
      - test_build/Build/SourcePackages
      - test_build/Logs
      - test_build/info.plist
    expire_in: 1 day

.test_template:
  stage: test
  needs:
    - job: build-mac
      artifacts: true
  variables:
    GIT_CLONE_PATH: ${CI_BUILDS_DIR}/${CI_PROJECT_PATH}
    CTRunningUnderTest: 'YES'
  coverage: '/^CodeCoverageOverall =(\d+\.?\d*)$/'
  interruptible: true
  before_script:
    - export PATH=~/.rbenv/shims:${PATH}
    - export FL_SLATHER_ARCH=`uname -m`
    - echo $PATH
    - bundle install
    - rm -rf ${CI_PROJECT_DIR}/DerivedData/Build/ProfileData
    - rm -rf ${CI_PROJECT_DIR}/output/*
  artifacts:
    when: always
    reports:
      junit: output/report.junit
      coverage_report:
        coverage_format: cobertura
        path: output/cobertura.xml
    paths:
        - output/scan/*.log
        - output/*.xcresult

.coverage_script: &coverage_script
  - >
     grep ^\<coverage output/cobertura.xml
     | sed -n -e 's/.*line-rate=\"\([0-9.]*\)\".*/\1/p'
     | awk '{print "CodeCoverageOverall =" $1*100}'
     || true

test:
  extends: .test_template
  parallel:
    matrix:
      - PROCESSOR: [arm64]
        OS: bigsur
      - PROCESSOR: [arm64,x86_64]
        OS: monterey
      - PROCESSOR: [arm64,x86_64]
        OS: ventura
  tags:
    - codesigning
    - ${OS}
    - ${PROCESSOR}
  script:
    - bundle exec fastlane --verbose citest
    - *coverage_script

Ed. Note: Updated for ventura

And my fastfile for running the tests (without building) uses a number of different commands:

test_from_build (to run the tests based on the testplan)
build_for_tests (to create the binaries for running the tests)

Neither of these are called directly, but are called from the cibuild and citest above

Note that build_for_tests takes an argument of an array of testplans to build.

default_platform(:mac)
my_xcargs = ''

platform :mac do
  desc 'Build for testing'
  lane :build_for_tests do |options|
    final_xcargs = [my_xcargs, 'ONLY_ACTIVE_ARCH=NO'].join(' ')
    if options[:testplan]
        final_xcargs = ([my_xcargs, 'ONLY_ACTIVE_ARCH=NO']+args_with_prefix(options[:testplan],'-testPlan ')).reject(&:empty?).join(' ')
    end

    run_tests(scheme: options[:scheme],
              configuration: 'Debug',
              code_coverage: true,
              address_sanitizer: false,
              output_types: "",
              disable_slide_to_type: false,     # note: this gets around a macos bug caused by assuming ios in fastlane
              clean: options[:clean] || false,
              xcargs: final_xcargs,
              derived_data_path: "test_build",
              build_for_testing: true)
  end

  desc 'Runs built tests'
  lane :test_from_build do |options|
    if options[:testplan]
        final_xcargs = ([my_xcargs]+args_with_prefix(options[:testplan],'-testPlan ')).reject(&:empty?).join(' ')
    else
        final_xcargs = my_xcargs
    end

    run_tests(scheme: options[:scheme],
              configuration: 'Debug',
              code_coverage: true,
              address_sanitizer: false,
              output_types: "junit",
              disable_slide_to_type: false,     # note: this gets around a macos bug caused by assuming ios in fastlane
              clean: options[:clean] || false,
              xcargs: final_xcargs,
              derived_data_path: "test_build",
              output_directory: 'output',
              test_without_building: true)
  end
end

Slathering Xcode variants

2022-05-10T06:23:00-04:00

I've been doing quite a bit of experimentation with recent features in Xcode lately, especially as regards trying to efficiently run my GitLab-powered Mac Mini build farm.

Recently, as I've been doing some work on CartoMobile, I've been updating the testing code there and stole some ideas from the Cartographica test suites, which I intentionally build for both Apple Silicon and x86_64 and then run on 2 OS variants with each processor family. In this case, I collect coverage information from all 4 and then merge them because I have variant code that runs on different CPUs and versions of the OS (more the former than the later, because some libraries are specific to one architecture or the other).

In Cartographica, the CI code follows these steps:

Build the code for testing
Run a matrix job across the CPUs and Operating Systems that I need to test, collecting junit and coverage information
Current macOS and x86_64
Current macOS and arm64
Previous macOS and x86_64
Previous macOS and arm64

For CartoMobile, I did something similar, but ran into a problem with slather in doing so and also realized that I was likely wasting time and effort.

First, the problem that I ran into was specifically with running slather without pointing at the correct directory. In this case, I wasn't pointing slather at the migrated directory when checking coverage, thus failing to find the coverage files when running.

However, more importantly, this led me to the realization that the way that I was going about this was wrong for CartoMobile. Unlike Cartographica, where I was doing matrixed coverage because I have code that only operates on specific CPU architectures or versions of macOS, CartoMobile has a single set of code that runs on all SDKs, and since I can't run coverage tests on the iphoneos SDK, that meant that all that was interesting was running the coverage tests for the simulator on both iPadOS and iOS.

In addition, for CartoMobile, I also run TSAN and ASAN tests (with UBSAN set). Although the builds for those can take a while, the runtimes are short, and the builds are completely independent. Further, the coverage for these are not important, since coverage tests can (and should) be run without the sanitizers. Thus, unlike Cartographica, for CartoMobile, I decided to matrix the build and test functions together. The result is:

Build for simulator and test with Coverage, collecting junit and coverage
Build and run simulator using an Xcode test plan (.xctestplan) that runs with:
TSAN + UBSAN
ASAN + UBSAN
Build and run my iOS snapshots on a minimum iPad and iPhone simulator (UI Test)
Build and run my iOS snapshots on all sizes and languages (AppStore snapshots)

Renaming Elasticsearch indexes

2022-04-04T06:23:00-04:00

I've been an ELK Stack (Elasticsearch, Logstash, Kibana, and Beats) user for quite some time, using exclusively the open source version of the stack.

Generally it works well and, with some exceptions, supports our mostly-Solaris based environment (using LX zones to host most of the beefier components, and using custom-built beats and senders for the lightweight senders).

Index Lifecycle Management (ILM)

A couple of years ago, I started using ILM, which automatically rotates indexes through various stages of life based on use of the index:

hot: index is being actively updated and queried
warm: index is not being updated, but is still being queried
cold: no longer being updated, but queried infrequently
frozen: no longer being updated, queried rarely
delete: no longer needed and may be deleted

Generally, you set up a file pattern which is followed automatically for the indexes creating a user-and-machine-friendly YYYY-MM-DD-NNN suffix, such as mail-7.4.2-2020-04-22-0027 so that the name denotes its creation date and has a numeric discriminator in case you need to rotate due to volume instead of dates. ILM supports rotation based on a number of factors, and it's common to specify a monthly rotation plus a file size rotation, in order to keep indexes a manageable size.

Fixing a bad index template

During one of my index creation steps, I used the wrong name template and ended up with all of my subsequent indexes (for over a year) being named with the same date and an incrementing discriminator value. This was practically fine, but annoying because it made it difficult to determine if the indexes were being deleted and rotated through stages correctly.

The fix for this was to make sure that the index being written by my logstash component was <mail-8.0.0-{now/d}-000032> (as an example), not mail-8.0.0-{now/d}-000032; the difference being when the value was evaluated. In the former, it's kept with the index; in the latter, it's evaluated when the index is created and the result is a provided_name of mail-8.0.0-date-created-000032, which means the date won't change.

I tested this change by modifying the provided_name of the running index, and then executing the manual rollover by sending:

POST <rollover-target>/_rollover/<target-index>

and specifying the new index template in the <target-index> and then subsequently setting index.lifecycle.indexing_complete to true on the index so that the lack of an automated rollover didn't cause error messages.

Changing historic names

The remaining problem was the names of the old indexes. Although w does not have a rename command, it does have two other useful commands, reindex and clone.

reindex will reindex all of the documents in the original index into a new index, which allows you to change the format of the index and settings prior to reindexing the data (in fact, you must create and provision the new index first, or you're likely going to either delete the reindexed copy or re-reindex it).

clone makes a complete clone of the index by using hard links (if possible on the underlying OS). The makes it particularly fast (at least for the primary shards) and allows you to create the new index with all of the attributes of the old index.

So, for the equivalent of renaming the indexes, you clone the old index to the new name, and then delete the original. In this case, you're only rebuilding the replicas and by deleting the old index, the hard links become referenced only by the new index.

Setting the ILM time

The final problem was that all of these indexes now were believed by ILM to be brand new. They were going to get rotated into another phase (or deleted) based on the date that I cloned them, not based on the last write date.

As an aside: the way that ILM looks at indexes is by considering the creation date to be the key date for the index (the lifecycle_date_millis) to be the creation date of the index, until it is closed for the first rollover, at which point the lifecycle_date_millis is set to that first rollover date. This way, ILM actions are based on the creation date of the index (rollover 30d after creation, for example) and subsequent actions are made based on the date that index was closed.

By cloning the index, I'd reset the creation date, and thus the lifecycle_date_millis. Not surprisingly, this was a pretty easy fix: determine the rollover date and then reset the value.

In my case, I double-checked the expected dates by executing a timestamp query:

GET mail-7.4.2-2022.02.23-000131/_search?size=0

{
  "aggs": {
    "max_date": {
      "max": {"field": "@timestamp"}
    },
    "min_date": {
      "min": {"field": "@timestamp"}
    }
  }
}

And then updating the index settings:

PUT mail-7.4.2-2022.02.23-000131/_settings

{
  "settings": {
    "index": {
      "lifecycle.origination_date": 1646629200000
    }
  }
}

Finally, check the ILM information:

GET mail-7.4.2-2022.02.23-000131/_ilm/explain

and verify that the age is as expected:

{
  "indices" : {
    "mail-7.4.2-2022.02.23-000131" : {
      "index" : "mail-7.4.2-2022.02.23-000131",
      "managed" : true,
      "policy" : "mail",
      "lifecycle_date_millis" : 1646629200000,
      "age" : "28.25d",
      "phase" : "hot",
      "phase_time_millis" : 1649003776346,
      "action" : "complete",
      "action_time_millis" : 1649004257477,
      "step" : "complete",
      "step_time_millis" : 1649004257477,
      "phase_execution" : {
        "policy" : "mail",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "50gb",
              "max_age" : "30d"
            }
          }
        },
        "version" : 3,
        "modified_date_in_millis" : 1647091493509
      }
    }
  }
}

And you're set.

License to talk upgraded

2022-03-17T06:40:00-04:00

After nearly 20 years with a Technician class ham license, I've finally taken (and passed) the test to upgrade my license to General class. Next step is to try for my Extra class in an attempt to upgrade before my 20th anniversary as a ham next year.

First look 2021 M1 MacBook Pro

2021-11-05T18:00:00-04:00

I last bought a MacBook Pro from Apple in November of 2019, in the midst of a bout of travel that was about to come to an end. In point of fact, I haven't used my trusty MacBook Pro much in the last 19 months, since the COVID-19 pandemic started, but it still is my go-to machine for running to the hosting center, taking with on the few trips we've been on, or just hanging out with Carol on the weekends. One thing that I will say about it is that it's been a hot, and relatively noisy machine, and has led me to keep a power cord in the front parlor now.

This is all coming to an end, as Apple "finally" released their professional M1 laptops, and they look awesome. So far, I've only had mine for the weekend, so this is likely to just be a first look at how it's working so far.

M1 migration

As I've written about before, I generally make a couple of full backups (one to TimeMachine, one using CarbonCopyCloner) before I migrate to a new Mac using Apple's Migration Assistant, and this time was no different than previous. Given that both the 2019 and 2021 MacBook Pros have Thunderbolt 3, I was able to hook them up over that connection for some super-fast copying—after one false start.

Turns out that if you want to hook up two Macs over TB3 for Migration Assistant, you want to have that connection in place before starting Migration Assistant on either of them. Apple's typically smart at looking at what networking is available, and will choose the TB3 direct connection over wired Ethernet or WiFI, but it appears to be the case only if that's seen relatively early in the process. My first time through with these two machines resulted in an 18-hour estimate based on running over WiFi, so I successfully aborted the process and restarted it after getting both machines connected via TB3.

The process went off without a hitch and everything expected moved over. As is usually the case nowadays, authorizations need to be reauthorized for new hardware, but other than that and a couple of pieces of software that needed to be reactivated, it was a straightforward process.

Developer Performance

I haven't done as thorough of a breakdown as I did with Developing on a 2019 Mac Pro, but I can report that the preliminary results on Cartographica are excellent. Previously, the 2019 MacBook Pro took about twice as long to build and about 12% less time to run tests than the Mac Pro. The test times have flattened out between the Mac Pro and the Intel MacBook Pro, and the Mac Pro still builds in about half the time. This is owing to some changes in my test jig, expansion of the software during the 1.5 rollout and overall improvement in test coverage.

The Cartographica results for the M1 MacBook Pro are extremely encouraging:

Machine	Build Time	Test Time
2021 MacBook Pro (8+2 core M1 Max, 64GB)	101s	157s
2019 Mac Pro (16 core Xeon, 96GB)	107s	167s
2020 Mac Mini (4+4 core M1, 16GB)	145s	191s
2019 MacBook Pro (8 core i9, 32GB)	202s	167s
2018 Mac Mini (6 core i7, 32GB)	251s	178s

Tests built against 128c43ef using Xcode 13.1.

It's important to note that the current version of Cartographica is not M1 native. Unfortunately, a reasonably popular raster format has a third-party library that I have not been able to isolate yet and is not available for Apple Silicon for the Mac. This means that all the tests run in emulation on any non-Intel mac.

Also, interesting to note that the 2021 MacBook Pro ran the whole build/test process with no fans and was still at 100% battery when it finished. I ran the tests for the 2019 MacBook Pro while plugged in. Carol could hear the fans 7ft away. Running on battery didn't substantially change either the performance or the noise.

Software compatibility

So far, most of what I use on my MacBook Pro has been working fine. I did have some old drivers for my Logitech MX Ergo trackball that didn't run correctly. Installing the latest version of Logitech Options fixed that problem.

Docker

Docker was a bit of a surprise. I guess I hadn't been paying attention, so I didn't expect it to automatically use QEMU to run an Intel VM in order to run my local DNS server (the only long-running Docker container that I use). I haven't played around much with ARM-based containers, but I now have the perfect place to build them.

Brew

Brew was great. I'd done some work with it when I put the M1 in the build farm, so I was aware that the Intel and AS versions run in separate directory trees, making it easy to install partially-Intel and partially-AS. At this point, I have only a couple of items that are on intel and everything else is compiled for, and bottled for Apple Silicon. There are good directions on the brew site and other places on the web.

Battery performance

So far, the battery performance has been crazy-good. I ran on laptop power for about 3 hours during one stretch this weekend and that resulted in an 13% drop in battery. It wasn't the heaviest use I could do, but it was representative (I watched a bit of video, surfed the net, and did some compilation and testing).

Following up on this, I let the battery run down for 3 days and then sat in the Parlor, without battery and installed/removed software from brew and recompiled and debugged Cartographica. That got me almost 4 hours of run time with intense workloads and heavy networking (as well as a bunch of background processing for reindexing, etc.)

Fans and lap comfort

By my reckoning, it's still shorts weather (it was 69°F yesterday), so the lap experience is without the benefit of jeans. As such, I can happily report that you might find the MacBook Pro does provide a little warming on cold days, it does not scorch your legs like its i9 predecessor did.

As for fans, maybe I'll write a follow-up to this when I notice them, but so far if they're running, I couldn't tell you.

Conclusion

It's just a first look, but wow is this an interesting machine! So far, software and hardware have worked great; I'm mostly getting used to the changes in macOS Monterey and have basically forgotten about the notch (although I've been a long-time user of Bartender and I did move to using the Bartender Bar again on the MacBook Pro to avoid collisions with my many menu bar items).

This machine is an amazing first-step into professional Apple Silicon, and based on it, I fully expect that whatever they replace my 2019 Intel MacPro with will be an absolute beast, an likely in ways that I won't find necessary.

My next question is: for my needs, is the 16" really necessary? I've bought the largest laptop that Apple makes since the 1989 Macintosh Portable and although I'm very happy that I no longer carry a 16-lb beast with 1MB of RAM and a 16MHz 68000, the question remains whether the additional screen real estate is really worth the tray table issues on airplanes and the extra weight in my backpack. Previously, the 16" (and the 15" before that, and the 17" before that) provided improved thermals, higher-spec'd processors, and frequently better options for discrete graphics. All of these reasons appear to be gone (save the "high power" mode that only exists on the 16"), so the question for me is going to be: does the additional screen real estate counter the weight and seatback tray compatibility? It remains to be seen.

Bacula pruning

2021-10-05T07:22:00-04:00

After 18 months of using Bacula and sending copies of my data to the cloud (in this case, cloud I operate in another location) using an S3-compatible storage mechanism, I noticed I had a lot of data sitting around on my current server for backups. When I set out to move to Bacula, I decided to use long retention times for my core monthly full backups, which resulted in more than a small number of terrabytes of data.

At the time of the implementation (and still the case at the time of this writing), the automatic options in Bacula for pruning/truncating local copies of cloud datasets were:

No (do not remove cache)
AfterUpload (each part removed directly after upload)
AtEndOfJob (each part removed at the end of the job)

None of these would work for me, as I want to retain the data for months locally, only giving up my cached copy when I'm outside of my normal restore window, or when I need the space.

There are a number of ways to prune, depending on how much you want to get into the Bacula mindset.

Manual purge using find

It turns out that if you leave the label intact (the label being part.1 in the volume directory), you can delete any parts in the cloud volume and they will be auto-retrieved during a restore. This will allow you to override any settings you have in bacula-dir.conf for your CacheRetention and just manually purge in any way you like. In my case, I made use of find:

find .  -regextype posix-egrep -regex '.*\/Vol-.*\/part\.([2-9]|..+)' -exec rm \{\} \;

This particular command uses a posix regular expression to find any file in any directory starting Vol- and named part._number_ where number is any value other than 1.

Manual pruning using bconsole

Bacula's console (bconsole) has a Cloud command which can be used to force a prune operation. The cloud prune command respects the CacheRetention setting and has a number of command-line parameters to allow you to specify what you want to prune. You can prune by storage, pool, or even MediaType. There is also a parameter to prune AllPools.

In my case, I used:

cloud prune AllFromPool Storage=Cloud-CT Pool=File

which breaks down to:

cloud command
prune sub-command
AllFromPool: run the purge command on all volumes in the pool
Storage=: use the specific Storage definition (in this case Cloud-CT)
Pool=: use the specific Pool (in this case File)

For ClueTrust, we use 3 different pools in our storage:

File for the full backups (historical naming convention)
Inc-File for the daily incremental backups (from the last File backup)
Diff-File for the weekly differential backups (from the last File backup)

In this case, I only want to purge the full backups that are outside of the range of the incremental and differential backups. To that end, I've set the CacheRetention appropriately in my bacula-dir.conf file and so I can trust bacula to clear these correctly.

Automatic pruning using bacula admin jobs

I've read that this is possible, but I haven't found the appropriate documentation yet. At this point, I can't recommend, but the other two processes work fine and are easily scripted if need be.

Rclone to the rescue

2021-10-05T06:01:00-04:00

Back in September of last year, I wrote in Bacula: 6 months on that cloud backups required part.0 in order to be recognized for automatic part retrieval.

While this was mostly accurate, the critical file is actually part.1. As such, when referencing my own blog post when trimming my bacula storage, I deleted the wrong file, leaving my "volumes" without labels, and thus rendering automatic part retrieval inoperative.

To remedy this, I needed to sync the part.1 files back into my local cache. As I'm using an S3-style storage mechanism for my remotes, I decided to use Rclone to bring the files back.

The command looks like this (this is the safe version that doesn't copy, hence -n, remove that when you're satisfied it's good to go):

rclone sync -n --no-update-modtime s3-server-ref:s3-bucket local --include "part.1"

Breaking this command down:

Use the rclone command
sync subcommand will synchronize from source to destination
--no-update-modtime attempts to leave the modification time the same
s3-server-ref is a reference to an Rclone server created with rclone config
s3-bucket is the bucket (or path) of the source
local is the path to the local destination
--include "part.1" is a filter that only copies the specific filename

Running the version as a dry run (-n) results in (partial results):

2021/10/05 10:06:56 NOTICE: Vol-0998/part.1: Not copying as --dry-run
2021/10/05 10:06:56 NOTICE: Vol-1101/part.1: Not copying as --dry-run
2021/10/05 10:06:56 NOTICE: Vol-1105/part.1: Not copying as --dry-run
2021/10/05 10:06:56 NOTICE: Vol-1106/part.1: Not copying as --dry-run
2021/10/05 10:06:56 NOTICE: Vol-1110/part.1: Not copying as --dry-run
2021/10/05 10:06:56 NOTICE: Vol-0999/part.1: Not copying as --dry-run

which indicates that each of these files would be copied if this had not been a dry run.

Re-run the command removing -n and you should get the desired results.

GitLab stuck MR

2021-09-26T17:40:00-04:00

MRs (Merge Requests) in GitLab are similar to PRs (Pull Requests) in GitHub, although the process and language around them are slightly different. The name specifically refers the the request to merge into another branch from a branch (or repository) that isn't the same. Simple enough.

Most of the time since moving to GitLab I've been extremely happy with both our productivity and the stability of GitLab, especially considering that we're running it from a docker container in SmartOS. All told, it's been a really good experience.

With that said, it hasn't come without bumps. Mostly these are related to the operating environment, and result in relatively straightforward failures, such as an inability to see memory consumption correctly.

Today, though, I got a bonus problem. This one took some real time and resulted in not only digging through gitlab.com's issues, but dropping into the rails console for gitlab to resolve it.

I'm not sure what actually caused this problem, and it may well be that I did or am doing something problematic. But, the result was that I had a Merge Request (MR) that was in the "merging" state for nearly an hour. Considering most other merges have taken seconds, even with my large codebase for Cartographica, it seemed like something was wrong.

I dug around on gitlab.com looking for some answers and ran across a couple of aging examples of a similar "stuck" MR.

If you find yourself in this position, you may want to look at:

Issue 18048 which details a few different diagnostics and work-arounds.

I used the gitlab console by logging in to the server container and

gitlab-rails console

to get the console running and then:

proj = Project.find_by_full_path('namespace/myproject')
mr = proj.merge_requests.find_by(iid: 123)
mr.state

This prints out the status of the merge request, which in my case was locked.

I did the following:

Verified that the record looked OK by using mr.valid?
Unlocked the MR using mr.unlock_mr, which resulted in true

After that, the state returned to open and I was able to go to the UI and merge it.

Further down there were recommendations for going all the way to the database console using:

# login to prostgres instance
sudo gitlab-rails dbconsole

-- check the list of locked merge requests
select id, iid as merge_id, source_branch, target_branch, locked_at, merge_error, merge_commit_sha, in_progress_merge_commit_sha, merge_status, state from merge_requests where state='locked' order by id desc;

But, I didn't find that necessary.

I'll put here (lest I forget it later) that I also made use of the log tailing mechanism (gitlab-ctl tail) which can also be directed a specific log by adding an argument.

Pelican plugin updates

2021-08-29T07:44:00-04:00

One of the advantages of our recent pivot to gitlab is that I'm spending some time looking at existing repositories and doing some updates.

Most of my repos are private and hosted on our private gitlab server. For public code, I generally place it on GitHub. With the recent automation changes on GitHub, I've been doing some updates to get CI running there as well. For the most part, this uses whatever templates exist for the project (in the case of plugins, for example) or the GitHub default.

Updating pelican plugins

I've mentioned before my move to static site generation, and also the plugins that I've created or modified for Pelican:

Yesterday, I decided to do a little clean-up of both of these plugins. In particular:

Updated nginx_alias_map to use the Pelican cookiecutter templated
Added tests to both plugins
Verified that CI was working correctly for both plugins
Fixed a Python 3.6-specific bug in nginx_alias_map
Bumped versions to 1.0/Production and changed environment to Plugins

The big change here is that nginx_alias_map is now on PyPI and can now be installed as a dependency without having to use a git submodule as I'd been doing previously.

If you're interested, check out:

They're both easier to use now, with pip install pelican-nginx-alias-map or pip install pelican-markdown-it-reader as the installation method.

To support using pre-commit with pelican (as detailed in Pre-commit and Pelican), I have also updated my mdformat plugins:

These (respectively) avoid reformatting or errors when using the footnote plugin with markdown-it, or when using pelican-specific items, such as {filename}.

Deploying with Gitlab

2021-07-25T18:49:00-04:00

In June, I mentioned in an article about Docker on SmartOS that we are doing some work with GitLab these days as a replacement for my venerable Gitolite server (and, to an increasing extent Jenkins).

Deploying from Pelican

I'm likely going to write more on GitLab in the near future, but for now, I'd like to document some things I've learned about deploying with Gitlab.

This blog is deployed in a semi-automated fashion. As mentioned previously, it is compiled using pelican and served as static pages using nginx.

As such, once modifications are made, I'm ready to verify that they look OK and work correctly on the Stage Server; once I'm happy with that deployment, it's time to push to production.

Historically, I started out by doing a complete rebuild of the server serving up the pages. THat got tedious if I was writing a lot of posts (or, at least if I was writing posts more frequently than OS releases and nginx releases). Eventually, I modified my ansible scripts so that they had a tag for publish which would skip the re-provisioning process and the process of building new certs, etc. and just deploy the latest pelican, build the pages, and reset the cache. In fact, it would do so in a separate directory, so that it would flash-cut the web pages.

While rolling out GitLab, I started playing with the CI tools and realized there was a lot I could do with it, much of it more easily than I could with Jenkins. As such, an automatic build to stage followed by a manually-triggered build to production was simple to configure.

So, I set out on my next automation journey with GitLab...

Access control

One nice thing about running the CI under the rubric of the SCM is that you can grant permissions to do source-related things just from the SCM. This makes it simple to pull from multiple repositories and perform other SCM-specific tasks.

However, this doesn't specifically extend beyond the CI and SCM and into the deployment. So, my next question was how to control access to the hosts and make sure that I could control them, and retrieve the code without trouble.

Further, I wanted to re-use the ansible playbooks that I used to deploy the systems (albeit with tags to reduce the plays), while limiting access to the stage and production servers (not the SmartOS global zones they're deployed from). Since I was reusing these mechanisms, I wanted to leave the existing ssh-based access controls in place.

As an aside, I could now switch my deployment method for git repositories to using deployment or personal access tokens, but I'd rather not right now.

SSH solution

My existing deployment pattern automatically deals with what I refer to as ssh_access_keys, which are SSH keys that are used for root access to the servers. These are generally used infrequently (there are separate deployment keys that are multi-server), but when accessing only the VM, the ssh_access_keys are precisely the right tool.

When running on the CI server, I need to adopt the ssh key as part of the CI process, and I use ssh-agent to do that (one agent per running CI process, segregated by the socket/pid combination). It's simple to start this by using:

eval $(ssh-agent -s)

This creates the agent and sets the shell variables so the agent is reachable.

Then comes the real trick: loading the ssh key into the agent. I had a vague recollection that it was possible to load a key from a shell variable, and here's how to do it:

echo "$HOST_DEPLOY_KEY" | tr -d '\r' | ssh-add -

Ansible ssh control paths

While getting this put together, I ran across an issue with the length of the path for ANSIBLE_SSH_CONTROL_PATH, which is used by SSH to persist connections (in our configuration). Especially on Solaris (and derivatives, like SmartOS), there's a path limit to the control file and caused a problem with the relatively-deep nesting that gitlab runners do for their paths. The solution was to define a bespoke path:

export ANSIBLE_SSH_CONTROL_PATH_DIR=/tmp/${CI_JOB_ID}-${CI_COMMIT_SHORT_SHA}/.ansible/cp

Not that this path is in /tmp, not in ~ and certainly not in the build directory, however, it does change for every job and repo.

Final gitlab script

  script:
    - eval $(ssh-agent -s)
    - echo "$HOST_DEPLOY_KEY" | tr -d '\r' | ssh-add -
    - export HOME=$(pwd)
    - export ANSIBLE_SSH_CONTROL_PATH_DIR=/tmp/${CI_JOB_ID}-${CI_COMMIT_SHORT_SHA}/.ansible/cp
    - 'git config --global url."https://gitlab-ci-token:${CI_JOB_TOKEN}@your.git.server/".insteadOf git@your.git.server:'
    - git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@your.git.server/playbooks/ansible-web.git
    - cd ansible-web
    - ansible-galaxy install -r requirements.yml -f
    - ansible-playbook -i stage -t publish --vault-password-file $VAULT_SECRET -e cert_renew_days=0 pelican.yml

Putting it all together:

set up SSH (lines 1 & 2)
set HOME so that we're not stomping on another cache; this may not be necessary if you can guarantee that only one runner will be running at a time in each account (line 3)
set the ansible control path (line 4)
rewrite our git URLs globally (line 5)
check out our ansible playbook repository (line 6)
update the ansible galaxy requirements (lines 7 & 8)
Run the playbook to our stage sever (line 9)

You might be wondering about line 5, where we use an interesting feature of git to rewrite the URLs. This might not be explicitly necessary if I were to allow the ssh key that I use for deployment access to all of my dependencies in my git repo. However, I've left it that way for future compatibility and because it confines this particular script to being run by the CI server.

So, for those keeping score, the gitlab server runs the gitlab script on a SmartOS host which runs the gitlab agent, and thus the ansible runs on SmartOS. Theoretically the ansible could run on basically anything (my Jenkins versions of this ran on macOS Jenkins nodes), but our provisioning is done from SmartOS these days, so keeping things the same is a good thing.

Manually-triggered releases

I mentioned in the beginning that I was going to be manually trigging the release to production. This is done using a rule in the GitLab CI configuration:

deploy-prod:
  tags: [ansible]
  stage: prod
  environment:
    name: production
    url: https://${SERVER}
  script:
    - "... see above ..."
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
      when: manual

This job definition requires that the runner be tagged ansible, names the stage prod, sets up an environment for prod with the URL pointing at our final server, includes the script above, and then conditionally (and only on the main branch) holds for manual release.

Script locations

One additional note I'll make is that I made some potentially interesting decisions on where to place the gitlab scripts. Since I tend to have multiple hosts (or groups) using the same ansible plays, I knew I wanted a place to share the scripts calling them. Their requirements tend to be more aligned with the ansible playbooks than the code that is deployed. As such, I placed the gitlab-ci jobs as templates in my ansible playbook repositories in a gitlab-deploy directory. I aligned the names with the playbooks.

To call these, I use the include directive in the .gitlab-ci.yml files for the repositories I'm deploying:

# This will work, but not on the python runner (yet)
include:
  - project: 'playbooks/ansible-web'
    file: 'gitlab-deploy/pelican.yml'

variables:
  SERVER_GROUP: gaiges_pages
  SERVER: www.gaige.net

Note the additional variables. Since this deployment script is used by both the Gaige's Pages and Cartographica blogs, I needed a way to pass in the server and group names.

Docker on SmartOS

2021-06-11T10:00:00-04:00

This spring, there was a some movement on the Illumos/SmartOS front in implementing features to better support running LX zones with Linux variants. Since Docker images (generally) run on Linux underpinnings, support for running Docker images on SmartOS are dependent upon this support working correctly.

For those familiar with Triton, you know that Triton can run Docker directly as part of its standard configuration. But, for those of us who don't run Triton, but do run SmartOS, there are some steps that can be taken to use Docker images under SmartOS.

To provide a concrete example, I'm going to use GitLab as the example in this article.

Docker on SmartOS, the harder way

I'm completely stealing that line (and adapting some of the content) from a 2016 blog post by Jasper Lievisse Adriaanse of the same name.

His summary of using Docker on SmartOS was the best resource that I found for creating virtual machines using vmadm.

Getting docker images on SmartOS

Docker Hub is "the world's largest library... for container images" and as such is basically where you want to go to get your docker images.

To get access to Docker Hub, you need to add the source to imgadm:

# imgadm sources --add-docker-hub

As noted in the post, imgadm avail doesn't work against Docker Hub, so you'll need to search there manually or get it directly from another source. Once you know what docker image you need, you can add it using imgadm import.

# imgadm import gitlab/gitlab-ee:latest

As is common with docker, some of the items in the image descriptor can be left off, most notably :latest can be omitted and the latest tag will be used by default.

Once you have the images loaded, you can see them weeded out from the rest of your images by using:

# imgadm list --docker
UUID                                  REPOSITORY              TAG     IMAGE_ID      CREATED
a001c571-b91a-a5b0-c251-0514a0d4a174  gitlab/gitlab-ce        latest  sha256:42486  2021-06-07T19:28:36Z
9080e799-d964-782e-e369-87d339e50798  gitlab/gitlab-ee        latest  sha256:1f383  2021-06-07T19:36:13Z

Reasoning through the requirements

One disadvantage of using SmartOS natively for docker in comparison to using docker on Linux is that there isn't a docker control daemon to set things up for you. As such, you'll need to dig in a bit to the requirements in order to make sure you have all of the right sittings to get up and running.

You'll need to take a look at the docker parameters. Some of these parameters are baked in to the images during the build phase and others are usually shown in command line arguments in the instructions to run the code. As is frequently the case, there are some of each to pay attention to in gitlab.

The instructions for running gitlab in docker (as of 2021-06-11) call for the following docker command line:

sudo docker run --detach \
  --hostname gitlab.example.com \
  --publish 443:443 --publish 80:80 --publish 22:22 \
  --name gitlab \
  --restart always \
  --volume $GITLAB_HOME/config:/etc/gitlab \
  --volume $GITLAB_HOME/logs:/var/log/gitlab \
  --volume $GITLAB_HOME/data:/var/opt/gitlab \
  gitlab/gitlab-ee:latest

Let's look at the key parameters here:

argument	docker	SmartOS json	Notes
hostname	gitlab.example.com	hostname
publish	443:443, 80:80, 22:22	N/A	See network section
name	gitlab	alias	I used the FQDN here
restart	always	N/A	no equivalent
volume	various	filesystems	See file system section

In addition to the parameters on the command line, there are also parameters inherent in the docker container that we need to propagate to the SmartOS JSON.

We can see the key information in the JSON that comes with the docker images by using

# imgadm info <uuid>

which will output the json for the image.

The key section to look for is the tags section, which in this version of gitlab-ee contains:

"tags": {
  "docker:repo": "gitlab/gitlab-ee",
  "docker:id": "sha256:1f38337b3401d2536562e4323999233b665aa41a2e6ef2c7509a0b938e53d94d",
  "docker:architecture": "amd64",
  "docker:tag:latest": true,
  "docker:config": {
    "Cmd": [
      "/assets/wrapper"
    ],
    "Entrypoint": null,
    "Env": [
      "PATH=/opt/gitlab/embedded/bin:/opt/gitlab/bin:/assets:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "LANG=C.UTF-8",
      "TERM=xterm"
    ],
    "WorkingDir": ""
  }

Again, we'll look at the key parameters of the docker:config object:

json path	value	SmartOS json	notes
`Cmd`	`['/assets/wrapper']`	`docker:cmd`
`Entrypoint`	`null`	`docker:entrypoint`	drop in this case, since it's empty
`Env`	`[ "PATH=...", ... ]`	`docker:env`	The entire JSON here needs to be encoded as a single string value
`WorkingDir`	`""`	`docker:workdir`
`User`	not present	`docker:user`	When commands need to run as a specific user

It's important to note that the "dockery" configuration elements are all strings, so you need to appropriately quote them to get them in the internal_metadata portion of your json for vmadm.

Setting up the storage

Your specific mileage may vary. In many cases, images may run only with ephemeral storage, in which case, you have nothing to do for filesystems, but in this example case, we have three specific mount points: /etc/gitlab, /var/log/gitlab, and /var/opt/gitlab. In our SmartOS systems, we use a separate data pool (usuallly spinning rust, in contrast to the zones pool, which is all SSD), so you'll see that in the example. Further, we have a naming standard for volumes that requires the FQDN and then the mount point.

Once you've figured out what zfs zones you need, create them using zfs create. I'll leave that as an exercise for the reader.

Constructing the vmadm json

As is frequently the case (I may address doing this in ansible at a later date, but so far these are all bespoke), you want to have a json file containing the parameters of the new zone, so that you can pass them along using vmadm create -f x.json.

Now, we need to put together what we know from the information we've gathered so far:

Start with a template LX zone
Make sure brand is lx and docker is true
Set the docker image UUID in image_uuid
Set up your network as required (see the gitlab note below for an understanding of why there are two interfaces)
Configure your file systems based on the required mount points
Put the operative docker information in the internal_metadata section
You will need to put an owner_uuid in the json because it is theoretically required by the firewall code.

{
    "alias": "gitlab.example.com",
    "hostname": "gitlab.example.com",
    "image_uuid": "9080e799-d964-782e-e369-87d339e50798",
    "owner_uuid": "f834f98a-cac8-11eb-8ca3-cbddba9a698b",
    "nics": [
        {
         "nic_tag": "vlan",
         "ips": [
            "X.X.X.X/24"
         ],
         "gateways": ["X.X.X.1"],
         "vlan_id": 100
        },
        {
         "nic_tag": "vlan",
         "ips": [
            "Y.Y.Y.Y/24"
         ],
         "vlan_id": 200
        }

    ],
       "filesystems": [
          {
          "source": "/data/gitlab.example.com/config",
          "target": "/etc/gitlab",
          "type": "lofs"
        },
      {
          "source": "/data/gitlab.example.com/logs",
          "target": "/var/log/gitlab",
          "type": "lofs"
      },
      {
          "source": "/data/gitlab.example.com/data",
          "target": "/var/opt/gitlab",
          "type": "lofs"
      }
      ],

    "brand": "lx",
    "docker": true,
    "kernel_version": "5.4.0",
    "max_physical_memory": 8192,
    "maintain_resolvers": true,
    "resolvers": [
        "8.8.8.8"
    ],
    "quota":100,
    "internal_metadata": {
        "docker:cmd": "[\"/assets/wrapper\"]",
        "docker:env" :
        "[ \"PATH=/opt/gitlab/embedded/bin:/opt/gitlab/bin:/assets:⏎
            /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\",⏎
            \"LANG=C.UTF-8\",⏎
            \"TERM=xterm\" ]"
    }
}

NOTE: I've split the docker:env line above for readibility. You'll need to keep that as a single line for it to work correctly. Remember that all of the docker items are strings, so they can't be split up. I've marked the returns for readibilty with ⏎ above.

Once this is prepared, you should be able to bring the zone up with

# vmadm create -f myzone.json

Networking or Protecting your container from the internet

One important thing to note about docker containers is that they all assume they're not accessible from the internet. Under normal circumstances they're running on a loopback or maybe an internal network, but certainly not on the internet. Depending on how your systems are set up, this may be an issue, as internal services are not protected using any normal process. You may get lucky and the folks who build the Docker container may have used loopback for everything not going off-container, but don't count on it.

This protection is appropriate for SmartOS built-in firewall system, which is controlled by fwadm from the global zone. Generally speaking, you want to firewall everything except the ports that were specifically mentioned in the publish argument to the docker command.

This is where the owner_id comes in. It turns out this is necessary to run the firewall in the global zone. No big deal, and you can add it later using vmadm set before you enable the firewall if you forgot to do so prior to pulling the system up.

Start the firewall for your container

bash # fwadm start 85effadf-f4d3-63ca-cec4-8549b8797f75

Enable access to your ports

bash fwadm add -e --desc 'allow docker ports' -O 5a2e68b0-ca8d-11eb-944f-8b6840c190dc⏎ "FROM any ⏎ TO vm 85effadf-f4d3-63ca-cec4-8549b8797f75 ⏎ ALLOW tcp (PORT 80 AND PORT 443 AND PORT 22)"

(again, I'm using the counter-intuitive ⏎ to mean you should not put a return there)

Should be self-explanatory, but the value after vm is the current zone's UUID, The value after -O is the owner UUID.

Debugging

In the end, the vms here are just LX zones running on SmartOS, so you can still access their filesystems and you can execute commands on them using zlogin (if you're careful).

Look in /zones/${UUID}/logs/stdio.log for stdio of the docker environment
Start a shell using zlogin -i ${UUID} /native/usr/vm/sbin/dockerexec /bin/sh
You can run an arbitrary command in the container using zlogin -i ${UUID} /native/usr/vm/sbin/dockerexec, the /bin/sh is just a specifically useful example

Additional hints

Here are some practical hints that I have developed by getting some of these images (and others) running in SmartOS:

The gitlab binaries appear to really want a private interface in order to look for other nodes in a potential cluster. As such, I found I needed to add a second, private network interface so that it didn't get confused. This was made clear by error messages thrown by the startup code.
Some Linux calls still don't work exactly the same in LX zones. This seems to be particularly The case with process and system information gathering. In the case of gitlab, the CE version ran with few changes to the configuration; but the EE version required reducing the worker count to "0" (actually 1 worker, but that's the semaphore). This is another place to look when debugging LX zones in general, and docker images in particular.

Additional links

Pivoting Elasticsearch data

2021-05-31T10:00:00-04:00

As I've possibly mentioned here before, ClueTrust is using Elasticsearch to perform analysis of log information. Recently, I finally decided to take some our telemetry inforamtion and pull it in to Elasticsearch as a data exploration and statistical tool.

Importing structured XML data into Elasticsearch

Although there are some filters and logstash methods that have this capability, the XML that we use is extremely regular (strict schemas, etc.), and I felt that it would be better to directly and intentionally import based on the DOM that I'd created in 2009 when preparing for Cartographica to ship.

For purposes of illustration, the basic form of the Cartographica telemetry files is:

Preamble
Crash logs (yep, they're embedded)
Event stream
Errors
Events (launch, quit, and other)
Exceptions
Statistics (at quit and other times)

Due to the way that Elasticsearch works, it turns out this is a really workable input, generating a (possibly too verbose) set of items from each telemetry report, including:

Report
Launch
Event
Crash
Error
Exception
Statistic

For most of these items, the format is regular and arguments are inserted directly into the record (so a Crash has a crash log along with some interpretive data as well as the preamble from the report). This holds true for Events and Errors as basically individual data points inside of a preamble+launch context. The only oddity is the Statistic report which contains many "columns" of data for each statistic event. It's not lost on me that this idempotent data set is very SNMP-like.

Searching for meaning among the data

Due to the choice to separate these out as separate objects in Elasticsearch, most statistical information is straightforward to ascertain. Want to know what formats are most popular? Look for importVector or importRaster events and tabulate the number of times each format and/or driver are used. Interested in how frequently a particular analysis tool is used? Look for its corresponding event.

When (and how) to pivot your data

The one piece that had me stumped for a few days was: how do I determine how many active users are on which version of macOS? I've got launch data and a unique (but pseudonymous) host identifier. For example, you can create buckets based on the OS and count unique host IDs... unfortunately, that creates a data problem with users who have upgraded during the time period being examined. Using this technique, a user who was running macOS 11 and upgrading on each release day would account for 10 separate macOS counters.

What I really needed was to look at the data for just the most recent report for each host ID. Basically, I needed to pivot around the host ID. After much too much time trying to find a complex way through this, I finally searched on "pivot elasticsearch" and found Pivot Transformations, which turns out to be just what I needed. By creating a transformed index with the latest method, I was able to get an index that only pointed to the most recent documents for each host ID. Once I had this, I could aggregate using terms to find the operating system, resulting in a bucket of OS versions used most recently by each host ID.

Pivot Transformations create an alternate index to documents in another index. In my case, I used a latest transformation, which maintains only the most recent item for each unique key, based on the specified timestamp field and possibly limited by a filter. In my case:

{
  "source": {
    "index": [
      "ct-app-logs-*"
    ]
  },
  "latest": {
    "unique_key": [
      "report.host.id"
    ],
    "sort": "report.timestamp"
  },
  "description": "Application Hosts",
  "frequency": "1m",
  "dest": {
    "index": "ct-app-hosts"
  }
}

creates the new index based on the existing ct-app-logs-* pattern and pulling out items by the unique report.host.id key, using report.timestamp to determine which item is the most recent. This boils down 12 indexes containing 18.5M documents into a single index containing 16K documents.

The destination index, ct-app-hosts was set up ahead of time using a basic clone of the original index.

If desired, I could add a query key which would have limited the scope of the documents.

Upping the ante for aggregation

Once I got this going, I was having some issues pulling information out of the data due to some variances in how versions were managed. In particular, I was interested in seeing major OS versions (macOS 10.15, macOS 11, iOS 14, etc.) and maybe the same for the application version.

To facilitate this, I used runtime fields, setting up 2 additional mappings in the destination index (ct-app-hosts) pointed at above. To do this, I PUT a new index definition for the index, citing the following:

{
    "mappings": {
      "runtime": {
        "major_app": {
          "type":"keyword",
          "script": {
            "source": """
def myField = doc['report.application.version'];
if (myField.empty)
    emit("");
else {
    def dom = myField.value;
    for( String suffix : ['a','d','b']) {
        if (dom.indexOf(suffix)>0) {
          dom = dom.substring(0,dom.indexOf(suffix));
        }
    }
    int last = dom.lastIndexOf('.');
    if (last == dom.indexOf('.'))
      emit(dom);
    else
      emit(dom.substring(0,last));
}""",
            "lang" : "painless"
          }
        },
        "major_os" : {
          "type" : "keyword",
          "script" : {
            "source" : """
def myField = doc['report.host.os.version'];
if (myField.empty)
    emit("");
else {
    def dom = myField.value;
    int last = dom.lastIndexOf('.');
    def major = dom.substring(0,last);
    if (major.startsWith('11') || major=='10.16') {
      emit('11');
    } else {
      emit(major);
    }
}""",
            "lang" : "painless"
          }
        }
      },

This creates a new column, major_app which:

Checks that report.application.version exists
Removes any suffix starting with a, b or d
Removes the last . (unless it is also the first ., such as in 1.4)

Similarly, it creates a column major_os which:

Checks that report.host.os.version exists
Removes the last .
Makes sure to emit 11 for 10.16

With these two powerful tools, I was able to create a clear, concise, and constantly-up-to-date resource for OS and application usage.

Always check your arguments

2021-04-24T13:16:00-04:00

Quite a while back, RS wrote a comprehensive ansible role for handling Let's Encrypt certificate issuance and renewal. We both use this role extensively, which is why it was a significant issue when it suddenly started throwing type errors deep inside of the dnspython library during an nsupdate call in a critical part of the script.

A cursory examination of the component parts indicated that the most likely cause was a change to the dnspython library, which had recently been upgrade from 1.16 to 2.0. Although there wasn't anything we could find online indicating other people had suffered this breakage (which should have been a clue), it hadn't been out very long, it crashed in a module that indicated it was checking something with IPv6, we use a lot of IPv6 on our systems, many people use no IPv6, and well, we hadn't changed anything...

This was an annoyance, but relatively easy to avoid in one of the following ways:

Pin the dnspython libraries to \<2.0 in pip
On the Mac, use brew's ansible and manually roll back the dnspython libraries in the installed version¹

I used both, as we ran ansible on both SmartOS and macOS.

Taking brew hackery up a notch

After maintaining this for a while, I needed to upgrade some modules in ansible, and needed to keep my CI environment (running on Macs under Jenkins in sync with what we were running on my desktop, laptop, and servers; and that lead me to create my own tap in homebrew by cloning the standard ansible formula and using my own repository.

The addition of this tap meant that I could configure this and test it once, but I could deploy it on all of my Macs (and anyone else who had access to the tap on my private git server could do the same).

One thing leads to another

After a few more months of using this tap on my Macs (and slowly moving ahead the ansible version on the SmartOS machines, but keeping dnspython pinned), I needed to upgrade the version of ansible at home (due to a project that I'll likely write about later, using ansible to configure my Jenkins agents). The driver here was the need to execute homebrew commands on an M1 mac, something that didn't work out of the box with ansible 2.9, which is what I was pinned to.

Ever-hopeful, I first decided to see if my aforementioned problem was "fixed" by unlinking² my private tap's version of ansible, and installing homebrew's version.

Sadly, running the ansible playbook just resulted in the familiar crash. I looked at it for a few minutes, decided the bug that was introduced in summer 2020 was still there and set about building a new tap for version 3.2.0 of ansible. This went smoothly, but after updating my formula, installing took a long time, on the order of a few minutes. Why was the standard homebrew install so much faster?

A bottle for monsieur?

Quick investigation lead to the fact that most brew taps are installed these days using bottles, or pre-built versions of the entire subdirectory that ends up in the Cellar. That seemed like it was a significant win, especially since I was going to install this at least 5 times each update, so I decided to figure out how to create my own custom bottles for my custom tap.

Thanks to a good article on Custom Tap and Bottles with Homebrew by Yehowshua Immanuel, I was on my way quickly after rebuilding from my tap formula once for each platform of Mac that I run (Intel Catalina, Intel Big Sur, and ARM Big Sur at this time).

The final verdict

After all this work, and getting a great solution in place for working around the perceived bug in dnspython, I took another quick look at the bug that was popping up in our role. I'd contributed to random python projects in the past and also contributed to ansible directly, so I was familiar with the process and figured I could track the problem down. I fired up pycharm to get a little better perspective on the particular bugs and settled in to reproduce a minimal set of the problem with the nsupdate command in ansible.

A few minutes (literally) into the investigation and I found myself looking at the what seemed like completely reasonable arguments to the dns.query.tcp method which were raising exceptions due to not being able to determine whether my hostname was an IPv4 or IPv6 address. I immediately checked the current docs for nsupdate in ansible and, indeed, the server argument is now designated an IP address (v4 or v6). Checking whether we'd just been lucky and ignoring this all along, I went back to the ansible 2.9 documentation and verified that it was mute on the issue of what was in the string argument.

At some point between 2.9 of ansible and 3.0, they documented the change caused by the the underlying library and I missed that change.

A few take-aways:

Once again, a reminder that checking your arguments against current documentation is often time well spent.
Assuming a behavior that goes against your expectations is a bug when nobody else is complaining about it is often a recipe for a lot of work.
Homebrew is a really well thought out package and if you have a need to maintain your own tools, it may be well worth it to use private taps and bottles, they're easy to create and super-easy to use.

Every once in a while, it's good to have your own assumptions challenged. I made a point of commenting on the bug report for ansible regarding this filed by someone else. Hopefully they're find my information useful.

This experience lead me to a nifty thing about brew, which is that many installations have every dependency installed in the Cellar directory for that specific package, including (for most python tools), it's own copy of site-packages. This makes it very easy to pin specific versions of dependencies and be able to run a number of python tools with different libraries and even interpreters. ↩
Everyone who uses ansible should be familiar with the link and unlink commands, which allow you to keep a version or command installed while switching to another one. In my case, since I was using a tap that had named versions (the best example of this I can think of is Postgresql, which has separate versions for current, 12, 11, 10, 9.6 and even some of the deprecated versions--use at your own peril). So, I could brew unlink ansible@2.9.13 and brew install ansible and get my private copy to move out of the way and use the brew-standard version for testing. ↩

2021 Backup Software

2021-03-28T10:42:00-04:00

As we approach another World Backup Day, I figured it was time for me to revise my 2013 backup article for a more up-to-date view of what my backup situation is and what I am currently recommending.

My basic backup strategy, as outlined in the previous article, hasn't changed significantly; however the tools, services, and locations have. There's now general acceptance now of the 3-2-1 backup strategy, which is:

3 copies of your data (1 live, 2 backups)
2 different backup media/packages
1 copy offsite

For the most part, I think that provides a good minimum basis; although I have been adhering to that only as a minimum, and in particular I've been recommending an extra, physically distant offsite location.

This year's post is big enough, it needs a TOC:

Locations
Software and Services
Encryption
Photos
In Conclusion
Afterward: Previous software suggestions

Locations

With some reasonable bandwidth now widely available for most people who are using computers, an offsite backup that is up-to-date should be considered table stakes.

Reiterating my advice from 7 years ago: a minimum of two physically distinct locations is a must: one at home/on-site for ease of access, and at least one kept remote. My last article suggested a safe deposit box, but I'm going to recommend against that unless you don't have sufficient online access to systems physically diverse from your location. Frankly, there's just too high of a likelihood that you won't remember to update it. I suggest that the second copy be an online service, so that you can make sure it is always available for you.

Again, I would suggest at least 2 locations. They should be far enough apart that they're unlikely to suffer the same fate in the event of a disaster. If you are using an online service for the second backup, try to find one that keeps two distinct copies of your data (I'll describe one way to do that below).

Software and Services

Last time, I described myself of being a big fan of paid-for backup software. Although I'd agree with that in general still, I'll note that of the 5 packages that I called out last time, there have been some changes.

Here are some specific suggestions:

Time Machine

Time Machine is Apple's built-in backup software for many versions of macOS, and is the only program that stayed on this list since the last time. It provides version storage as well is very simple administration, and can be used easily with an externally connected hard drive. Of course, it's not very useful for off-site backup. However, for local backup it is easy to set up and easy to restore data from.

Arq

Arq backup is a server-agnostic client-based offsite backup package. They provide a software package that can use many different systems for storage, including BackBlaze's B2, Amazon's S3 and Glacier, Dropbox, Google Cloud and Drive, Microsoft OneDrive or SharePoint, external and network drives, and just about any S3-compatible storage service (think minio, Wasabi and others).

Personally, we're using Arq to back up to a pair of minio servers that we run: one on each coast. We encrypt with complex keys at the client before the backups are sent to the cloud, so we're confident that we're safe from prying eyes.

Arq's been around since 2009 and has been providing similar capabilities that whole time. In the case of Arq, you are purchasing a software package (with updates if you in maintenance) and you will need to provide separately for your storage. Some might find that fiddly and a disadvantage.

For those wondering, there were some hiccups at the start of Arq v6 (first new release since v5) and I felt that the author responded well to them. Lots of criticism, as is the way of it in this day and age. His response was to buckle down and accelerate the release of v7, which came out earlier this year, with a tuned-up new Mac-native interface (one of the issues with v6 was the Electron interface). I had no trouble with either v6 or v7, but I'm happy to be on v7 now and have had good luck with backup and restore.

Carbon Copy Cloner

Carbon Copy Cloner is a package that clones Mac hard drives (and SSDs, really any storage). They have been around a long time (since 2002) and shown steady progress of solid software improvement. Even with the challenging changes over the last few years for the Mac, the folks at Bombich Software have managed to engineer their way around the new Apple choices with aplomb and have won their way back to being my preferred disk cloning software. When I need to make a bootable (or just carry-able) copy of an existing drive, I turn to Carbon Copy Cloner.

Bacula

I'm a big proponent of Bacula for server backups. It requires a bit of an investment up-front to figure out your backup plans and you need to be willing to put in the time to understand the options and configuraiton. But, if you're looking for something for servers, I would suggest checking my articles on Bacula:

BackBlaze

BackBlaze came onto the scene a number of years back and was initially an also-ran behind CrashPlan and Carbonite at the time. My, how things have changed. They're not perfect (note some recent Facebook-related unforced errors on their web site), but they reportedly provide a reliable service and charge reasonable prices. I've never been a customer of theirs for backup, but I have used their S3-compatible storage system (B2) for offsite storage and found them to be reasonable. They have the option of letting you self-key, which means they won't be able to tell what you are backing up.

Encryption

This is not an option. You need encryption. If you have enough operational fortitude to keep your own keys, you should do that (as opposed to having them escrowed by a backup provider).

It's especially important when keeping data off-site to make sure that data is encrypted using strong encryption and with keys that are only available to you. This is possible with some services like CrashPlan and BackBlaze, and with the new entrant above, Arq. Any data which is intentionally taken off-site should be stored in some encrypted form. Keep in mind that if you designate your own keys, you are going to have to safely store these keys in a manner that they will not be lost by whatever event causes your data to be lost. My suggestion is store a copy of your keys in a safe deposit box. There are electronic methods of storing this, but why mess around? If your biggest concern is losing the data, then keep those keys unencrypted in the safe deposit box. If your biggest concern is somebody getting your data, then keep the keys encrypted in the safe deposit box. With that said, in the case of a real disaster (one you don't survive), determine how you want those keys to convey to your heirs and assigns.

Photos

This year I wanted to add a separate item for photos (and video, for that matter). Most people have a large amount of their data wrapped up in photos and videos these days. You should treat this data basically as you should all valuable data and have backups, in multiple locations, and with multiple methods.

Many of you may be using Google Photos or Apple Photos to store and manage your photos. That's fair enough, but those services do little to prevent from accidental (or malicious) destruction of photo data.

Apple Photos

If you're using Apple Photos to manage photos, you should consider having a single machine with sufficient disk space to be designated for full-resolution backups. The process is pretty simple:

Log in to your AppleID on that machine (in your own account if it's a shared computer)
Start Photos
Choose Photos > Preferences and select the iCloud tab.
Under here, make sure the Download Originals to this Mac is checked
Back up your user account (or the location of the Photos library if you've moded it) as you would any other valuable data

DSLR users

If you're using a DSLR and RAW or very high resolution photography, you should consider backing up that data one more time. I usually have a staging area for photos while I'm on a trip (when we could take trips) and that tends to stay around quite a bit longer than it theoretically needs to. It's a second set of suspenders beyond the belt and suspenders that I'm already wearing.

In Conclusion

It doesn't really matter as much how you decide to back up your data, it just matters that you do back up your data. If there's something that you care about, back it up. If you care about the data being secure, encrypted it. If for some reason you believe you care about the data and you don't care about being secure, think again.

Afterward: Previous software suggestions

I figured some of you may be interested in knowing the fate of recommended products gone by. Since you may have followed my advice and chosen one or more of these, I'll sum up the current thinking on each.

CrashPlan

I'd been a strong proponent of CrashPlan in the last article, and in some ways, I still like it. The client, although "native", is still poorly designed, but data integrity and speed are still fine. What's changed is the business model. When I wrote this, you could still get a personal subscription; now there are only "small business" and "Enterprise" subscriptions, and outside of the "Enterprise" version, you're being pushed toward using their backup service. That's fine, insofar as it goes, but at this point, I think there are better players, like BackBlaze (above) for personal backup at this point.

SuperDuper

SuperDuper! is a package that clones hard drives on the Mac from one device to another. Over the years I have switched back and forth between Carbon Copy Cloner and SuperDuper! Recently, the folks at Shirt Pocket software have been slower to adapt to Apple's changes and I'm currently in a Carbon Copy Cloner phase, and that's what I recommend.

BRU Server

The company announced they were going out of business; the product was later purchased by OWC. Maybe OWC will do something with it in the years to come, but I no longer advise its use. If you're looking for something for servers, I would suggest checking the section above on Bacula.

Retrospect

Historically (in the old days), I used Retrospect, which went down hill significantly when the Dantz was acquired by EMC. The software product was spun back out into Retrospect, Inc. in November of 2011, and the word is that it has improved markedly since then. I gave it a try again once, but have not used it in production, nor have I tried recent versions.

pre-commit and Pelican

2021-03-28T09:00:00-04:00

Putting pre-commit to use

I mentioned in a previous post about pre-commit, a tool for maintaining code consistency through simple management of pre-commit checks.

The first place I decided to give this a whirl was on my blog sites. As you may be aware, I moved my blog sites (both Gaige's Pages and The Cartographica Blog) to static sites some time back.

Pelican markdown files have a preamble that is set apart by a blank line. Basically, a set of colon-delimited key-value pairs that are rudimentarily parsed and then passed to the interpreter. Basically it looks like this:

Title: My Blog Post
Date: 2021-03-28 07:48

# Some bloggy stuff

Content text is here... Oh, see my [previous post]({filename}previous-post.md)

In addition to the formatter, there are also some replacement items that can be used to reference generated data. For example: {filename} indicates that the path to the stored file should be substituted.

I had noticed there was a Markdown plugin for pre-commit using mdformat, and so I figured I'd give that a try. Initial results were good. It provided a lot of clean-up for free. On the downside: it also quoted all of the {filename} and similar references, such that they would no longer work as references. And, it also eliminated my footnotes.

My initial .pre-commit-config.yaml looked like this:

# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
  rev: v3.2.0
  hooks:
  -   id: trailing-whitespace
      exclude: ^.*\.md$
  -   id: end-of-file-fixer
  -   id: check-yaml
  -   id: check-added-large-files
  -   id: check-json
- repo: https://github.com/executablebooks/mdformat
  rev: 0.5.7
  hooks:
  - id: mdformat
    # optional
    args:
    - '--number'
    additional_dependencies:
    - mdformat-tables

Note here a couple of items:

I have excluded ^.*\.md$ from trailing-whitespace, this was specifically to deal with the fact that I had some two-space-at-end-of-previous-line implementations for handling forced line-breaks. This is one of a few ways of doing this, but was required for use with the python-markdown module that's used by default with pelican
I have added the mdformat plugin with a number of options and dependencies
--number as an argument to mdformat forces it to number ordered list items. I prefer that for readibility.
mdformat-tables adds table handling to mdformat (by default it uses a strict version of Markdown called Commonmark), so any extensions must be enabled with intention

mdformat plugins

With things mostly working, I looked at the mdformat documentation to see if I could make changes to the way it operated. Fortunately, there was a plug-in architecture that allowed for the modification of both parsing and output behavior.

Footnotes

Although there's support for footnotes in the underlying markdown parser that's used by mdformat, that parser (markdown-it-py, based on the Javascript-based markdown-it), that support wasn't built-in to the mdformat code. So, I decided that I'd take a look at mdformat-tables and see if I could do something similar for footnotes, since the code for both tables and footnotes are included in the underlying package as options.

The result is the mdformat_footnote plugin, which uses the existing parser (the hard part) and formats the footnotes appropriately.

This plugin can be installed using pip install mdformat_footnote or by adding mdformat_footnote to the list of items in the additional_dependencies list in the .pre-commit-config.yaml file.

Pelican-specific items

In this initial case, all I needed to do was change the output so that it didn't replace the {} characters inside of links. The code was straightforward, and after some playing around, I created the mdformat_pelican plugin for use with mdformat and pelican.

You can look at the code above, or install it with pip mdformat_pelican to get the latest version from pypi.org.

Implementing the initial code was straightforward. Effectively, the code hijacks the render_token function and modifies the token.attrs just before they're rendered, correcting any erroneously-quoted URLs.

This worked great, across nearly all of my files. Except for a couple that had square brackets in their metadata fields. For example, a post about Queen guitarist Brian May receiving his doctorate in Astrophysics had this front matter:

Date: 2007-08-03 07:26
Alias: /node/4836,/article.php?story=20070803092627654
Tags:
Category: general news
Title: [He's] a killer... astrophysicist?

which mdformat dutifully turned into \[He's\] a killer... astrophysicist?, which pelican didn't know how to interpret, so the backslashes ended up in my content pages...not desired.

Since I already had a Pelican plugin for mdformat, I decided to make it a bit more pelican-y, by marking the front matter as off-limits. This was a little trickier, but had good results. As you can see in the plugin source, understanding the frontmatter required adding a parser by putting in a new block rule and then putting in the parser as well as the code to render that later in render_token.

Since the format is very rigid (basically, collect everything until you reach the first blank line), it was easy to implement.

So, now my current working .pre-commit-config.yaml looked like this:

# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
  rev: v3.2.0
  hooks:
  -   id: trailing-whitespace
      exclude: ^.*\.md$
  -   id: end-of-file-fixer
  -   id: check-yaml
  -   id: check-added-large-files
  -   id: check-json
- repo: https://github.com/executablebooks/mdformat
  rev: 0.5.7
  hooks:
  - id: mdformat
    # optional
    args:
    - '--number'
    additional_dependencies:
    - mdformat-tables
    - mdformat-black
    - mdformat_footnote
    - mdformat_pelican
exclude: |
    (?x)(
        ^output/|
        ^themes/|
        ^venv/|
        ^content/NewZealand/
    )

This adds my new plugins (both the mdformat_footnote and the mdformat_pelican) and also adds an exclusion for some files in my pre-commit hooks. The ones that aren't actually committed (output, venv) wouldn't be included, but I have a set of badly-formatted HTML files in content/NewZealand that I don't want to fix yet.

This turned out well, but I had a couple of items that the parser in Pelican and the parser in mdformat could not agree on. In particular, things like indentation requirements for items with newlines within ordered lists that have multiple paragraphs in them.

In the end, that would lead me to write a new plugin for Pelican to replace the Markdown parser.

Markdown parser plugin for Pelican

The plugin architecture for mdformat is pretty good, but the one for Pelican is very mature and well thought-out. I've created plugins for Pelican before, notably the Nginx alias maps plugin.

Also, there already existed plugins to replace the Markdown reader in Pelican. As such the lift was pretty light:

Get a base plugin working
Parse the metadata (simple : split of each line before the first blank line)
Load the MarkdownIt package and configure with a few settings (tables, footnotes, and definition lists)
Add hooks to rewrite the \{filename\} items back to {filename}
Finally, add a new fence formatter, to use Pygments to format code

The code is available on GitHub in the markdown-it-reader repository and can be installed using pip install pelican-markdown-it-reader.

This plugin must be enabled on your site by adding it to the list of PLUGINS in your pelican.py file.

pre-commit

2021-03-28T08:00:00-04:00

Introducing pre-commit hooks

I recently became aware of the open-source project: pre-commit, which is "A framework for managing and maintaining multi-language pre-commit hooks."

The key feature of pre-commit is that it creates an execution environment for itself in order to enable running hooks without messing with (or creating conflicts with) your development or operating environment.

pre-commit uses a configuration file (.pre-commit-config.yaml) to determine which hooks to run and how to handle them. These hooks can come from public or private repositories and there's a pretty solid mechanism for dealing with dependencies.

I'd discovered the pre-commit project when looking at another open-source project that used it to maintain a level of code consistency. To carry this out, pre-commit can be used to update files during the commit phase to bring those files into alignment with the standards (and verify correctness, run lint programs, etc.)

Pre-commit can be installed on macOS using homebrew with

# brew install pre-commit

Once installed, you can add a pre-commit configuration by creating a .pre-commit-config.yaml file and putting appropriate contents in it. Here's an example config that cleans up yaml, end of file, and trailing whitespace:

repos:
-   repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v2.3.0
    hooks:
    -   id: check-yaml
    -   id: end-of-file-fixer
    -   id: trailing-whitespace

Install the git hook scripts with the install command:

# pre-commit install

And now give it a shot by running the command against all of your files:

# pre-commit run -a

It's not necessary to use pre-commit run, since the whole point of the pre-commit hooks is that they run before code is committed. However, if you make a change to your pre-commit configuration and want to bring the older content into line, you can use pre-commit run to clean up existing code to the new standards.

All told, it's pretty spiffy. I decided to use it to check my blog content, but that's a story for another post.

Bacula Restore Testing

2020-09-27T10:33:00-04:00

Originally this was going to contain a brief Bacula, 6 months on section at the start. Of course, that became much too detailed, so I split them up, however I would encourage you to read it.

Restore Testing

Backup is the most obvious part of doing backups. Almost everyone's aware they need to perform them and so most businesses and some individuals make sure that they have one or more backup systems and destinations.

However, it's surprising how infrequently real restore tests get performed. Frequently this is because of the difficulty of shutting down a production system for verification or because the commands are difficult, or even that it's difficult to find the available disk space necessary to do a complete restore.

With that said, restore testing is absolutely crucial and becomes even more so as your backup systems become more complex. However, it also gets quite a bit more difficult the more complex your systems becomes, and as such hasn't always been something that I've done well.

Fortunately, one of the side-effects of moving to Bacula was that we needed to perform backup and restore tests when we were considering the solution and that has resulted in some solid steps for performing restore tests.

Verify vs Restore

Frequently, you'll see "Verify" options along with backups and often even default read-after-write operations which can validate that the data written is what was expected. Those options, especially the latter, were absolutely essential in the days of tape backups, where you'd occasionally have a tape go bad during the writing process. However, these days, the need for verify-after-write is not nearly as strong.

Similarly, you'll see "Verify" operations which will attest to the fact that the data that is stored on your backup "volumes" is what is currently on your machine. These are sometimes used as a substitute for actual restore testing, as they simulate the restore process. In effect, they exercise the catalog, validate the contents of the backup (sometimes) and compare to the bytes in the file on disk (also sometimes). These are good to a point, but really no substitute for doing what you're going to do in an emergency.

Performing actual restores on a regular basis not only exercises all mechanisms of the storage and restore process, but they also tend to lead to automating your disaster recovery scenarios and improving familiarity with the details of the restore process.

Near In-place restore

Because of the way that Bacula is structured, a client must exist in order to be the target of a restore. In addition, it must contain the correct encryption keys. As such, the most straightforward restore test is to restore the contents of your volumes to another location on the existing client. As long as you have enough storage space, this is a pretty non-invasive restore and can be done by using the standard restore commands and adding the where= argument (or adjusting the restore parameter's where value). After running the restore, use your favorite diffing utility to determine if everything is as it should be and you've got a basic restore verification.

Restoring to staging

A more complex scenario involves restoring to a staging device. With our current configuration of explicitly separate staging and production environments, this can be a little tricky, but it has the advantage of allowing you to do an in-situ replacement and validate the full restoration process.

Restoring to a "similar" production device would serve nearly the same purpose, but in our case, the staging systems are designed to be safe copies of the production environment, and there's very little "stream crossing" to be done.

With that said, if your staging and production environments are separate, you'll need to do the following in order to restore on a staging system:

Register your staging system with your production bacula director by adding a client stanza for it in the bacula-dir.conf and reload the config in bconsole
Place the encryption key (public and private halves) on your staging server, so the data will decrypt appropriately
Add the production director to your bacula-fd.conf on the staging client, so that it can request the restoration
Restart the File Daemon on the staging client
Use status client to make sure that you can reach the client
(Belt and suspenders) I like to also disable the File Daemon on the original client to guarantee that you don't fat finger something and make yourself very unhappy
Run the restore command in bconsole selecting the appropriate files, setting the client to your staging server, and setting the restoration path to / so that you write to the original location
Once complete, remove the staging server from your production director's bacula-dir.conf (to avoid confusion or accidental backups in the future), and reload in bconsole. You may also want/need to delete the client in bconsole, which will make sure it's not available.
Replace the configuration on the staging device as necessary

Bacula 6 months on

2020-09-26T08:04:00-04:00

It's been about six months since I originally wrote Welcome Bacula, describing our transition to Bacula from our previous solution (and a bit of history even before that). If you haven't read it, it might be worth a read.

Although not quite 6 months since I wrote the first piece, it's now been over 6 months since we started using Bacula. The results have been extremely good:

Performance has been excellent
The backup mechanism has been highly reliable
Locally-cached cloud backups propagate to the cloud easily (and reliably)
Pre-transmission compression and encryption have improved performance and security
Text-based configuration files have improved automation of clients and servers

Performance

I'm going to start with performance, which has been an unexpected (and uncontested) win in comparison to our previous solution. Nightly incremental backups are finishing in 10-12 minutes after backing up a bit over 3GB in 2500 files across 17 machines (remember that a lot of systems we have don't store persistant data). Weekly differentials take a few minutes longer and tend to contain about double the number of files and data. Full backups take around 21 hours, backing up 350GB in 3 million files (including local storage and the push to the offsite storage).

With our previous solution, we carved out a 10 hour window for full backups for each of 4 backup sets covering 13 systems (the ease of automation in Bacula has resulted in our backing up a few more machines), and about an hour a day for each of the 4 backup sets. They didn't always take that long, but running the backup sets in parallel was not a good idea™. Full backups took about 41 hours (total, not running in parallel) to back up about 325GB. These backups were compressed, but not encrypted. In addition, the push to offsite storage was a separate operation and itself took a substantial amount of time (including requiring an encryption step). Generally, we could expect full backups to be completed, encrypted, and replicated to offsite within 48-72 hours of the start of the cycle.

Reliability

We've had excellent reliability out of Bacula. Error messages are delivered in a timely fashion (via email mostly) and status information during the job is readily accessible. The previous solution never had any reliability problems (that we were aware of), although getting real-time status information was always a bit of a chore. The GUI (which Bacula does not have) was a underwhelming and vague, but the CLI was too machine-friendly. In this case, Bacula aligns much better with our needs and desires. I'm very comfortable with CLIs and although GUIs are nice when they're done really well, for a facility like this I'd rather have a good CLI any day.

Cloud experience

I need to give the caveat here that we don't use what most people would call "the cloud" these days, as we have enough geographic diversity that we replicate to our own equipment in another data center. However, the concepts are the same, and since we're using an S3-compatible storage mechanism, I think the comparison to S3 or B2 is reasonable.

Bacula uses a fairly intelligent cloud cache which uploads backups in chunks as they are completed. I'm still not entirely certain whether this stops the backup process in order to upload or whether it uploads in parallel. Given that the backup isn't considered finished until the cloud send has been attempted, it doesn't make much difference to us. You'll note that I said "has been attempted". In the event that the cloud send fails, the backup continues and an error is logged. You can attempt to upload the parts later if they aren't completed before the backup.

It's worth noting that the cloud backups are just simple syncs of the directories from the cache, so you can actually use any mechanism you like to send them off site. However, using the built-in drivers also allows the system to pull the backups (piecemeal as required) during a restore, which is a nice feature if you're bandwidth constrained either in network or pricing.

I'll note here that the part.1 of the backup (inside of the "volume" directory) is the label and is required for the automatic pull to work. If that's deleted for some reason and you need to restore from a volume that's completely offline, you'll need to at least pull the part.1 file back to the storage server's cache to get the automatic pull to work.

Ed Note: this previously read part.0, which isn't actually created by Bacula

Automating configuration

As should be clear from the rest of this blog, Rob and I use Ansible for building basically everything we run. In many ways, the most significant advantage of the move to Bacula was being able to automate the configuration of both clients and servers without difficulty. As such, we have test and production environments and we're able to validate new versions and configuration ideas when we need to.

Conclusion

All told, as with any good migration of a long-standing system, the main take-away is: I wish I'd done this sooner. Bacula may be too fiddly for some people, but our environment is complex and highly automated. As such, we constrained the fiddling mostly to our initial configuration and have been able to craft a solution that is well suited to our environment and needs.

Trapped in the ice

2020-06-26T06:29:00-04:00

We've heard it all before: AWS is expensive, and you need to watch out for the hidden sharp edges of their pricing model. Today I provide a small lesson in that concept.

History

ClueTrust has run through a number of backup methodologies over the year, originally using Retrospect (when they were their original, independent, selves) to tape, then moving to BRU to handle more multi-platform capabilities, eventually deprecating tape and mirroring to an off-site storage system, and most recently, our move to Bacula.

BRU didn't have a glacier module, so I wrote (and re-wrote) a series of python scripts that handled backing up, storing metadata (because glacier doesn't allow you to choose the names of your storage units) and purging older archives when appropriate.

Melting the glacier

As part of our work this year, we've been looking at various storage models for our new datastore. Since Bacula is capable of supporting S3, we looked at storing off-site data using S3-compatible servers in a couple of locations. On the open-source side, this is powered by minio, but we also considered using the new Backblaze S3 Compatible API.

Either way, it was clear that raw glacier, as we'd been doing in the past, wasn't going to make any sense for us going forward.

In the intervening years, Amazon had done a nice job of reducing the price of storage, and even retrieval (in bulk) for Glacier, and we only have about 3TB of data sitting there right now. This costs us approximately $12/month to store. Still $144/year, which at today's prices will get you just about 1.5 4TB SMR drives per year (don't get me started on SMR, especially since we use ZFS).

We're just beyond the 90-day window of deleting data from Glacier, so I took a look at what it would cost us to download and archive the data from Glacier locally and just delete the rest of it. For those of you unfamiliar with Glacier, there's a minimum 90-day retention policy; if you delete your data in \<90 days, you pay for the entire 90 days for that data.

Getting you coming and going

This section title is mildly misleading: AWS doesn't charge you for upload (except transactions), but they do charge for storage ($0.004/GB/mo right now), and transfers out of Glacier to the internet ($0.09/GB).¹

So, to offload my 3TB of data it would cost 0.09 * 3000 or $270 (+$7.5 for the bulk retrieval fee).

We don't have all of that data locally stored (retention policies are tricky, and glacier guarantees a certain level of redundancy), so we will slowly delete that data as it ages out of our retention policy and hope that we don't need to restore it (and pay the retrieval fee). So, glacier's got us as a customer for a few more months declining from $12/month due to the financial lock-in of the retrieval price.

Pricing as of 2020-06-26 06:49. If you're reading this more than 6 months from now, pricing has probably gone down again. ↩

Static pages-18 months on

2020-05-31T21:31:00-04:00

In 2018, I wrote about the move to convert Gaige's Pages to a static generation model.

I followed this up in mid-December of that year describing the drop in processing and response time.

After 18 months of running the site (and Cartographica's Blog as well) on Pelican, I wanted to revisit the graph and look at today's performance.

Not only are things still looking super zippy (average time to deliver of 127ms), but it's pretty darned consistent. Not surprisingly, most of the time (90+% of it) is spent doing the TLS handshake.

In December, 2018 (on the same hardware) we were seeing an average of 125ms (minimum of 18ms, maximum of 805ms), now those numbers are almost unchanged with the average of 127ms (minimum of 18ms, and a maximum of 178ms).

In the days of the wild west of the internet (pre-SSL/TLS), this site would have been delivering in an average of 20ms.

XCTest + CoreData = ouch

2020-05-31T17:55:00-04:00

I put this up in hopes that somebody runs across it more quickly than I did...

This weekend, as a "break", I decided to do some work updating an ancient (2003-vintage) piece of code that I wrote when I was doing extensive blogging. I'm not certain it'll ever leave my computers, but it was an opportunity to play around with some technologies that I'd honestly not touched in years, including CoreData.

Among the things that I did was modify my code to use the more modern NSPersistentContainer, in hopes that I could experiment some with CloudKit. Although it's likely that I'll do that manually, at least at first, the thought of trying out the latest way of doing this made sense (to me, at the time).

Unfortunately, I have a habit of writing unit tests. I say unfortunately not because I don't see enormous value in them (in fact, I uncovered a long-standing bug in the existing code with the first test that I wrote). I say unfortunately, because a lot of people don't write them, or don't take them as seriously, and that means that strange interactions between tests and other parts of the OS tend to be harder to find.

In this case, the NSPersistentContainer came back to bite me hard. As I later located, there are some other people who have seen this problem, and it caused some problems for them as well.

In my case, I'd already created a container for my CoreData stack, and that was part of what caused me the pain. In order to test that independently, I created a test bundle which executed stand-alone. This bundle (of course) had a copy of the model file. Here's where my problems started.

Due to things going on behind the scenes and differences between Objective-C's and Swift's handling of namespaces (oh, I didn't mention that I was also doing all the new code in swift, with an existing, albeit small, Objective-C code base?) I had to make sure that I was passing the managedObjectModel parameter to my NSPersistentContainer init method so that it would find the right one (otherwise in the test bundle it would fail entirely).

In order to do this, I needed code to grab the bundle of the class I was using so that I could guarantee that the right bundle was being loaded. Easy enough, I wrote a small class method:

static func managedObjectModel() -> NSManagedObjectModel {
    let bundle = Bundle(for: self)
    return NSManagedObjectModel.mergedModel(from: [bundle])!
}

and in my initializer for my container class, I loaded up the container:

let container = NSPersistentContainer(name: "SiteQuoter", managedObjectModel: SQExceptionManager.managedObjectModel())

Which worked great. For my first test; and on every subsequent test started throwing non-fatal warnings. Admittedly, I have a real problem with warning, both at compile time and run time and I don't like them in tests either, so I spent too much time trying to track this down. Note: the tests succeeded despite the warning that:

Failed to find a unique match for an NSEntityDescription to a managed object subclass

After looking around (see above links), I confirmed that other people were also seeing the unexpected caching of the NSManagedObjectModel and set about to make sure I only created one of them (hoping that would solve my NSPersistentContainer-related problem).

I won't disclose exactly how long it took to figure out the magic incantation, but I will disclose that Xcode's code analysis system was crashing most of the time that I wasn't doing it right, and thus "functionality was limited".

In the end, I replaced my static method with an lazily-initialized class variable:

static var managedObjectModel: NSManagedObjectModel = {
       let managedObjectModel = NSManagedObjectModel.mergedModel(from: [Bundle(for: SQExceptionManager.self)])!
       return managedObjectModel
    }()

Which is now working like a champ. Thanks to rennarda for their February answer of the aforementioned SO post.

I believe that the piece that threw me a number of times was the reference for the class in the Bundle(for:) call. For some reason, that regularly crashed the editor code and all the other solutions I tried (like referencing the class directly) failed. In the end, it seems as if the problem with referencing the class directly in that case had to do with something inferring that I was trying to call an initializer instead of the class.

So much LDAP, so little time

2020-05-01T12:00:00-04:00

The background

Many years ago, all of my systems were pets. I tried to make them easier to manage by standardizing on a single operating system (MacOS X Server at the time) and used management tools that were part of that suite.

As time moved forward, Apple decided to concentrate on the iPhone instead of the Xserve as the next big thing and reduced their efforts on the server front. First, the hardware platform I was using (Xserve) disappeared, and then MacOS X Server started taking big hits in functionality.

Meanwhile, Rob and I were moving our systems to be much more like cattle than pets. We had standardized on SmartOS systems for running lightweight zones, and had standardized on Ansible for reducing the overhead of rebuilding systems. In as many ways as we could, all information on the servers, except for data which was rapidly changing or under user control, was moved into a highly-reproducible configuration management system that allowed us to try out new versions, run tests (some automated) and keep everything up to date by nuking and paving the servers and rebuilding them from scratch each quarter or so.

The primary directory management system for MacOS X is LDAP. The implementation, called Open Directory, was compatible with opensource and closed-source LDAP servers and scaled pretty well in large environments. In small environments, the GUI was good enough that it was not painful.

As we moved to SmartOS, though, we lost that built-in integration, and I had to create a set of Ansible roles to connect SmartOS to LDAP in order to continue to use the same severs (originally) and replicas of them based on OpenLDAP later. This all worked pretty well, except that the LDAP servers themselves were pretty hand-tuned.

State of affairs, pre-pandemic

It took me a few years of spare time to get all of my systems under management, and the last few systems to go were a set of LDAP servers that ran multiple domains worth of user configuration for mail systems that I was running. These LDAP servers ran in an HA configuration and generally worked really well. They also had securely hashed passwords, which I couldn't un-hash and used an algorithm incompatible with our selected OS (this becomes important later).

As a part of my final push to get things under control, I decided to finally bite the bullet and move the LDAP servers forward to the latest releases and get them on the automation train. It required serious care to make sure that everything worked correctly, but our habit of running separate instances for testing helped markedly in finding problems with the system before taking it live.

I'll note here that I did have some other systems that served mail to other groups of users that were built more recently. These used standard password hashes and were uncomplicated by the use of LDAP.

Quarantine Administration

During the end of March and the beginning of April, I finally got the testing systems running to my satisfaction. It seemed like the transition to production would be easy enough, since the systems using the servers hardly ever changed, users generally weren't resetting their passwords, and besides, there was no self-serve interface for that function anyway.

On the appointed day, I took one of the production servers offline (leaving it in a dormant state that could be resuscitated quickly) and brought up one of the new servers in the HA configuration, with the database of the previous production server. All seemed fine on the servers I was testing on, but then I noticed a small number of users were having trouble logging in.

I looked at the LDAP logs and there were no suspicious entries, but I noted in the database itself that the users having trouble had no password entries whatsoever. I rolled back the servers, but unfortunately the new data had been propagated to the other server. Rebuilding from the most recent backup had the same problem... as far as I can tell, the issue at this point was that a small number of users had something in their password records that was failing the ldap data dump that I had used to reload things. Unfortunately, this was the same dump I was using to rebuild the database, and that meant the old passwords were effectively gone.

The absence of complexity

Although it's pretty clear that the problem was one of bad backups, I needed to get the users back up and running, and I had no idea why the backups were bad. It might have been something about the ancient MacOS X Server LDAP schema that I'd pulled forward, or some change in the underlying configuration, but at this point, I didn't have time to figure it out. I needed to get things back up and running.

Here is where I made a fine choice, with some prodding by Rob. Seeing that I was going to need to reset passwords anyway for a number of users, I reached out to my user base and requested new hashed passwords from them. But, since I was going to have to request these passwords, I broadened the scope and got new hashes from everyone. This meant that I could remove the complication of LDAP from my systems.

At the end of the day, I had everybody back up and running and a set of systems that are even easier to operate than they were before.

I don't want anyone to take away from this that LDAP is bad. It isn't, it has places where it's definitely the right solution. However, a small datacenter application with a slowly-changing userbase and people with habits of good password hygiene is not one of those.

Let this serve as a reminder that we should always be open to making the right big changes when given the opportunity.

Welcome Bacula

2020-03-31T13:46:00-04:00

I wasn't originally going to write this up on the blog, but considering that we've just finished our transition from our old backup software (BRU, no link) to Bacula community edition and considering that it's World Backup Day, it seemed like it would make sense.

As many of you are likely aware, ClueTrust hosts equipment at a top-tier datacenter for providing services to our datacenter customers and our software customers alike.

From Retrospect to BRU

Since a lot of data that exists at the datacenter is not easily replaceable, we've had on-site backups since early on in our operation of the racks of servers. At the time that we started our backup journey, that was an LTO library hooked up to an Apple Xserve (RIP) via Fibre Channel. Because of those particulars (Macintosh-based server, clients of a variety of types, tape library), we had an extremely limited choice of backup software, and BRU was basically it. At the time (December, 2006) we had been through a couple of tumultuous years (literally, starting in December of 2003) evaluating various versions of BRU while they got their MacOS X Server ducks in a row, and while Retrospect (our previous backup software provider) didn't even seem interested in the Macintosh market at the time.

The journey with BRU was always a bit strained (I won't go into it here, but I believe we're both happier to be out of that dysfunctional relationship). My expectations for their responsiveness and customer-orientation were rarely met, and although there was a lot of work on the MacOS X platform in 2003-2010, the release cadence for BRU Server from that point seemed to grind to a near-halt. With that said, we have never lost a file with BRU, the backups were always readable, and the format was simple enough that it gave us confidence that even if an archive became corrupted, we could retrieve most of the data from it.

Taking it to the cloud

By 2014, our preferred method of sending backups off-site no longer required me to take my car to the datacenter and pull tapes out of the rack. Instead, we were moving to an offsite storage mechanism that used "cloud storage". In our case, that meant AWS Glacier.

There was no direct support for Glacier (or any other off-site backup mechanism) built in to BRU, but they did have a disk-to-disk-to-tape model that could be run without the "to-tape" part, which lead to my creating a bespoke Python solution for uploading our archives to Glacier. I would not recommend that to most people, as the process is a bit arduous and maintaining your own critical backup software is not recommended if you don't have the discipline to regularly test it (especially when you don't control the server).

The solution we put together took advantage of the mostly self-contained nature of of the BRU archives to shoot the data (encrypted after the fact, but otherwise unchanged) to Glacier.

By 2015, as I mentioned in SmartOS, Postfix and IPv6, we were in the process of shutting down our Xserves and replacing them with SmartOS. Although BRU Client worked fine on the Solaris variant, we were never able to get the licensing module to work with SmartOS, despite attempts to work with the BRU engineers. As such, we ended up running our backup server in a sub-optimal configuration, a KVM-based Ubuntu environment with a raw disk partition for scratch. Obviously, this would have been much better if we'd been able to run on SmartOS with a LOFS partition directly taking advantage of ZFS, but that was something we were never able to achieve.

Since certain catalog data wasn't readily extracted from the per-machine archives, I re-engineered our custom solution in 2015 to make sure that we were storing all of the salient metadata (type of backup, date, machine) in a way that would be more easily addressed. This allowed us to find and remove old incrementals and so forth in Glacier.

So, at this point, we had a custom off-site storage solution, hand-baked encryption, and we were running in a KVM machine instead of running directly on the OS. Not an optimal solution. Especially so when we had little hope that our chosen OS would be moving forward in BRU-land. Things were working, but it required a lot of work to keep it up.

Heading into the future

As 2020 dawned, Rob and I are working on a number of datacenter initiatives, including moving to a new SmartOS hardware platform and establishing beachheads in some other locations. As part of this, I was looking to see what options we had for self-hosting our off-site backups. Glacier wasn't hideously expensive (and its price pre byte decreases occasionally), but if we're going to have off-site hardware, why not put our off-site backups there.

The prospect of multiple sites also started me thinking about our current choice of a commercial software solution. BRU wasn't unreasonably priced, but running a second server would be a separate instance and that'd be a separate license. We could run the backups over the internet from our other datacenter(s), but that would be a weird configuration and likely not a performant one.

At this point, the idea hit that it was time to evaluate a solution that meets our 2020 needs, not our 2003 needs. As such, the requirements were:

Open Source solution (if possible)
Support for a wide variety of OS, including SmartOS, Linux, macOS
Well-documented storage format
Classic Full, Incremental (optionally Differential) backups
Off-site cloud storage with compatibility with open-source storage solutions
Built-in public-private key encryption (preferably e2e from the device being backed up)
Built-in transport encryption and positive identification
Zero client trust required
Easy scripted installations of client and server
Flexible and scriptable configuration

I looked around at a number of solutions, including the eventual winner, Bacula, and stalwarts such as Amanda, as well as a ton of other, younger, solutions. Many of the newer versions were either cloud-first or cloud-required, often they trusted the client too much (such as handing the could credentials to the client), and almost none of them had old-school multi-level backups, instead going for the much more modern, Time-Machiney approach of a perpetually fresh backup.

I'm a big fan of Time Machine on macOS, but it's not the only backup I choose to use and if I'm going to have a single backup mechanism, it's not going to be one where the loss of some kind of long-term incrementally-updated database will result in sadness. As it stands I've watched multiple times in the last decade as my Time Machine backups became corrupt or needed to be moved off of older hardware. It's an extremely convenient capability, but it's also brittle.

The choice: Bacula

So, after all that looking around, I turned my sights on Bacula as the leading contender.

It has an open source version (yes, there have been some issues in the past with the update cadence of the open source version, which lead to a fork named bareos)
OpenSolaris is a supported OS, as are all of our other required OS
There is storage format documentation
Backups are of the traditional Full, Incremental, Differential variety (although it also supports creating new synthetic Full backups)
Recent versions directly support S3-compatible off-site storage (including Minio, and with Minio's help, Backblaze)
Encryption is end-to-end (except for attributes¹ ) and uses public-private key encryption with optional multiple keys
TLS transport encryption and unique passwords for identifying each component to each other component for positive identification
Clients are not in control and not allowed to contact the director directly
Installation from source, or a binary package (available for some platforms directly from their website) is simple and easily scripted
Configuration parameters are all stored in text files which can be scripted easily

All told, it hit all of our specifications and came in at a great price ($0), with available commercial support if necessary and fully open source code.

Testing went well and I was able to script the building and packaging process as well as the installation process on both the client and server end without difficulty.

In fact, one of the side-effects of a free solution is that we're now able to run a complete test setup which mirrors our production setup and allows for easy validation of configuration changes and upgrades.

There has been some difficulty with the built-in cloud support, but at least some of that was owing to my problems getting the Minio-Backblaze gateway going. Now that's functional, things seem to be working better. In addition, the mechanism for uploading data to Backblaze (or S3, or Minio) is straightforward enough that uploading manually using rclone and downloading using the restore process in Bacula was completely successful.

By the way, performance has been excellent. It's not extremely fast when dealing with large numbers of very small files (presumably file attribute overhead there), but it is highly performant on large files and even the small file performance is acceptable. Because of the text-based configuration, I've been able to do quite a bit of experimentation and our nightly backup incremental across 18 different machines finishes in 8 minutes. Obviously, a full takes substantially longer, but through the use of separate "tape changers" we're able to keep the administratively-separate data separated while still running concurrent backups.

Bacula encrypts the data, but not attributes such as filenames, dates, modes, owners, etc. Although contents of your backups are protected, frequently the metadata can be just as important as the data in the file itself. As such, this begs for some kind of further encryption if you are sending this data offsite for third-party storage. ↩

ssh key choices

2020-03-07T16:36:00-05:00

This weekend, Rob and I had been testing the use of hardware keys to secure ssh sessions, especially for back-end console access and certain administrative functions. Since the hardware keys are a special case, and cannot be added to the ssh-agent, we were slinging around a fair number of command lines with -i <keyfile> on them to point ssh at the key we wanted it to offer. Also as part of the diagnostics, I was running with -v, so that I could tell exactly what was going on. This is when I was reminded that ssh's choice of keys isn't always what I expect (a problem I'd run into previously when maintaining keys for a number of customer projects on the same device).

Without looking at the code, the observed key offer sequence appears to be:

Keys added to your currently-available ssh-agent (unless you have disabled that with -o IdentitiesOnly=yes)
Keys in your ~/.ssh/config file referenced by the IdentityFile directive and matching the host pattern (these are cumulative).
Keys specified with -i on the command line

With that said, here are a few useful ssh-related notes:

When you know you're going to need to use a password (as opposed to a key), use

bash ssh -o PubkeyAuthentication=no

which I have used Keyboard Maestro to alias to sshP in Terminal. This prevents running out of login attempts before you get a chance to enter the password, as most ssh servers will only allow 3-5 attempts, including pubkeys and interactive passwords.

If you need to prevent your config file from being used, -F /dev/null will override your config file with an empty one.
Of course, you can always use ssh-add -D to remove all keys from the ssh-agent, but that affects all terminal sessions on your machine. As an alternative, you can avoid consulting the ssh-agent by unsetting the shell variable SSH_AUTH_SOCK, which is used to locate the authentication socket. Since this is a shell variable, it only affects the shell that you perform it in, so it leaves your other terminal windows able to use the agent's keys.
None of the identity commands affect the operation of the -A command line switch (or the corresponding ForwardAgent yes directive in your config file). So, even if you use -o IdentitiesOnly=yes to keep the session initiation from offering the keys in the agent at the time, an -A flag on the command line will allow you to use the keys on further communication from the target host (useful for things like bastion hosts, such as the ones we're securing).

Update to nginx_alias_map

2020-02-26T11:47:00-05:00

I've been doing a bunch of maintenance on my two blogs (company and personal) and one purpose has been to track down malformed and mis-mapped URLs on the site. Since both have been through changes in the underlying blog engine a couple of times, there are multiple sets of URLs that point to the same content. Generally, since there was a long period of time between each of the respins, the search engines picked up the changes in URLs, but occasionally I will see log entries for the two-engines-ago format, and I'd like to fix those as well, especially since the same content is still on the site in most cases.

The most recent versions (the Cartographica blog was on SquareSpace and Gaige's Pages as in drupal) are already mapped and had simple mappings due to good URI choices. However, both of these blogs were previously (initially) in Geeklog, and it had a format that was based on the article.php file and a query string.

As mentioned in my nginx_alias_maps post last year (this week, it turns out), I had written a bit of code to produce nginx maps to handle redirections.

However, as written, the map that I was generating was using the $uri variable in nginx, which cooks the URI by removing things like the query string. Obviously, this won't work for the old Geeklog query string-based redirection, so I needed to move to using the full $request_uri. That was fine, but came with drawbacks as well. As I mentioned, the $uri variable is cooked by nginx, removing relative directory traversals, double-slashes, query strings, etc. For most of my URIs, this is a much better fit. As such, I decided to complicate the plugin a bit and add support specifically for URIs which contained a ? as an indicator of query strings, and to process them in a second stage map. It's a little more time consuming, although it's not noticeable on my blogs.

The solution was to run the $uri map for any URIs not containing query strings and then run the $request_uri map for any URIs that did contain them. So, if you had an alias entry such as the one for Load up those album covers (header shown here):

Date: 2003-04-29 11:41
Alias: /node/4921,/article.php?story=2003042913413622
Tags:
Category: macintosh
Title: Load up those album covers

the code will generate entries in two maps:

map $uri $redirect_uri_1 {
    ~^/node/4921$ https://$server_name/load-up-those-album-covers.html;
}
map $request_uri $redirect_uri {
    default $redirect_uri_1;
    ~^/article\.php\?story=2003042913413622$ https://$server_name/load-up-those-album-covers.html;
}

Note here that the first map maps to $redirect_uri_1 and the second one maps to $redirect_uri, with a default value of $redirect_uri_1. Because of the way that nginx evaluates maps, you can't use $redirect_uri in both cases.

As with previous versions, you need to include the map in your http stanza in your nginx configuration, and you also need to check the value of $redirect_uri and send it back as a redirect if present:

include /opt/web/output/alias_map.txt;

server {
  listen       *:80 ssl;
  server_name  example.server;

    # Redirection logic
    if ( $redirect_uri ) {
        return 301 $redirect_uri;
    }

    location / {
        alias /opt/web/output;
    }
}

Of course, if you only have one or the other type of redirection, the code will make sure to only create a single-stage map.

Updated code is now available as nginx_alias_map on github.

Client Certs and Intermediate CAs

2020-02-24T10:12:00-05:00

Why client certificates?

RS wrote about Preventing drive-bys with client certs and although we'd discussed this method for some time, I hadn't deployed it yet. However, some recent log-spelunking had led me to determine that I liked the idea of a second layer of protection on some of my sites.

Just for clarification, this is being implemented as a first-layer of authentication and only will grant access to the basic functionality of the server. There will be further login requirements beyond that. However, as Rob notes in his post, there is significant advantage to keeping the wild internet away from the login screen (or anything else they might be able to exploit). So, as Rob says, this is drive-by prevention.

Previous to this exercise, I'd occasionally used nginx's built in authentication to require another user name and password before accessing the site, but this is a bit tedious due to the way that browser prompts and password systems (such as 1Password) work. Client-side certificates don't have this problem, at least on the Apple ecosystem, that authorization is tied to the device auth.

(TL;DR: put all predecessors of the client CA in the CRL)

Intermediate CAs

As RS's article is sufficient to get you going, why is this article necessary? Because I have a tendency to take opportunities like this to explore unnecessarily complex models, such that I can understand how the internals work and can employ them as needed in the future.

In this particular case, I have run an internal CA now for over a decade. This is used only on internal communications, although before Let's Encrypt, it was also used for securing web sites that weren't going to be accessed by people outside of the organization (and family and friends). With LE dropping the marginal cost of certificates to zero (and the fixed cost to the automation that RS and I have already created), the private CA hasn't been getting a lot of use.

A few years back, my original CA cert expired and it caused a notable amount of pain (this was still before LE was in general release), mostly because I needed to send out new certificates to each of my users and make sure they installed them on all of their various devices. In order to head this off for the future, I wanted a long-running CA certificate, something with an expiration beyond the date of my likely retirement, so something in the 2040's...

In the "real world" Root CAs have frequently had this kind of duration. For example the DigiCert High Assurance EV Root CA expires in 2031, and was originally issued in 2006; so, basically 25 years. However, they also have significant security on their Root CAs (for example, physical HSMs holding the keys) and do not issue normal certificates directly from the Root, but instead issue from intermediate CAs that have much shorter expirations. So, I figured I could approximate this using a secure USB storage device (tamper-proof and encrypted with a long PIN) for my Root, and by issuing certificates off of intermediates with much more limited lifespans.

For the few servers I've used this on, it has worked well. I trust the Root certificate on each of my devices and then the web servers send the intermediate (signed by the root) and the server certificate along with it.

Intermediate CAs and debugging client-side certificates

Since I already have the intermediate CAs (it turned out I'd created one for client certificates and another for server certificates originally), it seemed like an easy enough exercise to take RS's recipe and apply it to my client certs. I generated a new private key and a new client certificate for myself and then went about the configuration.

Simple enough, I took the intermediate CA and uploaded it to my server, along with the current Certificate Revocation List (CRL) and placed them into the ssl_client_certificate and ssl_crl stanzas respectively, making sure to turn ssl_verify_client on; as well.

That didn't work. No good error message, just an SSL error from both Safari and Chrome. I should note here that debugging client certificates requires quitting your browser frequently. If you don't there's often some piece of state that will either create false negatives or false positives. So, while testing, basically quit your browser and re-start it between every attempt. However, I've found no befit to restarting the machine.

I tried adding the Root CA to the Intermediate certificate in ssl_client_certificate, but this was unnecessary (as it should be if the client contains an authorized copy of the root, and the server is explicitly instructed to use the intermediate certificate as the base for authorization).

I'll note here that adding debug to the end of my nginx config's error_log line would have been helpful at this point. There were definitely errors occurring and only the server knew what they were.

Once I turned on debug, it was clear that the CRL verification was causing difficulty and so I validated that by commenting the ssl_crl line in the nginx config and restarting the server and my browser and things worked.

No CRL is probably a bad idea, so I did some looking around and it became clear that nginx wanted to check both the CRL of the Root and the Intermediate, so I concatenated the two CRL files and uploaded them to the server, and now the server is working fine.

In retrospect it made sense. If you're going to ask for client certificate verification and you're going to provide a CRL, you should provide CRLs all the way to the root.

Larry Tesler at NCSA

2020-02-19T13:40:00-05:00

Today I read of the passing of Larry Tesler, a computer scientist with a long and storied career, spanning Xerox PARC, Apple, Amazon, Yahoo, and others. He's considered the father of the modeless interaction model (think Cut/Copy/Paste on the Mac).

I met Larry in the late 1980s, when I was at NCSA and he was visiting there on sabbatical from Apple, with his wife (a geophysicist with a time grant on our supercomputer). As the head Mac enthusiast at the facility (ok, maybe Brand Fortner had that title at the time, but I was Mac developer and he was more management), one might think that Larry and I would hit it off. And, perhaps we would have, if it hadn't been for the Mac II Ethernet Card.

Back in those days, NCSA had a great relationship with Apple. We had a lab full of Macs to go with our Cray, and we liked to talk about them. In 1986, we'd released NCSA Telnet, my first serious professional software and a key communications package for the Mac in those days. Also, we'd been part of the educational beta test for the Macintosh II¹.

When Larry arrived with his wife to begin their work at NCSA, he brought with him a brand new Mac II with an even newer Ethernet card. This card was so new, they were not available to the general public, or even most of folks who were part of the beta test, instead we were still using LocalTalk² for our networking.

Larry comes in and assembles his Mac II, complete with test Ethernet card and I'm brought in to hook up the network connection at his desk. Introductions are made, although I had no idea who he was, and he certainly didn't know me from Adam. I hooked up the thin-net connection to the back of the Mac II, pulled out a trusty floppy to load the latest NCSA Telnet onto the machine, and sat down to get things working.

At the time, I don't recall if we had any other systems that had Ethernet adapters, but there were some units available for the Mac SE with the Processor Direct Slot, and I believe I'd already gotten the Ethernet drivers working there. Regardless of the timeline, we did manage to get the networking working quickly, and NCSA Telnet was humming along... for a few minutes and then everything came to a halt. The Mac II had crashed and there was no clear cause.

Being two software guys, Larry and I both assumed that there was something wrong with NCSA Telnet and its interaction with the Ethernet driver. I came back in and sat down to take a further look at what was going on. Debugging tools on the Mac weren't great in those days, but there was always Macsbug, and if time was right, maybe TMON.

After trying to figure it out for a while, we pulled the plug on the Ethernet card and tried NCSA Telnet over LocalTalk. That worked fine, and would allow him to get to work if he wanted to. Meanwhile, I had access to the machine in off-hours to try and figure the problem out. I came back to the office later in the day and tried again. It took much longer to reproduce the problem, but it eventually came back. I restarted again and let the system just sit while I went to talk to Tim or Tom, or one of the other folks on the team to see if they had any bright ideas. Nothing new from them, but when we came back, the system had crashed. The odd part was that NCSA Telnet wasn't loaded and, in those days, you only had one application running at a time. Further, NCSA Telnet didn't use any resident drivers as it had full access to the hardware when it was running. Curious...

The next day, I got a frantic call that the machine was acting up, even on LocalTalk. As far as Larry was concerned, this was proof positive that the problem was with NCSA Telnet and not with the Ethernet card. I went over to check out the problem and noticed that, unlike the previous day, we'd left the Ethernet card plugged in. I unplugged it and we tried to reproduce the problem, and could not. The problem was now coming into focus.

A few days later, some of these cards had made it to other universities in the test program and I was getting email telling me that they were experiencing crashing problem. Of course, it looked like NCSA Telnet might be involved, since it was one of the communications programs that would use Ethernet. However, in these cases as well, the problems also seemed to occasionally manifest when NCSA Telnet wasn't running.

Eventually (not sure if this was with a week or two), we were able to put the pieces together and realize that high traffic on the network caused the Ethernet hardware to freeze up, causing the Mac to follow suit. Apparently, Apple's test labs were much more orderly and had a lot fewer collisions, whereas our crowded educational networks were collision-fests and this was directly related to the problems on the card. A re-spin of the hardware was in order and it wasn't too long before we had working Ethernet cards in the Mac II that could handle a network with real activity on it.

Unfortunately, I never really got to know Larry, although we exchanged pleasantries while he was visiting, he was thoroughly involved in the work on geophysics at the time. Plus, I don't think that the Ethernet card situation resulted in a lot of good will...

Interestingly, while doing some searching for this piece, I ran across Kent Beck's Piece on Larry Tesler and if you notice about half-way down the page, you'll get to an anecdote about a geological core sample browser, that was what he was working on.

As an aside, we had to keep the Mac II in a locked room, and agree to a number of limitations on use, including a specific prohibition on taking off the cover. A bit disappointing, since it had 6 NuBus slots. However, when it arrived, it had no cover. Technically, we were never in violation. ↩
I'm not sure if this was before or after they renamed AppleTalk to LocalTalk, but since AppleTalk eventually became the name of the protocol stack and LocalTalk the name of the medium, I have disambiguated the two here. ↩

Overwatch Leaves nVidia's GeForce Now

2020-02-12T09:11:00-05:00

According to an article on PCWorld, Activision-Blizard has pulled all of their titles from nVidia's GeForce Now.

In my days as CTO of Haste (a service that improves network connections for gamers), I had occasion to spend a fair amount of time playing Overwatch as part of our test regime. As a Mac user (and the only one on our technical team), I had a much more difficult time than my co-workers getting things to run smoothly and beautifully. While Richard and Taric were playing on their huge honking Razer and SurfaceBooks with high-end nVidia graphics, I had to settle for running games (via Bootcamp) on my then current MacBook Pro. I had the highest-end graphics available in it, but the Radeon Pro at the time was no match for the then-current GeForce, especially given the thermal constraints of the 15" MacBook Pro at the time.

Having done quite a bit of team-based FPS back in the InterCon days (shout out to Bungie's early history as a Mac games company, and the awesome game, Marathon), I needed to see if I could reduce the technical discrepancy between my rig and my fellow players. So, I proceeded to do the only thing a laptop Mac user could do, short of buying a separate PC gaming rig, and I acquired an AKiTiO (now OWC) Node eGPU box, capable of running a modern, beefy GPU connected to my Mac. Since I was not trying to use my internal Retina display and I wasn't trying to use it under MacOS, I didn't have any significant problems getting things set up, and now I was cooking with gas with an external 4K monitor that I could drive at max resolution and 60Hz.

Of course, those of you who know me know this didn't help my gameplay that much, but at least I no longer felt left behind and there wasn't any real doubt that my skill was showing through in the game.

But, in the end, the system was a bit fiddly, not very portable, and required shutting down from MacOS and rebooting into Windows, which I needed to keep up-to-date in order to play the game. Not a great experience.

When I returned to DC and stepped back to being an advisor at Haste, I sold off my GeForce card (they depreciate quickly), mothballed the AKiTiO and stopped playing Overwatch.

Fast forward a couple of years, and nVidia announces that they're finally going live with GeForce now and not only is it reasonably priced, but there would be a free tier. With that, I couldn't put off giving it a try. Honestly, the gameplay blew me away. Considering that I was playing without having to leave macOS or reboot, or add any more hardware to my machine, this was pretty awesome. I signed up, and jumped back in to Overwatch, enjoying myself for the better part of an hour before calling it quits for the day.

Unfortunately, that looks like it has quickly come to an end, because of Blizzard's actions, and now I will have to decide if there's another game that's worth taking up. But, at least I no longer feel like I'll be left out when my friends are discussing some great new game which isn't (yet) available on the Mac.

Developing on a 2019 Mac Pro

2020-01-08T14:55:00-05:00

There's been a lot of discussion about the 2019 Mac Pro and various assertions that it's over-designed, overpriced, or underpowered. Since I decided to replace my venerable 2013 Mac Pro¹ with a 2019 Mac Pro, I figured I'd write up my experience with the device as a developer.

The Codebase

So that people have context for what the codebase is that I'm talking about here, I currently ship 3 "products":

Cartographica (Macintosh GIS product)
CartoMobile (Mobile GIS for field data entry - iOS)
LoadMyTracks (Free utility for working with GPS devices on macOS)

Cartographica is by far and away the largest of these three projects. Besides over 250K lines that I have written over the years (CLOC lines of code, not counting whitespace or comments), there are another approximately 1.2M lines of third-party code² that gets compiled every time I do a Clean and Build.

LoadMyTracks is much smaller, about 44K lines of code by me (some shared with Cartographica, but for the purposes of this article, I'm interested in describing what's compiled) and about 37K lines of third-party code.

CartoMobile is in the middle, with approximately 134K lines written mostly by me (the remainder of the ClueTrust code was written by a contractor early on in CartoMobile's life) and about 380K lines of third-party code.

Beyond the code, we have a fair number of tests that run in the system as well. Cartographica has on the order of 3500, CartoMobile has around 700, and LoadMyTracks around 150. These are each mostly unit tests, with some integration tests as well. The UI tests (a combination of manual and automated) are not included here, as they don't run automatically in the CI environment.

Build Environment

I try to keep on the latest build environments wherever possible, which means a fair amount of work maintaining the code base (and removing deprecations). Thus, the build environment is Xcode 11.3 as of the arrival of the 2019 Mac Pro. I'm using Xcode's parallel testing, so it's not out of the ordinary for the 2019 Mac Pro to have 8 or 10 copies of Cartographica running for the in-process tests and a similar number of xctest instances running during the stand-alone tests.

For Continuous Integration (CI), I run Jenkins on a set of Mac Minis, with most of the build work being done by a 2018 Mac Mini. CI builds are run using fastlane and Jenkins pipelines.

Before and after

Cartographica

Now that I've set out the environment and the challenge, let's see how the Mac Pro did with it. Clean building on the 2013 Mac Pro (6-core) takes 358s, or close to 6 minutes. Clean building on the 2019 Mac Pro (16-core) takes 82s. For the record, building in the Xcode GUI directly on the 8-core 2018 Mac Mini takes 177s.

The tests themselves are a different matter. The Mac Mini is still beating the 2019 by about 18s of wallclock time (131s vs 110s), mostly owing to 2 single-threaded tests (run in parallel) that take 51s on the Mac Pro and only 31s on the Mac Mini. I'm tracking this issue down, but in this case, the Mac Mini is running 10.14 still and the 2019 (like it's predecessor) is running 10.15.2. Since these long-running tests involve spawning a subprocess to run a shell script which executes a unix-level executable approximately 400 times, I'm guessing this is related to a difference in the shell execution. For reference, the 2013 Mac Pro took over 180s to run these tests. Interestingly, the MacBook Pro (running 10.15.2) runs the tests similarly to the Mac Mini, taking 35s to run the long shell-script based test.

Machine	Build Time	Test Time	Scripted Test
2019 Mac Pro (16 core Xeon)	82s	131s	51s
2019 MacBook Pro (8 core i9)	169s	116s	35s
2018 Mac Mini (6 core i7)	177s	110s	31s
2013 Mac Pro (6 core Xeon)	358s	223s	>180s

LoadMyTracks

Differences on LoadMyTracks were much smaller, which I assume is because the code base is so much smaller. The test times are almost identical.

Machine	Build Time	Test Time
2019 Mac Pro (16 core Xeon)	16.5s	3s
2019 MacBook Pro (8 core i9)	18.8s	3s
2018 Mac Mini (6 core i7)	15.5s	3s
2013 Mac Pro (6 core Xeon)	43.1s	4s

CartoMobile

CartoMobile's larger codebase (thus more amenable to compiling in parallel), results in another significant win for the 2019 Mac Pro over the 2013 (and the Macbook Pro this time). As with Cartographica, the Mac Mini is a close 3rd behind the 2019 MacBook Pro

Machine	Build Time
2019 Mac Pro (16 core Xeon)	20.8s
2019 MacBook Pro (8 core i9)	32.1s
2018 Mac Mini (6 core i7)	47.2s
2013 Mac Pro (6 core Xeon)	48.5s

Conclusion

Not surprisingly, it looks like the larger the codebase, and the more amenable to being built or tested in parallel, the better the 2019 Mac Pro runs. This was pretty much the case with the previous Mac Pro as well, when compared to the Mac Minis and laptops at the time.

I don't have an iMac Pro to perform these tests on, but based on the significant performance improvements over the 6- and 8-core machines, I expect I'd want the 14- or 18-core iMac Pro to try and reach the same performance. However, those two machines are clocked at 2.5 and 2.3Ghz respectively, which is significantly lower than the 2019 Mac Pro 16-core's 3.2Ghz.

Assuming that the 18-core would be sufficient, a similarly configured machine (2TB NVMe SSD, default video card, 64GB of RAM) would run me $8278 list. The 2019 Mac Pro configuration I purchased, along with 64GB of 3rd Party RAM and LG 5K display, listed at $8799 + $1299 + $442 = $10,540. It's definitely $2,262 more, although if I were purchasing the iMac, I'd probably upgrade the video card, since I don't have any option to do it later (reducing the difference by $700). In addition, if it hadn't been for the strange spot that the 2013 Mac Pro existed in (Thunderbolt 2 not quite having enough bandwidth for 5K 60Hz on a single cable), I wouldn't need to replace my Dell 5K monitor that I'd used with it for the last 5 years, which would drop the difference by another $1299.

With that said, for compiling my main App, the processor speed reduction of close to 30% might be an issue.

In the end, the value proposition of a Mac Pro vs. competing platforms is a judgement call for anyone who is considering it. The combination of the flexibility (I changed graphics cards and expanded RAM in my previous "cheese grater" models) and the external monitor to me makes enough of a difference to come down on the side of the Mac Pro. Assuming that Apple keeps up with it, I may end up replacing the computer and keeping the components, such as upgraded graphics cards and maybe RAM. If they don't keep up with the upgrades, the replaceable CPU is likely to result in opportunities to improve performance, as had been done with previous models containing replaceable CPUs.

Seriously, I was pretty happy with it, despite the clear lack of upgradability and limits to expansion. ↩
Shout out here to the most significant third-party code that we use: GDAL, Sparkle, and Proj. ↩

gitignore as a service

2019-12-09T13:02:00-05:00

When you're looking to quickly create an appropriate .gitignore file for a new repository, you can save yourself some time, and possibly aggravation, by using gitignore.io.

Available as either a website with a very simple interface (and completion), or as a simple API-based service documentation for the API and how to call it from the command line is available from the site.

Source code is available on GitHub, and the license is a nice MIT license.

I've generally found the information to be pretty exhaustive, although occasionally you can run into policy decisions, so I find it useful to grab a copy and then review it. Remember, if you miss a file that you want to have saved, it may just be left out of your repository.

One additional note: if you like the default for a set of items you generated from the web site, the top of the generated file contains the API URL to pull it in the future.

All told, a really useful service, thanks to those who created and have contributed to this!

Ansible become: useful and dangerous

2019-11-26T20:51:00-05:00

OK, now that I have your attention with the catchy title, let me get right into the reason behind this post. Rob has been doing a lot of work lately on a set of roles to provision raspberry pi systems. I'm grateful for the work in this area, because frankly, I find them a bit annoying to get boot-strapped. Although we're not ready to fully publish the workflow (I expect that'll happen in due time and likely on Rob's fine Technotes blog).

About become

become and it's associates (become_user, become_method, and the least-used become_flags) provide a way to handle privilege escalation in situations where you otherwise are logging in to a system with an unprivileged user. In our particular context, the default configuration on a raspberry pi has a single default user who can sudo, but can't execute privileged commands natively.

By setting become: true, privilege execution will by default become user root using the method sudo. Clearly, become_user and become_method provide a way to modify who and how you become.

A warning about over-scoped `become`

In cases like the raspberry pi, it's tempting to just run all commands using become, and there's a seemingly convenient command-line flag to do just this, -b. By asserting -b, you tell ansible to execute all commands with escalated privilege.

Note that I said all commands? This includes commands that are executed locally using ansible's local_action command. My initial reaction to this was surprise, figuring that local_action, as opposed to delegate_to: localhost would likely avoid the become capability, but in retrospect, it's a good bit of consistency.

So, don't over-become, and don't use the -b command-line flag. If you have a host, such as the aforementioned pi, that you need to become: true for all commands, you can assign ansible_become=yes using your configuration method of choice (stick it in your inventory file, add it to group_vars/all.yml, or maybe even in a specific subgroup if you have a mix of machines that have varying become needs).

Other interesting uses for become

become is a really useful feature, and we regularly use it in plays for command-specific one-offs, such as using become: yes and become_user: postgres to run PostgreSQL commands with the benefit of being the postgres user. Or with live content systems like django, I use become_user: www to run some of the maintenance scripts so that we don't have to manually re-chown all of the files that are collected for static web service.

Summary

All told, become is really useful, but pay attention to where you're using it, and just avoid entirely using -b, because you never really know what role might include some local_action that has a bug and might execute rm -rf / for you...

NetNewsWire rises again!

2019-11-11T19:17:00-05:00

One of the very first posts on this blog (16 years ago in the beginning of 2003) was entitled All of your favorite sites at a glance which discussed a new pair of apps (NetNewsWire and NetNewsWire Lite) that I'd just started using.

Considering the fallout for RSS from the Google Reader shutdown in 2013 and all that involved, it is surprising, but heartening that those of us who remain independent bloggers have clung to the RSS standard for making it easy to keep up with sites we're interested in.

I wanted to take a few minutes to add my congratulations to Brent Simmons, who has himself been blogging for over 20 years, and add my thanks to Brent and all of the contributors who made the new version of NetNewsWire possible. I'm now back to using it and it's the best experience I've had reading RSS on my Mac in years (well, since the last great NetNewsWire).

Separating Ansible roles for fun and profit

2019-11-11T13:41:00-05:00

At ClueTrust, we use a lot of automation to run our systems. It's mostly how just a couple of us can manage hundreds of virtual servers and keep them up-to-date and operational.

A few years back, I moved from using Puppet to Ansible, mostly at the suggestion of RS, who was finding Ansible a solid choice for both network and server automation.

At a high level, my problems with Puppet were that my code tended to rot pretty quickly due to changing dependencies, and the Ruby-based DSLs were overly complex and hard to debug a few months after using them last.

Ansible is based on Python (a language I'm very comfortable with), and although there's a bit less elegance in an ordered series of steps in a configuration file (a playbook in ansible), there's a whole lot less difficulty in figuring out what the system is doing.

After years of working with Puppet, I wanted to move to deliberately move to Ansible in a way that would maximize ease of re-use in similar situations. This lead me to the following guiding principles:

Scope and general nature of Ansible code should get more specialized and opinionated as it moves from commands to roles to plays to playbooks to environments.
At every level, use generalized code from the previous level when reasonable.
As you move to more specialized code, the code becomes less useful to other groups. By the time you reach a playbook, the code should mostly be tying together reusable lower- level modules (especially roles and commands) and applying inventory data to them.
Wherever possible, parameterize items that are easy to parameterize. Especially when writing roles, use of parameters that can be defaulted or automatically set in the role can make it much easier to expand the use of a role across operating systems. In our case although most of our roles are aimed at SmartOS, there are a growing number that can also be applied to Linux environments when necessary.
Discrete units of code should be treated as separate and therefore stored in their own repositories. For example, since Roles are expected to be highly reusable, they're always each stored in their own individual git repository and included using Galaxy-style requirements.yml files. (Note: ansible-galaxy can point at arbitrary git and hg repositories as well as the marketplace. This includes private git repos, which is exactly what we do. The format is - src: git+user@server.fqdn/repo) Similarly, separating environments containing playbooks into separate repos also provides advantages.

Why so many Repos?

Looking at number 5 above might make you think that we have a lot of repositories for our ansible roles, and you'd be right. Not only does this promote reuse across the organization, but it also provides an easy point at which to make a role public if appropriate. Further, eventually you may need to make a breaking change in a role. If you do this in a separate repo, you can just set the specific version in the requirements.yml file for playbooks that don't need, or can't use the newer version just yet. However, this can be done on a role-by-role basis. In contrast, using a single omnibus repository for all roles will make this difficult if you need different versions of 2 roles in a single playbook.

Ansible environments

Similarly, when we build playbooks, we tend to put them in what I call environments. An ansible environment is a location that contains playbooks, inventories, and requirements. Because they contain inventories, they also frequently contain customization data related to similar hosts. Although these can be administrative domains, they're commonly also functional. However, due to the level of reuse of roles, the environments can be split in any way that makes sense. At a minimum, the environment contains:

at least one inventory
at least one playbook
a requirements.yml file
various vars directories (group_vars and host_vars)
various files directories (files and templates are well known, but we also use host_files for common host-specific files)
a README.md file

It's not uncommon for us to begin with an environment that contains one or more simple playbooks for related systems, and then evolve those playbooks over time into roles. This is effectively the next obvious step in refactoring ansible for us. If inside of an environment a series of steps becomes so common that it exists across many playbooks, it likely is moved to an included play. From there, if it's usable across other environments it is then turned into a role and gets its own repository. In this way, we can control the complexity of the playbooks and environments and standardize our roles to minimize configuration drift between similar systems.

Stage vs Prod

Everyone has their own way of doing inventories, but since I'm discussing our environments here, I figured I'd also touch on how I do inventories to manage stage vs production. For the most part, I tend to work with 1:1 stage and production systems. This way, whenever we need to validate a specific system, we have a way to do so without having to cobble something together by hand. Not that I necessarily test every individual host every time, but by keeping the mechanism standardized, it's easy to do so when necessary.

Generally speaking, I have 2 inventories in each environment, prod and stage. By specifying these on the command line explicitly (-i prod or -i stage), it is always clear whether you're going to affect crucial systems or not. At the base of these inventories, I include every host in a group named for inventory. As such, we can use stage.yml and prod.yml in the group_vars directory to specify items that are specific to the two environments.

Since all.yml in group_vars will be at the bottom of the priority, it's easy to do things like temporarily change the location or version of a binary in stage by putting the stage value in stage.yml. However, it's generally safer to define all of the vars that might be shared in all.yml. Obvious exceptions to this are items that, if shared, would cause problems, such as database server addresses. In these cases, put those explicitly only in the stage.yml and prod.yml, so that they're undefined if left out.

Host-specific files

For the most part, it makes sense that an environment's key configuration files would be in files or templates, and that items like nginx configurations for specific hosts or classes of system would be called out at the top level and configured using variables.

However, there are cases where we either use roles to auto-generate files or have files in a particular environment that are always different for each machine in the environment and might be too cumbersome to put in the configuration. Especially in the first case, we need to be able to check for existence quickly, and to create or modify the data as necessary, without affecting any other configuration parameters. In this case, we use our host-specific files directory host_files. Accessed as host_specific_files in all.yml, it is defined thusly:

host_specific_files: "{{ inventory_dir }}/host_files/{{ inventory_hostname }}"

And, thus for every individual inventory host, it specifies a particular directory in the environment (relative to the inventory). Files can then be placed inside this directory (or in sub-directories) that are created the first time a host is provisioned and saved when the environment is committed to the repository. Obviously, these should only be public files unless they are encrypted using ansible-vault. However, in cases of some persistent private keys for machines or services, we will frequently encrypt those and store them in the repo, with the encryption keys kept locally on the user's machine.

Note: this isn't for high-security items. Obviously, those should be much more limited in access. However, this mechanism it does provide for a good way to limit re-provisioning of certain keys that might otherwise cause people to become complacent about seeing changes frequently. For example, ssh host keys.

SSH probe bad behaviors and the sshd settings that make them worse

2019-08-26T11:49:00-04:00

Over the past few weeks, RS and I have noticed an increasing number of unexplained failures logging in over SSH with both manual and automated means. On most of the servers it was just an inconvenience, but as it started to become more frequent, it became a significant issue for some of our automation scripts.

However, the most significant effects were to our git server which only allows access over ssh. Not surprisingly, this server gets a frequent, short-duration connections from our CI server as well as individual git clients that are checking for the status against their remotes. As such, the sudden spate of failures caused us to look into the issue.

Upon looking at the machine in question, it was clear that the usual ssh background probing was going on, but that the bot that was probing us was getting confused and leaving the connection open. This was causing a build-up of ESTABLISHED connections which was running up against the default limit of 10 on SmartOS. Individually, these were not coming in very quickly (probably in order to not trigger banning software), but since they were taking a very long time to transition from ESTABLISHED to CLOSED, they were taking up space in the table of 10 and causing additional inbound connections (from our legitimate users) to be unceremoniously shut down.

Upon further investigation of the sshd_config file, I noted that the LoginGraceTime in SmartOS is set to a default value of 600. Each connection had 10 minutes to wait around until a successful login occurred or until it was disconnected. That seems a bit long even if you're allowing password authentication, but in our key-only environment it is extreme, and so we cut it back to 8.

Interestingly, this caused a fair number of connections to be stuck in FIN_WAIT_1, an indication that the bot that was probing us was just terminating the connection on its end and moving on, without sending any indication to our side. That might just be a crash on their side, or it might be a tuned bot. The distinction would be hard to make, but fortunately, the sshd process quits immediately upon closing its side of the connection due to LoginGraceTime expiring, so the result is a sufficient reduction in the number of outstanding sshd's in ESTABLISHED, which fixed our immediate problem.

It was a bit of a surprise this hasn't bit us on other machines, but considering that most of our automations completely destroy and rebuild the VM over sshd, the likelihood that a significant number of abandoned connections build up during the build process is pretty low.

We will be updating our standard sshd configuration to take care of this in the next build cycle and I've updated machines which have seen this problem repeatedly on an individual basis.

Git subtrees for Perforce users

2019-08-18T08:00:00-04:00

For many years, I was a happy Perforce user. Despite clearly not fitting their precise model, I had a three-user license which allowed me and my bots to appropriately work on my code base. I have a number of pretty complex projects, which often have overlapping code and I took advantage of their evolving code sharing mechanisms. Initially by using a single repository with workspaces that included code from different locations, and then moving to the more powerful (and, in my mind, more easily understood) streams paradigm.

Moving slowly to Git

I generally follow an active trunk (master) strategy where new development is done on the trunk and branches are used to pull off releases. Generally speaking, I tend to develop like I've got a team even when it's just me.

As Perforce evolved, they tried to expand their offerings to include more git-like capabilities, in particular having local copies of your repositories. Obivously, this carried very similar benefits and disadvantages of git, but since lightweight branching and offline use were becoming more important, the perforce model needed to adapt.

It was through this lens that I started to move my repositories over to git. Frankly, I'd come up with adaptations which allowed me to make use of a central server system without compromising mobility (in particular, for years my primary server ran on my laptop, with a mirror running on my CI server), but over time the effort to maintain the synchronized servers became more significant. Perforce has a nice git front end for their server (git-fusion) which allows you to read and write Perforce repos using a git interface, giving you most of the advantages of both systems, but it falls down right where I tended to need it, on handling incorporated streams. The git submodule overlay was tedious and finicky at best and eventually convinced me to move fully to git.

At some other point in time, I'll discuss gitolite, which is what I'm using as a git server right now, but that won't be important for the rest of this story.

Adoption of git in my complex code base

Initially, I thought I was going to be able to use git-fusion and submodules to pull my Perforce streams out of the repositories and keep things in sync, but that proved problematic. Maybe it was the number of files or interesting merge tactics I'd experimented with, but in the end, I found that my large, complex repositories were best exported without Perforce's help with the submodules. The result, though, was that I had huge, monolithic, git repositories each of which contained complete histories of all of my submodules.

You'll note that I'm already using the word 'submodule' here, and that's not an accident. I tend to develop my shared code modules with their own projects, their own tests, and separate versioning and branching. With Perforce, this minimized the amount of duplication in the code base, and kept the system clean. Further, it lead to my creation of a CI environment that would test all of my modules individually as well as inside the larger projects. All told, I'm fond of it.

So, as I moved to a git lifestyle, I followed instructions by others on how to split my repositories using git filter-branch and successfully teased out the submodules that I shared between my major projects. It took a bit of trial and error, but it also proved a great experiment into the general forgiveness of git. All told, not having to commit to a central server makes it a lot easier to verify your full operation before you make a big mistake.

Git submodules and subtrees

When I started working my code into git, I'd read a number of pieces on the various advantages and disadvantages of git submodules and git subtrees. The basic feel of most of these articles was "Why you should never use [insert the other technique here]". Generally speaking, I didn't see a lot of benefit to the use of subtrees. The vast majority of the complaints seemed to be that submodules were a pain (they can be complex), and they're difficult to deal with if you need to make a lot of changes to other people's code. In my case, my submodules were almost exclusively internal. Thus, the issue with OPC wasn't a big deal.

Until I got to a library that I like to festoon with my own framework. In particular, I'm talking about the venerable GDAL, a widely-used library for raster and vector I/O used by much of the GIS industry. My macOS framework is a pretty large and complex beast, with multiple subcomponents (GDAL has a variety of optional libraries) and some private modifications. When using it in Perforce, I'd used a multi-stream system: an //import/gdal repo that carried an exact copy of the GDAL source and a //GDALFramework stream that was used to build the framework. The latter repo was imported into my application workspace using the stream functions. So, I'd grab the latest GDAL (which at the beginning was using svn) and then check that in to my //import/gdal repo, which I'd then merge into my //GDALFramework stream and fix any requirements or compilation/test errors, then I'd check that in and update the stream in my main application. It wasn't horrific, but getting to the point where I could run my application tests against it was a long track, so I didn't update as frequently as I'd like.

After I moved to git, my framework remained an intrinsic part of my source, along with its included gdal library. That worked fine, as long as I didn't need to update anything, but the cross-stream import branch information was long out of date, and I couldn't reasonably import a new version of GDAL without a lot of care. Enter git subtrees.

I had considered using GDAL as a git submodule and just forking it for the few changes I would need to make. That would work reasonably well, in theory, but the build environment that I use isn't the same as the standard environment and that meant that I couldn't reasonably test the code before committing into master, which I deem a no-no (ok, I could, but it would mean coordinating separate submodules with a set of special branches, which is do-able, but a pain in the neck). By using git subtrees instead, the imported code remains tightly coupled with the surrounding code, but isolated into an appropriate subdirectory. I can still submit changes back to the GDAL repo when appropriate, but I also get to fully test the code before I put it onto my master branch.

Subtrees are easy to work with, especially in comparison to submodules. As long as you're willing to commit to forward momentum in lock-step with your updated subtrees, the mechanism works great. Add a subtree in a subdirectory:

git subtree add --prefix GDALFramework/gdal gdal v2.4.2 --squash

and you've got the code you want, right where you asked for it. Explaining that line a bit, it adds the code from the gdal origin (which I added using git remote add), from the v2.4.2 tag in the subdirectory GDALFramework/gdal and uses the --squash option to limit the commits brought into the local repo. Once that's done, you can make changes to your heart's content, and merge in changes from the original repo by doing:

git subtree pull --prefix GDALFramework/gdal gdal <Branch id> --squash

where <branch id> is whatever branch or commit id you want to update to. Do your tests, make your changes, verify that everything is working and you're good. Commit your changes when you're ready.

When you've got something that you want to commit back to the community, you will need to fork the original repository and push to that origin so that you can prepare your changes for assimilation:

git subtree push --prefix=GDALFramework/gdal mygdalfork master

I'm happy with my decision to use a combination of submodules and subtrees. I'm sure that either method could be used for the purposes I'm using the other method for, but I find the distinction is useful. In particular, I can easily experiment with code branches between my apps using submodules, and I can work easily with code from others which needs some adaptation using subtrees.

Pelican plugin for NGINX redirection

2019-02-20T09:32:00-05:00

When I set out to move Gaige's Pages to a static web generator, chronicled in Gaige's Pages moves to static generation, I stated one of the reasons that I favored Pelican was because it is written in python, which is a language that I'm intimately familiar with.

Not surprisingly, that decision because useful pretty quickly. As I was working on moving the Cartographica Blog from SquareSpace to Pelican, I was having some concern over the redirection method I used in Gaige's Pages, the pelican-alias plugin.

The pelican-alias plugin is a highly useful piece of code, especially if you're going to be placing your pelican site on a server you don't control. The method of redirection is to place a file at the original location and then redirect using HTML. This is effective without propagating multiple copies of your pages in multiple locations (as would be the case if you used a mapping in your web server), however it has two undesirable effects:

It causes a slight browser-induced delay for the HTML reload command to be recognized and executed
It doesn't tell search engines to permanently relocate your pages to their new location

I realized that the problem I was looking to solve was slightly different. In my case, I have complete control over the web server (nginx in my case), and therefore can provide configuration information to it directly, including having it redirect using HTTP 301 and 302 codes.

Furthermore, since I have a fine static blog engine with plugin support written in a language that I am comfortable with, and with plenty of example code, I was able to pull together a pretty simple plugin to create a map from the alias attribute in my blog postings to the final published URI. The result is now available as nginx_alias_map on github.

I'm now running two sites using it and both seem to be performing admirably.

Code is published under the MIT license and pull requests are welcome.

XCUITests and macOS

2019-02-05T14:23:00-05:00

A number of years ago, I set out to automate a set of manual tests that we've been using for years to validate functionality and UI in Cartographica. I've been through a lot of technologies over the years, some expensive commercial tools, some open source technologies. I won't go through an exhaustive list of what we used and how we found them, but I will say that the last technology we used was Appium, a tool created mostly for mobile (macOS does benefit from its younger, more popular sibling occasionally) which models itself on Selenium and uses that popular web testing tool as the orchestration layer.

When I moved to using Appium, I'd already been doing something similar using and hand-modified version of Pyatom, a python-based automated testing framework for the Mac that uses the macOS Accessibility Framework to provide testing. I implemented some custom modifications for testing Cartographica, but the system was required that all tests be written manually in python. That's fine, I'm good with python, but there were no recording mechanisms or other tools to make the process easier.

After happening upon Appium when trying to automate testing for CartoMobile, I realized I could take advantage of the Selenium offshoot to get some much-needed tooling around my testing for Cartographica. This worked pretty well, although the recording tools never turned out to be a good shortcut, so I ended up writing all of my tests in Python manually anyway. Still, there were third-party tools, and integrations with popular testing and continuous integration frameworks, such as Jenkins.

In the end, I added about 25 tests for Cartographica using Appium before the relative brittleness got to me. In addition to being a bit difficult to manage on the Mac, I ran into some significant timing problems and issues with item occlusion when doing drag testing. Although that wasn't the end of the world, it was a large time sync and the tests themselves, even when hardened had a failure rate of 3-8% per run, which meant they weren't so much a gate as a hurdle. My last check-in notes when I was still hoping for better GUI test development were in 2014, and read "Try once again to get the drag-and-drop tests working under 10.10". They did not after many days, and I finally gave up and went back to manually running and validating the

Dynamic XCTests

2019-01-20T22:04:00-05:00

For a number of years, Cartographica has had a lot of tests-on the order of 1500+, but a few of them are quite a bit bigger than they should have been, owing mostly to their data-driven nature.

This post describes the method used to provide dynamic test creation for Cartographica.

Cartographica File Formats

Cartographica uses a couple of big libraries to provide a wide variety of file import and export capabilities, a situation which is not uncommon in the GIS world, as most of the players, including ESRI use (and make available) libraries for accessing file formats that they either create or need access to. The result is that many applications have access to a large number of file formats.

In order to deal with this large variety of formats in an intentional way, I created an internal catalog that is used by Cartographica to handle most operations with file formats, including:

User-readable format names
System-usable identifiers
Format limitations
Available test files and verification data
Identification of implementation source
Identification of licensing

The files that contain this information are shipped as an internal part of the shipping code, and are part of the source bundle used to create the software.

As such, the test information in the file format data file provides essential information for programmatically testing Cartographica's import and export capabilities (including round-trip testing where applicable).

Implementing Dynamic Testing

Historically, Cartographica's file format testing appeared in the test system as a small set of tests which ran for a long time and provided little debugging information when a specific file format test failed. In addition, due to the data-driven nature, it was difficult to test just a single file format.

Over time, I considered a number of different ways to create dynamic tests, most of which required code generation, which was creating too much complexity for me to approve of. I needed a solution that would work within the confines of our Continuous Integration build system (Jenkins) and would create obvious signals if the tests either failed or were not executed for some reason.

This weekend, I took another stab at handling the problem and was successful. The mechanism is straight-forward, almost elegant, and very Objective-C.

After choosing to spend some time in this area, I read up on the official mechanism for creating dynamic tests, using defaultTestSuite, which looked like it would be a good option, except that it has a side-effect of not playing well with the IDE for re-running tests, and I really wanted to be able to do that. Because the suite creation code is only called when running the whole suite, it doesn't make available the individual tests if they are re-executed.

The examples I'd found also resulted in a large number of identically-named methods being called, which was pretty useless in terms of providing visibility to the tests. The solution to this problem was to create separate implementations using the dynamic runtime properties of Objective-C. By doing so, the second problem (identical naming) was solved and I could create uniquely-named, highly-identifiable test names.

As an added bonus, this also paved the way to fixing my first complaint through the removal of code. By creating all of my tests with names starting with test, the system will auto-discover the tests (as long as they are created in the +initialize method). Therefore, I was able to remove the code which otherwise created the default TestSuite, since the Objective-C introspection for methods beginning with test would find all of my new dynamic tests.

I had help from a couple of sites in putting this approach together, and the result has been great. My 2 file format tests are now approximately 200, and they are individually named for what they do and to what data types.

In-depth Code Review

The particular code that I'm using works with hundreds of different file formats, and most of them have sample data files and information that allows me to verify that they're all loading correctly. To index all of this information, I use a strongly-typed XML Schema for these descriptions. The strong typing facilitates the strong validation of the files. Since all of this information (including driver names) is in this XML file, the best way to truly test this data is by running automated tests that interpret the files.

The actual code implementation is relatively simple:

+initialize method to parse the file and set up the tests
+CTUTaddInstanceMethodWithSelectorName used to register a block for the method

Our test registration code calls the add instance method with the block that executes the actual tests. The parameter passed to the block (ImportExportFileTests) is just a subclass of XCTestCase, as that's what XCTestCase expects to call test* methods with.

In this particular example code, we're passing in a NSString containing an XML fragment which contains information on the individual test files. The -runFileTestWithFileNode method referred to inside of the block is a basic test routine which I was able to reuse without modification. In this case, I chose to place run all the tests for a file format in the same test.

Looking at the code below, checkTestFileList verifies that at least one test file exists for this file format and driver (and the level of test that we're doing... some tests take a long time and are only executed when isExhaustive is set).

fileFormatName is the human-readable name of the format, fileFormatType is the UTI for the file type, some of which are vendor-specific, others are defined in Cartographica. For the UTIs we define, we create associations for com.ClueTrust.Cartographica.external.<type_name>, where <type_name> is a unique identifier for each type. Since that's a large, mostly useless, prefix and we're guaranteed that it's unique (since we check that elsewhere when validating the file format file), we shorten it to external. Finally, . is replaced with _ to comply with method name requirements.

The block passed to +CTUTaddInstanceMethodWithSelectorName:block: executes the tests themselves, using the standard XCTest assertions to flag problems.

NSString *testMethodBaseName = @"testVectorImport_";

NSArray *checkTestFileList = [fileFormat nodesForXPath: testFileSearchString error:nil];
if (checkTestFileList.count<1)
     continue;

NSString *fileFormatName = [[fileFormat attributeForName: @"name"] stringValue];
NSString *fileFormatType =[[[[fileFormat attributeForName:@"typeID"] stringValue]
        stringByReplacingOccurrencesOfString:@"com.ClueTrust.Cartographica.external" withString: @"external"]
        stringByReplacingOccurrencesOfString: @"." withString:@"_"];
NSString *testName = [testMethodBaseName stringByAppendingString: fileFormatType];
NSXMLDocument *doc = [NSXMLDocument documentWithRootElement: [fileFormat copy]];
NSString *xmlString = [doc XMLString];

[self CTUTaddInstanceMethodWithSelectorName: testName block:^(ImportExportFileTests *test) {
     NSXMLDocument *xml = [[NSXMLDocument alloc] initWithXMLString: xmlString options: 0 error:nil];
     NSAssert( xml, @"Need XML");

     NSArray *testFileList = [xml.rootElement nodesForXPath: testFileSearchString error:nil];
     NSAssert( testFileList, @"no test files for %@", fileFormatName);
     NSAssert( testFileList.count>0, @"Need >1 test");

     test.readRasterPixels = isExhaustive;
     for (NSXMLElement *testFile in testFileList) {
         [test runFileTestWithFileNode: testFile formatName: fileFormatName asRaster: isRaster];
     }
}];

The actual code we use to add the selector is thanks mostly to a Stack Overflow Posting which describes exactly how to do this.

+ (BOOL)CTUTaddInstanceMethodWithSelectorName:(NSString *)selectorName block:(void(^)(id))block
{
    // don't accept nil name
    NSParameterAssert(selectorName);

    // don't accept NULL block
    NSParameterAssert(block);

    // See https://stackoverflow.com/questions/6357663/casting-a-block-to-a-void-for-dynamic-class-method-resolution

    id impBlockForIMP = (__bridge id)(__bridge void *)(block);

    IMP myIMP = imp_implementationWithBlock(impBlockForIMP);

    SEL selector = NSSelectorFromString(selectorName);
    return class_addMethod(self, selector, myIMP, "v@:");
}

Fastlane + Jenkins Pipelines (Gaige gets his Java on)

2019-01-02T22:11:00-05:00

Jenkins

For years, I've been using Jenkins as a CI environment at ClueTrust. For those unfamiliar with Jenkins, it's a long-running open-source project built in Java for doing Continuous Integration. It'll work on just about any platform that can run Java (although it's most at home on Unix machines) and it can be used with nearly any development toolset. It's been a helpful tool, but not without its difficulties.

Most of my personal development for the past few years has been for macOS and iOS and thus requires Xcode, which means that my Jenkins slaves need to run on macOS machines. In my case, I run both the slaves and the master on Mac Minis (including a brand-spanking new one that really screams). Over the years, running CI on a Mac has been difficult, and required quite a bit of manual tweaking of the environment to keep up with changes in Apple's toolchain.

Enter Fastlane

Over the past few years, development has progressed on an iOS-mostly (macOS-kinda) development tool called Fastlane. The original developer, Felix Krause has been working on Fastlane for years and has moved from independent to Twitter and now Google.

In a nutshell, it shepherds most of the build/test/deploy cycle for iOS and macOS software, providing pretty output and a fast cycle.

When I went through our last feature release for CartoMobile in the fall of 2018, I ran into some annoying changes in the build process that caused some difficulties with building and testing in my desktop and laptop environments, and especially automating the screenshot process.

I looked to Fastlane as a solution to this problem, and was quite pleased with the results. In fact, they were so good, I moved my other active projects, including Cartographica, to Fastlane for build/test/deploy. Despite comments about Fastlane being mostly for mobile development, I found it quite easy to work with for macOS development as well.

Of course, if it's worth doing in the manual build process, then it's worth automating, so I set out to follow the Fastlane instructions to get it working with Jenkins. It's clear that somebody had the same idea, as there's explicit support for running Fastlane under Jenkins. Although it required using the KPP Management Plugin, which I hadn't previously used, it worked quite well.

Moving to Pipelines

Fast forward to this winter, and I've been doing some more complex build and test cycles, and reading up on the more recent techniques for using Jenkins and decided it was time to try Pipelines. The value proposition was that you could create a more intelligent pipeline of actions. Whereas Jenkins has long had the ability to chain builds together, or to use build steps, the proliferation of individual "projects" necessary to handle a complex build and test cycle could get pretty unwieldy. For example, Cartographica had a 4-project setup if you don't count the individual libraries that are built separately.

Applying pipelines to Cartographica wasn't a difficult process, thanks to the Convert to Pipeline Plugin provided by Infostretch. The automated conversion worked well, except that it didn't know how to convert the KPP Management Plugin. After a cursory stop in the Pipeline Syntax page, it became clear that the KPP Management Plugin was old enough (6 years since the last update at that point) that it didn't use the required code base to get automatic support for Pipelines. Sad trombone... But, hey, I'm a programmer, I've got this.

Enter the Javas

I'd done some recent Java work at Haste, but it'd been quite a while since my last serious effort to pick up someone else's code and run with it in that language, especially a plugin to yet another, much larger codebase. It turns out that the Jenkins folks (and the folks who made the original plugin) did a good job of making things pretty sane. Once I'd reoriented my brain to Java, it was pretty straightforward to modify the code to support pipelines, and the results is that after a few hours, I now have a completely functional Jenkins Pipeline for Cartographica using Fastlane and my modified KPP Management plugin.

My modified source code can be found at gaige/kpp-management-plugin on GitHub.

Helpful links:

Follow-up on static pages

2018-12-16T10:57:00-05:00

At the beginning of the month, I wrote about the move to convert Gaige's Pages to a static generation model. Today I'm following up with some performance graphs. There's absolutely nothing surprising here, but it's good to see nonetheless that things work as they should.

Look to the right of the red line in the middle to see the new site and the left for the old dynamic site.

As any good programmer does, I am now trading my time to create the posts (x1 for just me) to reduce the time for readers (x2 at least, one might hope). Even if no human reads it, I'm saving the indexing bots time.

Average total page load time (DNS+Connection+SSL+First Byte+Download) went from 844ms on the old site (minimum of 208 and max of 60,169) to 125ms (minimum of 18ms, maximum of 805ms), so as you can see, our worst-case scenario in the last week was better than the average in preceding weeks.

As a note for those who are interested, the only significant variation in the previous dynamic site was the time to first byte. This is the time from after the SSL negotiation and the pass-off to the CMS and the receipt of the first byte by the browser. This time has gone from a minimum of 205ms to a maximum of 4ms.

Gaige's Pages moves to static generation

2018-11-30T16:30:00-05:00

Gaige's Pages has been through a lot of changes over the last 15 years, since I did the first major revamp of the site. At that time, I was converting from a statically generated site that I was manually creating (with a little help from DreamWeaver) to Geeklog, a venerable, geeky CMS with many more features than I needed.

A little history

At the time (January 2003), I was expecting to create more content; and, boy did I! During that year, my most prolific blogging year, I managed to crank out over 1400 individual blog posts, ranging from a 3059-word missive about the future of media to about 800 pieces less than 100 words in length that mostly point at other people's thoughts (with some commentary). I wrote about 50 pieces that would be considered essay-length, which isn't bad, except when you consider that I really didn't have a full time job that year.

My interest in blogging waned as I got back into working with other people, and as Carol & I got married in 2004.

Year	Posts	Word Count
2003	1493	214283
2004	304	48612
2005	191	35872
2006	149	30039
2007	52	12761
2008	41	15988
2009	15	7539
2010	4	1418
2011	1	582
2012	0	0
2013	5	1632
2014	2	178
2015	7	2202
2016	0	0
2017	1	574
2018	2	959

Blog Engines

Over the intervening 15 years, I've changed blog engines infrequently. Frankly, it's an annoying process with few significant upsides when you aren't writing frequently.

Geeklog (2003)

I moved to GeekLog in 2003 when I was thinking I wanted to write more and was tired of dealing with hand-coding HTML (even with the help of DreamWeaver). This was a reasonable choice at the time, and got me a hybrid HTML editor and a content indexing system (along with a ton of stuff I didn't need, like a calendar system).

In 2006, when I acquired Cartographica.com in advance of releasing Cartographica (in 2008 for beta, and 2009 for public release), I used GeekLog as the basis for the Cartographica blog as well. This provided me with a converged platform and the ability to leverage my knowledge.

SquareSpace (2011+ Cartographica)

In 2011, as Cartographica was gaining steam, we hired a bright youngster to do some significant blogging, however GeekLog was showing its age and wasn't really adequate for blogging with images unless you were a pro with HTML, so I moved Cartographica to the hosted service SquareSpace, which we still use.

Drupal (2013?)

After moving Cartographica's blog to SquareSpace, I realized that I didn't particularly enjoy using GeekLog any longer, and support for it was waning. Wordpress was the new hotness, but it was way too maintenance heavy (see any security blog), so I wanted something with the right level of geek, good maintenance and good code hygiene. After consulting with some friends, I decided on Drupal. Drupal's been good, but I've just not been blogging that frequently and the amount of effort necessary to maintain and defend a full-fledged CMS doesn't seem worth the effort for a single-writer blog that is infrequently updated. In truth, for the last few years I've spent more time updating my blog software than writing for my blog.

Static blog engines

A few months ago, I started looking at moving the Pages to a static blog engine. My thinking is that I'm not blogging that much these days, and I'd rather use my scant blog-related time to actually write than to do administrative tasks on the server; especially those involving security patches. In addition, I prefer to write my posts in Markdown. Most of the static blogging engines work with Markdown and that just seemed the right direction for me to take.

Jekyll

I really tried to use Jekyll. RS, my oft-times partner in technology, considers it "good enough to offset being one of two unfortunate cases of ruby in my life, the other being brew.sh". It seems to be well liked by most people who use it, so I set out to translate my 2000+ blog posts into Markdown for assimilation into Jekyll.

Unfortunately, I ran into two problems: first, it's good for small blogs, but regeneration times on large ones is really long; second, it's written in Ruby. I know what you're thinking: I shouldn't be bashing an entire programming language which is likely used by millions of people. Well, millions of people smoke, too, and although both things are completely legal, I wouldn't consider either safe for your health.

It's not so much the ruby syntax that bothers me, it's the ethos. Jekyll is great for people who like it and have no problems with it, but when my generation was taking literally hours to never finish, I decided that I'd take a look at the code to make sure there wasn't some weird bug I was triggering.

We'll never know if that was the case, because after finding no reasonable way to even figure out what module was consuming time, I just gave up.

Pelican

I took to Google to find something that fit the bill. If I was going to have to potentially debug this puppy, I wanted it to be written in a language that I am comfortable with and one whose ethos involves making code that can be read and debugged by someone other than the original author.

Into google, I typed static blog engine python markdown jinja2. I wasn't sure I'd find something, but I wanted python for maintainability, static blog engine so that I didn't have to work if I wasn't changing anything, markdown for a fast and familiar writing environment (Hello, BBEdit!) and finally jinja2 for any templating. The heavier-duty MacGIS web site had already been moved to Django, which uses Python and (optionally) jinja2 as a templating language. Similarly, I'd moved from Puppet to Ansible a couple of years ago (another story, many thanks to Rob), which is also based on Python and jinja2.

The first recommendation was Pelican, which has been my answer for this stage of Gaige's Pages.

The transition has not been without hiccups, and I even ran into a bug which I needed to diagnose; but, diagnose it I did, because I could use all of my tricks for debugging Python, including PyCharm

Looking for a Nav system (Revisited: 2018)

2018-11-30T15:00:00-05:00

If you want a good feel for the advancements in Navigation systems in the last 10 years, you should check out my piece Looking for a Nav system from 2008. The article went through my key issues that lead to my recommendation of the TomTom's in those days.

TL;DR: Today, I just think you should use whatever is best integrated with your phone or what gives you the best traffic or routes in your most frequent driving areas.

Running through the items from the article:

The Nav system should (mostly) trust you In the original article, I stated "believe" you, but I think trust is the real issue. If I drive in a different direction, the unit should figure there may have been a good reason for it and give some priority (at least a tie breaker) to the human driver.
MapShare updating technology I thought this was going to be a helpful tech at the time, and it is, but it's now supplanted by much better connected algorithms for navigation. Not only does your Nav system believe you, but if you drive over a road that the device doesn't know about often enough, there's a good chance it'll show up in a future update.
Better pricing on map updates Not much better pricing than free, or at least included with your device. Take your pick of how to pay: subscription for "pro" software, free for Apple's defaults, or with your data in the case of everyone else. But, you generally won't be paying a fee unless you are using pro software.
Better Management Software No need for this any longer. Once your mapping system locates you, as long as you're online, you have the maps you need. There are some systems designed to allow you to choose the areas to cache, but generally that's not necessary except for specialized systems for activities like hiking.
IQ Routing Thanks to the connected nature, this is a non-issue. Everyone has segment-based timing these days, automatically reported back; usually while you drive.
Advanced Lane Guidance Table stakes again.

There are still some difficulties:

mounting brackets are still a problem Unless you have a navigation system that coordinates with your car (Apple's CarPlay and Google's equivalent, Android Auto), you're probably struggling with some difficult to handle device to hold your phone.
Privacy Now that everyone's driving is being monitored by their navigation system, some amount of private information that was previously unavailable is now available to bad actors. You may trust your phone provider and OS manufacturer (although that may be a bad idea in some cases), you have to watch any App that has access to your location outside of when the App is running.
Cell coverage == navigability It's getting better, but in many cases, if you lose your cell coverage, your nav system not only loses its ability to get real-time traffic data, but you may be stranded without a map.

Tune in in 2028, and maybe I'll update this topic again!

Codesigning ate my Sunday

2018-11-11T16:49:00-05:00

I have a version of Cartographica that I need to push out before the end of the year, due to a certificate expiration on one of my long-term servers. As a bulwark against problems occurring just at the turn of the year and to make sure that users can use the 1.4.x series of Cartographica, I set out to sign and release 1.4.9, a version of 1.4.8 containing only this signature fix.

And then, codesigning ate my Sunday.

It seemed simple enough, I ran my release build script, it created a disk image (dmg) file and pushed it up to our web server for release. Then I added the release notes into Feeder.app (which I use to push the RSS feeds) and ran one last test before I pushed it out to the users:

File downloaded fine
Disk Image opened fine
Cartographica application copied fine

And then Cartographica crashed upon start, giving the relatively cryptic:

 Reason: no suitable image found.  Did find:
  /Applications/Cartographica.app/.../XXX.dylib: code signing blocked mmap() of /Applications/Cartographica.app/.../XXX.dylib

Well, that was unexpected. What ensued was a wide variety of searches on the internet mostly leading to people's expired keys and certificates or signing and testing the wrong items, but eventually I noticed that the library in question (supplied by a third party) was a fat library (i386 an x64_86). Could that be related?

It appears that version 2 code signatures contain the following "Format=bundle with Mach-O thin (x86_64)" which is the same as our last shipping version in August 2017, but the 2017 variant works just fine. Maybe something related to Mojave (on which I'm running the signing and the executing the binary)?

I lipo'd the library to remove the i386 code and rebuilt. Then I re-signed the binary. Seems fine on pre-10.14 machines, but won't work on pre-10.14. Even if I sign on 10.13, it works fine on 10.13, but not on 10.14.

I spent time looking through any number of internet postings about similar problems (mostly iOS, which wasn't necessarily applicable) and they were mostly related to the actual certificates being used to sign. You may want to give those a try if you got here looking for a solution to this problem on iOS...

After about 5 hours I finally decided to throw in the towel, save for one last effort. I rebuilt with code signing on my MacBook Pro (still running 10.13, because I hadn't had time to upgrade it yet) and Xcode 7 (the last version that can build the 1.4 string without substantial updates. Under normal circumstances, I don't build the binaries with code signing on (at least not in the older versions) because of inconsistencies in the way that code signing worked. Further, it was just as easy to re-sign everything as part of my scripted steps to get the DMG file built and uploaded.

This time, I changed the code signing parameters to sign with my Developer ID certificate and rebuilt for archive, taking that archive and manually deploying the disk image (thanks to DropDMG, this was a simple command line operation). The build was successful and ran on 10.13 without a hitch.

Testing on 10.14 also ran without a hitch, so it looks like I've finally solved the last 1.4 issue.

TPS Reports? (Testing PostgreSQL under SmartOS)

2018-10-22T06:47:00-04:00

Rob and I are working on updating our standard environmnent in our data centers. As may be clear already, we're big proponents of SmartOS, which has been working really well for our needs. We've also big proponents of automation (and, in particular, Ansible).

Due to a hardware failure this weekend, we had to promote our test environment to a production machine while waiting for a power supply to be sourced (yeah, our next environments are going to be dual power supply, now that we have dual feeds to the rack). This left me without a test box, which turned out to be more of a problem than I'd realized, because some of my automated tests for my macOS and iOS development require a functioning server.

Enter Toys-R-Us... well, of course, Toys-R-Us has exited, but in doing so, they provided us the opportunity to pick up some recent-model Dell kit at ridiculously low prices. As Rob had recently installed my two machines (R330s) in the rack, I now have the opportunity to take them for a real-world spin. As an added bonus, we've got 10K 300GB SAS drives in one (running mirrored) and 500GB SATA SSDs (Samsung 850 EVO) running in the other (also mirrored). Compare and contrast this to our existing HP DL160g6 servers which have RAIDZ2 4-drive x 4TB 5400RPM WD Red drives, and it looks like we've got an opportunity for a bit of a bake off.

My first test is to take a look at the PostgreSQL performance on these systems. Apples-to-Apples will be a bit harder, but we've got some interesting comparisons here anyway, so I've rolled out pgbench for the testing and have chosen to simulate my workload using TPC-B (sortof), which is the tool's default with 100 clients on 10 threads, with 100 transactions per (pgbench -c 100 -j 10 -t 100).

System	Latency Average	TPS (w/estab)
Dell R330 w/SSD	135.242	739.415489
Dell R330 w/Quickly Spinning Rust	456.189	219.207000
HP DL160g6 w/Spinning Rust	996.330	100.368323

So, there you have it, the not-so-surprising news that mirrored SSDs are faster than mirrored 300GB 10K SAS drives when running certain PostgreSQL benchmarks on a system doing nothing else.

Jumping a dead 2000 Boxster S

2017-10-21T12:49:00-04:00

I'm generally extremely happy with my 17-year old 2000 Boxster S that I bought new. However, running back and forth between Atlanta means that I've not been driving it as much as I have recently (although quite a bit more than I did in the 2008-2010 period). Last year, I replaced the battery with the same basic unit with a 48-month warranty, figuring that 7 years was a long time for a battery to last (yes, the previous one was replaced in 2010).

After returning from one of my trips back home, I tried to start the car, resulting in a very disappointing lack of lights when I opened the vehicle. I figured it was probably time to appropriate one of those nifty battery-powered jumps starting units (which I did, more on that later).

However, when I approached the car, I realized that getting the hood open (or front trunk, or bonnet, depending on your opinion on such things). The 2000 model has manual trunk releases, but the central locking system protects those releases through the use of a solenoid-actuated locking bar. Without power, that bar doesn't move.

After scouring the internet, I found a lot of information about getting past this problem on different models, but each of them is a bit different. Models a couple of years younger (I believe 2003+) have a terminal in the fuse box to energize the battery remotely to get past this. However, my model requires a slightly hackier operation.

Required items for this operation:

12V supply sufficient to start the car (I used a battery jump unit, but another car will also do)
6"+ piece of wire stripped at both ends
Jumper cables
(Your keys)

Operation

Open the door using the key to access the interior of the vehicle
Remove the fuse box cover (located in the driver's side footwell)
Using the fuse-removal tool, remove the C3 fuse (you should check the enclosed fuse map and look for "Central Locking", but it's C3 on my vehicle and elsewhere on the internet)
Insert a piece of wire in the C3 socket and replace the fuse
Connect the negative end of your 12V supply to the body of the car (I used the door hinge)
Connect the positive end of your 12V supply to the other end of the wire connected to the C3 fuse socket At this point, you should see interior lights energize and your fan may come on. If you don't see signs of life, there's something wrong with the wire placement or the fuse (or a deeper problem with your vehicle)
Rotate the door latch manually (to convince the car that the driver's side door is shut) This is essential and cost me a couple of weeks (time, not effort) in trying to figure this out
Now use the key in the driver's side door to lock and unlock the doors You should hear the click as the solenoid pulls the locking bar back and you should be able to freely operate the hood latch
Pull the door handle to unset the door latch - Failure to do this could damage your door system
Detach your cables from the car and remove the wire
Now, pop the hood and complete the jump-start operation

Sun SparcStation 20 vs Raspberry Pi

2015-12-27T12:17:00-05:00

For those of us old timers, here's an amusing shootout between a SparcStation 20 and a Raspberry Pi, including both the Pi and the Pi2 (but unfortunately not the Pi0).

For those unfamiliar with the SS20, it was a workhorse desktop in the 1990's. Early pizza-box design and pricing out at well over $10,000 at the time (according to the article, closer to $25K today). It was an awesome desktop workstation, and was seen in some equipment racks as a "cheap" alternative server platform. For comparison, the Pi sells for $20, and the Pi2 for $35.

The short version is that a computer which costs roughly 1/1000 of the price performs about 7x as fast for operations (and for the record, there is no category where the SparcStation wins). For those keeping score, that's a 7000x increase in price/performance in 20 years.

SmartOS, Postfix and IPv6

2015-12-18T06:46:00-05:00

As part of completing our shut-down of 2007-vintage Xserves at the hosting center, we're moving a lot of servers to SmartOS (or at least SmartOS-hosted VMs). We've been really happy with the system so far. Here's a quick story of the power of this environment.

As part of the transition, I decided to take care of a long-standing intention to allow IPv6 mail delivery. The latest releases make IPv6 much easier to deal with in SmartOS and so I figured I'd turn it on and see how it went. Early tests looked good, but there was little IPv6 traffic initially. However, once google/gmail started throwing data our way, it became clear there was a problem. IPv6 reverse lookups were failing in Postfix and that was causing mail not to be delivered. Thankfully, it was a 450 error, so the mail would get queued on the server and be tried again later, but it wasn't a good place to be, so we turned off IPv6 and went in search of the root cause.

A quick look at the postfix source code made it clear that the OS (and not postfix's internal DNS code) was being used to look this up. I decided to play with it a little longer, grab source code so that I could do some experiments, and then I went to bed.

In the morning, I did what you do when you have standard configurations and SmartOS: I stood up a new server and ran some tests on it. The new server was configured with the latest (15.3.0) build of SmartOS and the 2015Q3 packages, resulting in an upgrade to the 3.0.2 version of Postfix. Once set up, I ran some email through it and was able to confirm whether this resolved the problem, and it did.

The moral of the story (which I wish I'd realized a few hours earlier yesterday): here in the future, it's easier to stand up another machine to check whether a bug has been fixed than to read the release notes and source code.

From a discovery/knowledge perspective, I'm not sure I like this new world. From an efficiency perspective, disposable virtual hardware is a luxury and a huge time saver.

Replacing a RAID set under El Capitan

2015-12-04T07:32:00-05:00

Over Thanksgiving, one of the two drives in my "Big Disk" RAID (it was a mirror of 2 2TB drives that I used to store large things that aren't worth having on the SSD on my Mac Pro). Generally speaking, my response to failures with SMART (especially with cheap spinning rust drives) is to replace the drive immediately and if it's a set of drives in a RAID to consider replacing both of them and bumping to the next most efficient capacity.

For this year's upgrade (it's actually been 2+ years since the time these drives were installed), the most cost efficient drives are 4TB. I don't think I'm likely to fill them up any time soon, but pictures and video aren't getting smaller, and I sometimes have to store some large GIS database on these while I'm working with them, and the new drives are NAS rated and have a 3 year warranty, so I need to anticipate use for 3 years out.

With El Capitan, Apple's Disk Utility program has become... well, pretty lackluster. However, from a RAID perspective, I've used the command line for years because of the desire to have explicit control over what's being done.

The key objectives:

Replace the 2TB mirrored RAID with a 4TB mirrored RAID
Retain the existing Time Machine history
Retain the existing CrashPlan history

The process was quick and successful, and here are the steps that I took:

Using an external SATA to Thunderbolt dock, I attached and formatted the new 4TB drive, naming it something other than its final name. Because of the way OS X handles name collisions, I didn't want any confusion in the devices, so I chose to only have one drive mounted with a particular name at a particular time.
Using Carbon Copy Cloner, I cloned the old 2TB RAID's contents onto the new 4TB drive. Averaging >100MB/s, this still took a while (about 4 hours for the 1.4TB of active storage)
Once the clone was complete, I unmounted the 2TB RAID and pulled both of the constituent drives from the LaCie 5Big Thunderbolt (link is to the TB2 version of this device) and set them aside.
Before setting the old drives down, I marked them with a Sharpie, taking care to mark the FAILED drive so that it could be destroyed, and the OK drive so that it could be retained if desired as a snapshot. I'll keep it for at least a few weeks, but then it'll either go to the safe deposit box, or the disk shredder. Drives are always marked with the RAID name and date they were taken out of service, along with any appropriate legends (Personal and Confidential, copyright, etc)
Unmount the 4TB drive from the external SATA to Thunderbolt dock and install it, and its future mirror into the enclosure
Once the 4TB drive remounts from the enclosure, make sure that Time Machine isn't running (and either wait for it to stop, or cancel the run before proceeding)
Now, convert the 4TB drive to the new RAID by using Terminal and the command line: diskutil appleRAID enable mirror /dev/disk3s2 You will need to replace /dev/disk3s2 with the slice number corresponding to the volume on the disk, not the disk device itself. This can be gathered by using diskutil's list command if you are uncertain.
Rename the 4TB drive to the name of the original RAID (this takes care of the CrashPlan issue, since CrashPlan only knows the path to volumes)
Now that we have the RAID set prepared and the volume renamed, we need to make sure that Time Machine thinks it's the same drive for content purposes. To do this we'll use Terminal again: sudo tmutil associatedisk -a "/Volumes/RAID" /Volumes/TM/Backups.backupdb/Machine/Latest/RAID Names have been changed for privacy, but the basics are clear, you need to replace /Volumes/RAID** **with the UNIX path of your volume as mounted on your computer and the second path is the path to the volume name in the Latest Time Machine backup. Do not associate with another backup or you'll get an error.
To confirm this process, you can start a TM backup and see if you see a message from backupd in the Console app like this: com.apple.backupd[1931]: Inheritance scan may be required for '/Volumes/RAID', associated with previous UUID: XXXXX-XXXX-XXXX-XXXX-XXXXXXX Indicating that Time Machine will try and sync up the volume histories.
Once the initial Time Machine backup is complete, start rebuilding the RAID (this is going to take a long time, which is why I do it after the initial TM backup), using Terminal again: diskutil appleRAID add member /dev/disk4 disk7 replacing /dev/disk4 with the device you want to add, and disk7 with the RAID's Device Node (you can again find these using diskutil's list command)

A bit time consuming, but everything worked like a charm, and I have full access to my Time Machine history for my new RAID set.

The Age of Deception

2015-10-22T06:20:00-04:00

Occasionally, in the vast expanse of the internet there are gems from people I know and respect. I'm not going to summarize, because the entire article, The Age of Deception , is worth reading by itself.

Thanks, ssh.

Obama Won't Seek Access to Encrypted User Data

2015-10-22T05:07:00-04:00

Somehow in the midst of all of the craziness around here, I missed that, as the New York Times reports, Obama Won’t Seek Access to Encrypted User Data. For the time being, they appear to have agreed to the rationale that a back door provides as much entree to the criminal element as it does to law enforcement, and that the benefits don't exceed the costs. I'm sure that part of this is also due to the political capital that would have to be expended to go against the interests of the tech industry, and the potential economic damage if the US continues to be a place where it is believed that the government has ready access to stored and in-flight data.

At least for the time being, this is good news for US companies that want to compete in international markets (especially Europe) where user data protection is given more weight (in law, at least) than it is here.

The EFF has a more cautionary take on this announcement.

Familial DNA Searching

2015-10-22T04:59:00-04:00

Wired had an article last week entitled Your Relative's DNA Could Turn You Into a Suspect, in which they describe a method of using familial DNA searching to locate suspects. There are interesting implications here, especially with regard to public DNA search resources like Ancestry.com.

Thanks to Bruce Schneier's Blog for the link.

Time (Saver) Machine

2015-08-09T09:14:00-04:00

Over the past couple of weeks, I once again reacquainted myself with the joy of using TimeMachine as a backup system. (Please, use more than one, at least one off-site and one on-site would be a good idea, consider CrashPlan for the offsite version, we've used it for years and are very happy with it).

In this case, I needed to borrow a computer from my cluster of Mac Minis that make up my build and testing farm. In particular, we needed a clean machine that could be used and wiped, so I wanted to back up the system, install a fresh copy of OS X 10.10.4 (the machine was running pre-release 10.10.5), take it to the trade show, and then reverse the process when I came back.

My network environment at home includes a Mac Mini server running an older version of the OS and serving as a Time Machine server to machines that aren't easily connected directly to disk. This is an easy configuration to use if you have any machine that is running all of the time, especially with recent versions of OS X that can have server grafted on.

The process went without a hitch. I hadn't been backing up that machine (that was a bit of a surprise, since I thought I'd learned my lesson a couple of years ago when I lost the hard drive on another one of my Mac Minis), but turning on Time Machine was easy and the backup (over wired Ethernet) went quickly, especially since that machine is basically a simple test box, so it doesn't have much installed on it.

Installation was a bit more of a hassle, since I normally run that machine without a monitor, I needed to hook it up to my TV to install (have I mentioned recently how much I enjoy the fact that we have HDMI these days?), and the USB stick I was installing from wasn't USB 3, so it was a bit slow.

The machine worked great at the conference, and since it didn't have anything of value on it, I just restored over top of it (after erasing the hard drive partition) using Recovery Mode. My third-party Bluetooth keyboard didn't help much when trying to hold CMD-R to bring up Recovery Mode, so I had to drag down the one remaining USB keyboard in order to boot it. Once that boot was done, I selected my Time Machine server, logged in, and chose the backup to restore. Wait for a little while for the recovery to complete and wow! Completely functioning machine!

If we were doing a lot of shows (especially with dicey environments like Black Hat), then I would probably have bought a machine just to use for this purpose, but in this case, the cost was only a couple of minutes of my time and a couple hours of the device's time. All told, a great experience.

Academia's Tug-of-war with the NSA over Encryption

2014-11-18T06:33:00-05:00

There's an excellent article, Keeping Secrets, on medium today (originally from the November/December 2014 issue of Stanford Magazine) about the conflict between academic work on cryptography and the NSA's role in national security. Most of the focus is on what happened and not on who was right or wrong. Particularly interested is the early section about the increasing understanding of the necessity of encryption for data security and computer communication that started as early as the 1970's. Once again, thanks to Bruce Schneier's excellent blog, Schneier on Security, for the pointer.

Nice retrospective podcast on Real Genius

2014-07-14T05:49:00-04:00

If you have interest at all in 1980's "culture", the tech industry, and/or the movie Real Genius, you should check out the iMore Review program Review 16: Real Genius.

Don Melton, Matt Drance, Guy English, and Rene Ritchie do a great job of running down the highs and lows of this classic.

Nice set of Nagios scripts for OS X

2013-12-12T04:05:00-05:00

When digging around for information about Apple's new Caching Server, I happened across this informative article about Caching Server for Mavericks by Dan Barrett. Definitely worth a read if you're interested in finding out how to make the most of your network connection with your Macs.

However, from there, I noticed a link to the Nagios plugins for OS X on github. The plugins contain a lot of useful functionality for monitoring OS X systems, and appear to support many versions, supposedly back to 10.4. They are all local scripts, so they need to run on the systems that you are checking.

There's a lot of useful stuff here.

Great piece on user interface evolution

2013-05-13T07:15:00-04:00

Matt Gemmell penned a great piece on User Interface design and evolution as it relates to, well, a lot of things. It's definitely worth the time to read Tail Wagging.

Backup Software

2013-03-31T08:05:00-04:00

I'd never heard of World Backup Day before seeing an article about it in Wired today, but it sounds like a good idea, especially for those whose friends may partake in a little bit of the April Foolery tomorrow.

So, it's a good time for me to discuss backup software and strategies. I'm not going to speak specifically about how I perform backups, but here are some key packages and concepts that are good to think of when you are considering a backup strategy. And, yes, this should be a strategy, not a specific backup. Unless you feel that all of your data is easily replaceable (your photos, your business plan, your accounting data, your scanned legal documents, etc), you need to take this seriously.

Locations

Sometimes people ask me where they should store their backups. My belief is that a minimum of two physically distant locations are a must. Keeping backups at home is cheap and convenient, and results in complete loss in the case of a fire. Those of you with "fire safes" need to keep in mind that most "fire safes" are rated for documents. The general rule is that they keep things cooler than 350° which is great, except that slides, CDs, DVDs, etc. tend to become useless after any reasonable amount of time spent at 125° and above, so if you are going to keep your backup safe in a safe, then make sure you have something that is media rated, not document rated.

And then think about flooding, earthquakes, and sinkholes. Each of these can take out your media pretty easily and irreparably, and if your main copy of your data is also in your house, you're looking at a total loss.

Safe Deposit Boxes are a reasonable place to keep spare hard drives and CD/DVD copies of data. They're unlikely to fail in the same way as your home or laptop, so you have some diversity, and generally speaking they are safe. Most folks don't encrypt data which goes to a safe deposit box, and that's a two- edged sword: your data is in its easiest-to-access form, but that's true not only for you, but anyone else who rummages in your box.

I would suggest at least 2 locations. They should be far enough apart that they're unlikely to suffer the same fate in the event of a disaster.

Software

I'm a big fan of paid-for backup software. Here are some specific packages:

CrashPlan CrashPlan is best known as a service for backup. They also provide hosted enterprise solutions, which allows you to have servers at their own location. The software also can be used to backup to any location that there is another crash plan user. This means that you and your buddy can provide storage space for each other's backups, guaranteeing that if one of your houses goes up in flames there is a copy of the data at the other house. That's a neat feature, although I've never used it. Generally speaking I have found the crash plan software to be reliable and those support services to be adequate. I can comfortably suggest their service to most users for off-site backup as it provides significant encryption and their systems do constant reliability checks on the stored data.

Time Machine Time Machine is Apple's built in backup software for many versions of OS X. It provides version storage as well as very simple administration, and can be used easily with an externally connected hard drive. Of course, it's not very useful for off-site backup. However, for local backup it is easy to set up and easy to restore data from.

BRU Server I use BRU Server in daily use at ClueTrust for the machines in the hosting center. It's not the prettiest software package (by a mile), but it works and works reliably. We had a serious hardware event over the first of the year, and it came through with flying colors. The software itself installs on a server and then individual agents are installed on each client machine. Agents are available for just about any operating system that you can imagine including more varieties of UNIX than I thought even existed anymore. I don't have much experience with the Windows agent, but the Mac agents work well, and the UNIX agents also function just fine.

SuperDuper! SuperDuper is a package that clones hard drives on the Mac from one device to another. This provides you with a completely bootable version of the device as of the time that you created the clone. These backups are exact duplicates, and that means that you don't get to go back and look at previous versions of the files.

Retrospect Historically (in the old days), I used Retrospect, which went down hill significantly when the Dantz was acquired by EMC. The software product was spun back out into Retrospect, Inc. in November of 2011, and the word is that it has improved markedly since then. I gave it a try again once, but have not used it in production, nor have I tried recent versions, but reliable sources say that it is getting better.

Encryption

When possible, do this. It's especially important when keeping data off-site to make sure that data is encrypted using strong encryption and with keys that are only available to you. This is possible with some services like CrashPlan, which allow you to designate your own keys and is possible when you store data encrypted on your own hard drives. Any data which is intentionally taken off- site should be stored in some encrypted form. Keep in mind that if you designate your own keys, you are going to have to safely store these keys in a manner that they will not be lost by whatever event causes your data to be lost.

In Conclusion

It doesn't really matter as much how you decide to back up your data, it just matters that you do back up your data. If there's something that you care about, back it up. If you care about the data being secure, encrypt it. If for some reason you believe you care about the data and you don't care about being secure, think again.

FirstToDisclose.org repository for invention disclosure

2013-03-29T04:22:00-04:00

Launched just ahead of the first move by the US to switch from first-to-invent to first-to-file, a new site, FirsttoDisclose.org has launched, with the idea of putting your materials in the public view in order to limit other people from attempting to claim patent protection for ideas you are using. I'm not a lawyer, but from what I can see, this is mostly useful for forcing ideas to be disclosed for public use (i.e. intent to eventually place in the public domain).

It's an interesting idea, and has gotten some press coverage, but I'm not sure how much traction it will get, especially in this format.

Some further information about the move toward First-to-File is presented in an article by two IP lawyers from Steptoe & Johnson LLP: Patent law creates a first-to-disclose system (PDF), which discusses exceptions in the prior art handling mechanism in the new law, something that leaves some differentiation between the US system and those of other first-to-file systems. It also describes some interesting scenarios where the US system might lead people to intentionally disclose earlier in some fields, which would preserve their rights in the US, and possibly create problems outside of the US. There have been similar problems in the past with PPAs here in the US and foreign rights.

According to the FAQ (confirmed by a press release from the clinic), the site was set up by members of the Brooklyn Law Incubator and Policy Clinic (BLIP), and is registered to an individual with an address in Brooklyn.

As of today, there is one test disclosure up on the site, and no information on the number of registrants.

Ed Note: Sadly, this site went away and is now being used by a law firm for other purposes. So much for a repository of disclosures.

New server, new design

2013-03-28T08:08:00-04:00

Hi folks. We're back on the air with a new server and a new design. Hopefully this less noisy design is a bit more palatable. Any links to the old site will cease to function today, but there weren't many anyway (according to Google Webmaster Tools), but we have preserved all of the content.

My plan is to post more in 2013 than I did in 2012, which I believe I have already achieved by posting a single article.

Cheers for now, and set the new RSS feed to keep up to date.

Where'd my darned flash go!

2011-12-27T14:37:00-05:00

I received a question this afternoon from my cousin about the amount of free flash in her MacBook Air and figured that the answer would probably be useful to others as well. Note that none of this is officially from Apple, so it might be wrong, but I have had quite a few SSDs and the vast majority of it is correct, or at least a jumping off point.

So, my tech-savvy cousin sent me the output of "df -k" in terminal and wondered: "Where's the rest of my hard disk space?" As background, she has a MacBook Air with a 128GB Hard disk and the output of df is:

Filesystem                        1024-blocks      Used Available Capacity  Mounted on
/dev/disk0s2                        117649480 114646588   2746892    98%
/devfs                                    179       179         0   100%
/devmap -hosts                              0         0         0   100%
/netmap auto_home                           0         0         0   100%    /home

There are a number of things that are at play in this.

First, 128GB is 128,000,000,000 bytes and the 1024 blocks are in 1024 byte chunks. So, your 117,649,480 is 120.4730 billion bytes. Then, if you're running Lion (which you should be), there will be the recovery partition, which is hidden but absorbs about 1.3GB on my MacBook Pro.

The remainder is likely to be formatting slop and wear leveling space. The latter is used to make sure that things don't go haywire as SSDs have some peculiar requirements as to how data is written, etc.... this is one of the reasons that Apple bought a flash memory controller manufacturer in Israel late last week.

Any rate, as for the "private", that's the disk space that is required to make your computer work well. The VM is what is called "virtual memory backing store" and provides space for the computer to store stuff that should have been in memory if your computer had 8 or 16 or 32GB of RAM. Without it, you would have to quit more frequently out of programs in order to get other programs to run. (Incidentally, there is no VM on the iPhone and iPad, because they just kill off programs that are taking up memory in the background, but that can't be done with Apps on the Mac very easily, since they were mostly written a long time ago... the good news is that with Lion there is infrastructure for apps to support this kind of behavior in the future). The Sleep Image is what allows your computer to go into deep sleep and not use any/much battery when you have the lid closed. Here you can be thankful you have only 4GB of RAM, because my laptop eats 8GB for its sleep image....

So, in summary, this all looks pretty normal under the circumstances. You can turn off the sleep image, but I would strongly suggest against it. As a general rule, you want to keep between 8-16GB Free as a minimum on SSDs, and 30-50% free on rotating (old style) disks in order to maximize performance. The more full your disk, the more rewrites you flash will take and the sooner it will wear out.

In the end, as with all disks, it's a balance between performance and longevity vs what you want to have with you. The trade-offs are different between rotating media and solid-state, in that disks can wear out without much writing, whereas an SSD that is read most of the time will have very little "wear".

Verb-first AppleScript commands

2010-11-03T13:35:00-04:00

This is a note for those of us who might run into this problem. When working on some changes for Cartographica, I ran into some difficulty when using a verb-first command that could take either a list of one type of user-defined object or another type of user-defined object.

The list worked fine (once I realized that it was going to come in as a set of NSScriptObjectSpecifiers that needed to be evaluated). However, I couldn't get the NSScriptCommand to receive the call for the stand-alone object of a different type. AppleScript kept complaining that the class in question didn't respond to that message. Well, mostly right, except that the command, which was specified elsewhere should have been taking care of that, or so I thought.

If you have an NSScriptCommand-based command that takes direct parameters, you need the class to specify that it responds to that message, even if you do so by adding an empty method descriptor in the responds-to. If responds-to isn't there, it gets blocked up at a higher level.

So, the solution was simply adding an empty responds-to for the command that I'd added and then leaving the work to the NSScriptCommand-derived class.

I hope this saves somebody some time.