Developer Guide

This document is a reference for developers of projects within the pubtools family.

Project setup

Naming

If you are introducing a new pubtools project, please follow these conventions:

  • For task libraries:
    • Name the project pubtools-<foo> where foo identifies the most relevant target system. (example: pubtools-pulp is for pushing content to Pulp.)

    • Put all Python code under the “pubtools” namespace.

    • (“Task library” means the scope of the project is to provide console_scripts-based tasks, as described in a later section.)

  • For any other kind of project:
    • If you are developing a project associated with pubtools but not a task library, there is no specific convention. Give the project any sensible name.

    • If the project can be of general use, you do not have to name it with a pubtools- prefix.

Target platforms

Projects should aim to be compatible with the following environments:

  • Python 3.9 and later

  • Red Hat flavored operating systems (RHEL, CentOS, Fedora)

Understanding tasks

Task interface

The interface to any pubtools task is a standard console_scripts entry point. See the setuptools manual for information on these.

Entry points by convention are named as <project_name>-<task_name>. For example, the pubtools-pulp project has a publish task which is defined in setup.py as:

entry_points={
  "console_scripts": [
    "pubtools-pulp-publish = pubtools._pulp.tasks.publish:entry_point",
  ]
},

Inputs

Tasks should accept all necessary inputs to perform their work via:

  • command-line arguments
    • read from sys.argv using any preferred method (e.g. argparse)

    • do not use for secrets

  • environment variables
    • read from os.environ as usual

    • recommended method of passing secrets

Note that reading data from local files is not recommended. This is because it breaks or at least complicates the Standalone vs Hosted paradigm: if you read inputs or configuration from files, you must somehow arrange for those files to be deployed onto the hosted environment and cleaned up when no longer needed.

There’s a notable exception to this: if you need to process very large amounts of data, you may find that the argument list becomes too long to launch a new process (see argmax).

This is not a problem within the hosted environment, since that doesn’t launch new processes for new tasks. However, argument limits may interfere with local development and testing. You may therefore wish to add some optional functionality to read from local files as a workaround to the limit, to be used primarily during development.

Outputs

Tasks should use the following approaches to produce output. (Note we refer to logging-style outputs here, not the primary side-effects a task aims to implement).

  • use the Logging system to log messages

  • use the pushcollector library to save additional (small) files and push item metadata
    • example: save a snapshot in JSON format of some remote resource prior to updating, to ensure an audit trail exists

Standalone vs Hosted

Tasks provided by pubtools libraries are designed to work within two different contexts, with significant differences between them. These are:

  • Invoked directly as command-line interface tools
    • We refer to this as Standalone execution.

  • Invoked from within a Python-based service using the “entry points” system
    • We refer to this as Hosted execution.

The following table summarizes the major differences between these contexts:

Standalone

Hosted

usage scenario

development, testing, emergencies

production

inputs

sys.argv populated by command-line arguments

sys.argv populated by the hosting service

secrets

environment variables set by caller

environment variables set by the hosting service

sys.stdout

connected to standard output

redirected to logger

logging

logging subsystem uses default (empty) configuration

loggers are configured prior to task execution

hooks

Hooks are available

hooks are available, hosting service may provide some @hookimpls

pushsource

pushsource library uses default (generic) configuration

pushsource library is bound to specific environments

pushcollector

pushcollector library uses default local backend

pushcollector library uses a backend specific to the hosting service

In principle, the pubtools family of projects are able to be integrated with any number of services. In practice, tasks are almost always hosted in one specific system known as Pub.

Argument conventions

In this section we document some common arguments supported across multiple task libraries.

If you need to support any of the functionality below, it is best to follow these conventions. Using consistent arguments across the projects helps to avoid errors while integrating tasks into Pub or other task hosting environments.

Arguments with multiple values

For cases where you want to accept multiple values as a list, it’s recommended to support both the following argument styles simultaneously:

--key val1 --key val2 --key val3

This style tends to be simpler for programmatic usage.

--key val1,val2,val3

This style is more user-friendly for humans.

(Of course, you may not accept this style if you need values which themselves contain the , character.)

--debug: enable debug logging

As explained at Avoid changing logger configuration, task libraries in general should refrain from configuring loggers; in production scenarios, this is the responsibility of the task hosting environment.

However, while locally developing and testing task libraries, there will commonly arise the need to enable verbose logging. If you want to enable this, it’s suggested to provide a --debug option with semantics:

no --debug

Enable root logger at INFO level.

--debug

Enable root logger at INFO level and this project’s logger at DEBUG level.

--debug more than once

Enable even more loggers at DEBUG level, perhaps even the root logger, depending on how may times the argument is used.

Keep in mind that this option is intended for local development or testing purposes only. The option should never be used when the task runs in a hosted environment.

--skip, --step: execute subset of task

Some tasks might be made up of several discrete steps, and you may want to allow callers to skip some of them in certain cases.

If so, it’s suggested to enable this via the following arguments:

--skip a,b,c

Don’t execute steps "a", "b" or "c".

--step a,b,c

Execute only steps "a", "b" and "c".

Here are some implementation guidelines for these arguments:

  • Follow the advice in Arguments with multiple values.

  • Don’t validate step names.
    • If passed any unknown steps, simply ignore them.

    • Validating the step names would make the interface more brittle (e.g. removing a step is a backwards-incompatible interface change).

    • If you’re familiar with ansible - compare with ansible tags, where it’s not an error to specify a tag matching zero tasks.

  • The target audience includes developers and power users.
    • By default, a task should do the right thing without having to be passed any --step or --skip.

    • It may be used during development to focus only on relevant parts of code.

    • It may be used in production as an emergency workaround to skip failing but non-critical portions of a task.

    • It’s not expected that all possible combinations of steps will work or shall be tested.

--source: obtain content via pushsource

If your task consumes content via the pushsource library, it’s strongly recommended to use the URL-based configuration mechanism from the library, and accept the URL via a --source argument:

--source <some-pushsource-url>

Use content from the specified source.

Accepting other arguments to configure the pushsource library is discouraged. Strictly using URLs for configuration will ensure that your task is forwards-compatible with new features and improvements in future versions of that library.

Logging

Libraries can make use of the standard logging module. Following the guidelines below will help ensure that your logging works effectively in both the standalone and hosted execution modes.

Use loggers as primary output mechanism

Use standard Logger objects for user-oriented output from your task. Don’t write to stdout or stderr.

(Note: although it’s mentioned earlier that stdout will be redirected to a logger within the hosted environment, this is intended as a last resort to avoid losing valuable information while debugging; relying on it is bad practice.)

Avoid changing logger configuration

When invoked in the hosted environment, loggers will already be configured by the time your task’s entry point is called. In this case, you must not adjust or overwrite the configuration of the loggers (particularly the root logger). Logger configuration is considered to be the responsibility of the hosting service; tasks should avoid interfering with it.

Note that the standard logging.basicConfig() function implements useful behavior: it installs basic log configuration writing to sys.stderr if and only if the root logger is not already configured.

Therefore, a simple and recommended way to set up loggers correctly for both the standalone and hosted case is to insert a call to basicConfig near the top of your task’s entry point, such as:

logging.basicConfig(level=logging.INFO)

Name loggers after your project

If your project is named pubtools-foo, you should use loggers under the pubtools.foo hierarchy. This ensures there is a simple and predictable way to control the logging for each project.

Use appropriate log levels

Log levels within the pubtools projects are typically used as follows:

DEBUG

Messages of interest to developers.

Hidden by default, only enabled when debugging problems. Log anything you may find useful while debugging.

INFO

Messages of interest to end-users.

This level and all later levels are enabled by default for pubtools.* loggers when executing in a hosted environment.

Should be verbose enough to understand any major side-effects of a task (e.g. writing data to a remote system), but not so verbose as to cause performance issues or make the log unreasonably noisy.

WARNING

Messages indicating a potential but non-fatal issue, or adding information which may explain the root cause of an error encountered elsewhere.

ERROR

Messages indicating that something has certainly gone wrong and some action is likely needed.

Refrain from using these levels for any recoverable errors (e.g. if task retries an operation, only use ERROR after all retries are exhausted).

Note that users will sometimes be concerned or confused by ERROR logs even if a task ultimately ends up succeeding. A useful rule of thumb is to ask yourself: when this code path is reached, do I want users to file a ticket/bug for me? If the answer is “No”, then ERROR may be too severe.

Structured logging via extra

The logging system supports adding structured metadata onto log events via the optional extra argument, which is a dict of arbitrary key-value pairs.

Within the pubtools family of projects we have the convention of an event attribute, of the following form:

{
  "event": {
    "type": "foo-happened",  # brief string identifying what just happened
    "key1": "value1",        # & then any arbitrary fields relating to
    "key2": "value2",        # this event (don't add timestamp, it's implied)
  }
}

If your project creates logging events using this structure, these may be collected by the hosting environment and used in various interesting ways (e.g. recorded into metrics-collecting systems).

It’s recommended to make use of this whenever a logged event might be interesting in order to programmatically monitor the performance or health of a service.

Here is an example from pubtools-pulplib, logging metadata on the Pulp task queue.

LOG.info(
  "Still waiting on Pulp, load: %s running, %s waiting",
  running_count,
  waiting_count,
  extra={
    "event": {
      "type": "awaiting-pulp",
      "running-tasks": running_count,
      "waiting-tasks": waiting_count,
    }
  },
)

Any string may be used as an event type, though a few conventions exist:

event.type

Usage

<x>-start

The current task has started a step named <x>.

<x>-end

Step <x> completed normally.

<x>-error

Step <x> returned early due to a fatal error.