The last 6 months have been…eventful. After dealing with some big life events, some expected and others not, I’ve found myself with time to get back into exploratory programming. I always have some side projects going but it’s hard to make serious progress with a full-time day job, too.

Metaflow

Over the last year I’ve spent a lot of time using and hacking on Metaflow. Metaflow is a framework/library/CLI tool to build and deploy machine learning (ML) workflows. The project has a strong focus on usability and developer friendliness and it shows. Metaflow can run locally on your laptop or on any of the 3 major clouds - AWS, GCP, Azure - seamlessly with zero code changes. In the cloud Metaflow supports AWS Step Functions and Kubernetes. The best part, from a developer perspective, is the coherent model the framework presents.


from metaflow import FlowSpec, step

class HelloWorldFlow(FlowSpec):

    @step
    def start(self):
        self.next(self.do_something)

    @step
    def do_something(self):
        print("Hello, world!")
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == "__main__":
    HelloWorldFlow()

A workflow is a class which inherits from metaflow.FlowSpec. This enables a bunch of clever metaprogramming which protects the workflow author from having to care about the nitty-gritty details of deployment and execution. Workflow steps are merely Python functions decorated with @step.


# Local execution
> python ./hello_world.py run

# Remote execution on Kubernetes
> python ./hello_world.py kubernetes run

Metaflow is a deep subject and I’m only scratching the surface here. To learn more check out the project’s docs.

Improving Metaflow’s Model

tl;dr Writing a workflow in Python isn’t that hard. Managing, packaging, and distributing the workflow code can be very difficult.

There’s no denying Metaflow is great and one of the easiest orchestration frameworks to use. IMO, it has one flaw which makes using Metaflow in practice harder than strictly necessary. It’s written in Python. Before going further I want to say Python is a fine language suitable for solving many, many different kinds of problems. That doesn’t mean Python is perfect, though. The interpreter’s GIL (Global Interpreter Lock) is the most well-known foible but there are others.

The Python ecosystem sports a variety of project management tooling. pip. conda. virtualenv. poetry. pipenv. Picking one can be hard for someone who doesn’t keep up with the latest. A similar situation exists around packaging and distribution tooling. There are also subtle interactions between project management and packaging tools which add yet another layer of complexity to manage.

Elixir

I thought it’d be interesting to prototype what I see as the next iteration of Metaflow’s core concepts in Elixir. BEAM languages have a solid concurrency story and improved performance thanks to the recently added JIT compiler. Elixir itself has excellent tooling for project management, code packaging, and distribution. Finally, thanks to the great folks building Livebook and the family elixir-nx projects, Elixir’s ML support is rapidly improving.

I’ve started a Github repo, Dagger, and committed my first iteration of work this weekend. Workflows are Elixir modules and steps are Elixir functions declared using the defstep macro. Internally Dagger constructs a DAG from the flow module’s source at code compile time to reduce runtime overhead.

My first goal is to get the DAG representation right. Currently Dagger builds its DAG via Elixir metaprogramming facilities. I’m generally happy with its current form but want to incorporate support for step typespecs before changing focus.


defmodule HelloWorldFlow do
  use Dagger.Flow

  defstep start() do
    next_step(hello_world)
  end

  defstep hello_world() do
    IO.puts("Hello, world!")
    next_step(finish)
  end

  defstep finish() do
    :ok
  end
end

Once compiled a workflow module’s DAG can be retrieved via dag/0.


iex(1)> HelloWorldFlow.dag()
%Dagger.Graph{
  module: HelloWorldFlow,
  file_name: "/Users/kevsmith/repos/dagger/examples/hello_world.ex",
  sanitized_name: "hello-world-flow",
  steps: %{
    {HelloWorldFlow, :finish} => %Dagger.Graph.Step{
      line: 13,
      arity: 0,
      name: :finish,
      fun_name: :finish,
      inputs: [],
      return: :unknown,
      next: nil
    },
    {HelloWorldFlow, :hello_world} => %Dagger.Graph.Step{
      line: 8,
      arity: 0,
      name: :hello_world,
      fun_name: :hello_world,
      inputs: [],
      return: :unknown,
      next: :finish
    },
    {HelloWorldFlow, :start} => %Dagger.Graph.Step{
      line: 4,
      arity: 0,
      name: :start,
      fun_name: :start,
      inputs: [],
      return: :unknown,
      next: :hello_world
    }
  }
}

Dagger runs several validations to ensure the resulting DAG is well-formed:

  1. All workflows must have start/0 and finish/0 public functions.
  2. Other workflow steps can have unrestricted arity.
  3. Every step, except for finish/0, must call next_step/1.

Dagger has 2 constraints on how execution is advanced:

  • next_step/1 must be invoked from the body of a step function.
  • next_step/1 cannot be called from within a conditional expression.

I hope to remove them in the future. For now they’re necessary to simplify the initial implementation.

I’ve commented out finish/0 to illustrate Dagger’s compile validations in action.


>  mix compile
Compiling 1 file (.ex)

== Compilation error in file examples/hello_world.ex ==
** (Dagger.MissingStepError) DAG Elixir.HelloWorldFlow is missing the finish step
    lib/dagger/graph.ex:104: Dagger.Graph.validate_required_steps!/1
    lib/dagger/graph.ex:48: Dagger.Graph.validate!/1
    lib/dagger/compiler.ex:69: Dagger.Compiler.finalize!/2
    expanding macro: Dagger.Flow.__before_compile__/1
    examples/hello_world.ex:1: HelloWorldFlow (module)

Next Steps

After the DAG representation is finished I’ll focus on location execution followed by executing workflows on plain vanilla Kubernetes. I want to explore ways to reduce the complexity and overhead of working with containers perhaps composing container images on the fly. More on that soon. Later Dagger will host a metadata server which will track workflow execution statistics and status.