Low-fat Skylark rules – saving memory with depsets

In my previous post on aspects, I used a Bazel aspect to generate a simple Makefile for a project. In particular, I passed a list of .o files up the tree like so:

  dotos = [ctx.label.name + ".o"]
  for dep in ctx.rule.attr.deps:
    # Create a new array by concatenating this .o with all previous .o's.
    dotos += dep.dotos

  return struct(dotos = dotos)

In a toy example, this works fine. However, in a real project, we might have tens of thousands of .o files across the build tree. Every cc_library would create a new array and copy every .o file into it, only to move up the tree and make another copy. It’s very inefficient.

Enter nested sets. Basically, you can create a set with pointers to other sets, which isn’t inflated until its needed. Thus, you can build up a set of dependencies using minimal memory.

To use nested sets instead of arrays in the previous example, replace the lists in the code with depset and |:

  dotos = depset([ctx.label.name + ".o"])
  for dep in ctx.rule.attr.deps:
    dotos = dotos | dep.dotos

Nested sets use | for union-ing two sets together.

“Set” isn’t a great name for this structure (IMO), since they’re actually trees and, if you think of them as sets, you’ll be very confused about their ordering if you try to iterate over them.

For example, let’s say you have the following macro in a .bzl file:

def order_test():
  srcs = depset(["src1", "src2"])
  first_deps = depset(["dep1", "dep2"])
  second_deps = depset(["dep3", "dep4"])
  src_and_deps = srcs | first_deps
  everything = second_deps | src_and_deps

  for item in everything:
    print(item)

Now call this from a BUILD file:

load('//:playground.bzl', 'order_test')
order_test()

And “build” the BUILD file to run the function:

$ bazel build //:BUILD
WARNING: /usr/local/google/home/kchodorow/test/a/playground.bzl:7:5: dep1.
WARNING: /usr/local/google/home/kchodorow/test/a/playground.bzl:7:5: dep2.
WARNING: /usr/local/google/home/kchodorow/test/a/playground.bzl:7:5: src1.
WARNING: /usr/local/google/home/kchodorow/test/a/playground.bzl:7:5: src2.
WARNING: /usr/local/google/home/kchodorow/test/a/playground.bzl:7:5: dep3.
WARNING: /usr/local/google/home/kchodorow/test/a/playground.bzl:7:5: dep4.

How did that code end up generating that ordering? We start off with one set containing src1 and src2:

Add the first deps:

And then create a deps set and add the tree we previously created to it:

Then the iterator does a postorder traversal.

This is just the default ordering, you can specify a different ordering. See the docs for more info on depset.

Custom, locally-sourced output filenames

Skylark lets you use templates in your output file name, e.g., this would create a file called target.timestamp:

touch = rule(
    outputs = {"date_and_time": "%{name}.timestamp"},
    implementation = _impl,
)

So if you had touch(name = "foo") in a BUILD file and built :foo, you’d get foo.timestamp.

I’d always used %{name}, but I found out the other day that you can actually use other attributes, too. For example, you could have:

greet = rule(
    attrs = {"my_name": attr.string()},
    outputs = {"greeting": "hi-there-%{my_name}"},
    implementation = _impl,
)

Then if you have greet(name = "a-greeting", my_name = "kristina") and build :a-greeting, you’ll get “hi-there-kristina” as an output file.

The entire source for this example is available as a GitHub gist (all four lines of implementation function not shown above).

Communicating between Bazel rules: how to use Skylark providers

Rules in Bazel often need information from their dependencies. My previous post touched on a special case of this: figuring out what a dependency’s runfiles are. However, Skylark is actually capable of passing arbitrary information between rules using a system known as providers.

Suppose we have a rule, analyze_flavors, that figures out what all of the flavors are in a dish. Our build file looks like:

load(":food.bzl", "analyze_flavors")

analyze_flavors(
    name = "burger",
    ingredients = [
        ":beef",
        ":ketchup",
    ],
)

analyze_flavors(
    name = "beef",
    tastes_like = "umame",
)

analyze_flavors(
    name = "ketchup",
    tastes_like = "sweet",
)

We want to build up a flavor profile for :burger, based on its ingredients.

To do this, food.bzl looks like:

def _flavor_impl(ctx):
  # Build up a flavor profile from this rule & its ingredients.
  flavor_profile = []
  for ingredient in ctx.attr.ingredients:
    if ingredient.flavor != None:
      flavor_profile += ingredient.flavor

  if ctx.attr.tastes_like != "":
    flavor_profile += [ctx.attr.tastes_like]

  # Write the list of flavors to a file.
  ctx.file_action(
    output = ctx.outputs.out,
    content = "%s tastes like %sn" % (
        ctx.label.name, " and ".join(flavor_profile))
  )

  # Return the list of flavors so it can be used by rules that depend on this.
  return struct(flavor = flavor_profile)

analyze_flavors = rule(
    attrs = {
        "ingredients": attr.label_list(),
        "tastes_like": attr.string(),
    },
    outputs = {"out": "flavors-of-%{name}"},
    implementation = _flavor_impl,
)

The highlighted lines are where the rule returns a provider, flavor, to be consumed by its reverse dependencies (the targets depending on it).

Our BUILD file gives us the following build graph:

graph

:burger depends on :beef and :ketchup. :beef and :ketchup each provide :burger with a flavor. Thus, if we build :burger and check its output file, we get:

$ bazel build :burger

INFO: Found 1 target...
Target //:burger up-to-date:
  bazel-bin/flavors-of-burger

INFO: Elapsed time: 0.270s, Critical Path: 0.00s
INFO: Build completed successfully, 2 total actions
$ cat bazel-bin/flavors-of-burger
burger tastes like umame and sweet

This can be used to communicate rich information from rule-to-rule in Skylark. See the Skylark cookbook for another example of providers.

Collecting transitive runfiles with skylark

Bazel has a concept it calls runfiles for files that a binary uses during execution. For example, a binary might need to read in a CSV, an ssh key, or a .json file. These files are generally specified separately from your sources for a couple of reasons:

  • Bazel can understand that it is a runtime, not compile dependency (so if the runfiel changes, the binary does not need to be rebuilt).
  • The type is less restrictive: most rules have restrictions on what its sources can “look like” (e.g., Java sources end in .java or .jar, Go sources end in .go, Python sources end in .py or .pyc, etc.).

Thus, these runfiles are often specified in a separate data attribute.

If you’re writing a skylark rule that combines several executables, you will probably want the skylark rule to also combine the runfiles for all of them. Let’s create a rule that can combine several executables and include all of their runfiles. As a toy example, I created a rule below that creates an executable. The rule has one attribute, data, that can be other files or rules. The executable, when run, will just list all of the runfiles it has available.

For example, suppose you had the following BUILD file:

list_runfiles(
    name = 'main-course', 
    data = ['lasagna.txt']
)

If we ran bazel run :main-course, it would print:

./my_workspace/main-course
./my_workspace/lasagna.txt
./MANIFEST

However, we can also provide list_runfiles targets as data. For example, our BUILD file could say:

list_runfiles(
    name = 'main-course', 
    data = [
        'lasagna.txt',
        ':side-dishes',
        ':drink',
    ]
)

list_runfiles(
    name = 'side-dishes', 
    data = [
        ':soup',
        ':salad',
    ]
)

list_runfiles(
    name = 'soup', 
    data = ['gazpacho.txt']
)

list_runfiles(
    name = 'salad', 
    data = ['waldorf.txt']
)

list_runfiles(
    name = 'drink', 
    data = ['milk.txt']
)

Then running bazel run :main-course will print:

.
./my_workspace
./my_workspace/side-dishes
./my_workspace/waldorf.txt
./my_workspace/salad
./my_workspace/milk.txt
./my_workspace/main-course
./my_workspace/lasagna.txt
./my_workspace/gazpacho.txt
./my_workspace/soup
./my_workspace/drink
./MANIFEST

:main_course has collected all of its transitive runfiles in its runfiles tree.

Here’s the list_runfiles rule definition:

def _list_runfiles_impl(ctx): 
  ctx.file_action(
    output = ctx.outputs.executable,
    content = 'n'.join([
        "#!/bin/bash",   
        "cd $0.runfiles",
        "find"
    ]),
    executable = True)
  return struct(runfiles = ctx.runfiles(collect_data = True))

list_runfiles = rule(
    attrs = {
        "data": attr.label_list(
            allow_files = True,
            cfg = DATA_CFG,
        ),
    },
    executable = True,
    implementation = _list_runfiles_impl,
)

The key is the line is runfiles = ctx.runfiles(collect_data = True). collect_data will automatically “harvest” the runfiles from data, srcs, and deps attributes. We’ve only defined data for this rule, so that’s what it will use.

Note that this won’t work if you change "data": attr.label_list( to something not covered by collect_data, e.g., "stuff": attr.label_list(. (Theoretically this should be covered by transitive_files, but I couldn’t actually get that working.)

Runfiles