Why use Hamilton?¶

There are many choices for building dataflows/pipelines/workflows/ETLs. Let’s compare Hamilton to some of the other options to help answer this question.

Comparison to Other Frameworks¶

There are a lot of frameworks out there, especially in the pipeline space. This section should help you figure out when to use Hamilton with another framework, or in place of a framework, or when to use another framework altogether.

Let’s go over some groups of “competitive” or “complimentary” products. For a basic overview, see the product matrix on the homepage.

Orchestration Systems¶

Examples include:

Hamilton is not, in itself a macro, i.e. high level, task orchestration system. While it does orchestrate functions, and the DAG abstraction is very powerful, it does not provision compute, or schedule long-running jobs. Hamilton works well in conjunction with these macro systems. Hamilton provides the capabilities of fine-grained lineage, highly readable code, and self-documenting pipelines, which many of these systems lack.

Hamilton can be used within any python orchestration system in the following ways:

  1. Hamilton DAGs can be called within orchestration system tasks. See the Hamilton + Airflow example. The integration is generally trivial – all you have to do is call out to the hamilton library within your task. If your orchestrator supports python, then you’re good to go. Some pseudocode (if your orchestrator handles scripts like airflow):

    #my_task.py
    import hamilton
    import my_transformations
    dr = hamilton.driver.Driver({}, my_functions)
    output = dr.execute(['final_var'], inputs=...)
    do_something_with(output)
    
  2. Hamilton DAGs can be broken up to run as components within an orchestration system. With the ability to include overrides, you can run the DAG on each task, overloading the outputs of the last task + any static inputs/configuration, and pass it into the next task. This is more of a manual/power-user feature. Some pseudocode:

    #my_task.py
    import hamilton
    import my_functions
    prior_inputs = load_relevant_task_results()
    desired_outputs = ['final_var_1', 'final_var_2']
    inputs = my_inputs
    dr = hamilton.driver.Driver({}, my_functions)
    output = dr.execute(
       desired_outputs,
       inputs=inputs,
       overrides=prior_inputs)
    save_for_later(output)
    

Feature Stores¶

Examples include:

One can think of Hamilton as a being your “feature definition store”, where “store” is code + git. While it does not provide all the capabilities of a standard feature store, it provides a source of truth for the code that generated the features, and can be run in a portable method. So, if your desire is just to be able to run the same code in different environments, and have an online/offline store of features, you can use hamilton both to save the features offline, and generate features online on the fly.

See the feature engineering example for more possibilities, as well as blogs on the feature topic.

Note that in small cases, you probably don’t need a true feature store – recomputing derived features in an ETL and online can be very efficient, as long as you have some database to look values up (or have them passed in).

Also note that joins and aggregations can get tricky. We often recommend using our “polymorphic function definition” i.e. functions decorated with @config.when, to either load up the non-online-friendly features from a feature store or do an external lookup to simulate an online join.

We expect Hamilton to play a prominent role in the way feature stores work in the future.

Data Science Ecosystems/ML platforms¶

Examples include:

We’ve kind of grouped a whole suite of platforms into the same bucket here. These tend to have a lot of capabilities all related to ML. Hamilton can be used in conjunction with these platforms in a variety of ways. For example, you can use Hamilton to generate features for a model that you train in one of these platforms. Or you can use Hamilton to generate a model using the platform’s compute, and then save the model to the platform’s registry.

Registries / Experiment Tracking¶

Examples include:

Most pipelines have a “reverse ETL problem” – they need to get the results of the pipeline into a some sort of datastore or registry. Hamilton can be used in conjunction with these tools as the glue code that helps everything work together. For example, you can use Hamilton to generate a model and then store metrics computed by Hamilton to one of these “destinations”.

There are three main ways to integrate with these tools:

  • inside a function that Hamilton orchestrates

  • outside Hamilton (e.g. in a script that calls Hamilton)

  • using “materializers” (see materializers) (see this blog).

See this ML reference post for examples of how to use Hamilton with these tools.

Python Dataframe/manipulation Libraries¶

Examples include:

Hamilton works with any python dataframe/manipulation oriented libraries. See our examples folder to see how to use Hamilton with these libraries.

Python “big data” systems¶

The following systems are ones that you would resort to using when wanting to scale up your data processing.

Examples include:

These all provide capabilities to either (a) express and execute computation over datasets in python or (b) parallelize it. Often both. Hamilton has a variety of integrations with these systems. The basics is that Hamilton can make use of these systems to execute the DAG using the GraphAdapter abstraction and Lifecycle Hooks.

See our examples folder to see how to use Hamilton with these systems.