Monday, November 28, 2011

Writing testable ETL processes with Rhino-ETL

On a recent project we had to integrate several external data sources into the application database. Some of these sources were csv files, while another one was an Oracle database. Our application, meanwhile, used a SQL Server database. Furthermore some of the data had to be loaded automatically, while other data had to be loaded by a business user. We tossed around several ideas for how to accomplish this: SSIS, Informatica, something custom built. We finally arrived on Ayende Rahien's (aka Oren Eini) ETL framework Rhino-ETL. It looked like it would fit the bill as it could be integrated into any .NET application and it allowed us to write unit and integration tests around it.

Unfortunately there is a dearth of information around how to use this framework. The only good piece of documentation is this great video tutorial. While a good starting point, I thought I'd write a blog to show how to get started with the framework.

Rhino ETL is based on a pipeline concept. Each process consists of a bunch of operations strung together. Each operation's output is fed into the next operation as input. The operations interact by the following method:

public interface IOperation : IDisposable
{
    ...
    IEnumerable<Row> Execute(IEnumerable<Row> rows);
    ...
}

So each operation takes in a collection of Rows and outputs a collection of Rows. A Row is basically an instance of Dictionary<object, object> with an added twist: no exceptions are thrown if a key does not exist. A Row will simply return null when a key does not exist in the dictionary. Each operation can also leverage the oft unused C# keyword of yield return. A typical operation might looks something like this:

public class MyOperation : AbstractOperation
{
    public override IEnumerable<Row> Execute(IEnumerable<Row> rows)
    {
        foreach (var row in rows)
        {
            //process the row
            yield return row;
        }
    }
}

The yield return keyword instructs the compiler to generate an iterator that, for each row, would execute whatever code you had in the foreach statement. This allows us to defer execution of each row to the point in time when it passes through the operation pipeline.

Since your operations expose the Execute method, you can easily write unit tests around each operation, with any kind of data as input. Voila, by using this framework and developing a good suite of unit tests, your ETL process has become far more robust than if you were using a tool like SSIS.

The main thing to note about Rhino-ETL is that you will have to code all of your operations. Typically your operations will inherit from the AbstractOperation class. Some other useful operations:
SqlBulkInsertOperation
Set up the schema and do a bulk insert into a table. Useful for inserting large data sets quickly and efficiently
SqlBatchOperation
Batch your sql operations to reduce server roundtrips
BranchingOperation
Send rows to multiple operation pipelines
JoinOperation
Join your rows to a result set from another operation
PartialProcessOperation
Typically used with BranchingOperations to create an operation pipeline within a branch

A process is typically created by inheriting from the EtlProcess class.

public class MyProcess : EtlProcess
{
    protected override Initialize()
    {
        Register(new MyOperation());
        //register other operations
    }
}

A final note, in order to get any output from your etl process you will need to set log4net up. A simple console output can be created by the following entry in your App.config:

<log4net debug="false">
    <appender name="console" type="log4net.Appender.ConsoleAppender, log4net">
      <threshold value="WARN" />
      <layout type="log4net.Layout.PatternLayout,log4net">
        <param name="ConversionPattern" value="%d [%t] %-5p %c - %m%n"/>
      </layout>
    </appender>
    <root>
      <level value="INFO" />
      <appender-ref ref="console"/>
    </root>
  </log4net>

and by adding the following line of code to your application startup:

log4net.Config.XmlConfigurator.Configure();

To conclude, while you do give up fancy designers and useful operations like SSIS's Fuzzy Lookup, the benefit gained from being able to write unit tests around your ETL process can be invaluable. Due to its simplicity and the fact that it can be easily integrated into other .NET applications, Rhino ETL is a very useful tool to have in your arsenal.