alt text

Many times in my career I’ve worked on Extract Transform Load (ETL) projects. Often they have grown organically through stored procedures, take hours to run and are fragile.

Below are a few questions my university friend challenged me to answer on an IMDB dataset.

I decided to focus on the veracity of the answers over speed as ultimately that was the big unknown in the systems I’ve worked on:

Is the ETL package doing the right thing / what is it missing?

And paraphrasing a Nick Craver Tweet
 

My general approach is the dumbest, simplest thing that works well

 

Source code, data and anwsers for this article is here and the IMDB dataset was extracted from here

Sample Questions

  1. How many female actors are listed
  2. The number of female actors and the number of male actors as a single query
  3. The movie titles and number of directors involved for movies with more than 6 directors

Strategy

The ‘source of truth’ data is coming from CSV files (see the source project data directory) so the first thing to do is protect these files and get them under source control.

Then I imported the CSV data with Excel (Data, Import) saving as xlsx (incluced in source project data directory), and drew a rough DB diagram on paper to trial the relationships

I used SSMS with db diagrams to enter the schema and simple 1 to many ie link table Foreign Key constraints

alt text

I used a SQL Server Database Project Visual Studio to keep the schema source controlled

  • C# with CsvHelper to load in the delimited text files (Extract)
  • C# to analyse the data for anomalies and fix (Transform)
  • C# with Dapper to insert (Load)
  • SQL to answer questions once the verified clean data was loaded
  • C# to answer harder questions. Clarity over speed, iterative code fine to start with

alt text

DB project in Visual Studio comparing source control to the (localdb)/mssqllocaldb

DB Schema

I used the same schema and naming convention as the source even though I don’t like it: Link to my article on db naming

Some observations were:

  • Composite keys in link tables
  • Capitalisation of columns names

ETL with C#

  • Extract from CSV
  • Transform and exploring - do we need any data cleansing?
    • whitespace where shouldn’t be
    • unusual characters - using UTF-8 in db and all queries parameterised
    • custom validation
  • Load using Dapper

Below is the salient code:


var sw = Stopwatch.StartNew()
using var db = Util.GetOpenConnection();

// A simple flag to switch on an off parts of the etl load (some took some time)
bool writeToDb = true;
if (writeToDb)
{
    // clear down db first in correct order
    db.Execute("DELETE FROM MoviesToDirectors");
    // snip..
    db.Execute("DELETE FROM Actors");
}

// 1.Extract Actors
var actors = LoadActorsFromCsv();
Console.WriteLine($"Total Actors imported from csv is {actors.Count}"); // 98,690

// Transform
foreach (var actor in actors)
{
    // check for leading or trailing whitespace
    if (ListStringsContainLeadingOrTrailingWhitespace(new List<string>
        {actor.actorid, actor.name, actor.sex}))
        Console.WriteLine($"whitespace in {actor.actorid}");

    // check for interesting/bad/unusual characters that could affect results eg LF?..
    // all queries are paramerterised so no probs with ' chars
    // okay due to db handing unicode UTF-8 and using nvarchar to hold strings

    // Custom validation
    if (!IsSexACapitalMOrF(actor.sex))
        Console.WriteLine("non M or F in Actor sex column");

    var sql = @"
    INSERT Actors
    VALUES (@actorid, @name, @sex)";
    // don't need INSERT INTO
    // don't need to specify column names
    // dapper will map column names from actor object
    if (writeToDb) db.Execute(sql, movie);
}

Console.WriteLine($"done in {sw.Elapsed.TotalSeconds}");

// snip...

private static List<Actor> LoadActorsFromCsv()
{
    using var reader = new StreamReader("..\\..\\..\\..\\..\\data\\actors.csv");
    using var csv = new CsvReader(reader, CultureInfo.InvariantCulture);
    csv.Configuration.Delimiter = ";";
    return csv.GetRecords<Actor>().ToList();
}

public class Actor
{
    // favouring the simplest data type string in this load
    // until I understand the data (ie what edge cases are there)
    public string actorid { get; set; }
    public string name { get; set; }
    public string sex { get; set; }
}

public class Util
{
    public static IDbConnection GetOpenConnection()
    {
        var connection = new SqlConnection(  
            "Server=(localdb)\\mssqllocaldb;Database=IMDBChallenge;Trusted_Connection=True;"
            + "MultipleActiveResultSets=true");
        connection.Open();
        return connection;
    }
}

These load scripts are very good as can rebuild on any machine with a clone from source control. This is a very simple load script with no thought towards insert performance. I could use SQLBulkCopy and FastMember if Dapper isn’t fast enough. Using .AsParallel() above didn’t produce any gains, and actually hurt performance sometimes.

There is a MiniProfiler for Consoleapps which looks useful if you want to see performance and good logging.

Serilog Logging

Serilog is my logger of choice. Here are the Serilog configuration basics.

Log.Logger = new LoggerConfiguration()
     .WriteTo.Console()
     .WriteTo.File("../../../logs/log-.txt", rollingInterval: RollingInterval.Day)
     .CreateLogger();
Log.Information("Starting");

alt text

Preserving Object Structure - @Destructuring is useful as shown above in the output and below in code:

foreach (var r in ratings)
{
    if (ListStringsContainLeadingOrTrailingWhitespace(new List<string> { r.movieid, r.rank.ToString(), r.votes.ToString(), r.distribution }))
    // @Destructuing my own Rating type (r)
        Log.Warning("Ratings whitespace somewhere in {@R} ", r);
}

Unit Tests

I write xunit paramertarised tests for the functions that do the checking. Easy to do when you’re in C#


[Theory]
[InlineData(null, true)]
[InlineData("", true)]
[InlineData("              ", true)]
[InlineData("asdf", false)]
[InlineData(" asdf", true)]
[InlineData("asdf ", true)]
[InlineData(" asdf  ", true)]
[InlineData("asdf asdf", false)]
public static void ContainsLeadingOrTrailingWhitespace_Tests(string s, bool expected)
{
    var result = ContainsLeadingOrTrailingWhitespace(s);
    Assert.Equal(expected, result);
}

Errors and Logging

Try Catch and log out to serilog. Often I’ll use a railway oriented style pipeline of - ie the data has to go through all the checks before allowed into the db. If it goes off on a branch line, it is not allowed into the db, it must go to the errors table.

CSVHelper Errors

If you have problmes in the source of truth then have to be able to trap them.

How to dump out problem rows in the CSV which cannot be read

This strategy helps us get any issues that CSVHelper may have eg malformed rows

Querying with SQL

SQL great once the data is loaded to answer some of the easier questions. Some notes to my future self include:

  • count (1) - more performant than count(*)
  • group by having - interesting syntax
  • multiple left join pipeline strategy

Exploring data with C#

I find that my functional set based SQL skills wane over time and it is easier to start thinking of a problem in an iterative manner first (ie get it working first, then work on performance later)

As we have the load scripts, can use these data structures and LINQ to analyse the data and answer hard questions.

I found a good strategy to be: Think of the datastructure I want to query to get the answer then work back

class Foo
{
    public string maleActorID { get; set; }
    public List<Bar> listBars { get; set; }
}

class Bar
{
    public string femaleActorIDworkedWith { get; set; }
    public List<string> movieIDsWorkedOnWithThisFemaleActor { get; set; }
}

Conclusion

This is a simple strategy, however extremely effective, if you are a C# developer. Most of the issues I’ve seen with ETL packages have been with DB admins starting them, or people just trying their best.

There a many many tools out there to help you, however I would always advise the simpler the better