Inside the Data Seeder — YAML, Visitors, and Event-Sourced Aggregates

In a previous post, I wrote about why we built a structured data seeder and the range of problems it solves. This post goes one level deeper: how the pipeline is actually wired together.

The short version: YAML files describe the seed data, C# DTOs hold it, the visitor pattern transforms each record into a domain aggregate, and repositories persist everything through the proper event-sourced channels. Let me walk through each layer.

The Pipeline at a Glance

YAML file → DataSeedReader → YamlDotNet deserializes → SeedDto
  → SeederBase processes each item
    → SeedVisitor per entity type
      → domain aggregate instantiated
        → DbInitializer persists via repository

Every tenant seed starts with an HTTP POST to POST /tenants/{id}/run-seeder. The handler triggers AppDbInitializer, which in turn calls four domain-specific initializers (accounting, operations, inventory, catalog). Each initializer reads its YAML, deserializes it, runs its visitors, and persists the result.

YAML and DTOs

Each domain area has an all_data.yml file per dataset. The top-level keys in the YAML must match the property names on the corresponding *SeedDto class exactly.

CreditTerms:
  - Id: e0e93ca7-620e-4230-b8f5-312cd00d10b4
    Name: NET7
    Availability:
      - Customer
      - Vendor
    DueInDays: 7

Companies:
  - Id: cec21f16-4556-49b2-bea4-266b8534ce96
    Name: Super Freight Company
    BusinessRelationships:
      - Vendor

IDs in the YAML are stable GUIDs. They’re permanent once assigned, because they may be referenced as cross-dependencies inside visitors or by existing data. Names must be unique within a list, because that’s how the visitor helpers look up related records.

The DTOs live in a separate project that gets published as a NuGet package. That’s not an accident: external systems (legacy data export APIs, for example) need to produce YAML that the seeder can consume, so the DTO contracts have to be independently distributable.

The Visitor Pattern

Once the YAML is deserialized into a *SeedDto, the SeederBase iterates over each collection of entity DTOs and dispatches each one to the appropriate visitor. The ordering matters: if entity B depends on entity A, entity A’s collection must appear earlier in GetEntitiesDataFrom().

Each visitor extends SeedVisitorBase and implements a single Visit() method that returns a domain aggregate:

internal class CreditReasonSeedVisitor
    : SeedVisitorBase
{
    public CreditReasonSeedVisitor(AccountingSeedDto seed, ISeedEntity entitySeeder)

        : base(seed, entitySeeder) { }

    public override CreditReason Visit(CreditReasonDto dto)
    {
        return new CreditReason(dto.Id, dto.Code, dto.Name);
    }
}

When an entity depends on another aggregate that was already seeded earlier in the run, the visitor retrieves it by calling GetEntity():

var postingAccount = GetEntity(ca => ca.Id == dto.PostingAccountId);
return new BankAccount(dto.Id, dto.AccountNumber, dto.RoutingNumber, postingAccount.Id);

When it needs to look up another DTO within the full seed by name (rather than by ID), it calls FindDataFor() or FindIdFor(). This is what makes the YAML human-friendly: the file says category: fruit, and the visitor resolves that name to the actual entity.

Adding a New Entity: Seven Steps

When a new aggregate type needs seeding support, the process is consistent across all domains:

Define the YAML shape and add entries to every relevant dataset.
Add a DTO class in Seeder.IO implementing IEntityDtoSeed.
Add a property to the domain’s *SeedDto class.
Create a visitor in its own file in the domain’s Seeder/ folder.
Register the visitor in the seeder’s MakeOrderedVisitors() and update GetEntitiesDataFrom() and MakeSeedEntitiesFrom().
Wire persistence in the *DbInitializer by calling Persist(seededEntities.MyNewThings).
Write a fast test that verifies the full pipeline: YAML fixture → deserialization → visitor → aggregate.

That last step runs entirely in memory. There’s no database involved, and the test is marked [Trait("Category", "Fast")]. It’s the quickest way to confirm that a new DTO, visitor, and YAML entry all connect correctly before touching a running environment.

It seems like a lot, but it’s mostly boilerplate. The visitor may require more work for more complex aggregates. With the AI tools we have now, most of that code is written for us, so we keep getting a lot of value from this.

Testing the Pipeline

Seeding is tested at three levels.

At the unit level, each domain has a fast deserialization test that reads a small YAML fixture, runs the DataBuilder, and asserts expected counts and field values on the resulting entities. No database, no application process, runs in milliseconds.

At the handler level, the SeedTenantHandler is tested in isolation, verifying that InitializeFor() is called exactly once and that the TenantRegistered event is published.

At the integration level, there’s a full end-to-end seeding test that creates a real tenant, calls InitializeFor(), and queries the database to confirm that entities across all four domains were persisted correctly, with spot-checks on specific GUIDs and names.

What the Architecture Buys You

The indirection — DTOs, visitors, ordered collections, SeederBase — looks like overhead until you need to add a tenth entity type, extend an existing visitor, or debug why a cross-reference is failing. At that point, having each piece in its own clearly-named place makes the system navigable.

What I’m learning is that the patterns we reach for in application code (aggregates, events, separation of concerns) tend to pay dividends when we apply them consistently to the supporting infrastructure too. The seeder isn’t glamorous, but it’s been one of the more reliable parts of the system, and I think the design choices behind it are a big reason why.

Claudio Lassala's Blog

Leave a ReplyCancel reply