Shay Rojansky’s Blog

Queryable PostgreSQL arrays in EF Core 8.0

2023-05-20T00:00:00+02:00

Queryable collections?

EF Core 8.0 preview4 has just been released, and one of the big features it introduces is queryable primitive collections. This is a really cool feature that allows mapping primitive collections (e.g. int[]) to the database, and performing all imaginable LINQ queries over them. Before reading further here, please read the EF Core blog post on this feature; more info and examples are also available in the EF What’s New documentation.

The rest of this post will discuss PostgreSQL-specific aspects of this feature, which is also fully supported starting with version 8.0.0-preview.4 of the PostgreSQL EF provider.

Contains over parameter

The EF blog post starts with a tricky problem: how to translate the LINQ Contains operator when the list of values is a parameter?

var names = new[] { "Blog1", "Blog2" };

var blogs = await context.Blogs
    .Where(b => names.Contains(b.Name))
    .ToArrayAsync();

The solution introduced in preview4 serializes the names .NET array into a string containing a JSON array representation, and then uses a SQL function to parse the values out in SQL. Here’s the SQL Server sample:

Executed DbCommand (49ms) [Parameters=[@__names_0='["Blog1","Blog2"]' (Size = 4000)], CommandType='Text', CommandTimeout='30']

SELECT [b].[Id], [b].[Name]
FROM [Blogs] AS [b]
WHERE EXISTS (
    SELECT 1
    FROM OpenJson(@__names_0) AS [n]
    WHERE [n].[value] = [b].[Name])

The nice thing about PostgreSQL, is that it has full, first-class support for array types in the database - this is quite a unique feature. So we don’t have to mess around with JSON at all - we can simply send the .NET array directly as a parameter and use it in SQL as follows:

Executed DbCommand (10ms) [Parameters=[@__names_0={ 'Blog1', 'Blog2' } (DbType = Object)], CommandType='Text', CommandTimeout='30']

SELECT b."Id", b."Name"
FROM "Blogs" AS b
WHERE b."Name" = ANY (@__names_0)

In fact, the EF PostgreSQL provider has done this for a few years already, freeing PostgreSQL users with the performance problems that users of other databases had to contend with (see this issue). So preview4 doesn’t bring any improvements around this specific problem - we were already doing the optimal thing.

Fully queryable arrays

However, even though the EF PostgreSQL array has supported arrays, its support for querying over them has been quite limited. Now, EF 8.0 preview4 unlocks generalized LINQ querying over primitive collections - once again by converting them to JSON, and using a SQL function to unpack them to a relational rowset. For example, the following LINQ query:

var tags = new[] { "Tag1", "Tag2" };

var blogs = await context.Blogs
    .Where(b => b.Tags.Intersect(tags).Count() >= 2)
    .ToArrayAsync();

… is now translated to the following SQL with SQL Server:

Executed DbCommand (48ms) [Parameters=[@__tags_0='["Tag1","Tag2"]' (Size = 4000)], CommandType='Text', CommandTimeout='30']

SELECT [b].[Id], [b].[Name], [b].[Tags]
FROM [Blogs] AS [b]
WHERE (
    SELECT COUNT(*)
    FROM (
        SELECT [t].[value]
        FROM OPENJSON([b].[Tags]) AS [t] -- column collection
        INTERSECT
        SELECT [t1].[value]
        FROM OPENJSON(@__tags_0) AS [t1] -- parameter collection
    ) AS [t0]) >= 2

This uses the SQL Server OpenJson function to unpack the JSON array column and parameter into rowsets, over which the LINQ operators are translated in the standard way.

Now let’s see how this works on PostgreSQL!

First, here’s our .NET entity type:

public class Blog
{
    public int Id { get; set; }
    public string? Name { get; set; }
    public string[] Tags { get; set; }
}

This creates the following table:

CREATE TABLE "Blogs" (
  "Id" integer GENERATED BY DEFAULT AS IDENTITY,
  "Name" text,
  "Tags" text[] NOT NULL,
  CONSTRAINT "PK_Blogs" PRIMARY KEY ("Id")
);

Note that Tags is a PostgreSQL array - text[], and not a simple string column containing a JSON array. Aside from mapping .NET arrays more directly and naturally, this has the following advantages:

It’s stored more efficiently: array elements are stored in the same efficient binary encoding that PostgreSQL uses for regular, non-arrays values.
It’s also transferred more efficiently. The same binary encoding is used when reading and writing the elements, meaning that we don’t need to constantly serialize and parse JSON.
Arrays provide more database type safety; it’s impossible for the column to contain anything than the defined array type. Similar type safety may be achievable with a JSON array via complex check constraints, but this is more complicated and probably less efficient.

The LINQ query above translates to the following SQL:

Executed DbCommand (14ms) [Parameters=[@__tags_0={ 'Tag1', 'Tag2' } (DbType = Object)], CommandType='Text', CommandTimeout='30']

SELECT b."Id", b."Name", b."Tags"
FROM "Blogs" AS b
WHERE (
  SELECT count(*)::int
  FROM (
      SELECT t.value
      FROM unnest(b."Tags") AS t(value)
      INTERSECT
      SELECT t1.value
      FROM unnest(@__tags_0) AS t1(value)
  ) AS t0) >= 2

Where SQL Server had the OPENJSON function, on PostgreSQL we use the unnest function to expand the array into a relational rowset; conceptually things are very similar, except that the thing being expanded is a native PostgreSQL array rather than a string value containing a JSON array.

So far, so good: we can use arbitrary LINQ operators to query PostgreSQL array columns (and parameters), and the EF provider translates those by “unnesting” the array and then using regular SQL over that.

And a bonus: PostgreSQL specialized translations

We could stop there - queryable arrays are already a powerful, flexible new mechanism for your LINQ queries. But PostgreSQL also provides a rich set of functions and operators for working with arrays - far beyond what’s possible with JSON arrays in other databases. For example, let’s say you want to index an element in the array:

var blogs = await context.Blogs
    .Where(b => b.Tags[2] == "foo")
    .ToArrayAsync();

On SQL Server this translates to the following:

SELECT [b].[Id], [b].[Name], [b].[Tags]
FROM [Blogs] AS [b]
WHERE CAST(JSON_VALUE([p].[Ints], '$[1]') AS int) = 10

… whereas on PostgreSQL, we can simply do the following:

SELECT b."Id", b."Name", b."Tags"
FROM "Blogs" AS b
WHERE b."Tags"[3] = 'foo'

For something more complex, the provider can translate queries of the following form:

var tags = new[] { "Tag1", "Tag2" };

var blogs = await context.Blogs
    .Where(b => tags.All(t => b.Tags.Contains(t)))
    .ToArrayAsync();

This, in effect, queries Blogs where the Tags column contains all elements in the tags parameter. It so happens that PostgreSQL has an array containment operator, so we translate this to:

SELECT b."Id", b."Name", b."Tags"
FROM "Blogs" AS b
WHERE @__tags_0 <@ b."Tags"

This translation - and several like it - have been implemented for several years already; but 8.0.0-preview.4 brings a few new ones. For example:

var tags = new[] { "Tag1", "Tag2" };

var blogs = await context.Blogs
    .Where(b => b.Tags.Intersect(tags).Any())
    .ToArrayAsync();

This queries Blogs where there’s any overlap between the Tags column and the tags parameter:

SELECT b."Id", b."Name", b."Tags"
FROM "Blogs" AS b
WHERE b."Tags" && @__tags_0

Moving on from set operations, we now also translate Skip and Take to array slicing operations. For example:

var blogs = await context.Blogs
    .Where(b => b.Tags.Skip(2).Contains("Tag1"))
    .ToArrayAsync();

… translates to:

SELECT b."Id", b."Name", b."Tags"
FROM "Blogs" AS b
WHERE 'Tag1' = ANY (b."Tags"[3:])

Note that the C# 2 has been transformed to a 3, since PostgreSQL arrays are 1-based, not 0-based. We can do the same for Take:

var blogs = await context.Blogs
    .Where(b => b.Tags.Take(2).Contains("Tag1"))
    .ToArrayAsync();

… which translates to:

SELECT b."Id", b."Name", b."Tags"
FROM "Blogs" AS b
WHERE 'Tag1' = ANY (b."Tags"[:2])

… and even combines the two, generating:

SELECT b."Id", b."Name", b."Tags"
FROM "Blogs" AS b
WHERE 'Tag1' = ANY (b."Tags"[2:3])

For all the specialized translations supported by the provider, see this doc page. But remember - even if a specialized translation isn’t available, the provider will now use unnest to expand your array to a rowset, and then employ standard SQL to compose query operators on top of it.

Summary

The PostgreSQL provider has supported arrays for quite a while, but 8.0.0-preview.4 brings a major upgrade to array support: arbitrary LINQ operators can now be used, and some specialized PostgreSQL translations have been added to make your SQL tighter and more efficient. Let us know about cool querying ideas or any bugs!

When “UTC everywhere” isn’t enough - storing time zones in PostgreSQL and SQL Server

2021-11-10T00:00:00+01:00

When “UTC everywhere” isn’t enough

I’ve been dealing a lot with timestamps, timezones and database recently - especially on PostgreSQL (see this blog post), but also in general. Recently, on the Entity Framework Core community standup, we also hosted Jon Skeet and chatted about NodaTime, timestamps, time zones, UTC and how they all relate to databases - I highly recommend watching that!

Now, a lot has been said about “UTC everywhere”; according to this pattern, all date/time representations in your system should always be in UTC, and if you get a local timestamp externally (e.g. from a user), you convert it to UTC as early as possible. The idea is to quickly clear away all the icky timezone-related problems, and to have a UTC-only nirvana from that point on. While this works well for many cases - e.g. when you just want to record when something happened in the global timeline - it is not a silver bullet, and you should think carefully about it. Jon Skeet already explained this better than I could, so go read his blog post on this. As a very short tl;dr, time zone conversion rules may change after the moment you perform the conversion, so the user-provided local timestamp (and time zone) may start converting to a different UTC timestamp at some point! As a result, for events which take place on a specific time in a specific time zone, it’s better to store the local timestamp and the time zone (not offset!).

So let’s continue Jon’s blog post, and see how to actually perform that on two real databases - PostgreSQL and SQL Server. Following Jon’s preferred option, we want to store the following in the database:

The user-provided local timestamp.
The user-provided time zone ID. This is not an offset, but rather a daylight savings-aware time zone, represented as a string.
A UTC timestamp that’s computed (or generated) from the above two values. This can be used to order the rows by their occurrence on the global timeline, and can even be indexed.

In Jon’s NodaTime library, the ZonedDateTime type precisely represents the first two values above. Unfortunately, databases typically don’t have such a type; SQL Server does have datetimeoffset, but an offset is not a time zone (it isn’t daylight savings-aware). So we must use separate columns to represent the data above.

We’ll start with PostgreSQL, but we’ll later see how things work with SQL Server as well. The code samples below will show Entity Framework Core, but the same should be doable with any other data access layer as well.

PostgreSQL

PostgreSQL conveniently has a type called timestamp without time zone for local timestamps in an unspecified time zone, and a badly-named type called timestamp with time zone, for UTC timestamps (no time zone is actually persisted); those are perfect for our two timestamps. We also want the UTC timestamp to be generated from the two other values, so we’ll set up a PostgreSQL generated column (called computed column by EF Core) to do that. Here’s the minimal EF Core model and context, using the NodaTime plugin:

public class EventContext : DbContext
{
    public DbSet<Event> Events { get; set; }

    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
        => optionsBuilder.UseNpgsql(@"Host=localhost;Username=test;Password=test", o => o.UseNodaTime());

    protected override void OnModelCreating(ModelBuilder modelBuilder)
        => modelBuilder.Entity<Event>(b =>
            {
                b.Property(b => b.UtcTimestamp)
                    .HasComputedColumnSql(@"""LocalTimestamp"" AT TIME ZONE ""TimeZoneId""", stored: true);

                b.HasIndex(b => b.UtcTimestamp);
            });
}

public class Event
{
    public int Id { get; set; }

    public LocalDateTime LocalTimestamp { get; set; }
    public Instant UtcTimestamp { get; set; }
    public string TimeZoneId { get; set; }
}

This causes the following table to be created:

CREATE TABLE "Events" (
  "Id" integer GENERATED BY DEFAULT AS IDENTITY,
  "LocalTimestamp" timestamp without time zone NOT NULL,
  "UtcTimestamp" timestamp with time zone GENERATED ALWAYS AS ("LocalTimestamp" AT TIME ZONE "TimeZoneId") STORED,
  "TimeZoneId" text NULL,
  CONSTRAINT "PK_Events" PRIMARY KEY ("Id")
);

A few notes on the above:

The AT TIME ZONE operator in the generated column definition converts our local timestamp to a UTC timestamp, using the time zone recorded in the other column.
PostgreSQL uses IANA/Olson timezone IDs - this is what you need to store in TimeZoneId. These time zones look like Europe/Berlin, and are not the Windows time zones that .NET developers are usually used to. The good news is that .NET 6.0 contains time zone improvements which allow working with IANA/Olson time zones.
UtcTimestamp is a stored generated column, meaning that its value gets computed whenever the row is modified, and gets persisted in the table just like any other column. Databases usually also support non-stored generated columns, which get computed every time upon select, but PostgreSQL does not support these yet. This distinction will actually be important further down.
We create an index over our generated column, which allows us to efficiently perform queries on our events, e.g. get all of them sorted on the global timeline.

Perfect, job done… or is it?

The astute reader will have noticed that since our UTC timestamp is a stored generated column, it’s computed when we insert the row, and is not recomputed again unless the row changes. So what happens if the time zone database actually changes after that? That’s right - our UTC timestamp may not longer be correct, and that’s exactly the problem we wanted to fix by preserving the original, user-provided local time and time zone! To “resync” the UTC timestamp, we can recreate the column after a time zone database change (or just periodically):

ALTER TABLE "Events" DROP COLUMN "UtcTimestamp";
ALTER TABLE "Events" ADD COLUMN "UtcTimestamp" timestamp with time zone GENERATED ALWAYS AS  ("LocalTimestamp" AT TIME ZONE "TimeZoneId") STORED;

Note that all this assumes you actually need the UTC timestamp as a database column; an alternative would be to omit it, and to perform the time zone conversion in your queries. For example, with the NodaTime plugin you can do the following:

var events = await ctx.Events
    .OrderBy(e => e.LocalTimestamp.InZoneLeniently(DateTimeZoneProviders.Tzdb[e.TimeZoneId]).ToInstant())
    .ToListAsync();

This will translate to the following query:

SELECT e."Id", e."LocalTimestamp", e."TimeZoneId"
FROM "Events" AS e
ORDER BY e."LocalTimestamp" AT TIME ZONE e."TimeZoneId"

This effectively does the same thing as the generated column above, but doing the time zone conversion at query time; this ensures the up-to-date time zone database is always used, and does not take up any disk space. The main disadvantage, of course, is that you can’t have an index over the UTC timestamp, so operations like sorting will be slow.

SQL Server

Let’s see how this whole thing works on another database - SQL Server. We’ll do pretty much the same thing, but to change things up, we’ll just use the native BCL DateTime type instead of NodaTime (although a NodaTime plugin for the SQL Server provider does exist). As before, here’s the minimal EF Core model and context:

public class EventContext : DbContext
{
    public DbSet<Event> Events { get; set; }

    protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
        => optionsBuilder.UseSqlServer(@"")

    protected override void OnModelCreating(ModelBuilder modelBuilder)
        => modelBuilder.Entity<Event>(b =>
            {
                b.Property(b => b.UtcTimestamp)
                    .HasComputedColumnSql(@"[LocalTimestamp] AT TIME ZONE [TimeZoneId] AT TIME ZONE 'UTC'", stored: true);

                b.HasIndex(b => b.UtcTimestamp);
            });
}

public class Event
{
    public int Id { get; set; }

    public DateTime LocalTimestamp { get; set; }
    public DateTimeOffset UtcTimestamp { get; set; }
    public string TimeZoneId { get; set; }
}

A couple of notes, comparing this to PostgreSQL:

On SQL Server, AT TIME ZONE returns a datetimeoffset type - that’s why UtcTimestamp is a DateTimeOffset. If you really want a UtcTimestamp to be a DateTime, you can add a conversion back from datetimeoffset to datetime2.
The computed column SQL is a bit more complicated: we first convert the local timestamp to a datetimeoffset in the user’s time zone, and then to a UTC datetimeoffset.

Looks great… except that trying to create the table gives us the following error: Computed column 'UtcTimestamp' in table 'Events' cannot be persisted because the column is non-deterministic. SQL Server is stricter than PostgreSQL here: since the AT TIME ZONE operator depends on an external time zone database - which can change at any time - it is non-deterministic, and therefore cannot be used in a computed column definition. In effect, SQL Server is alerting you to the danger discussed above - your UTC timestamp may become out of sync with its inputs.

If you’re willing to give up the index, then unlike PostgreSQL you can use a non-stored computed column instead:

modelBuilder.Entity<Event>()
    .Property(e => e.UtcTimestamp)
    .HasComputedColumnSql(@"[LocalTimestamp] AT TIME ZONE [TimeZoneId] AT TIME ZONE 'UTC'");

Note that we removed the stored: true we had before (the default is non-stored). This column cannot be indexed, and effectively fulfils the same purpose as the PostgreSQL query we saw above. If you do want an indexed column, then you’ll have to set up a database trigger to keep UtcTimestamp up to date:

CREATE OR ALTER TRIGGER Events_UPDATE ON Events
    AFTER INSERT, UPDATE
    AS
BEGIN
    SET NOCOUNT ON;

    DECLARE @Id INT
    DECLARE @TimeZone NVARCHAR(MAX)
    DECLARE @LocalTimestamp DATETIME2

    SELECT @Id = INSERTED.Id FROM INSERTED
    SELECT @LocalTimestamp = INSERTED.LocalTimestamp FROM INSERTED
    SELECT @TimeZone = INSERTED.TimeZoneId FROM INSERTED

    UPDATE [Events]
    SET [UtcTimestamp] = @LocalTimestamp AT TIME ZONE @TimeZone AT TIME ZONE 'UTC'
    WHERE Id = @Id
END;

If you’re using EF Core Migrations, you can use raw SQL to define this trigger. Note that it’s now your responsibility to redo the conversions when the time zone database changes, just like with PostgreSQL above.

Some closing words

It’s interesting to compare PostgreSQL and SQL Server on what is considered a non-deterministic function (and therefore, what can be used in a computed column). I sent a message about this to the PostgreSQL maintainers, and Tom Lane explained that if we’re absolutely strict, then even string comparison isn’t really deterministic, since it depends on collation rules which may also change. One could claim that if users need an auto-updating column that uses AT TIME ZONE, they’ll end up doing it with a trigger in any case, like we’ve done above for SQL Server; so we may as well make it easier and not disallow it in generated columns. It’s the user’s responsibility to take care of resyncing in any case.

Finally, if you think that converting a local date to UTC is simple - even when we know the time zone - then I encourage you to read the “Ambiguous and skipped times” section in Jon Skeet’s post. Timestamps are just so much fun.

Mapping .NET Timestamps to PostgreSQL

2021-10-10T00:00:00+02:00

INTERESTED IN TIMESTAMPS? SEE ALSO When “UTC everywhere” isn’t enough - storing time zones in PostgreSQL and SQL Server

Npgsql 6.0 contains some significant changes to how timestamps are mapped between .NET and PostgreSQL - most applications will need to react to this (although a compatibility flag exists). This post gives the context for these changes, going over the timestamp types on both sides and the problems in mapping them.

PostgreSQL timestamps

Like with most things, PostgreSQL conforms to the SQL standard when it comes to timestamps (full docs): it has a timestamp without time zone and a timestamp with time zone type (the shorter aliases are timestamp and timestamptz). timestamptz is perhaps the worst-named type in the world: it does not, in fact, store a time zone in the database, but rather a UTC timestamp; that causes lots of confusion from users expecting to persist a full timezone-aware timestamp to PostgreSQL. In this sense, timestamptz is different from the SQL Server datetimeoffset type (but see note below on why offsets may be a bad idea). What timestamptz is good for, is storing and interacting with UTC timestamps, or globally agreed-upon points in time, where the time zone does not matter. For example, when recording the time a transaction took place, you typically store a UTC timestamp, and then display it in the user’s local time zone, as reported by their web browser; this allows you to show the same timestamp to multiple users, each in their own time zone, and also to support the fact that users may be in one time zone today, and in another tomorrow. This is sometimes called doing “UTC everywhere”, and it tends to work well as a default pattern. In the relatively rarer cases where you need to store the time zone along with a timestamp, a separate column must be used alongside your timestamp column, typically holding a string representation of the timezone (e.g. Europe/Berlin)¹.

The other type - timestamp - can be used to store a timestamp whose time zone is unknown, implicit or assumed to be local. It’s really important to understand that this does not represent a specific point in time unless coupled with some time zone: the same date/time combination corresponds to different universal instances in different time zones. PostgreSQL does have a TimeZone connection state parameter, which defines the “local time zone” of the connection; it’s defined in your PostgreSQL configuration by default, and can be changed in your connection. When converting a timestamp into a timestamptz (remember: the latter means “UTC”), PostgreSQL will treat your timestamp as a local timestamp, and convert it to UTC based on the connection’s current TimeZone. However, fiddling around with your connection’s TimeZone and depending on your database to do timezone conversions usually isn’t a practical way to do things - you typically want to store and retrieve UTC timestamps from your database, and do any conversions to/from local timezones in your application, when interacting with users.

.NET timestamps

The .NET situation around timestamps is… not pretty… .NET has some basic flaws in this area which have been with us since the beginning of time, and cannot be corrected without introducing unacceptable breaking changes. The .NET timestamp arsenal includes two main types: DateTime and DateTimeOffset.

DateTime unsurprisingly contains a date and a time, but also a Kind property which can be Utc, Local or Unspecified: Utc is pretty self-explanatory, Local means a timestamp in the timezone of the machine where .NET is running, and Unspecified is, well, not very specified. One problematic aspect of DateTime is that these very different concepts are represented via the same .NET type: if a function accepts a DateTime, which Kind should you pass in? What happens when you compare a UTC DateTime with an Unspecified one? (The answer is that the timestamps will be compared disregarding the Kind, which I can’t imagine can produce fruitful results in any sane application). To know more about DateTime’s failings, see this excellent blog post by Jon Skeet.

DateTimeOffset is at least less ambiguous than DateTime: it’s a date and time, plus a timezone offset. Taken together, these identify a specific instant in time, and so a DateTimeOffset can always be unambiguously converted to a UTC timestamp, if needed. Its API still has some issues (see Jon’s post above), but in my opinion, the main problem with this type is that it gives the illusion of being timezone-aware without delivering on it. An offset (e.g. UTC+01:00) is not a timezone (e.g. IANA/Olson Europe/Berlin): timezones contain information about daylight saving time, which a simple offset does not; Berlin is sometimes at UTC+01:00 and sometimes at UTC+02:00. This is especially important if you’re going to do arithmetic on a timestamp: if you add a few hours to a timestamp, an accurate result would have to take daylight savings into account. And if you’re not doing arithmetic, then you may not need the timezone in the first place (why not just use UTC?). The same criticism goes for the SQL Server datetimeoffset type: it makes you think you’re good, while neglecting daylight savings time.

Oh, and since I mentioned Jon Skeet above, you should absolute take a look at his NodaTime library: this is how date/time types are done right. I’d recommend that any serious application that needs to deal with timestamps seriously consider using it, and Npgsql even fully supports it (both at the ADO.NET and EF Core levels).

Mapping .NET to PostgreSQL

One of the tasks of a database driver is to map two different type systems to one another; in our case, the .NET types must be mapped to the PostgreSQL ones. The mapping is sometimes simple (e.g. .NET long corresponds perfectly to a PostgreSQL bigint), but sometimes it’s quite complex. You guessed it: timestamps fall in the latter basket.

One curious thing with PostgreSQL timestamptz, is that while it’s stored as a UTC timestamp in the database, its textual representation is a local timestamp based on the TimeZone connection parameter: reading a timestamptz as text yields something like 2004-10-19 10:23:54+02. Unfortunately this odd behavior shaped Npgsql’s original timestamp mapping in a significant way: reading a timestamptz returns a Local DateTime¹. Among other things, this means you cannot round-trip a UTC DateTime: you can send it just fine, but when you read it back, you get a converted local timestamp. A similar thing was done with DateTimeOffset: Npgsql converted it to UTC before sending, and returned a DateTimeOffset in the machine’s time zone when reading (remember, no timezone or offset is actually stored in the database!): if I send a DateTimeOffset with offset +02:00 on a machine configured with offset +01:00, it would be saved to UTC but read back with +01:00, with Npgsql doing all the conversions. This state of affairs led to a lot of general confusion, and made it quite difficult to support simple “UTC everywhere” programming, where you send a UTC timestamp to the database, and read it back in the same way.

In Npgsql 6.0, we redid the timestamp mapping with the following principles in mind:

1st-class support for the “UTC everywhere” pattern, and promote it as the default timestamp strategy.
Cleanly separate between UTC timestamps and non-UTC timestamps as two different types, and disallow mixing them to protect against accidental errors.
Values should always be round-trippable - whatever you send to PostgreSQL, you should get the same thing back. If we can’t roundtrip it, we should refuse to write it.
Values should never undergo any implicit timezone conversions when being read or written. Any conversions should be done by the user, making them clear and explicit in the code.

This means the following concrete things:

We now send UTC DateTime as timestamptz, and Local/Unspecified DateTime as timestamp; trying to send a non-UTC DateTime as timestamptz will throw an exception, etc. In effect, Npgsql is creating a strict type distinction between the different DateTime Kinds (which is how they should have been represented in the first place in .NET).²
We only allow sending a DateTimeOffset with offset 0 (as timestamptz): since the offset isn’t stored in the database, it can’t be round-tripped.³
Reading a timestamptz will now yield a UTC DateTime or DateTimeOffset - no more implicit conversions.
By nature, EF Core must make mapping decisions solely based on the type (DateTime) and cannot take the Kind into account, so we have to pick one type or the other as the default. Since “UTC everywhere” should generally be preferred, we now map DateTime to timestamptz.⁴
Corresponding changes were done to the NodaTime mappings, though the situation is much simpler there, since the different concepts are represented by different .NET types (e.g. Instant vs. LocalDateTime).

As you can imagine, the above implies a lot of breaking changes… This is not something we did lightly, but we do believe our users will end up in a better place. However, we’ve also provided a backwards compatibility flag which allows reverting to the previous behavior; see the documentation.

Please let us know what you think! Don’t hesitate to open questions on the Npgsql or EF Core provider repos, or to ping me on twitter.

Appendix: what PostgreSQL (or the SQL standard) got wrong

For those interested, here are a few thoughts on flaws in the PostgreSQL timestamp system - though the SQL standard is probably the one at fault here. There’s not much to be done about these, but it’s important to be aware of them.

The naming is quite atrocious:
- timestamp with time zone has a timezone in its textual representation, but not in its storage.
- This is also inconsistent with the type time with time zone, which does store an offset in the database.
- The timestamp name is bound to make people use it as the default, though it probably is not the thing most applications want.
PostgreSQL implicitly casts between timestamp and timestamptz, making it easy to accidentally get a timezone conversion; it would have been better to require explicit conversions instead. For example, the extract function accepts a timestamp, so passing a timestamptz would cause an implicit timezone conversion.
It’s arguably a bad idea for the timestamptz textual representation to contain a local representation.
timestamp is sometimes treated as a local timestamp (e.g. when converting it to timestamptz), but sometimes is simply an unspecified timestamp.
time with time zone makes no sense.

INTERESTED IN TIMESTAMPS? SEE ALSO When “UTC everywhere” isn’t enough - storing time zones in PostgreSQL and SQL Server

For more information on when “UTC Everywhere” is less appropriate and how to deal with it, see this great post by Jon Skeet. ↩ ↩²
Note that Npgsql did the timezone conversion based on the machine’s timezone, rather than based on the PostgreSQL TimeZone, so did not match the PostgreSQL behavior in any case. This was because .NET had no way of parsing the PostgreSQL IANA/Olson timezone IDs until .NET 6.0. ↩
Incidentally, this is the only case in the Npgsql type mapping system where the PostgreSQL type depends not only on the CLR type (DateTime), but also on its value (the Kind). This isn’t trivial to do, and especially to do efficiently - thanks once again, DateTime! ↩
There are a few specific cases where we allow non-round-trippability. For one, PostgreSQL has only microsecond precision, whereas the .NET types have tick precision (100 nanoseconds); the driver silently truncates the extra precision rather than throwing. We also allow writing default(DateTime) as timestamptz even though its Kind is Unspecified. ↩

Query parameters, batching and SQL rewriting

2021-08-17T00:00:00+02:00

In the upcoming version 6.0 of the Npgsql PostgreSQL driver for .NET, we implemented what I think of as “raw mode” (#3852). In a nutshell, this means that you can now use Npgsql without it doing anything to the SQL you provide it - it will simply send your queries as-is to PostgreSQL, without parsing or rewriting them in any way. Explaining what this means is a great opportunity to go into some interesting aspects of database programming - so let’s dive in.

Parameters

Parameters are important in database programming: instead of putting values directly into your SQL query, you integrate a placeholder which references a parameter value that’s delivered separately. This is important for preventing SQL injection attacks, but also helps performance through plan caching and prepared statements. Anybody who’s used .NET’s database API (ADO.NET) knows how parameters work:

var cmd = new NpgsqlCommand("SELECT * FROM employees WHERE first_name = @FirstName AND age = @Age", conn)
{
    Parameters =
    {
        new("FirstName", "Shay"),
        new("Age", 18)
    }
};

A command has a collection of parameters, and each parameter has a name and a value. Pretty straightforward… or is it?

It turns out that while some databases accept such name/value parameter pairs (e.g. SQL Server), PostgreSQL actually has a positional parameter system! Rather than the named parameter placeholders @FirstName and @Age, it expects to get $1 and $2, which refer to positions in the parameter list. And indeed - there’s quite a zoo of parameter placeholders once you look around: Oracle does have named parameters like SQL Server, but uses a semicolon as the prefix (so :Age rather than @Age). In ODBC, parameter placeholders are simply question marks, which also bind positionally to parameters (this means that it’s impossible to refer to the same parameter twice without sending it twice as well).

What a mess. Now, the ADO.NET documentation calls all this out, clearly stating that parameter placeholders vary across data providers. In fact, if you look at the DbParameterCollection class, you’ll find a collection that is both named like a dictionary (for SQL Server, Oracle…) and ordered like an array (for ODBC, PostgreSQL). But at some prehistoric moment in Npgsql’s history, someone made the decision to support named parameters. This was probably done to make it easier to port applications from SQL Server to PostgreSQL, without having to change any SQL - not a bad idea. Unfortunately, that also means that Npgsql has to internally parse your SQL query and rewrite it to send the following to PostgreSQL:

SELECT * FROM employees WHERE first_name = $1 AND age = $2

Batching

As fascinating as the above mess is, let’s leave it for a second and concentrate on something else - statement batching. When you want to execute two unrelated SQL statements, it’s far more efficient to send both at the same time, and not wait for the first to complete before sending the second. In principle, any type of SQL statement can be batched in this way: an UPDATE and a DELETE, 5 different SELECTs, anything; if you’re not already batching where you could be, I highly recommend giving it a try.

The current way to perform batching with ADO.NET looks like this:

var cmd = new NpgsqlCommand("SELECT * FROM employees; SELECT * FROM departments", conn);

You simply pack two SQL statements into a single command - separated by a semicolon - and execute that command as a single batch. Pretty straightforward… or is it?

The above works as-is on SQL Server, but the situation is a bit more complicated on PostgreSQL. PostgreSQL supports two protocols on the wire: the simple protocol and the extended protocol. The former does allow sending multiple statements as above, but has no support for parameters, prepared statements and various other features. At some point in the past, Npgsql actually used the simple protocol, and got around the lack of parameter support by interpolating parameter values directly into the SQL (client-side binding); this meant Npgsql needed to know how to generate (and parse) string representations of all supported data types, and that’s still inefficient due to the lack of real parameterization and prepared statements. Modern Npgsql exclusively uses the extended protocol, where each protocol message corresponds to exactly one SQL statement, with its own parameter list - no semicolons allowed.

So how does the above batching code work? You guessed it! Npgsql parses the SQL, locates the semicolons and breaks up the command’s text into multiple extended protocol messages.

So what’s the big deal?

We’ve seen two reasons why Npgsql has to mess around with your command’s SQL: to rewrite your named parameter placeholders into PostgreSQL native positional ones, and to break up any multiple statements for batching. But why should we care about all that?

Parsing the PostgreSQL SQL dialect isn’t trivial. For example, we must avoid manipulating string literals, which may contain semicolons or text that looks like placeholders. Of course, Npgsql doesn’t include a full SQL parser - that would be very hard to do - but rather a very small parser that knows the absolute minimum in order to perform its job. Now, we haven’t had any bugs recently, but I’m sure that if I really dove in there, I could produce cases where the parser mistakenly identifies a placeholder or semicolon where it shouldn’t, or vice versa. It’s an inherently unsafe situation.
Beyond correctness, both parsing and producing the rewritten SQL is work, which can hurt performance. The longer the SQL query and the more parameters it has, the more overhead this process adds to query execution. Nobody wants that.
When managing a parameter collection (e.g. NpgsqlParameterCollection), we have to maintain an internal dictionary that indexes parameters by their name. If we didn’t have to handle names, the collection would become a simple list - this is more efficient.
Lastly and most importantly, I hate it. I believe a database driver’s job is to transmit the SQL users give it, without manipulating it in any way. Simple, easy, efficient, no frills.

So what can be done about this?

Going raw

The first step towards removing SQL manipulation is the introduction of a proper, 1st-class batching API: rather than packing multiple statements into a single semicolon-delimited string, a structured API would allow the user to manage multiple statements within a batch. The driver would then receive a batch which is already broken down into the various statements, and would no longer need to search for semicolons. The upcoming .NET 6.0 features a new ADO.NET batching API which does precisely this:

var batch = new NpgsqlBatch(conn)
{
    BatchCommands =
    {
        new("SELECT * FROM employees"),
        new("SELECT * FROM departments"),
    }
};

The question of parameter placeholders is a bit trickier. Starting with Npgsql 6.0, you can do the following:

var cmd = new NpgsqlCommand("SELECT * FROM employees WHERE first_name = $1 AND age = $2", conn)
{
    Parameters =
    {
        new() { Value = "Shay" },
        new() { Value = 18 }
    }
};

Our parameters no longer have names! And since that’s the case (NpgsqlParameter.ParameterName is null), Npgsql implicitly switches into “raw mode”, where it no longer performs any parsing or rewriting of your SQL. One consequence of this is that “legacy batching” - multiple semicolon-separated statements - is no longer supported; if you use positional parameters, you must also use the new batching API. If you use named parameters, Npgsql will continue behaving as before, rewriting your SQL in order to maintain full backwards compatibility.

Everything seems to be neatly taken care of - except for one small point. If your command has no parameters at all, Npgsql cannot be sure that there isn’t a semicolon hiding somewhere in your SQL, and must grudgingly fall back to parsing; not doing so would break backwards compatibility. So we added an AppContext switch which allows opting into raw mode everywhere, always:

AppContext.SetSwitch("Npgsql.EnableSqlRewriting", false);

var cmd = new NpgsqlCommand("SELECT * FROM employees", conn);

Disabling rewriting will make your queries fail if they contain any named parameters or make use of semicolons for batching. Aside from optimizing the zero-parameters case, this switch can ensure that your application is always communicating in the safest and most efficient way with PostgreSQL.

Epilog

All the above is available in Npgsql as of 6.0.0-preview7. Unfortunately, positional parameters and the new batching API require changes in layers used over Npgsql: I’m not sure to what extent Dapper supports positional parameters, and EF Core requires some changes in order to support everything too; it’s unfortunately too late in the EF Core release cycle to make that happen, but I plan to work on that for EF Core 7.0.

One last point… The new .NET batching API wasn’t introduced just so that Npgsql could avoid parsing its SQL. While SQL Server does natively support multiple semicolon-separated statements in a single command (or “batch” in SQL Server parlance), there are some significant drawbacks for doing so - read this old post for the details. We also have good reason to believe that the MySQL provider can benefit from a better batching API as well - so lots to look forward to.

Oh, and thanks to @NinoFloris for some very helpful conversations on this!

UPDATE 2022-05-09: Amazing timing… PostgreSQL 14 has introduced new syntax which breaks Npgsql’s SQL parsing logic, and will probably be non-trivial to recognize properly… see this issue. This shows what a bad idea it is for a driver to be parsing SQL.

EF Core 6.7 Update Performance Improvements

2021-07-12T00:00:00+02:00

For the 7.0.0-preview6 release of Entity Framework Core, I wrote a blog post about the update pipline performance improvements introduced into EF Core 7.

EF Core 6.0 Performance Improvements

2021-05-25T00:00:00+02:00

For the 6.0.0-preview4 release of Entity Framework Core, I wrote a blog post about the performance improvements introduced into EF Core 6.

The Curious Case of Commands and Cancellation

2020-10-15T00:00:00+02:00

Async brought a world of goodness (and complexity) to .NET, including the concept of cancellation: since async operations are by their nature supposed to take a while, it makes sense to allow us to cancel mid-way and exit early. Like async in general, cancellation took some time to propagate everywhere - socket operations only started honoring cancellation token in .NET Core 3.0. For the upcoming 5.0 release of Npgsql, the PostgreSQL database driver, a lot of work is going on to provide a good command cancellation story (thanks @vonzshik!!), and it is far more complicated than you’d think.

DbCommand is the standard .NET type which represents a query you run against a database - you set SQL and parameters on it, invoke ExecuteReaderAsync, and it gives you back a DbDataReader which allows you to consume the results:

await using var cmd = connection.CreateCommand();
cmd.CommandText = "SELECT something_from_the_database";

await using var reader = await cmd.ExecuteReaderAsync();
// The query has been started and is now running in the background
// Consume results via the reader

Now, back in the old days - before async was even a thing - DbCommand already had a Cancel method. This method attempts to cancel the ongoing execution of the query, on a best-effort basis, by doing whatever is appropriate for your database. When async came along, all the database types were retrofitted with async methods accepting cancellation tokens: DbCommand.ExecuteReaderAsync, DbDataReader.ReadAsync, etc. Logically, invoking the cancellation token is the async analog of calling the old DbCommand.Cancel - both semantically mean the same thing. Or so it would seem.

When you pass a cancellation token to some method, the general expectation is for the token to control that specific invocation; if you trigger the token, that invocation should terminate as early as possible and throw an OperationCanceledException. The simplest example is HttpClient.GetAsync: that method call represents one potentially long process, and the cancellation token can abort that process; when the method completes, you know nothing lingers in the background. The database API, in contrast, is more complex: when DbCommand.ExecuteReaderAsync completes, the query has only just started, is (likely) still running, and may continue running for a very long time. The DbDataReader it returns allows you to start processing the result stream, possibly in parallel, while the database server is still running the query and sending results back.

So ExecuteReaderAsync starts some background process (the query), which doesn’t complete when the method itself completes - why is that significant? One question this raises, is how one goes about cancelling the query after ExecuteReaderAsync completes; the traditional DbCommand.Cancel API doesn’t have this problem, because it’s a method on the DbCommand, rather than a token you pass to some method call.

Another related question is what happens with the various methods on DbDataReader which also accept a cancellation token, such as ReadAsync: what should happen when the token for ReadAsync is triggered? The usual expectation in the async world is again, for the token to only cancel the method to which it was passed, i.e. ReadAsync; but if we do that, we’re left with no token-based means to cancel the query at all - which is a pretty important requirement. We could tell users who want to cancel the query to rig their cancellation tokens to call our old DbCommand.Cancel:

public async Task ExecuteSomethingAsync(DbConnection connection, string sql, CancellationToken cancellationToken = default)
{
    await using var cmd = connection.CreateCommand();
    cmd.CommandText = sql;

    await using var reader = await cmd.ExecuteReaderAsync(cancellationToken);
    using var registration = cancellationToken.Register(() => cmd.Cancel());

    // Process results
}

But this would have to be done everywhere where a potentially cancellable command needs to be executed, and isn’t very discoverable. Finally, it just doesn’t seem incredibly useful to allow ReadAsync to be cancelled while leaving the query itself running; it’s simply unlikely that a later retry would produce useful results where an earlier ReadAsync was cancelled. Yes, this business of “detached” async background processes which don’t correspond to method calls isn’t entirely trivial.

The solution we opted for was to treat the ReadAsync’s token - and indeed, all tokens accepted by methods on DbDataReader - in the same way as we treat ExecuteReaderAsync’s: triggering it cancels the query. This solves the typical user requirement: if a cancellation token comes from somewhere (rigged to some GUI button, for example), and is passed to all database methods, then the query gets cancelled if it gets triggered.

This does have one peculiar consequence. It is quite standard for async methods to start with the following:

public async Task SomeLongThing(CancellationToken cancellationToken = default)
{
    cancellationToken.ThrowIfCancellationRequested();

    // ... actual stuff ...
}

This performs an upfront check on the token, and immediately returns a cancelled Task if the method is invoked with an already-cancelled token. But in our case, calling the method with a cancelled token is actually the way to request cancellation of the query. It’s a bit odd, but it works and performs what people generally want.

Hate it? Love it? Let us know.

C# 8 Nullable Reference Types, old TFMs and Multitargeting

2020-01-04T00:00:00+01:00

C# 8.0 finally brought us nullable reference types (NRTs), which us to annotate our reference types as non-nullable and get compiler warnings for code that may be in violation. As libraries and applications in the .NET ecosystem opt into this feature, C# code will get safer and more self-documenting, as it’s immediately clear which variables can hold null and which can’t. Here are the C# docs for NRTs, and you may also want to check out this blog post to get started.

There’s one problem though: C# 8.0 is only supported when targetting at least .NET Core 3.0 or .NET Standard 2.1, so if your project has to target an older TFM (say, .NET Standard 2.0 or even .NET Framework), you can’t officially use this feature. Unfortunately, some of us maintain software that can’t always target the newest shiny thing, but we’d still like to get the benefits of NRTs. No problem! As this is a compiler-only feature without any runtime requirements, there’s nothing really preventing you from turning it on in your csproj:

  
    netstandard2.0
    8.0
    enable
  

And this will actually work! Until it doesn’t, that is… In some rare cases, things won’t work as they should when targeting the old TFM:

static string Foo(string? s)
{
    Debug.Assert(s != null);
    return s;
}

Since s is a nullable string, using it in a non-nullable context will generate a warning. Now, let’s say that we know that in this particular context, s cannot be null, and wish to assert that. This code will compile just fine on recent TFMs, since the compiler knows that if Debug.Assert returns successfully, s can’t be null. However, when targeting an older TFM, this code will generate a warning. To be fair, this is a relatively rare corner case: most NRT code does compile correctly even on older BCLs.

A way around this is to target a newer TFM where NRTs are fully supported, and simply disable nullability on older one; in effect, we’ll be using the newer TFM to do the compiler verifications that our code is null-correct. So we can simply modify our csproj to do the following:

  
    netstandard2.0;netstandard2.1
    8.0
  
   Condition=" '$(TargetFramework)' != 'netstandard2.0' ">
    enable

Great! There’s just one problem - our build will now generate warning CS8632 - The annotation for nullable reference types should only be used in code within a ‘#nullable’ annotations context - since in the older TFM we’re using the NRT feature without having it turned it on. No problem, we can just ignore that warning for the old TFM by adding the following:

 Condition=" $(Nullable) != 'enable' ">
  $(NoWarn);CS8632

That’s it. You now have a project targeting two TFMs, with NRTs enabled on the newer TFM and disable on the older. Happy fun nullificating your projects!

Conceptual and API documentation with Docfx, Github Actions and Github Pages

2019-10-03T00:00:00+02:00

A good software project is (among other things!) measured by the quality of its documentation, but setting up a good documentation workflow isn’t trivial. There are generally two kinds of documentation: conceptual articles, which are written manually (e.g. in markdown) and API documentation which is generated directly from source code. Docfx is a great tool which knows how to generate a single, seamless site from these two documentation types, but reaching doc nirvana is still hard:

The whole process should be fully automated: devs shouldn’t ever need to run docfx manually. We’re better than that. Let’s call this a continuous documentation pipeline.
Conceptual docs are sometimes managed in a separate repo, raising the question of how to bring multiple docs together.
The Npgsql case is even more complex: the same site has documentation for two projects (both the base Npgsql driver and the EF Core provider).

The post will describe the new documentation pipeline used by Npgsql to solve these challenges, using docfx, Github Actions for automation and Github Pages for hosting. It assumes you’re (somewhat) familiar with docfx, and won’t go into the details of configuring it.

Automating Docfx with Github Actions

Let’s concentrate on conceptual documentation for now. For people hosting static (or Jekyll) sites on Github Pages, life is simple: edit a file and push, and your changes are automatically live. When a processing system such as docfx is used, we have to run docfx, and the resulting HTML site needs to be hosted somewhere. We can use Github Actions to do this for us.

Create a file called .github/workflows/build-documentation.yml in a repo containing some articles and a docfx.json:

name: Build Documentation

on:
  push:
    branches:
      - master

jobs:
  build:

    runs-on: ubuntu-18.04

    steps:
    - name: Checkout repo
      uses: actions/checkout@v1
      with:
        path: docs
        fetch-depth: 1

    - name: Get mono
      run: |
        apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys 3FA7E0328081BFF6A14DA29AA6A19B38D3D831EF
        echo "deb https://download.mono-project.com/repo/ubuntu stable-bionic main" | sudo tee /etc/apt/sources.list.d/mono-official-stable.list
        sudo apt-get update
        sudo apt-get install mono-complete --yes

    - name: Get docfx
      run: |
        curl -L https://github.com/dotnet/docfx/releases/latest/download/docfx.zip -o docfx.zip
        unzip -d .docfx docfx.zip

    - name: Build docs
      run:  mono .docfx/docfx.exe

Let’s go over the above. We’ve created a workflow called “Build Documentation” that will run every time something is pushed to our repo’s master branch. It first clones our repo into a directory called docs, and to save time we fetch only 1 commit deep (who needs all that history). Now, since I’m a diehard Linux dude, we’ll be running on Ubuntu; this unfortunately means that we need to install mono, since docfx only runs on .NET Framework (guys, it’s 2019 and .NET Core 3.0 has just been released…). I won’t go into the technicalities of this step, and if you prefer Windows you can skip it entirely.

Once mono is installed, we get the latest version of docfx by fetching it from their Github releases page, and unzip it into some directory. At this point we’re ready to go, and can run docfx - hurray! Could it be this simple?

Repos, repos…

Well, uh, no… What are we going to do with all those HTML files that docfx generated? They need to be hosted somewhere. If you have some external hosting service, at this point you’d pack the outputs into a ZIP and send it off somewhere. But if you use Github Pages for your hosting, you may be tempted to host your site in the same repo which contains the sources. While this may seem like a good idea, it probably isn’t: to actually go live, you need to push a new commit containing these HTMLs; creating a new commit in your sources repo would mean you have to pull it the next time you want to make a change. But in any case, who wants a source repo to contain generated HTML artifacts - that’s like committing your compiled objects alongside your sources, yuck.

So we’ll open a new repo whose sole purpose is to host our static HTML files: this will be our publicly-visible repo. Our workflow will clone that repo, make sure that docfx generates its outputs into it’s directory, and finally commit and push the changes to it. Let’s add this additional fragment after our repo’s checkout:

    - name: Checkout live docs repo
      uses: actions/checkout@v1
      with:
        repository: npgsql/livedocs
        ref: master
        fetch-depth: 1
        path: docs/live
    - name: Clear live docs repo
      run: rm -rf live/*

This time we need to specify which repo we want to clone, since it isn’t the repo where the workflow is running. We also specify to clone it into a live directory inside our sources repo; when docfx runs, it will automatically generate HTMLs into that directory. Once that’s done, all that’s left is to commit and push those changes to the live repo - but unfortunately that’s a bit complicated.

To push changes to another repo, we’re going to have to have an access token with the proper permission, so let’s head over to the site and generate one, with repo permissions, by following these instructions. Now, I know we want to go fast, but we absolutely cannot insert that access token inside our workflow YAML: this is a public file, and putting our token there would give the world write access to our repo - not cool (don’t do this even for private repos!). Fortunately, Github Actions has a secrets management feature, which allows you to store your access token and reference it safely from your YAML. Follow the instructions in that page to create a token called DOC_MGMT_TOKEN, and insert the following fragment at the bottom of your workflow:

    - name: Commit and push
      run: |
        cd live
        git config --global user.email "noreply@npgsql.org"
        git config --global user.name "Automated System"
        git add .
        git commit -m "Automated update" --author $GITHUB_ACTOR
        header=$(echo -n "ad-m:$" | base64)
        git -c http.extraheader="AUTHORIZATION: basic $header" push origin HEAD:master

We unfortuately have to jump through some hoops - this should ideally be simpler. We:

Enter the live repo’s directory, where our HTMLs have been generated
Configure our name and email with git, as these will appear in the commit we’re about to create
Add all files
Create the commit
Git push the commit to the live doc, doing some magic to include our access token in the HTTP header for proper authentication

… and we have a fully-working, automated documentation pipeline - just push any changes to see it appearing live! Now we’re done, right?

API documentation

I know… I promised we’d also do API documentation here. It’s not so hard after what we’ve already been through. After configuring your docfx.json appropriately, if your (conceptual) doc repo is separate from your actual project(s), you will simply need to add workflow step to clone it into a directory where docfx will look for it:

    - name: Checkout Npgsql
      uses: actions/checkout@v1
      with:
        repository: npgsql/npgsql
        ref: master
        fetch-depth: 1
        path: docs/Npgsql

Note that Npgsql follows Gitflow, which means that the latest released version can always be found in the master branch - so that’s where we generate API docs from. Your git workflow may be different, adjust accordingly.

At this point, a doc rebuild is triggered whenever something is pushed to our conceptual repo, which is great, but we also want our project repo to trigger a rebuild! So we add another trigger at the beginning of our doc repo’s workflow file:

on:
 repository_dispatch:
 push:
    branches:
      - master

The added repository dispatch is basically an event that can be triggered externally via a simple HTTP POST request. All that’s left is to drop the following workflow in our project repo, under .github/workflows/trigger-doc-build.yml:

name: Trigger Documentation Build

on:
  push:
    branches:
      - master

jobs:
  build:

    runs-on: ubuntu-18.04

    steps:
    - name: Trigger documentation build
      run: |
        curl -X POST \
             -H "Authorization: token $" \
             -H "Accept: application/vnd.github.everest-preview+json" \
             -H "Content-Type: application/json" \
             --data '{ "event_type": "Npgsql push to master" }' \
             https://api.github.com/repos/npgsql/doc/dispatches

Note that we also need to use an access token here as well, and to configure it on our project repo, since that is where the workflow runs. Once this is done, every push to your repo’s master branch will result in a doc rebuild (Npgsql even has two repos with this trigger to the same site).

Nirvana

Once this is all properly set up, you hopefully never have to think about syncing docs ever again…! If you think the above is useful (or hate it), please drop me a comment below - any improvement suggestions would be welcome as well!

Oh, and here are the full files so you can see it all put together. Feel free to wander around the doc repo or the Npgsql project repo to see how it all fits together.

build-documentation.yml
trigger-doc-build.yml
docfx.json (in case it floats your boat)

EFCore 3.0 for PostgreSQL - Advanced JSON Support

2019-09-26T00:00:00+02:00

JSON and Databases

Most relational databases have had some sort of native support for JSON for quite a while now; PostgreSQL introduced its first JSON support in version 9.2, back in 2012, and the more optimized jsonb type in 2014. JSON types have in part been the relational database response to the NoSQL movement, with its pervasive, schema-less JSON documents: look, we can do it too! But the marriage of a traditional relational schema with non-relational documents has proven very powerful indeed; complex data no longer have to be represented via sprawling, relational models involving endless joins, and islands of fluid, schema-less content within a stricter relational model brought some very welcome flexibility.

Database JSON support usually means that some operations on JSON data can be performed in the database; after all, simply storing and loading JSON documents is quite useless in itself. For JSON to really shine, we need to be able to ask for all JSON documents satisfying some condition - and to do it efficiently. At the most basic level, PostgreSQL supports the following:

SELECT * FROM some_table WHERE customer->>'name' == 'Joe';

Assuming the some_table table has a JSON column named customer, this query will make PostgreSQL examine each row’s document, and return those rows where the name key is equal to “Joe”. Proper indexing can make this perform very fast, and PostgreSQL has a plethora of other operators and functions that can be used to construct JSON queries.

Now, the syntax above is entirely PostgreSQL-specific: other databases have others ways to express queries. SQL/JSON standardization is underway, and PostgreSQL 12 will support jsonpath queries which should finally provide a cross-database way to describe JSON queries. Unfortunately, the non-standardized nature of JSON support has meant that ORMs have often stayed away from it, and developers have been forced to drop down to raw SQL if they wanted to access JSON goodness.

No more! Release 3.0.0 of the Npgsql Entity Framework Core provider for PostgreSQL brings some exciting new JSON support, leveraging a unique feature of C#’s LINQ to express database JSON queries in a strongly-typed and natural way. The rest of this post will present the key new features, consult the documentation for a more complete description.

Strongly-typed access via POCOs

Without further ado, you can now define an EF Core entity as follows:

public class SomeEntity   // Maps to a database table
{
    public int Id { get; set; }
    [Column(TypeName = "jsonb")]
    public Customer Customer { get; set; }
}

public class Customer    // Maps to a JSON column in the table
{
    public string Name { get; set; }
    public int Age { get; set; }
    public Order[] Orders { get; set; }
}

public class Order       // Part of the JSON column
{
    public decimal Price { get; set; }
    public string ShippingAddress { get; set; }
}

Our SomeEntity type - which maps to a database table - contains an arbitrary user type (or POCO, plain-old-CLR-object), which is mapped to a PostgreSQL jsonb column via the [Column] data annotation attribute. This is really all you have to do, and everything will work as expected: Npgsql will use the new System.Text.Json to serialize and deserialize your instances to JSON data. Note also that our POCO, Customer, contains an array of another POCO, Order; this will also just work as expected, with the array of orders appearing inside the customer’s JSON document.

That’s it, couldn’t be simpler. No need for additional customer and order tables, with joins all around. But what about querying, as promised above? No problem:

var joes = context.CustomerEntries
    .Where(e => e.Customer.Name == "Joe")
    .ToList();

This will produce the PostgreSQL-specific JSON syntax we saw above. Once again: we’re using natural C# and LINQ to express an SQL query over a JSON column in our database.

Weakly-typed access via JsonDocument

Mapping POCOs is great when your JSON documents have a stable schema, but JSON is frequently used precisely when things are fluid: a document in one row could have a certain key which another document might not. A strongly-typed POCO is inappropriate for mapping in these circumstances, but never fear - there’s a solution for that as well. System.Text.Json also comes with a Document Object Model (DOM) for accessing JSON documents: you use types such as JsonDocument and JsonElement for weakly-typed access. These can also be mapped:

public class SomeEntity
{
    public int Id { get; set; }
    public JsonDocument Customer { get; set; }
}

var joes = context.CustomerEntries
    .Where(e => e.Customer.GetProperty("Name").GetString() == "Joe")
    .ToList();

This will produce the same SQL as above.

Closing Words

This hopefully gave a good overview of this new JSON feature, which should make PostgreSQL JSON operations accessible to EF Core users - the full documentation is available here. Based on feedback, the plan is also to look into supporting JSON in other database providers, such as SQL Server or Sqlite; standardized SQL/JSON may provide an opportunity for generic, cross-database support.

Please send positive and negative feedback via twitter (@shayrojansky) or by opening an issue on the provider repo. And have fun!