2024-09-05

Unpacking ZIO Schema's Accessors

The image

ZIO Schema is a ZIO-based library for modeling the schema of data structures as first-class values.

The big part of it is an automated, macro-based derivation of zio.schema.Schema from your case classes. It has been adapted by many libraries within the ZIO ecosystem, which makes it a good candidate to include in your next ZIO-powered library when it comes to codec derivation. What makes it even cooler is an underdocumented feature: Accessors. There is not much information available on the internet about this library, so I’ve decided to share some of my findings about it.

ZIO Apache Parquet

I’ve recently been hacking on my new library, ZIO Apache Parquet. The main reason I decided to start building this library was a genuine interest in what the ZIO schema is capable of.

It is not a secret that it is essential requirement to be able to filter the data when you work with big data sets. From the very beginning I didn’t like the idea of repeating the type-unsafe filter predicate API that the rest of the languages suffer from. I’m sure you know what I’m talking about. This clunky col("foo") == "bar" type of code that doesn’t provide any guarantees. I wanted to get rid of it and forget about it forever, for the sake of all fathers of type theory.

ZIO SQL

It was my idée fixe for a while. I had been thinking about it in the background. Once I got some free time, after a brief research, I came across another library from the ZIO ecosystem - ZIO SQL. I noticed one interesting detail in how they manage accessing fields (columns) of the case classes (tables). You can find it right in their documentation:

That was an “aha” moment! Of course, I’ve jumped straight ahead into the documentation of “ZIO Schema”… and found nothing about that. The only mention I found was on the documentation page about the zio-schema-optics module. Fortunately, there is a brilliant talk by Jaroslav Regec on YouTube “Peeking Inside the Engine of ZIO SQL” by Jaroslav Regec that sheds some light on this topic. The explanation of “Accessors API” and what you can achieve with it starts at 7:10.

Profunctor optics

You might say, “Nah, it’s nothing new to see here.” Look at any library that implements profunctor optics like quicklens or Monocle. These libraries do provide a set of predefined combinators that allow you to access and manipulate the fields of case classes. However, there’s an important distinction here: they work with actual data values within case class instances. In other words, they’re focused on accessing and modifying the contents of existing objects.

In our case, the case classes represent not data, but rather schemas. We need to be able to build an arbitrary DSL over the fields. In the case of ZIO SQL and ZIO Apache Parquet, the fields of a case class represent the columns of a table in the database and the columns of a Parquet dataset respectively. Moreover, we want to have a rich set of operators to be applied to the data contained in the columns.

No more talk, let’s dive deep into this already.

Accessor Builder

An entry point to it is zio.schema.AccessorBuilder:

trait AccessorBuilder {
  type Lens[F, S, A]
  type Prism[F, S, A]
  type Traversal[S, A]

  def makeLens[F, S, A](product: Schema.Record[S], term: Schema.Field[S, A]): Lens[F, S, A]

  def makePrism[F, S, A](sum: Schema.Enum[S], term: Schema.Case[S, A]): Prism[F, S, A]

  def makeTraversal[S, A](collection: Schema.Collection[S, A], element: Schema[A]): Traversal[S, A]
  
}

As you may see, it is the same old profunctor optics, but you can define your own implementation of Lens, Prism, and Traversal. That’s what changes everything. It means we are not limited by the regular optics combinators such as: modify, set, and get.

For this, we need to override type Lens and def makeLens only:

final class ExprAccessorBuilder(typeTags: Map[String, TypeTag[?]]) extends AccessorBuilder {

  override type Lens[F, S, A] = Column.Named[A, F]

  override def makeLens[F, S, A](product: Schema.Record[S], term: Schema.Field[S, A]): Column.Named[A, F] = {
    val name             = term.name.toString
    implicit val typeTag = typeTags(name).asInstanceOf[TypeTag[A]]

    Column.Named[A, F](name)
  }

}

This piece is taken from my ZIO Apache Parquet library, where Lens is Column.Named, which represents a Parquet column that supports standard comparison operators such as >, <=, ==, and so on. makeLens is being called for each field of a case class. This way, we build a complete set of columns of an arbitrary Schema.Record that represents the case classes of arity N.

All this happens inside a special method called Schema.makeAccessors:

def makeAccessors(b: AccessorBuilder): Accessors[b.Lens, b.Prism, b.Traversal]

where Accessors is an abstract type member:

type Accessors[Lens[_, _, _], Prism[_, _, _], Traversal[_, _]]

These two are being overridden for each subtype of zio.Schema, including zio.Schema.Record. Let’s have a look at the CaseClass2 for the sake of example:

sealed trait CaseClass2[A1, A2, Z] extends Record[Z] {

  type Accessors[Lens[_, _, _], Prism[_, _, _], Traversal[_, _]] =
    (Lens[Field1, Z, A1], Lens[Field2, Z, A2])

  override def makeAccessors(
      b: AccessorBuilder
    ): (b.Lens[Field1, Z, A1], b.Lens[Field2, Z, A2]) =
      (b.makeLens(self, field1), b.makeLens(self, field2))
      
}

The last step is calling the makeAccessors methods passing in our own implementation of AccessorBuilder:

val accessorBuilder = 
  new ExprAccessorBuilder(typeTag.asInstanceOf[TypeTag.Record[A]].columns)
  
schema.makeAccessors(accessorBuilder)

The result has the type:

schema.Accessors[accessorBuilder.Lens, accessorBuilder.Prism, accessorBuilder.Traversal]

For example, let’s define a very simple case class:

case class MyRecord(a: Int, b: String)

Let’s say, we derived a zio.Schema for it. The result of calling schema.makeAccessors(accessorBuilder) will be:

(Column.Named[Int, "a"], Column.Named[String, "b"])

What are those “a” and “b” literal singleton types? They are well-known “phantom types”. It may be very useful if you need some extra information available only during compile time in your macros. I’ll leave it to be explored in one of my next blog posts.

Conclusion

Now we can clearly see that we changed the representation of MyRecord.a field. Instead of value of type Int it became Column.Named[Int, "a"] that’s obviously not a value anymore but some sort of a schema. This shift in perspective opens up exciting possibilities for crafting a domain-specific language (DSL) tailored to your needs.