Engineering at Karat: YAML, JSON, and data modeling


Engineering leader, Kyle Wenholz, shares a Karat data modeling project

Data modeling is awesome. Or hard. It depends on the day of the week. Any engineer at a startup will attest your data models are probably evolving faster than your code. We’re always riding the bike as we try to build it.

I’ve shared some PR-approved quotes about the engineering culture at Karat, but they never really capture that thrill and terror. So the powers that be decided we should just tell a story about the bike, which in this case is migrating a few old data models.

Here’s how we did it.

Background

The Karat engineering team uses a lot of simple data models: users, events, etc. They’re often a row in a relational table with some foreign keys, non-nullable columns, default values, and not much else. We also store some things in S3 like videos, but those locations are referenced with a pretty simple table.

Rails hides a lot of the complexity for us, but it’s not all that complex and we do plenty of data analysis on those raw tables without Ruby to lend a hand. Life gets more complicated when we start talking about questions and results: the bread and butter of over 60,000 Karat conducted interviews.

“But Kyle,” you say, “isn’t a question just some words and a punctuation mark?”

The answer is a good old “yes, but.”

Loosely defined and ignoring a bunch of complexity we have a few definitions:

Question: a yes/no, multiple choice, or open ended response query for knowledge. May have multiple parts, diagrams, external resources, constraints, or follow ups.

Result: an ordered list of question answers during an interview with an attached snapshot of the questions as asked. This also comes with some scoring of the correctness.

Notice all the caveats, qualifiers, and commas? You get the point.

The Data Modeling Problem

Since the dawn of Karat, some wise engineering folks made a great tradeoff and decided to punt on figuring out what exactly the shape of a question should be in a relational database.

How did they describe it? They used a text column with YAML stuffed inside.

ALTER TABLE questions ADD COLUMN content RAW NOT NULL;    

 

As long as the YAML conformed to a general shape and the UI and scoring engines could understand it, we were good.

# Notice this is all pretty human readable, hooray YAML.   
internal_title: Best first question                        
duration: 600 # time allowance, seconds                    
content: |                                                 
  type: group                                              
  title: A group of excellent questions                    
  competency: tenderheartedness                            
  components:                                              
  - type: knowledge                                        
    title: 100 Acre Wood                                   
    prompt: |                                              
      Where can a bear find the most honey?                
    fields:                                                
      approach:                                            
        options:                                           
          - approach: In a log with a buzz                 
            quality: optimal                               
          - approach: Where the owls live                  
            quality: suboptimal                            
  - type: programming                                      
    title: Something less fun                              
    prompt: |                                              
      Tell me about a time you traversed a tricky array.   

 

This allowed a lot of quick iteration without reshaping the tables every time we wanted a new question/result format. Just update the UI components and the scoring classes to handle the new shape (types in the example), and voila.

Time travel with versions

Those YAML blobs worked really well for years. Over the past eight months though we started asking questions about how results and questions were changing over the course of some edits. For context, we quality control most interviews to ensure accurate scoring, professional conduct, and excellence. This can change the results initially filled in (also, spelling). In particular, we wanted to do this analysis at scale rather than as one-offs.

Fortunately, we track versions of our records using a fun little gem called PaperTrail. At first, it was for disaster recovery, but now it could give us time travel superpowers.

Our first table for PaperTrail recorded objects and changes as a YAML blob sitting in a text column. In our Rails app, it was fast/easy enough to search questions for the UI by rehydrating the YAML into plain old Ruby objects (n < 100), and we never look over results in aggregate in that app.

kyle=# \d+ versions;                                                                                                                                   
Table "public.versions"                                                                                                                                
       Column       |            Type             | Collation | Nullable |               Default                | Storage  | Stats target | Description
--------------------+-----------------------------+-----------+----------+--------------------------------------+----------+--------------+------------
 id                 | integer                     |           | not null | nextval('versions_id_seq'::regclass) | plain    |              |            
 item_type          | character varying           |           | not null |                                      | extended |              |            
 item_id            | integer                     |           | not null |                                      | plain    |              |            
 event              | character varying           |           | not null |                                      | extended |              |            
 whodunnit          | character varying           |           |          |                                      | extended |              |            
 object             | text                        |           |          |                                      | extended |              |            
 created_at         | timestamp without time zone |           |          |                                      | plain    |              |            
 object_changes     | text                        |           |          |                                      | extended |              |            
 transaction_id     | integer                     |           |          |                                      | plain    |              |            
 request_id         | character varying           |           |          |                                      | extended |              |            
Indexes:                                                                                                                                               
    "versions_pkey" PRIMARY KEY, btree (id)                                                                                                            
    "index_versions_on_item_type_and_item_id" btree (item_type, item_id)                                                                               
    "index_versions_on_transaction_id" btree (transaction_id)                                                                                          

Here’s the table. It’s simple and works for any object, but those text columns are rather unstructured in the database.

Analyzing a YAML blob, questions or results, within a YAML blob, versions, with SQL was turning into a string matching nightmare. Running a Ruby script over the data was also becoming slower than we care to admit.

SELECT *                                   
FROM questions                             
WHERE content LIKE '%type: programming%';  

I just want the sweet questions. Now imagine if I want to see the sweet questions that changed (YAML inside YAML).

Loving unstructured data modeling with a little more structure

At this point, you might think we would embrace creating a real table and schema for these questions. It would give us:

  • More structured SQL analysis
  • Faster processing (no need to hydrate records or string match to inspect their innards)
  • A concrete format for questions and results

(Un)fortunately, we decided a concrete definition of a question was still too much work to maintain. Karat plans to iterate on existing content and new formats at an increasing rate. We started thinking about alternatives. A little over a year ago we made the move from MySQL to Postgres. We did this for a lot of reasons at the time, but one was because we liked some of the specialty column types like JSONB. JSONB allows you to define a column that holds some JSON and query the innards with SQL. In particular, JSONB is a slower insert, faster operated on, version of the JSON column type, so we skipped straight to JSONB because our workload is more read-heavy.

# Does this JSONB contain a `type` of `group`? It's an abbreviated version of our YAML                                  
# above, but you get the picture.                                                                                       
SELECT '{"content": {"type": "group", "title": "A group of excellent questions"}}'::jsonb#>>'{content, type}' = 'group';
-- true                                                                                                                 

 

This would solve some of our analysis issues (we could leverage the database before rehydrating records) and allow us to embrace the rate of content changes. (Always keeping in mind that records don’t have the same shape, but that’s a whole other discussion.) We got to writing some backfill scripts and migrating our code to read and write this new format. (Never let anyone tell you JSON is a pretty subset of YAML.) It became clear the lack of structure was getting out of hand as we discovered edge cases and bugs in some older code. To resolve this problem, we turned to JSON Schema.

# This JSON document can tell a JSON Schema validator that I want my data to have three 
# specific properties with the specified types.                                         
{                                                                                       
  "type": "object",                                                                     
  "properties": {                                                                       
    "internal_title":      { "type": "string" },                                        
    "duration":     { "type": "integer" },                                              
    "content":   { "type": "object" },                                                  
  },                                                                                    
  "required": ["internal_title", "duration", "content"]                                 
}                                                                                      

 

A flexible schema, coupled with a nice little gem called activerecord_json_validator, allowed us to create a few guard rails (incremental change is a virtue), and build some confidence we understood the shape of our data. We ran a few validation scripts to test this, learned of a few “dark” questions we needed to fix, and were on our way to migrating the columns.

At the same time, we moved our version columns to JSONB to make it all match. (At the time of writing not all columns are JSONB at Karat.) I won’t go into too many details on that migration, but we saw a lot of the same benefits across historical data, which is a juicy topic for another blog post. Versions using JSONB meant we could query details of the changes with SQL. Not always pretty, but possible in our existing reporting infrastructure.

So what happened?

Discovery of our problem–insufficient data analysis–followed by analysis, exploration, implementation, and backfill took a few side project months to complete. Our systems kept running and we churned out incremental benefits from the project along the way. Once we put the columns in place and backfilled the data:

  • We started doing some fun historical analysis (the subject of an aforementioned future blog post). This unlocked some really interesting insights into our most consistent interview engineers and opportunities to provide better guidance on questions
  • Our Ruby code actually got faster. We have several batch jobs that join some more complicated models, translate enum columns to human-readable text, and do some calculations. A few of these were taking on the order of 75 KBs/second (>30 minutes total)before the changes. The faster deserialization of JSON rather than YAML was 36% faster. We thought this might be a nice benefit, but we didn’t expect it to be so large at our humble scale.
  • Lastly, this major schema change and backfill created some interesting momentum. Now that we could query them with SQL, we discovered our versions tables were much too large to reasonably query. (Terribly indexed to boot.) So we started splitting those out (per the PaperTrail docs) to boost performance. And did it ever.

Now I have to change all my graphs to work with more than 30 days of data . . . Because we finally can!

Quote from an actual engineer

“We’re pleased with how we managed to evolve a simple solution to a complicated problem into a still simple solution. Someday we may decide it’s too simple, but for now we have power, flexibility, and a small investment.”

Leave a Reply

Your email address will not be published. Required fields are marked *