Interactively Visualizing a Data Model With 700 Entities

Flexport’s domain — international logistics — is complex, and most engineers have little knowledge of it when they join. We learn along the way. This approach has been effective in that we’ve looked at the industry with fresh eyes and focused on underlying customer needs. But it’s been challenging in that our codebase has evolved incrementally, with engineers moving quickly to keep up with business growth. This has been especially challenging for our data model and internal documentation.

When I joined the company a year ago, I found myself wishing for an overall view of the data model that I could use to better understand the system at a high level. Ever since my mentor introduced me to the wonderful book Object Oriented Modeling and Design by James Rumbaugh at the beginning of my software career 27 years ago, understanding the data model of any problem, system, or software project has been my key to getting a deep insight into that system, and an absolute necessity to bring the project to a successful completion.

This inspired me to launch an internal data visualization project, which I’ll explain in this post.

A map of the system

Any hiker knows that one of the most important things to bring along on a 5-day journey into the backcountry is a good topographic map. That is what a clear, well-documented data model is — a map — a visual representation of the system as a whole in the most compact and information-dense format possible.

interactively-visualizing-a-data-model-with-700-entities-ce5b88a49505

A Data Model is Like a Map — the Lay of The Land at a Glance

Desired Criteria

Since the purpose of any map is to gain a fast, intuitive sense of the “lay of the land” and also jump into fine detail whenever necessary, we adopted three criteria for our desired solution:

Maximize visual understandability
Put tool-tips on anything and everything that could possibly provide more detailed information
Have HTTP links on anything that could possibly be cross-linked

The goal was to effortlessly navigate the entire data model with a view to understand it quickly.

Prior Art

Our back-end system is built on Rails and has around 700 models. We originally made an attempt at visualization using the ERD Gem to generate a single combined diagram and printed it on our 4-foot wide plotter. Unfortunately, the result was 10-foot long plot that was just a jumble of boxes and lines — too much information to be helpful.

We also incorporated the annotate_models Gem which automatically annotates each model Ruby source-code file with the schema of the corresponding table in the form of comments at the bottom of the file. However, while helpful, this did not meet our desire to have an overview “map” of our entire system (see below).

# == Schema Info
#
# Table name: line_items
#
# id :integer(11) not null, primary key
# quantity :integer(11) not null
# product_id :integer(11) not null
# unit_price :float
# order_id :integer(11)
#

The end result

Given that the available tools did not meet our need, we decided to dig deeper. The source file for each of these 700 models had the name of the responsible team right in the header, assigning it to one of our approximately 15 engineering teams. Furthermore, some of the larger teams had the models broken down by Rails “engines”, and further into Ruby modules. So there was a natural three-level hierarchy that could be automatically extracted out of the source-code itself. Taking advantage of this structure we were able to generate navigable, interactive data model diagrams that achieved our goal of intelligible and helpful data model visualization.

The below animation shows the final result. Starting from a top-level index (image with multiple straight arrows), we navigate to a team’s diagram, then, via a Foreign Key, to another team’s diagram, then to a data-dictionary for a particular model and, finally, to the diagram of yet another team via a Foreign Key in the data dictionary.

Each of our 15 teams is assigned its own color.

Navigating the Data Model of a system with 700 database tables

The generation scripts we wrote create three types of artifacts:

1. An index page in two formats

First, a visual index showing the teams, engines and modules along with their relationships. Each color corresponds to a team; the arrows correspond to foreign key dependencies between the teams, engines and modules. The darker the arrow, the stronger the linkage.

Visual Index to Flexport’s Data Models grouped by Team, Engine, Module. The Lines Show Foreign Key Relationships.

Second, though not shown in the diagram above, the same landing page also contains a list of all the 700 models in alphabetical order, allowing our engineers to search for a model they are interested in.

2. A visual representation for the models and their relationships

Data Model Diagram for one of the “Engines”. Diagrams Are Legible Up to About 45 Models.

As in the index, each color corresponds to a team. The lines represent Foreign Key relationships between models, including the multiplicity.

3. An HTML data-dictionary

Drilling down to each individual model shows an HTML data dictionary, containing:

general information about it (name of database table, name of Ruby class, link to GitHub repo, etc)
list of attributes
list of outgoing Foreign Key relationships (belongs_to)
list of incoming Foreign Key relationships (has_one, has_many)
list of Polymorphic Associations

An HTML data-dictionary for one of the models. YAML annotation files will provide descriptions in the future.

Visual Understandability, Tool-Tips, Navigation

Following the three criteria mentioned earlier, each model on a Data Model Diagram has the following features:

Visual Understandability

Colors on Foreign Key links indicate which team’s model the link points to
Relationship line symbols indicate multiplicity (one-to-one, one-to-many, etc)
Role name (“Removal Reason”) is indicated on relationship lines as appropriate

Tool-Tips

Hovering over model title (“Service Item Removal Reason”) shows general info about the model
Hovering over any attribute (e.g. “Note”) shows info about the attribute
Hovering over any of the “buttons” (134, 2, Doc) shows where the link navigates to — e.g. “134” navigates to another team diagram with 134 models in it, and “2” navigates to a Ruby module within that team with just two models in it.

HTTP Links

Clicking on title (“Service Item Removal Reason”) navigatse to the Data Dictionary
Clicking on an any attribute navigates to its specific entry in the Data Dictionary
Clicking on one of the number buttons (e.g. 134) navigates to the Data Model Diagram which contains the entity pointed to by the Foreign Key
Clicking on the “Doc” symbol navigates to the Data Dictionary of the model pointed to by the Foreign Key

Overall, these features enable the user to navigate through and understand the entire data model with minimum effort.

CI Pipeline Integration

Finally, our backend-infra team was able to integrate the data visualization generation process into our Continuous Integration pipeline, so these artifacts are generated after every build and always stay up-to-date. This was an important requirement, as without automatic updates during the normal course of development this kind of documentation can become obsolete quickly.

Implementation details

The following diagram gives a high-level overview of the extraction and generation process.

Overview of Model Documentation Generation Process

Schema Extraction

A Ruby script uses Rails’ Reflection API to extract metadata about all the models and their relationships and dumps all this information into a YAML file. This process takes roughly 45 seconds on a MacBookPro 2018.

The substantive code for this 120-line script is the following…

First, the application must be eager-loaded:

# Initialize the application.
require “./config/environment.rb”
Rails.application.eager_load!
Rails.application.config.eager_load_namespaces.each(&:eager_load!)

The actual extraction part boils down to the following:

# Extract info about each model.
ActiveRecord::Base.descendants.each do |model|
# …Write model parameters…
 model.columns.each do |column|
# …Write column parameters…
 model.reflect_on_all_associations.each do |association|
# …Write association parameters…

File Generation

Code (written in C#) ingests the YAML file produced by the Ruby script above, assembles its output into an object-oriented representation of the schema and generates:

The HTML for the index file and the data dictionary for each model
Files that describe the Data Model Diagrams in the declarative graph modelling DOT Language used by the open-source graph visualization tool graphviz.

The DOT Language is actually very simple in its syntax, and it was fairly trivial to write an object-oriented wrapper around it. This language is made up of three basic concepts: the graph, nodes and edges. On top of that, there is a plethora of attributes to control the behavior of each of these entities.

Here is a sample piece of code that creates an “edge” in the graph to represent an inheritance relationship and sets several attributes on it:

Edge edge = new Edge() {
 Source = model.Superclass.Id,
 Destination = model.Id,
};edge.SetAttrGraph(“dir”, “both”) // Allows for both ends of line to be decorated
 .SetAttrGraph(“arrowsize”, 1.5) 
 .SetAttrGraph(“fontname”, “Helvetica”) 
 .SetAttrGraph(“arrowhead”, “none”)
 .SetAttrGraph(“arrowtail”, “onormal”)
 .SetAttrGraph(“tailport”, “s”); // Forces arrow to connect to center bottom.return edge;

3. The C# code invokes graphviz on each of the DOT files.

This runs surprisingly fast, and the generation of roughly 150 model diagrams takes only about 10 seconds on a 2018 MacbookPro.

Graphviz provides several layout formats, and we have found that for the Model Diagrams, the “dot format”, which arranges nodes in layers, seems to give the nicest results. However, we used the “fdp format” (which treats the graph layout as a physics optimization problem) to produce a reasonable diagram for the index — see picture close to the top of the article.

Continuous Integration Hooks

Finally, our backend-infra team created some magic to integrate all this into our post-build process and push the resulting HTML and SVG files into a secure Amazon S3 Bucket — visible to all in the company through our ‘go’ links as ‘go/models’.

Conclusion and Future Work

It is still a little early to tell how pervasive the use of this tool is throughout the company, but our Data Science Team, in particular, has expressed a strong interest.

Annotations

While the current tool, as is, is very helpful, one thing is still lacking — high-level verbal descriptions of what the different models and attributes are intended to represent. The tool contains a feature where every Ruby <model>.rb file can have a parallel YAML file for model and attribute annotations.

Other Ideas

Another idea we’ve been considering is to parse out Rails migration files, extract which models and attributes they added, and connect this with Git Pull Request (PR) information. Then, we would be able to annotate the models and their attributes with links to these Pull Requests. Reading the description of the Pull Requests would likely be sufficient to get enough context to understand why the models or attribute were created.

Yet another idea is to automatically add an HTTP link to each Ruby model file, linking it directly to its generated data dictionary.

Future: From Database Models to API Models

As we move to break up our rather monolithic code-base that completely relies on the database schemas for communication into a partitioned system that communicates via APIs, our hope is to continue to use this tool to auto-generate understandable “maps” for our API-based system.

Exciting times ahead — stay tuned, and if you’re interested in this sort of thing we are hiring!

Interactively Visualizing a Data Model With 700 Entities

Interactively Visualizing a Data Model With 700 Entities

A map of the system

Desired Criteria

Prior Art

The end result

1. An index page in two formats

2. A visual representation for the models and their relationships

3. An HTML data-dictionary

Visual Understandability, Tool-Tips, Navigation

CI Pipeline Integration

Implementation details

Schema Extraction

File Generation

Continuous Integration Hooks

Conclusion and Future Work

Annotations

Other Ideas

Future: From Database Models to API Models

Recommend

明日起“长赐”号重新开航；出口退税新系统上线

Notion和飞书文档如何选择

组织或项目内部影响因素分析

亚马逊Prime Day热销产品出炉，旺季即将来临您做好选品了吗？

亚马逊澳大利亚站已新西兰买家开放

不需要 root 权限的 ICMP ping

贝索斯卸任亚马逊首席执行官；美国独立日周末发生多起枪击事件|跨境电商日报

The 9 Best Website Design Trends of 2021

Let's prioritise where cloud-centric organisations should focus data protection

Ending the debate on inline functions in React

About Joyk