TensorFlow Sad Story

I have been using Pytorch for several years now, and always enjoyed it. It is clear, intuitive, flexible and fast. And then I was confronted with an opportunity to do my new computer vision projects on TensorFlow. This is where this story begins.

Image by author

TensorFlow is a well-established widely used framework. It couldn’t be that bad, I said to myself. Given Google’s engineering resources and ML expertise I expected they would get it right. Now after working with TensorFlow for the last year, I can say that it is a messy assemble of poorly maintained, ill-directed, and buggy pieces of code. Below is a list of things that I encountered.

As a disclaimer it is important to mention that I am not a TensorFlow expert, this is not a comprehensive overview, and some things could and definitely did escape my attention.

Installation

The fun starts right from the installation. As a pre-requisite one needs to install both NVidia CUDA, and cuDNN libraries. For some reason, it especially bothers me that for cuDNN download one has to register and login on the NVidia portal. Next, the installed CUDA and cuDNN versions have to match the TensorFlow version, otherwise it will not work. And that matching table is hard to find even if you know that it exists. These dependencies are a constant source of problems. It is hard to upgrade and downgrade for experiments, and some libraries were routinely missing for me.

Besides tensorflow itself, there is one additional python package tensorflow-addons, which people often use. It is supposed to be for less frequently used code etc., but in effect for the developers who always install both it is just a redundant separation of code into two pieces. Of course, the versions must match exactly, there is a table for this too.

Finally, there is tensorflow-gpu. Despite my best effort, I was not able to figure out why it still has a continuously updated package on pip with the same versions as tensorflow, despite its reported retirement 2 years ago.

The installation is tough, but hey, you have to be tough for what is coming.

Eager execution is a hoax

Before I got to know the real eager execution this sentence from the official v2 documentation actually made me hopeful

TensorFlow 2.0 executes eagerly (like Python normally does) and in 2.0, graphs and sessions should feel like implementation details.

Image by artificialintelligencememes.

The major problem with eager execution is that it often runs slower than the graph mode. For my CNN models eager execution is 5 times slower. This immediately renders it a niche debug tool at best, — who will ever want to run it by default?

In the TensorFlow documentation one can find this clarification:

For compute-heavy models, such as ResNet50 training on a GPU, eager execution performance is comparable to tf.function execution. But this gap grows larger for models with less computation and there is work to be done for optimizing hot code paths for models with lots of small operations.

Despite promising a comparable performance, the 5x slowdown that I measured was indeed with ResNet50. But I also used a custom head and a custom loss, and that was apparently too much for the eager to handle. The Keras team is aware of the problems, and their default execution mode is the graph mode. They even specifically do not recommend to use eager:

Note that Keras is the main recommended API for TensorFlow. And yet, TensorFlow 2.0 allowed itself to proudly pronounce that they execute eagerly by default.

tf.data.DataSet API is a mess

The recommended way to build data pipelines in TensorFlow v2 is tf.data.DataSet API. And it does look nice on the surface with its “intuitive” modular structure and promised performance improvements. But after experimenting with it for many days, I am disappointed.

The ML task that I worked with in this article is a basic image classification problem. Using old Sequence API for the same image loading and augmentation pipeline achieved similar GPU utilization and was even slightly faster than tf.data.DataSet.

Image by fossa from imgflip.

But worse than that, it is notoriously difficult to work with tf.data.DataSet API, and hard to debug it. One seemingly unnecessary thing that one needs to maintain is the arguments types which must be specified explicitly. If you use tf.py_function you additionally need to specify the tensor shapes on the output, which is not well explained in the documentation and you are almost guaranteed to spend time finding it the hard way.

For the processing functions, one needs to decide between tf.function, tf.py_function, and tf.numpy_function. tf.function is the compiled version, and does not support eager execution at all. But tf.py_function can be slow as each call locks python GIL. You are welcome.

API duplication

Guess how many 2D convolution layer implementations do we need? Apparently, 7 is just the right number, according to the TensorFlow dev:

Some of them are low and high level APIs, some developed by third parties, and some are deprecated. But note how none of this is mentioned on the respective documentation pages.

This duplication is everywhere across TensorFlow. As a result, it is hard to know which one of the functions is the correct one to use, what are the differences, and whether some are compatible between themselves but not with the others. This is not a theoretical problem, developers have to spend their precious mental power each day on this dumb activity of picking the right function out of 7.

Sloppy development

Below is a collection of several other issues that I came across.

ImageDataGenerator, which is the TensorFlow augmentation library, does not allow local random seeding, as all modern libraries do. Instead, it uses global numpy.random.seed(). Using global seed significantly complicates development of reproducible experiments, especially in case of multithreading.
The Sequence data API method on_epoch_end() does not have epoch as an argument. For couple of releases it was not even called, so a missing argument might seem like a small nuisance in comparison. From that case you can get a sense of the test coverage in TensorFlow (much smaller than it should be).
The GPU memory management in TF is bad. First, there is a default of taking all the memory on model initialization, regardless of the actual model size. Fortunately, there is a config option which allows the allocated memory to grow incrementally on-demand. But then there is a real issue, — allocated GPU memory can not be released after use, the only way to free the memory is to kill the process. Facepalm.
TF training logs always contain several obscure warning messages. Depending on the release, it is different warnings. I personally dislike warning messages in my logs and try to resolve them. But in this case it is impossible, because it is usually one TF function complaints about another TF function. They should really pick a different way of communicating.
Finally, model.predict() function leaks RAM memory. Isn’t it the main API call for the production inference? Isn’t TF supposed to at least be good for production?

Conclusions

Just switch to Pytorch.

Versions

This article is based on TensorFlow 2.3.0 to 2.5.0.

TensorFlow Sad Story

TensorFlow Sad Story

Installation

Eager execution is a hoax

tf.data.DataSet API is a mess

API duplication

Sloppy development

Conclusions

Versions

Recommend

No, You Can Not Stay at My House

Write Better And Faster Python Using Einstein Notation

Install Atom Text Editor on RHEL 8 / CentOS 8

Install and Configure FreeIPA Server on Ubuntu 20.04|18.04|16.04

Best Books To Learn Rabbitmq|Activemq|Zeromq in 2021

How To Install FreeSwitch PBX on Ubuntu 20.04|18.04

The World Wide Web Consortium at 27: a guiding star for the future of the web

Sorting JavaScript Arrays By Nested Properties

How I monitor my web server with the ELK Stack

The Slow Poisoning of Girls

About Joyk