14

SAS vs ROOT. Roasting SAS and comparing to ROOT

 3 years ago
source link: https://mightynotes.wordpress.com/2019/11/18/sas-vs-root-roasting-sas-and-comparing-to-root/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

SAS vs ROOT. Roasting SAS and comparing to ROOT

screenshot-from-2019-11-18-10-34-16.png?w=1251

Recently, guys from the department of Finance asked on the school’s discussion group for someone to help them in a data analysis contest (Hosted by SAS). What can go wrong? I have heard good things about SAS since I’m 10. My mother’s company uses SAS extensively. In the worst case we can write ML algorithms directly in SAS. I’m already familiar with another data analysis framework – ROOT. Shouldn’t be too hard, right? And I absolutely love expensive toys.

Oh, crap. SAS stinks.

Introduction

SAS, for those wondering. Is a very expensive statistics package by the SAS Institute. Now manly used by enterprises for legacy reasons. And non IT/statistic background people because the amount of advertisement they do. And I mean very expensive by: This thing costs like 9000 US$ for a basic Windows, single user package. And a Linux/Mainframe version; just the software itself, not including user fees costs like $100,000.

ROOT, in the other hand, is a statistics package developed by CERN. Used to discover the Higgs boson. ROOT supports everything SAS does and is absolutely free (if you can’t tell, I love ROOT). It also comes with IMO some of the most stupendously ridiculous yet powerful technology. Also, you use C++ to interact with ROOT. How heart warming is that :3

The competition gives us remote access (via RDP) to some real, anonymized customer data. Asked us to predict the behavior of other customers. Sounds fare enough! Though I only have a week to run preprocessing and build ML models. Fortunately we are allowed to use Python or R for data pre/post processing.

screenshot-from-2019-11-18-11-39-09.png?w=1024
The environment we got. Windows Server 2012 with the entire SAS stack.

Also there’s that I have filed 2 formal vulnerability report informing the host about potential data leaks. The first time via SSH and SCP directly and the second time via sending encrypted data over UDP. I believe transferring data via QUIC still works at the end of the competition. – I didn’t even try, the security just falls apart. At least they remember to block DNS.

There’s also Excel. – The most hated software by data analysts.

We got a free 2-day training course from SAS. The first for SAS Enterprise Guide and the second for SAS Viya. SAS EG is fine; basically Scratch for statistics. Viya is the beast in the room. It is SAS’s big data analysis tool, supporting multiple ML models and calmed to be blazing fast.

Let’s be clear. SAS Viya sucks. I’ll address why one by one.

First, SAS Viya only supports basic ML models. I’m not asking for Deep Learning. They don’t even have some of the best models for commercial cases like xgboost and Projective likelihood estimation. Viya only suports the basic algorithms like random forest, gradient boosting, Linear/logistic regression.

Secondly using SAS from Python is a nightmare. Python-SWAT is available to us to interact with Viya’s Cloud Analytics Service subsystem. This thing works using the RESTful API. Serious? Do you even have considered performance in any chance? That’s not even the problem. CAS have some really interesting properties – after model training, the model is stored as a SAS table on the server. To use the trained model for inference, we need to download the model then re upload it to CAS. WAT? Also CAS will end-up in an infinite loop if we ask it to inference on a model that doesn’t exist. Maybe a typo or something. Furthermore, SWAT won’t throw exceptions when things go wrong. Instead, it prints the error to stdout and return an error code. For the number 42’s sake. Gosh

Finally, and the most important thing: Viya operates everything in memory! What!? As a package for big data, you have to load everything into memory before doing the job? I have asked in a straight face to the trainer when he brought up it. “But how would you deal with 1TB of data when you only have 700GB of system RAM?” I asked. And he replied “Get a server with more RAM!” Crap….

I forgot to mention. Viya’s GUI is so slow that it slows down my workflow. Yet it is the only way I can do ML with it before I figure out all the bugs and quarks that SWAT have.

screenshot-from-2019-11-18-13-00-04.png?w=1024
SAS Viya

ROOT despite some of it’s flaws. Does a lot of things correctly. Foremost, ROOT runs on modern C++ instead of the SAS language which looks awfully like COBOL. Secondly, ROOT’s TTree and RDataFrame supports total on-disk operation. So given enough storage space, you can process Tera Bytes on data on a single machine (you can, but its gonna be slow).

auto rdf = ROOT::RDataFrame("data", "some_data_*.root");

The rest of ROOT benefits from this a lot. ROOT’s ML package, TMVA can therefor train on very large datasets even when you don’t have anything close to enough memory to hold the dataset. TMVA also supports many, state-of-the-art ML models.

Here’s the kicker, unlike SAS where I have to use a wrapper with abstractions in order to use SAS from Python. ROOT uses information from the internal C++ JIT to automatically generate Python binding. Meaning using ROOT from Python is exactly how the same as in C++. You can still wrap a wrapper around it for a Pythonic interface. – the point is that knowledge transfer is easy in ROOT.

And the fact ROOT runs on C++ means you can interact with basically and library you want. Even directly talk to the OS if you wish so. And supports a lot of low level features for very fast code.

SAS vs ROOT

In conclusion, I made this table to show how bad my experience with SAS is and why I really love ROOT.

SAS ViyaROOTPriceVery expensiveFreeLicenseProprietaryLGPLOn disk analysisNoYesDistributed computingYes – DefaultYes – via PROOFML modelsBasicMediumStatisticGoodGoodPlottingGoodInteractive plotsCommunityGoodGoodPackage availabilityBadGood – C++Entry barrierLowHighPlot savingimage filesimage, LaTeX, C..LanguageSASC++General useNoWhy not – it’s C++SerializationbasicEvery ROOT and
STL class.
DB supportyessame, but more

Some more roasting

  • screenshot-from-2019-11-18-11-40-13.png?w=1024
  • screenshot-from-2019-11-18-11-40-38-1.png?w=1024
  • screenshot-from-2019-11-18-11-50-36-1.png?w=1024
  • screenshot-from-2019-11-18-12-59-50.png?w=1024
  • screenshot-from-2019-11-18-14-27-58.png?w=1024
Just showing SAS UI/SWAT and how bad it is.
Each page in SAS takes > 5s to load. (Jupyter is cool.)

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK