57

3 Reasons Why I’m Ditching SSIS for Python

 4 years ago
source link: https://www.tuicool.com/articles/FbUrAbR
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

y2Ufey3.jpg!web

Photo by Chris Ried on Unsplash

I’ve been using the Microsoft SQL Server technology stack for more than a decade, and while I continue to be extremely bullish about it, I’ve lately changed my tune on a key component of it, namely SQL Server Integration Services , or SSIS for short. SSIS is a very powerful tool to perform extract, transform, and load (ETL) workflows on data, and can interact with pretty much any format out there. And while I’ve mostly seen it used in the context of loading data into or out of SQL Server, that certainly isn’t its only use.

I’ve authored more than my share of SSIS packages over the years, and while I still feel it’s a tremendous tool to have in your arsenal (and one that in many cases may be the only one available in large enterprises with strict standards around technology usage), I’ve now decided that for reasons I’ll outline below, I’d prefer using Python for most, if not all, ETL needs. This is especially true when combining Python with two modules specifically made for manipulating and analyzing data at scale, namely Dask and Pandas .

Python is free and open source

Python is a completely open source language, and is maintained by the Python Software Foundation . It, and a huge number of its packages, are available completely free of charge, and you can easily contribute to the underlying source code if you see a bug or need a feature. For example, Dask and Pandas combined have had over 25,000 commits and 9,000 forks on GitHub. Both are very active projects and have large, distributed, and active communities behind them. In addition, Python can talk to pretty much any data source using other open source packages; from CSV files , to Kafka , toscraping web sites. Python as a whole is hugely popular and growing, jumping from seventh to fourth place in Stack Overflow’s 2019 Developer Survey .

SSIS, on the other hand, requires you to license any machine running it just as you would any other machine running a full instance of SQL Server. So, if you want to follow good practice and offload your ETL processing on to a machine different from your SQL Server instance, you’ll have to pay fully to license that machine. Assuming you can successfully navigate the rather complex licensing for SQL Server, think about this: many ETL workloads are batch operations, meaning they tend to run at scheduled times during the day and otherwise sit idle. Do you really want to be paying (sometimes big) money for something that’s used once a day or sometimes less? And while it’s true that SSIS is extensible, I’ve yet to see the kind of broad open source toolset available that you see on the Python side.

Using tools like Dask, Python is inherently horizontally scalable

Dask was designed specifically to handle the issue of datasets that are too large to fit in memory on a single node, and is scalable across many nodes . As a result, you can easily scale the size of your data processing environment as your needs dictate, using tools that your organization might already use (like Kubernetes ) without writing complex code to distribute data across nodes. I’ve personally used it to crunch tens of gigabytes of data on my laptop using nothing but the inbuilt local distributed model , and I didn’t have to change the way I wrote the data processing code one bit.

SSIS, at least as best I can tell, has no inherent way to distribute processing across multiple computers, at least not without complex solutions like that proposed on this ServerFault thread . While SSIS as of SQL Server 2017 does have Scale-Out functionality, this is more for distributing work at the package and task level (e.g. running individual parts of a package in a distributed way), so if any of those single tasks process a large amount of data, you’re still limited.

Python code is inherently testable

As my good friend Nick Hodges wrote recently, unit testing your code is kind of important . This is just as true when dealing with ETL workflows; how else can you ensure without some kind of manual observation that your ETL process is producing the correct outputs given the expected inputs (or equally important, how it handles things when it gets unexpected data)?

Python has a number of useful unit testing frameworks, such as unittest or PyTest . By breaking up your ETL processes into consumable units of code, you can easily ensure expected behavior and make changes without fear of inadvertently breaking something.

In contrast, SSIS doesn’t come with any easy way to write unit tests, and the two most oft mentioned frameworks out there are either inactive or appear to have been largely transitioned to proprietary products .

The bottom line is this: in this rabid SQL Server fan’s opinion, if you’re developing new ETL workflows, the better choice is Python rather than SSIS. Mind you, SSIS is still extremely powerful, and if you’ve got a robust existing team supporting it and have invested heavily in SQL Server licensing, then there’s certainly no reason to retrofit all of those. And as I said at the outset, in many large organizations SSIS (or Informatica to give another example of a very popular proprietary ETL tool) is the de-facto standard. Still, it’s clear that Python deserves a seat at the table, given its explosive growth.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK