2

The Art of the Code

 3 years ago
source link: https://medium.com/the-stata-guide/the-art-of-the-code-e3d44efd84cb
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

The Art of the Code

Reading published code is hard work, and trying to emulate its complex structure is even harder. While we can untangle the code into smaller core parts, can we interlink these different parts back together to form a coherent whole? In this guide we discuss the art of coding, why the coding is not given the attention it needs, why it seems difficult, and how we can approach these issues. And yes, we should all learn to code.

Coding at higher levels require the ability to interweave a lot of small parts into an intricate tapestry. At lower levels, we can usually patch together some script to give us the outputs we want. These outputs are typically valued more than the code itself, especially in a field like economics. I have also touched upon this point in earlier guides (here and here), where I talk about organization and replicability especially if someone else has to use our scripts. The more coherent, and neater our code is, the easier for it to live on. Assuming others don’t even have to use our code, but if we ourselves cannot read our old files, then it implies a lack of structure. And we are simply not taught how to approach and structure code. This issue is not uncommon, but it is also at the heart of an important problem. That is, coding is taught as a means to an end, and not an end in itself. Furthermore, knowing how code actually functions inside computers helps a lot.

The coding curve is not linear

Let’s start from the replicating code point mentioned above. It important to know that the gap between one level and the next is not linear. It is exponential. Each new level of coding is a lot more than the sum of it parts. It requires, not only the ability to piece together pieces of a large puzzle, it also requires us to internalize a certain way of thinking.

Let’s take a simple of a two-dimensional jigsaw puzzle that contains many small pieces that fit together in only one possible way. This puzzle also has a fixed dimension. Therefore, it is a well-defined problem with only one solution. But we can arrive at this solution in several ways. For example, we can pick out the edge pieces first and figure out the puzzle boundary. We can also sort the pieces by color that gives us some indication of where they are likely to fit. Or we can just go for the brute force method where we check the shapes of the pieces to see where they fit.

Each of the above processes is a method, or a function, that converts an input into an output. Each of these methods will eventually solve the puzzle, but some will be faster than others, or at least perform better for some parts. But there is no right or wrong method here. We can also invoke different algorithms at different times or even use several of them at the same time. It all depends on our objective function; speed or fun or both.

Now assume a more complex puzzle that is in three or higher dimensions. How do we solve this problem? We can take a building block approach and start at the bottom to create a foundation, usually by defining core modules, and then keep building new blocks on top of the old ones until the puzzle is finished. This method only works if the blocks are fully independent of each other. If different parts are inter-linked, then we already need to start building different layers simultaneously in the early stages. This is systems thinking for code.

So what goes into coding? Programmers usually start with a code shell, sort of like the outline of a 2-D puzzle that produces some basic results and defines the boundaries. The aim is to produce a minimum working example. We can then start embellishing it with additional options, crossing all the t’s, dotting all the i’s, so eventually our program covers ALL the possible outcomes it needs to produce. And, an extremely important yet overlooked point is that, a piece of code should also know what all it is NOT supposed to produce. In short, garbage-in should give an error and not garbage-out. Once all the pieces are in place, and outputs are giving us what they are supposed to give, the next (overlooked) step is to clean up the code. This entails (i) making it compact and efficient by better inter-linking the different parts, (ii) reprogramming parts of the code to reduce the memory footprint, and (iii) making it look neat and documenting it. These under-the-hood processes require several iterations and are usually skipped. Simply because the focus is on producing neater outputs and finishing up those papers. By this time, the code itself it probably on the back burner. But as journals are pushing more and more for transparency, and more and more code is now available in the pubic domain, the focus also has shifted back to the process that produces the outputs. Therefore it is even more important to start earlier than later, because untangling the convoluted mess of some syntax becomes a whole project on its own.

While starting with a shell code and working towards a final piece is a fairly standard process, the key difference is that more seasoned programmers will already be thinking of the final product early on. This requires having a very clear top-down meta view of the final output, while using a bottom-up approach to interweave the parts up. Experienced coders are also constantly striving to program sparsely, with as little syntax as possible, and to maximize computational efficiency, in order to utilize as little memory as possible. Learning these things of course take time but if you are aware of this aspect in the early stages, it will huge benefits later on.

In order to code better, knowing how a piece of code interacts with your computer also really helps.

It all ones and zeros

Unlike solutions to real world problems, that can be structured or unstructured (or lucky guesses), puzzles are solved inside computers using algorithms. Since computers are just ones and zeros, so are the solutions. If you understand this, and can structure your coding around this, then you have already advanced a level.

Let’s take the topic of machine learning (ML), a buzzword that is liberally thrown around. Let’s start with a classic ML example, where we need to recognize handwritten characters. Let’s assume that you write some character on a piece page, scan it, and open it on your screen. How would you go about writing an algorithm to read this image and convert it into a number? While we can see and process images in one go (because brains are amazing), a computer only sees ones and zeros. Therefore, all it can “read” are the pixels on the screen and whether they are light or dark or partially shaded. So how does one convert this information into a character that the computer itself recognizes?

Without making this too much of an ML lecture, in very simple terms, a computer cannot identify a character on its own. But what it can do very well, and extremely fast, is do a bunch of mathematical calculations extremely fast. Some of these calculations require solvers, or core programs design to find a solution to some numerical problem. But what are we solving? Since a computer does not know what a letter or a number looks like, it can be trained to find patterns by feeding it a lot of data on existing characters, for example providing the computer a large amount of data that contains an image (set A) to character (set B) mapping. This is called a training dataset. A computer can use this information to solve some minimization problem to detect close matches to our image. But what is it matching? Basically, pixels from our scanned example with a bunch of pixels from our training data. A computer cannot tell us what character it has found. All it can give us are probabilistic matches from set A to set B based on the information elicited from the training data. The methods used to determine these matches are a whole science on their own that we will leave for another day. And here is another lesson. Computers are very good in doing mathematical calculations very fast and that’s about it. Until computers become sentient (insert another buzzword here: AI), this is what we are stuck with.

In the land of 1s and 0s, solvers are the king

Let’s talk about these solvers for a bit since without them, one cannot do much including regressions, ML, or even play games. Solvers are essentially basic algorithms that solve a specific problems like system of equations, optimization problems, finding minimum distances etc. These are core programs that build other programs including other algorithms. They are the building blocks of large puzzles.

In the field of economics, where students are suddenly thrown in a world of computer syntax, where they are taught that typing reg y x spits out a regression output. Some students will also know from econometric courses that behind this OLS is the famous matrix solution (X'X)^-1 * (X'y). A few might even learn how program this in directly using some matrix algebra. But almost no one will know how a computers deal this problem. For example, how does a computer know how to invert the matrix? We know how to invert a 2x2 or a 3x3 matrix by hand using the Gauss-Jordan algorithm (the method that is taught in high schools and colleges), but computers can also utilize various other algorithms to invert n-dimensional matrices, for example, by utilizing LU or QR decompositions methods that closely approximate the actual answer using numerical solutions. But we need to figure out the methods. Since we have derived them to solve the inversion of 2 or 3-dimensional matrices by hand, a computer can take these set of instructions and apply them to n-dimensional matrices.

Let’s take another famous solver, the A* algorithm, that finds the shortest distance between two nodes on a network. For example, think about searching on Google maps for an optimal route between two locations. Optimal can be shortest distance or shortest time or shortest number of transfers. A computer can use A* to easily search a large graph and provide us with different solutions. Until only until recently, A* has sort of become redundant since computational power is strong enough to just brute-force evaluate all the possible distance combinations.

Another example are algorithms that generate a convex hull for a set of points. These are “search” algorithm, where through different heuristics they try and find the outline of a shape that contains all the points. Other examples of search algorithms include gradient descent or hill-climbing methods. Since one can search for something in several different ways, much like solving the jigsaw puzzle, different search algorithms have their strengths and weaknesses for dealing with different problems.

Let’s go back to why we are talking about solvers so much. As economists we don’t need to know about solvers since we are only interested in the outcomes, but as coders we have understand how they work. A lot of this stuff had to be figured out in the early computer days since there were huge restrictions in terms of memory and speed. Several core solvers like matrix inversion, gradient searches etc., evolved to optimize speed and efficiency. The core structure of these can also be traced back to low-level languages like Assembly, Fortran, C/C++ etc. Low-level languages are characterized by being “close” to the computer in terms of talking in ones and zeros. Since new languages are built on these languages, they also inherently contain the knowledge of the early programs. In other words, computer languages are the cumulative sum of their past. A lot of the first generation solvers and algorithms have been carried forward and are still utilized in computer languages we use today. Take any software, find the documentation of one of its solvers, and it will eventually lead you back to some early day code. The earlier generation of coders (especially in economics) know this information much better than our generation, and this understanding of the code-machine interface also makes it easier for them to understand programming.

New and more efficient solvers are still being developed. A great example is the Quake 3 source code that was made public in 1992. Quake 3 was a trend setting game with an amazing physics engine that rendered 3-D graphics with little memory footprint. One of the things that became viral in the coding community was an algorithm inside its game engine for finding square roots. This algorithm is referred to as the fast inverse square root method and has become an industry standard.

Start coding (very) early

So how do we improve our coding literacy? Mastering computer languages is not impossible. It just takes time just because a lot has developed in the last few decades. Learning coding in the 2000s was easier than it is now, simply because the amount of content available was relatively more manageable. New languages are constantly being created that are used to create even more new stuff. Currently, there are no limits to what one can achieve. Contemporary computer languages also have strong overlaps and can also talk a lot more to each other now as compared to the earlier languages. All in all, this can be a bit intimidating. Early on, languages had their “territories” with some overlaps. For example, all could run an OLS. During the time I was doing my PhD coursework (2007–2008), Stata was the main language for econometrics, SPSS for qualitative work, MATLAB for handling matrices, while R was an upcoming software. Python existed somewhere else and was not even considered a competitor for econometrics as is the case now. So depending on what one wanted to achieve, it was sort of easy to pick a language and specialize in it. It was also easy to follow the developments just because innovations were also fewer. In short, starting to learn a language is definitely tougher now. Not because it is difficult, but because the choice set has expanded considerably and keeping up is challenging. As economists we need to learn about methods, applications, code inputs and code outputs.

Going back even further, the first piece of code I ever wrote was in Turbo Basic around 1990, which really helped internalize thinking in code. Learning Basic on a 286 on DOS entailed writing some lines of syntax to produce a box on screen. As a kid, it was a life-changing experience. This also introduced the concept of a “language sandbox”, where one construct whatever one wants as long as one can stay within a defined space of that language. But unlike sandbox games (like Minecraft), the boundaries of languages are not restricted. We can program in more stuff and expand the boundaries. There is no upper limit to what can be done.

There are also issues in how coding it taught. In most cases, as a means to an end, but not an end itself. Coding, at least in economics, is essentially taught to support research and conduct some analysis to get some outputs. That is why it is also taught fairly late. Usually when it is essentially and absolutely needed in undergraduate programs. But it should really be taught in schools, even as low as primary schools, just like art or music or singing or even science labs. It can also be made fun and interactive. Some kids will internalize it and some won’t, but this exposure to coding literacy is extremely important given how much our lives are determined by algorithms.

The role of the Stata Guide

Last point I would like to mention here is how all of this is related to the Stata Guide. This is the anniversary post of the guide, where the first post went up a year ago. First of all, a big thanks all to all of you who constantly engage with it! A comment I frequently get about the visualizations is why don’t I package them up. This is a step I intentionally want to avoid. While nice visualizations are fun to make, and I really like figuring them out, there is a lot of learning in the process itself. Breaking the code down in smaller steps, also helps with understanding how Stata works. For example for the advanced mapping guide, where I asked the question, how can we draw as many spatial layers as we want in Stata given that we can draw a few layers using spmap, a package I really like. In order to do this, I had to untangle the code in spmap and recreate it. In this process a lot of things cleared up. While I am unlikely to do advanced cartography in Stata on a regular basis, I now have a very good understanding of how Stata draws and handles shapes. This itself has helped me write up new guides. Plus explaining code in writing is something I had to learn, and its not easy! The alternative would have been to just dump everything in some online repository and let the users figure it out. In short the aim of the Stata Guide, is to push early-level coders up to the middle level, where they can start reading and using more advanced code, and once they reach this point, they can choose to take the code to the next level.

About the author

I am an economist by profession and I have been using Stata since 2003. I am currently based in Vienna, Austria where I work at the Vienna University of Economics and Business (WU) and at the International Institute for Applied Systems Analysis (IIASA). You can see my profile, research, and projects on GitHub. You can connect with me via Medium, Twitter, LinkedIn or simply via email: [email protected].

The Stata Guide, releases awesome new content regularly. Subscribe, Clap, and/or Follow the guide if you like the content!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK