11

Python's Array: Working With Numeric Data Efficiently

 8 months ago
source link: https://realpython.com/python-array/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Understanding Arrays in Programming

Some developers treat arrays and Python’s lists as synonymous. Others argue that Python doesn’t have traditional arrays, as seen in languages like C, C++, or Java. In this brief section, you’ll try to answer whether Python has arrays.

Arrays in Computer Science

To understand arrays better, it helps to zoom out a bit and look at them through the lens of theory. This will clarify some baseline terminology, including:

  • Abstract data types
  • Data structures
  • Data types

Computer science models collections of data as abstract data types (ADTs) that support certain operations like insertion or deletion of elements. These operations must satisfy additional constraints that describe the abstract data type’s unique behaviors.

The word abstract in this context means these data types leave the implementation details up to you, only defining the expected semantics or the set of available operations that an ADT must support. As a result, you can often represent one abstract data type using a few alternative data structures, which are concrete implementations of the same conceptual approach to organizing data.

Programming languages usually provide a few data structures in the form of built-in data types as a convenience so that you don’t have to implement them yourself. This means you can focus on solving more abstract problems instead of starting from scratch every time. For example, the Python dict data type is a hash table data structure that implements the dictionary abstract data type.

To reiterate the meaning of these terms, abstract data types define the desired semantics, data structures implement them, and data types represent data structures in programming languages as built-in syntactic constructs.

Some of the most common examples of abstract data types include these:

In some cases, you can build more specific kinds of abstract data types on top of existing ADTs by incorporating additional constraints. For instance, you can build a stack by modifying the queue or the other way around.

As you can see, the list of ADTs doesn’t include arrays. That’s because the array is a specific data structure representing the list abstract data type. The list ADT dictates what operations the array must support and which behaviors it should exhibit. If you’ve worked with the Python list, then you should already have a pretty good idea of what the list in computer science is all about.

Note: Don’t confuse the list abstract data type in computer science with the list data type in Python, which represents the former. Similarly, it’s easy to mistake the theoretical array data structure for a specific array data type, which many programming languages provide as a convenient primitive type built into their syntax.

The list abstract data type is a linear collection of values forming an ordered sequence of elements. These elements follow a specific arrangement, meaning that each element has a position relative to the others, identified by a numeric index that usually starts at zero. The list has a variable but finite length. It may or may not contain values of different types, as well as duplicates.

The interface of the list abstract data type resembles Python’s list, typically including the following operations:

List ADT Python’s list
Get an element by an index fruits[0]
Set an element at a given index fruits[0] = "banana"
Insert an element at a given index fruits.insert(0, "banana")
Delete an element by an index fruits.pop(0), del fruits[0]
Delete an element by a value fruits.remove("banana")
Delete all elements fruits.clear()
Find the index of a given element fruits.index("banana")
Append an element at the right end fruits.append("banana")
Merge with another list fruits.extend(veggies), fruits + veggies
Sort elements fruits.sort()
Get the number of elements len(fruits)
Iterate over the elements iter(fruits)
Check if an element is present "banana" in fruits

Now that you understand where the array data structure fits into the bigger picture, it’s time to take a closer look at it.

The Array as a Data Structure

To visualize the traditional array data structure graphically, imagine a series of identically sized rectangular boxes lined up side by side, each holding a distinct value of a common type. For example, here’s what an array of integers could look like:

An array of numbers

An array of numbers

This array has seven elements, indexed from zero to six, all of which are integer numbers. Note that some of these values occur more than once.

At first glance, this looks just like the standard list in Python. After all, both represent the list abstract data type covered earlier. What makes the array data structure special is a few unique features. In particular, an array must be:

  • Homogeneous: All elements in an array share a common data type, allowing them to have a uniform size. This is different from Python lists, which are allowed to contain elements of mixed types.
  • Contiguous: Array elements are adjacent to each other in the computer memory, occupying a contiguous block of memory space. No other data is allowed between the beginning and the end of an array. Such a layout corresponds to the way your computer memory is logically organized.
  • Fixed-Size: Because other data may follow immediately after the block of memory occupied by your array, expanding the block would overwrite that data, potentially causing severe errors or malfunctions. Therefore, when you allocate memory for an array, you must know its size up front without the possibility to change it later.

The first two traits make the array a random-access data structure, letting you instantly retrieve an element at any given index without having to traverse the whole array. When you know the memory address of the first element in the array, you can calculate the address of the desired element by applying a straightforward math formula:

A formula for an address of an array element

To find the memory address Ai of an element located at index i in the array, you can take the address of your array, which is the same as the address of the first element A0 at index zero, and then add an offset. Because all elements have equal sizes, you can obtain the memory address offset by multiplying the target index i by the single element’s size in bytes.

This formula works as long as all elements have an identical size and appear next to each other. The downside of such a memory layout is the array’s fixed size, which becomes apparent once you wish to append, insert, or remove elements. However, you can only simulate resizing by making another array, copying existing elements to it, and disregarding your old array.

As a result, you end up frequently asking the operating system for an ever larger block of memory, and you waste time moving data from one place to another. It’s what makes insertions and deletions so inefficient with pure arrays.

One way to mitigate this limitation is to build an abstraction on top of your array that eagerly allocates more memory than necessary and keeps track of the current number of elements against the array’s capacity. When the underlying array becomes full, you can reallocate and copy the data to a larger array. That’s roughly what the Python list does under the surface to cope with dynamic resizing for you.

What about elements of the same type but variable sizes, such as strings?

Strings can have several representations in the computer memory. For example, the length-prefixed representation allocates an extra byte to encode the number of characters that follow. Another popular representation is the null-terminated string, which is essentially an array of bytes terminated by a special sentinel value called the null character. With such a representation, you can create an array of pointers to strings:

An array of character strings

An array of character strings

The array depicted above consists of pointers, or numeric addresses, each indicating a specific memory area where the given string begins. Even though the individual elements can have variable sizes and be scattered all over your computer’s memory, their corresponding pointers are plain integers of uniform size that fit into an array.

As it turns out, the array data structure isn’t the only way to implement the list abstract data type. Next up, you’ll learn about a popular alternative—the linked list.

Arrays vs Linked Lists

At this point, you know that a list in computer science is an abstract data type and an array is one of its specific implementations. But why would you want to have more than one implementation, and how do you know which one to use?

The choice of a given data structure is vital for performance and memory efficiency. Most data structures offer a trade-off between the two, and they usually provide better time-space complexity for certain operations at the expense of others. Therefore, you should always pick a suitable data structure based on the requirements of the specific problem that you’re trying to solve.

For example, you can implement the list abstract data type using either an array or a linked list. Both are sequences of elements, but arrays will be more suitable in scenarios involving frequent lookups, while linked lists will work better when you anticipate a greater number of insertions or deletions.

Arrays always offer instant access to arbitrary elements but require extra time to add a new element. Conversely, linked lists allow for quick insertions but may take longer to find an element by index. That trade-off is the result of how both data structures organize data in memory.

If you’re interested in how the linked list achieves efficient insertions and deletions, then check out the relevant section of the corresponding article on Wikipedia to learn more. Now, the time has finally come to reveal whether Python has arrays or not.

Arrays in Python

Although a Python list represents the list abstract data type, its implementation isn’t the typical array or linked list. When you look at the CPython source code, you’ll find that the underlying PyListObject is a dynamic array of pointers to objects, allowing Python’s list to combine the benefits of both data structures. It’s cleverly optimized, ensuring quick lookups as well as insertions under most circumstances.

Python takes the idea of keeping an array of pointers to character strings, which was outlined before, a step further. It does so by retaining extra information about the data type of a value being pointed to from an internal array.

This makes Python’s list capable of storing heterogeneous elements of arbitrary types. On the other hand, this and the over-allocation make the Python list less memory-efficient than the classic array that you’d typically find in lower-level programming languages like C.

To sum up, Python internally relies on the array data structure but doesn’t directly expose it to you. The closest you can get to an array in Python is with the standard library’s array or NumPy’s ndarray data types. However, neither one of them provides a true array, as they can only hold numeric values.

On the Real Python website, you’ll find tons of resources on NumPy if you’d like to learn more about that library. And up next, you’ll read about the array module.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK