Memory Consumption of Python List Lists

advertisements

A code I was recently working on was found to be using around 200MB of memory to run, and I'm stumped as to why it would need that much.

Basically it mapped a text file onto a list where each character in the file was its own list containing the character and how often it has shown up so far (starting from zero) as its two items.

So 'abbac...' would be [['a','0'],['b','0'],['b','1'],['a','1'],['c','0'],...]

For a text file 1 million characters long, it used 200MB.

Is this reasonable or was it something else my code was doing? If it is reasonable, was it because of the high number of lists? Would [a,0,b,0,b,1,a,1,c,0...] take up substantially less space?

If you do not need the list itself, then I fully subscribe @Lattyware's solution of using a generator.

However, if that's not an option then perhaps you could compress the data in your list without loss of information by storing only the positions for each character in the file.

import random
import string

def track_char(s):
    # Make sure all characters have the same case
    s = s.lower()
    d = dict((k, []) for k in set(s))
    for position, char in enumerate(s):
         d[char].append(position)
    return d

st = ''.join(random.choice(string.ascii_uppercase) for _ in range(50000))
d = track_char(st)

len(d["a"])

# Total number of occurrences of character 2
for char, vals in d.items():
    if 2 in vals:
         print("Character %s has %s occurrences" % (char,len(d[char]))
Character C has 1878 occurrences

# Number of occurrences of character 2 so far
for char, vals in d.items():
    if 2 in vals:
        print("Character %s has %s occurrences so far" % (char, len([x for x in d[char] if x <= 2))
Character C has 1 occurrences so far

This way there is no need to duplicate the character string each time there is an occurrence, and you maintain the information of all their occurrences.

To compare the object size of your original list or this approach, here's a test

import random
import string
from sys import getsizeof

# random generation of a string with 50k characters
st = ''.join(random.choice(string.ascii_uppercase) for _ in range(50000))

# Function that returns the original list for this string
def original_track(s):
    l = []
    for position, char in enumerate(s):
        l.append([char, position])
    return l

# Testing sizes
original_list = original_track(st)
dict_format = track_char(st)

getsizeof(original_list)
406496
getsizeof(dict_format)
1632

As you can see, the dict_format is roughly 250x times smaller in size. However this difference in sizes should be more pronounced in larger strings.

Memory Consumption of Python List Lists - codesd.com

Memory Consumption of Python List Lists

Recommend

Free access to geographic / industrial data?

囚禁的量子，开放的应用

E-mail header analyzer

Simulate length function to get the length of a list

What are the differences between NoSQL databases and how are they different from...

追赶USDT，45亿美元估值上市，Circle背后的秘密

SQL - How to count the linked entries in another table with a specific value

load and unload a message

How to check neighbors in a range of tables

Using the user input to call an integer

About Joyk