Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

When requesting a VM for data analysis, one of the first things you need to do is to determine how much memory is required to work with your data set. Below are some common questions and answers to help you address this.

 

Expand
titleIs there a simple formula for determining my memory requirements?

 

There is a simple formula to determine how much memory you need to load a data set:

  • [number of rows]  x [number of columns] x [number of bytes used for data type]  = [number of total bytes needed] 

 

For example, if you have a data set that consists of numbers stored as 4-byte values, and the size of the data set is 2,500,000 rows with 44 columns, the amount of memory space you need to load this data set would be:

 

  2,500,000 rows x 44 columns x 4 bytes  =  440,000,000 bytes (420 megabytes) ( megabytes = bytes / 2^20 )

 

You will need to do a little more work when not all of your columns have the same data type. In those cases just calculate the memory requirement for each of the data types then sum them up. For example, if you had a data set with 2,500,000 rows with 30 columns with 4-byte numbers and 14 columns with 8-bytes number, then your memory requirements would be:

 

  2,500,000 x 30 x 4  =  300,000,000 bytes 
+ 2,500,000 x 14 x 8  =  280,000,000 bytes 
                              580,000,000 bytes (553 megabytes)
 

If you have a data type, such as character string, that allows for variable lengths, then you can use the average byte length of values in that column to determine the number to use for the data type size. For example if your data set above had 2,500,000 rows with 42 columns of 8-byte numbers, one column of strings with an average length of 7 bytes, and and  another column of strings with an average of 23 bytes, then your memory requirements would be:

 

  2,500,000 x 42 x 8  =  840,000,000 bytes 
+ 2,500,000 x 1 x 7   =   17,500,000 bytes 
+ 2,500,000 x 1 x 23  =   57,500,000 bytes 
                              915,000,000 bytes (873 megabytes)
 

We have no specific recommendation for the type of average (mean, median, etc.) to use for the "average" length for a column of character values. If you have the option of calculating different types of average sizes for a character column, it is a good practice to overestimate your memory requirements by using the average with the largest value.

 

...

Expand
titleAre there any tips for reducing memory requirements?

 

You won't always have the control you would like over determining how much memory your datasets and caluclations require. If you do not face any memory constraints, then you do not need to worry if your memory foot print could have been smaller. If you do face memory constraints, below are some options to consider if they are availabe to you:. Note that these are only general tips. They may not apply depeding on what application or programming language you are using or what type of functions you are calling.

  • If you don't need to use all of the columns or rows, make a copy of the data set that only includes the columns and rows you need. Use that as your working copy.
  • If you have oversized data type of the values in your dataset, make a copy of the data set using smaller byte size data types for columns that will allow it. For example, if you have a text field stored as unicode, but it only contains basic ASCII characters, thren you can reclaim memory space by saving that as a single-byte data type. Also, if you have a 4-byte bigint that is used to store only numbers from 0 to a few thousand, then you can relaim space by saving that as a 2-byte integer
  • Make a copy of the data set that only includes the columns and rows you need.
  • Make a copy of the data set using smaller byte size data types for columns that will allow it. Note that this may effect the precision with which you can perform calculations depending on the calculation and languagesand language or application used.
  • For results of calculations, use the smallest data type that wil contain the range of possible answers.
  • If you are peforming a series of calculations that result in intermediate datasets, dispose of those datasets when you no longer need them. If you will reuse them, then offload them to a file if you are able to do so and re-read them when they are needed again.