When requesting a VM for data analysis, one of the first things you need to do is to determine how much memory is required to work with your data set. Below are some common questions and answers to help you address this.
Expand | ||
---|---|---|
| ||
There is a simple formula to determine how much memory you need to load a data set:
For example, if you have a data set that consists of numbers stored as 4-byte values, and the size of the data set is 2,500,000 rows with 44 columns, the amount of memory space you need to load this data set would be:
2,500,000 rows x 44 columns x 4 bytes = 440,000,000 bytes (420 megabytes) ( megabytes = bytes / 2^20 )
You will need to do a little more work when not all of your columns have the same data type. In those cases just calculate the memory requirement for each of the data types then sum them up. For example, if you had a data set with 2,500,000 rows with 30 columns with 4-byte numbers and 14 columns with 8-bytes number, then your memory requirements would be:
2,500,000 x 30 x 4 = 300,000,000 bytes + 2,500,000 x 14 x 8 = 280,000,000 bytes 580,000,000 bytes (553 megabytes) If you have a data type, such as character string, that allows for variable lengths, then you can use the average byte length of values in that column to determine the number to use for the data type size. For example if your data set above had 2,500,000 rows with 42 columns of 8-byte numbers, one column of strings with an average length of 7 bytes, and and another column of strings with an average of 23 bytes, then your memory requirements would be:
2,500,000 x 42 x 8 = 840,000,000 bytes + 2,500,000 x 1 x 7 = 17,500,000 bytes + 2,500,000 x 1 x 23 = 57,500,000 bytes 915,000,000 bytes (873 megabytes) We have no specific recommendation for the type of average (mean, median, etc.) to use for the "average" length for a column of character values. If you have the option of calculating different types of average sizes for a character column, it is a good practice to overestimate your memory requirements by using the average with the largest value. |
...
Expand | ||
---|---|---|
| ||
You won't always have the control you would like over determining how much memory your datasets and caluclations require. If you do not face any memory constraints, then you do not need to worry if your memory foot print could have been smaller. If you do face memory constraints, below are some options to consider if they are availabe to you:. Note that these are only general tips. They may not apply depeding on what application or programming language you are using or what type of functions you are calling.
|