- Created by James L Coombes, last modified on Apr 01, 2019
When requesting a VM for data analysis, one of the first things you need to do is to determine how much memory is required to work with your data set. Below are some common questions and answers to help you address this.
There is a simple formula to determine how much memory you need to load a data set:
- [number of rows] x [number of columns] x [number of bytes used for data type] = [number of total bytes needed]
For example, if you have a data set that consists of numbers stored as 4-byte values, and the size of the data set is 2,500,000 rows with 44 columns, the amount of memory space you need to load this data set would be:
2,500,000 rows x 44 columns x 4 bytes = 440,000,000 bytes (420 megabytes) ( megabytes = bytes / 2^20 )
You will need to do a little more work when not all of your columns have the same data type. In those cases just calculate the memory requirement for each of the data types then sum them up. For example, if you had a data set with 2,500,000 rows with 30 columns with 4-byte numbers and 14 columns with 8-bytes number, then your memory requirements would be:
2,500,000 x 30 x 4 = 300,000,000 bytes
+ 2,500,000 x 14 x 8 = 280,000,000 bytes
580,000,000 bytes (553 megabytes)
If you have a data type, such as character string, that allows for variable lengths, then you can use the average byte length of values in that column to determine the number to use for the data type size. For example if your data set above had 2,500,000 rows with 42 columns of 8-byte numbers, one column of strings with an average length of 7 bytes, and and another column of strings with an average of 23 bytes, then your memory requirements would be:
2,500,000 x 42 x 8 = 840,000,000 bytes
+ 2,500,000 x 1 x 7 = 17,500,000 bytes
+ 2,500,000 x 1 x 23 = 57,500,000 bytes
915,000,000 bytes (873 megabytes)
We have no specific recommendation for the type of average (mean, median, etc.) to use for the "average" length for a column of character values. If you have the option of calculating different types of average sizes for a character column, it is a good practice to overestimate your memory requirements by using the average with the largest value.
A data type stores its value as a particular byte size based on the number of different values that data type allows. When dealing with numbers for example, a single byte allows for a range of 256 numbers (either 0 to 255 or -128 to +127 depending on where signs are used). A single byte has the benefit of using a small amount of memory, but is limited in how many things for which it can account (only 256 different things). If you wanted to assign the 50 US states their own number code, then you could use a single byte to do that. But if you wanted to assign each of the 3000+ US counties their own number code, then you would need to use a data type that allows for more than one byte.
Below is a table showing the range of possible numbers by byte size.
Smallest and largest number values allowed by byte size | ||||||||||
Data Type | Unsigned Integer* | Signed Integer** | Floating Point Decimal*** | |||||||
Size | Smallest | Largest | Smallest | Largest | Smallest | Largest | ||||
Bytes | Bits | Integer | Integer | Integer | Integer | Decimal | Decimal | |||
1 | 8 | 0 | 255 | -128 | 127 | |||||
2 | 16 | 0 | 65,535 | -32,768 | 32,767 | |||||
3 | 24 | 0 | 16,777,215 | -8,388,608 | 8,388,607 | |||||
4 | 32 | 0 | 4,294,967,295 | -2,147,483,648 | 2,147,483,647 | -3.40282e+38 | 3.40282e+38 | |||
8 | 64 | 0 | 18,446,744,073,709,500,000 | -9,223,372,036,854,780,000 | 9,223,372,036,854,780,000 | -1.79769e+308 | 1.79769e+308 |
* The range of values for unsigned integers are those that are physically allowed by the number of bits.
** The range of values for signed integers are those that are physically allowed by the number of bits minus one. One bit is excluded, because that bit is used to indicate whether the integer is positive or negative.
*** The ranges of floating point decimal numbers are based on the single precision format (32-bit) and double precision format (64-bit) defined by the IEEE 754 standard. Other 32-bit and 64-bit floating point formats exist, but these are the ones most commonly used in computing.
A character data type is one that does not store a number but a single character such as the letter "A", a punctuation mark such as "?", a symbol such as "%" or really anything that is not a number. A single character is stored a 1-, 2-, 3- or 4- byte value depending on the character encoding scheme that is used. The character encoding scheme determines how many possible different text characters can be used. ASCII is a 1-byte encoding scheme that only allows for 256 different text characters. It includes all upper and lower case English characters, numbers, punctuation marks and some symbols. Unicode is an encoding scheme that allows for up to 65,536 different characters including many more symbols and foreign language characters when two bytes are used to store each text character. When Unicode uses 4-bytes it can allow up to 4,294,967,295 different text characters. So, the more bytes an encoding scheme uses the larger its "alphabet", but also the greater the memory each individual character requires.
For example, in ASCII the word "Cat" requires 3 bytes to store. In UCS-4, a form of Unicode that uses 4-bytes, the word same word, "Cat", requires 24 bytes to store.
There is no entirely consistent terminology for data types across programming languages and applications, though many similarities do exist. You should always consult the documentation for the version of a language or application you are using, but you can use the table below to see some of the more common data types.
Note that different languages/applications can use the same term for different byte size objects. Also, newer versions of a given language/application may increase the size of a previously existing data type instead of defining new data type. Some languages/applications may introduce their own additional restrictions on ranges of allowed number for a given byte size. For example, versions of C before 1999 store the int data type using 16-bits, but since then C stores them using 32 bits. Also, since R has been optimized for 64-bit computing, it stores even a 1-bit logical value using 8 bytes. The 8 bytes allow for many other numbers, but R does not allow their use when the data type is defined as a logical.
Byte Size | Bit Size | Data Types By Application/Language | Lowest Number | Highest Number | Some Common Uses For Data Types of This Byte Size | ||||||||
Matlab | SAS | Stata** | R*** | C/C++**** | C# | Python | TSQL | MySQL | |||||
1* | 1 | _bool | bool | boolean | bit | bit | 0 | 1 | True/False, Yes/No, +/- | ||||
1 | 8 | uint8 | character | uint8_t, char | byte | byte | char, varchar | unisgned tinyint, char, varchar, binary | 0 | 255 | Integers; ASCII characters; Unicode characters: UTF-8 variant format; Basic colors and audio | ||
int8 | tinyint | int8_t, char | sbyte | tinyint | tinyint | -128 | 127 | ||||||
byte | -127 | 100 | |||||||||||
2 | 16 | uint16 | uint16_t | ushort, char | str | nchar, nvarchar | unsigned smallint | 0 | 65,535 | Integers; Unicode characters: UTF-8 variant , UTF-16 variant, UCS-2 formats; High colors, CD quality audio | |||
int16 | smallint | int, int16_t, short | short | smallint | smallint | -32,768 | 32,767 | ||||||
int | -32,767 | 32,470 | |||||||||||
3 | 24 | unsigned medium int | 0 | 16,777,215 | Integers; Unicode characters: UTF-8 variant format Date only, time only; True colors, DVD/Blue-ray quality audio | ||||||||
mediumint, date, time | -8,388,608 | 8,388,607 | |||||||||||
4 | 32 | uint32 | uin32_t | uint | smalldatetime | unsigned int, timestamp | 0 | 4,294,967,295 | Integers; Floating point decimal numbers (8-bits for exponent, 24 bits for fraction); Datetime in seconds; Unicode characters: UTF-8 variant, UTF-16 variant, UTF-32, UCS-4 formats; Deep colors (30-bit and up), high quality audio | ||||
int32 | integer | integer | int, int32_t,long | int | int | integer | int | -2,147,483,648 | 2,147,483,647 | ||||
long | -2,147,483,647 | 2,147,483,620 | |||||||||||
single | float | float | real | float | -3.40282 x 10^38 | 3.40282 x 10^38 | |||||||
float | -1.70141 x 10^38 | 1.70141 x 10^38 | |||||||||||
logical | 0 | 1 | |||||||||||
8 | 64 | uint64 | uin64_t | ulong | datetime | unsigned bigint, datetime | 0 | 18,446,744,073,709,551,615 | Integers; Larger and more precise floating point decimal numbers (11 bits for exponent, 53 bits for fraction); Datetime in milliseconds | ||||
int64 | bigint | int64_t, long, long long | long | bigint | bigint | -9,223,372,036,854,775,808 | 9,223,372,036,854,775,807 | ||||||
double | double | numeric | double | float | float | double, real | -1.79769 x 10^308 | 1.79769 x 10^308 | |||||
double | −8.98846 x 10^307 | 8.98846 x 10^307 |
* Single bit data types still need to be stored inside an entire byte, but some applications and languages, such as TSQL, are able to combine several discrete bit values into a single byte in order to save memory space.
** STATA has unique ranges of allowed numbers for each data type. For example, even though the number 105 will fit into one byte, STATA will instead store it as a 2-byte int. This is due to technical requirements for how precisely numbers can be stored or used in calculations in STATA.
*** In R, unless a number is explicitly defined as an integer, R will treat it as an 8-byte "numeric" data type, though it can dynamically switch to a 4-byte integer data type if possible when it needs to save memory. R logical data types are also stored as 8-byte values even though they are only single bit values.
**** In C, the sizes for datatypes char, int, short, long, and long long vary based on the version of C.
You can find a basic memory requirements calculator at:
https://secure.mccombs.utexas.edu/public/datasetmemorysizer/default.aspx
Keep in mind that you do not just need to determine the byte size of columns in your initial data set. That just determines how much memory you need to load the initial data set. If you will use that data in calculations, you will need to calculate the data types and memory requirements of the resulting data set.
For example, consider a very simple data set with just two one row and two columns (A and B), each with a 2-byte number. If you want to add a third column (C) that involves a calculation of the first two columns, the data type requirements of column C will depend on the type of calculation you are performing.
For example, the if the value in colum C below is the result of a caluclation involving the values in columns A and B, the exact value of C depends on what kind of caluclation that is (addition, subtraction, multiplication, etc.). Also, the number of bytes required by C depends on how large that result might be.
A (2-bytes) | B (2-bytes) | C (?-bytes) |
---|---|---|
500 | 900 | ? |
- If C is defined as A + B then C = 1,450, therefore column C could also use a 2-byte data type to store this value, since that would be big enough to store this number.
- If C is defined as C x B then C = 475,400, therefore column C would need to use something larger than a 2-byte data type to store this value, since two bytes is too small to store this number.
- If C is defined as C / B then C = 0.5263..., therefore column C would need to use a 4-byte or 8-byte data type depending on how precise you want your decimal number to be.
In the real world, your data sets and calculation won't be this simple, but you would still apply the same principal: when creating a new column based on values of existing columns, you need to determine what the the largest possible result might be.
When you have a choice in selecting data types, choose one with a byte size large enough for what you need, but no larger than that. If you don't have a choice, because your application enforces its own default rules for byte sizes (such as the case with R), then just be aware of what those byte sizes are and plan accordingly.
You won't always have the control you would like over determining how much memory your datasets and caluclations require. If you do not face any memory constraints, then you do not need to worry if your memory foot print could have been smaller. If you do face memory constraints, below are some options to consider if they are available to you. Note that these are only general tips. They may not all apply depeding on what application or programming language you are using or what type of functions you are calling.
- If you don't need to use all of the columns or rows, make a copy of the data set that only includes the columns and rows you need. Use that as your working copy.
- If you have oversized data type of the values in your dataset, make a copy of the data set using smaller byte size data types for columns that will allow it. For example, if you have a text field stored as unicode, but it only contains basic ASCII characters, then you can reclaim memory space by saving that as a single-byte data type. Also, if you have a 4-byte bigint that is used to store only numbers from 0 to a few thousand, then you can relaim space by saving that as a 2-byte integer. Note that this may effect the precision with which you can perform calculations depending on the calculation and language or application used.
- For results of calculations, use the smallest data type that will contain the range of possible answers.
- If you are peforming a series of calculations that result in intermediate datasets, dispose of those datasets when you no longer need them. If you will reuse them, then offload them to a file if you are able to do so and re-read them when they are needed again.
- No labels