|Friday, 24 May 2013|
Written by Andreas Roth
Choosing the right data type
Choosing the right data type is not so easy as you may think. Especially when you planning to write a program which should run without modifications on several systems, platforms and compilers. The are several basic data types like integer, character, real and maybe even strings. In the following I will explain the issue by using C++ asa an example. But the key point also applies to any other programming language.
Data types in C/C++
In C/C++ there are the data types char, short, int and long to represent integers. And each of these data types can be either signed or unsigned. This makes 8 different types in total for integers. For real numbers there are two (or three) different types: float, double and long double. To make the situation more complicate, the different data types do not always have the same value range and representations on different platforms and/or compilers. The question is when to use which type to accomplish the task at hand.
The value range of a data type depends on its width and if the data type is signed or unsigned. A simple signed char can hold values from -128 to +127. For some problems this is enough, but you get into trouble if you want to store the value 200 in a char. Check the value range of the number problem at hand and choose the data type which fits best. But also try to ask yourself the question: Is it possible that even larger/smaller number can arise? If you determined the maximum value range and the next question is: Use a signed or an unsigned version of the data type.
Signed vs. Unsigned
Very often the compiler raises a warning about a comparision of a signed and an unsigned value or an signed to unsigned assignment. You may ignore those warning if you are very certain that it's not a problem. In some cases you may have introduced a serious problem.
To demonstrate the problem of signed and unsigned check out the following example. The task is to generate several sine-values and store them in an array. So you write a simple loop, which puts the calculates sine value into an array:
Try to compile this example (e.g. gcc -o sine sine.c -lm -pedantic) and see for youself that the result is not as you may expect. The program outputs the following lines:
sine[ 0]=0 sine[ 1]=4010 sine[ 2]=7958 sine[ 3]=11780 sine[ 4]=15416 sine[ 5]=18809 sine[ 6]=21905 sine[ 7]=24656 sine[ 8]=27018 sine[ 9]=28954 sine[ 10]=30433 sine[ 11]=31433 sine[ 12]=31936 sine[ 13]=31936 sine[ 14]=31433 sine[ 15]=30433 sine[ 16]=28954 sine[ 17]=27018 sine[ 18]=24656 sine[ 19]=21905 sine[ 20]=18808 sine[ 21]=15415 sine[ 22]=11779 sine[ 23]=7957 sine[ 24]=4010 sine[ 25]=0 sine[ 26]=61526 sine[ 27]=57578 sine[ 28]=53756 sine[ 29]=50120 sine[ 30]=46727 sine[ 31]=43631 sine[ 32]=40880 sine[ 33]=38518 sine[ 34]=36582 ...
As you can see the result is as expected up to n=25, but what happens if n gets larger? That's the result of an mixture of signed and unsigned. The array of sine values is defined as unsigned, but the result of sin() is signed. Note that on this example even the compiler does not complain about this problem. The fix for this problem is very easy (of course); change the type of the sine array from unsigned to signed and the results are as expected:
... sine[ 22]=11779 sine[ 23]=7957 sine[ 24]=4010 sine[ 25]=0 sine[ 26]=-4010 sine[ 27]=-7958 ...
One important point of choosing the right data type is ensuring that your code remains portable. If your using specific data types only available on one compiler or system you have to put in much effort to port it to another compiler or system. Portability can be easily accomplished if you use the data types defined by the standard. These data types must be present on any compiler which pretends to be compatible to the standard. For example the C99 standard defines that there must be header file called stdint.h which defines a type int32_t for a signed 32-Bit integer, uint16_t for a unsigned 16-Bit Integer.
Sometimes you don't care much if to use 32-Bits or 64-Bit to represent a simple number. The best example for this issue are counter variable in loops. In such cases you could simple use a integer type without specifying the exact width. For example to count from 0 to 30 you could simple use an int or unsigned. But you keep the value range of the data type and the signed-unsigned-issue in mind.
Special data types for special situations
Most available libraries introduce the own data type for special use cases. These data types are based on the native data types, but they have one advantage: They improve readability of your program.
The variable n of the type unsigned does not tell the reader that it's gonna be used to store a process identifier. So if you write
instead its much clearer for which purpose the variable is meant. At this point you may say "But this I can also accomplish by choosing the right name for the variable!". Your right, but choosing the right data type can increase the readability. And there's another reason why to use these special data type instead of the native data types. Let's assume on your OS there maximum number of processes is 216, so you choose an unsigned short to represent the process identifier. After several years and several thousand lines of code you increase the maximum number to 232. Now you use an unsigned int for the PIDs and you have change all occurance of unsigned short to unsigned int when it's been used to hold a process identifier. If you just used the type pid_t instead, the change would be quiet simple. Only change the definition of the type pid_t and that's all.
There a several well known type which should be used in certain situations. A very good example is the type size_t in C and C++. It's supposed to be used to measure the size or length of an object or buffer. Many functions of the standard library of C/C++ are using size_t when the size of a buffer needs to be specified. For example strlen returns the length of the given string in characters as size_t. But many people use a unsigned or int to represent the size of a buffer. Using size_t in such cases would make it more easier to understand the function and its parameters. So choose size_t whenever you needs the length or size of an object, buffer or string.
Choosing a data type may also have some influence on the performance of your program. The 64-bit arithmetic operation on a 32-bit machine must be implemented by the compiler (or in the libraries of the compiler) since most 32-bit machines do not have 64-bit arithmetic in the regular instruction set. For example a addition of two 64-bit value must be carried out using several instructions to get the result. A addition of two 32-bit value can be done by a single instruction. For some applications the difference matters, especially if you must perform such an operation very frequently.
Normally the endianness of the values does not matter if your programm does not interact with other program on other platforms. But if you intend to exchange information in a network you have to make sure that every member uses the same representation of your data. On little-endian machines (like AMD or Intel) the integer numbers are store with the least significant bit at the highest memory location. The big-endian machines (like Motorola and some PowerPCs) on the other hand are putting the most-signifacant bit into the highest memory location.
Most data which is transferred over the network is done so in big-endian byte order. Sometimes it's also called network byte order. This ensures that machines can communication with eachother also if there are using very different hardware.
As you have seen to choose the right data type is sometimes not so easy at all. You have to consider which value range is required and if you need negativ value or not. We have seen that you can decrease the effort for porting your code from one system to another if your using portable data types. Two minor points mentioned in this article are the performance conciderations and endianess. After reading this article your should be able to choose the right data type for your problem.
|Copyright © by AR Soft 2005-2013|