When doing some cracking challenges, I stumbled upon a call to a function named
__ctype_b_loc(void)
. It turns out there is no really good resource on the
internet explaining what this function is and what it does. So here are my notes
about what this function is and how it is used.
__ctype_b_loc
is an internal function used by functions found in <ctype.h>
:
The <ctype.h>
functions will adjust their behavior when the locale is changed.
If we take a look at the file /usr/include/ctype.h
we can see that
__ctype_b_loc
is defined elsewhere, and is a constant :
extern const unsigned short int **__ctype_b_loc (void)
__THROW __attribute__ ((__const__));
This constant points to an array of 384. 384 is chosen so it can be indexed by :
unsigned char
value [0,255]signed char
value [-128,-1)For each character the array contains an unsigned short int
describing its the
properties of the character (uppercase, alphabetic, numeric, whitespace, ...).
If the unsigned short int
(2 bytes) is 0x0001
, then the only property of the
character is that it is uppercase.
The enum described in <ctype.h>
shows the correspondance between bits and
properties :
enum
{
_ISupper = _ISbit (0), /* UPPERCASE. */
_ISlower = _ISbit (1), /* lowercase. */
_ISalpha = _ISbit (2), /* Alphabetic. */
_ISdigit = _ISbit (3), /* Numeric. */
_ISxdigit = _ISbit (4), /* Hexadecimal numeric. */
_ISspace = _ISbit (5), /* Whitespace. */
_ISprint = _ISbit (6), /* Printing. */
_ISgraph = _ISbit (7), /* Graphical. */
_ISblank = _ISbit (8), /* Blank (usually SPC and TAB). */
_IScntrl = _ISbit (9), /* Control character. */
_ISpunct = _ISbit (10), /* Punctuation. */
_ISalnum = _ISbit (11) /* Alphanumeric. */
};
_ISbit(bit)
will return an int where the only 1 in the binary representation
is at location "bit". For instance _ISbit(3) = 0b01000, and _ISbit(0) = 0b0001.
The functions isalnum(char)
, isupper(char)
, isblank(char)
(and so on...)
actually use this enum and the table given by __ctype_b_loc
to test if the
character is alphanumeric, uppercase, blank, etc...
These functions all use __isctype(c, type)
under the hood.
__isctype(c, type)
will return an int indicating if the character passed as
the parameter c is of type type.
To do so it uses the table __ctype_b_loc
:
# define __isctype(c, type) \
((*__ctype_b_loc ())[(int) (c)] & (unsigned short int) type)
To have a better understanding of how __ctype_b_loc
works let's reverse
engineer our own code.
The following program uses every function using __isctype(c, type)
.
#include <ctype.h>
#include <stdio.h>
int main(int argc, char const *argv[])
{
int i;
char A = 'A';
printf("isupper");
i = isupper(A);
printf("islower");
i = islower(A);
printf("isalpha");
i = isalpha(A);
printf("isdigit");
i = isdigit(A);
printf("isxdigit");
i = isxdigit(A);
printf("isspace");
i = isspace(A);
printf("isprint");
i = isprint(A);
printf("isgraph");
i = isgraph(A);
printf("isblank");
i = isblank(A);
printf("iscntrl");
i = iscntrl(A);
printf("ispunct");
i = ispunct(A);
printf("isalnum");
i = isalnum(A);
return 0;
}
We compile the code with gcc test_ctype.c -o test_ctype -g
and open the
compiled executable r2 -d -AA test_ctype
.
0x00400596 push rbp ; test_ctype.c:6 {
0x00400597 mov rbp, rsp
0x0040059a sub rsp, 0x20
Then edi
and rsi
are saved on the stack (in case another function need
to use it) :
0x0040059e mov dword [local_14h], edi
0x004005a1 mov qword [local_20h], rsi
Then we find the declaration of the vairable A whose content is the character A
:
0x004005a5 mov byte [local_5h], 0x41 ; test_ctype.c:8 char A = 'A'; ; 'A' ; 65
After that we can see the first call to printf
.
Remember that the first 6 integer parameters in a function under Linux and OS X
are passed in registers rdi , rsi , rdx , rcx , r8 and r9.
The printf
call needs to know which string to display (stored in edi
)
and also needs to know the number of floating point parameters (0 in our
case): that information is stored in eax
.
0x004005a9 mov edi, str.isupper ; test_ctype.c:9 printf("isupper"); ; 0x400874 ; "isupper"
0x004005ae mov eax, 0
0x004005b3 call sym.imp.printf ;[1] ; int printf(const char *format)
The function __ctype_b_loc
is then called. It does not need an argument, it
will put the unsigned short int**
in rax
.
0x004005b8 call sym.imp.__ctype_b_loc ; test_ctype.c:10 i = isupper(A); ;[2]
unsigned short int**
is a pointer to a pointer. The pointer to the array is
stored in rax
(rax = *(result)) :
0x004005ed mov rax, qword [rax]
Now that rax
contains a pointer to the array storing the properties of each
character, the isupper
function needs to find the information related to our
character 'A'. So our character is loaded in rdx
.
0x004005f0 movsx rdx, byte [local_5h]
Now we need to find the entry storing the information about our character 'A'.
Since our table is storing properties in unsigned short int
, a property for
a character is stored in 2 bytes (necessary to index 384 entries).
So at rax
we will have the property for the character 0x00
, at rax+16
we
will have the property for the character 0x01
, etc...
To find the entry corresponding to the character 'A' (0x41
in hex ASCII), we
need to look at rax+0x41*2
. This is what the assembly code does here :
0x004005f5 add rdx, rdx ; rdx = rdx * 2 = 0x41 * 2
0x004005f8 add rax, rdx ; rax = rax + 0x41 * 2
The unsigned short int
storing the properties for the character 'A'
is then stored in eax
:
0x004005fb movzx eax, word [rax] ; rax contains a pointer to the unsigned short int property. Load it in eax
0x004005fe movzx eax, ax ; truncate to only 2 bytes (the unsigned short int)
Now that we have the unsigned short int
containing the properties of 'A',
we want to know if the character 'A' is uppercase.
According to the the enum declared in <ctype.h>
, if 'A' is uppercase, then the
bit 0
of the unsigned short int
should be set to 1.
Let's take a look at this integer:
:> dr ax
0x0000d508
We can see that it is equal to 0xd508
. But intel processors uses the
little-endian system for storage so we have to revert the integer to 0x08d5
to read it as big-endian (which is more natural when reading it as a human):
:> ? 0x08d5
2261 0x8d5 04325 2.2K 0000:08d5 2261 "\xd5\b" 0b0000100011010101 2261.0 2261.000000f 2261.000000 0t10002202
With the binary display shown above, the bit to the extreme right is bit 0 and the bit to the extreme left is bit 15. Note the difference with the little-endian system:
:> ? 0xd508
54536 0xd508 0152410 53.3K 0000:0508 54536 "\b\xd5" 0b1101010100001000 54536.0 54536.000000f 54536.000000 0t2202210212
Since we are analyzing an intel assembly, we will continue this analysis with the little-endian system. Here is an analysis of each bit value:
Bit number | Value | Description |
---|---|---|
7 | 1 | graphical |
6 | 1 | printing |
5 | 0 | whitespace |
4 | 1 | hexadecimal numeric |
3 | 0 | numeric |
2 | 1 | alphabetic |
1 | 0 | lowercase |
0 | 1 | uppercase |
0 | ||
0 | ||
0 | ||
0 | ||
11 | 1 | alphanumeric |
10 | 0 | punctuation |
9 | 0 | control |
8 | 0 | blank |
So here we can see that the character 'A' is :
If we only want to know if 'A' is uppercase, we need to use the bit mask 0x100
against the property integer. If the result is positive, then the property is true:
:> ? 0xd508 & 0x100
256 0x100 0400 256 0000:0100 256 "\x01" 0b0000000100000000 256.0 256.000000f 256.000000 0t100111
This is what is done in the rest of the assembly code:
0x004005d1 and eax, 0x100 ; Test for uppercase
0x004005d6 mov dword [local_4h], eax ; store result in variable i
Now we know how __ctype_b_loc
works !
Here is the full assembly code
0x00400596 push rbp ; test_ctype.c:6 {
0x00400597 mov rbp, rsp
0x0040059a sub rsp, 0x20
0x0040059e mov dword [local_14h], edi
0x004005a1 mov qword [local_20h], rsi
0x004005a5 mov byte [local_5h], 0x41 ; test_ctype.c:8 char A = 'A'; ; 'A' ; 65
0x004005a9 mov edi, str.isupper ; test_ctype.c:9 printf("isupper"); ; 0x400874 ; "isupper"
0x004005ae mov eax, 0
0x004005b3 call sym.imp.printf ;[1] ; int printf(const char *format)
0x004005b8 call sym.imp.__ctype_b_loc ; test_ctype.c:10 i = isupper(A); ;[2]
0x004005bd mov rax, qword [rax]
0x004005c0 movsx rdx, byte [local_5h]
0x004005c5 add rdx, rdx ; '('
0x004005c8 add rax, rdx ; '('
0x004005cb movzx eax, word [rax]
0x004005ce movzx eax, ax
0x004005d1 and eax, 0x100
0x004005d6 mov dword [local_4h], eax