What is __ctype_b_loc

2018 May 1

When doing some cracking challenges, I stumbled upon a call to a function named __ctype_b_loc(void). It turns out there is no really good resource on the internet explaining what this function is and what it does. So here are my notes about what this function is and how it is used.

About __ctype_b_loc

__ctype_b_loc is an internal function used by functions found in <ctype.h> :

  • int isalnum(int c)
  • int isalpha(int c)
  • ...

The <ctype.h> functions will adjust their behavior when the locale is changed. If we take a look at the file /usr/include/ctype.h we can see that __ctype_b_loc is defined elsewhere, and is a constant :

extern const unsigned short int **__ctype_b_loc (void)
     __THROW __attribute__ ((__const__));

This constant points to an array of 384. 384 is chosen so it can be indexed by :

  • any unsigned char value [0,255]
  • EOF (-1)
  • any signed char value [-128,-1)

For each character the array contains an unsigned short int describing its the properties of the character (uppercase, alphabetic, numeric, whitespace, ...). If the unsigned short int (2 bytes) is 0x0001, then the only property of the character is that it is uppercase.

The enum described in <ctype.h> shows the correspondance between bits and properties :

enum
{
  _ISupper = _ISbit (0),        /* UPPERCASE.  */
  _ISlower = _ISbit (1),        /* lowercase.  */
  _ISalpha = _ISbit (2),        /* Alphabetic.  */
  _ISdigit = _ISbit (3),        /* Numeric.  */
  _ISxdigit = _ISbit (4),       /* Hexadecimal numeric.  */
  _ISspace = _ISbit (5),        /* Whitespace.  */
  _ISprint = _ISbit (6),        /* Printing.  */
  _ISgraph = _ISbit (7),        /* Graphical.  */
  _ISblank = _ISbit (8),        /* Blank (usually SPC and TAB).  */
  _IScntrl = _ISbit (9),        /* Control character.  */
  _ISpunct = _ISbit (10),       /* Punctuation.  */
  _ISalnum = _ISbit (11)        /* Alphanumeric.  */
};

_ISbit(bit) will return an int where the only 1 in the binary representation is at location "bit". For instance _ISbit(3) = 0b01000, and _ISbit(0) = 0b0001.

The functions isalnum(char), isupper(char), isblank(char) (and so on...) actually use this enum and the table given by __ctype_b_loc to test if the character is alphanumeric, uppercase, blank, etc...

These functions all use __isctype(c, type) under the hood. __isctype(c, type) will return an int indicating if the character passed as the parameter c is of type type. To do so it uses the table __ctype_b_loc :

# define __isctype(c, type) \
  ((*__ctype_b_loc ())[(int) (c)] & (unsigned short int) type)

Understandig __ctype_b_loc with reverse engineering

To have a better understanding of how __ctype_b_loc works let's reverse engineer our own code. The following program uses every function using __isctype(c, type).

#include <ctype.h>
#include <stdio.h>

int main(int argc, char const *argv[])
{
    int i;
    char A = 'A';
    printf("isupper");
    i = isupper(A);
    printf("islower");
    i = islower(A);
    printf("isalpha");
    i = isalpha(A);
    printf("isdigit");
    i = isdigit(A);
    printf("isxdigit");
    i = isxdigit(A);
    printf("isspace");
    i = isspace(A);
    printf("isprint");
    i = isprint(A);
    printf("isgraph");
    i = isgraph(A);
    printf("isblank");
    i = isblank(A);
    printf("iscntrl");
    i = iscntrl(A);
    printf("ispunct");
    i = ispunct(A);
    printf("isalnum");
    i = isalnum(A);
    return 0;
}

We compile the code with gcc test_ctype.c -o test_ctype -g and open the compiled executable r2 -d -AA test_ctype.

0x00400596      push rbp                    ; test_ctype.c:6 {
0x00400597      mov rbp, rsp
0x0040059a      sub rsp, 0x20

Then edi and rsi are saved on the stack (in case another function need to use it) :

0x0040059e      mov dword [local_14h], edi
0x004005a1      mov qword [local_20h], rsi

Then we find the declaration of the vairable A whose content is the character A:

0x004005a5      mov byte [local_5h], 0x41   ; test_ctype.c:8     char A = 'A';    ; 'A' ; 65

After that we can see the first call to printf. Remember that the first 6 integer parameters in a function under Linux and OS X are passed in registers rdi , rsi , rdx , rcx , r8 and r9. The printf call needs to know which string to display (stored in edi) and also needs to know the number of floating point parameters (0 in our case): that information is stored in eax.

0x004005a9      mov edi, str.isupper        ; test_ctype.c:9     printf("isupper");    ; 0x400874 ; "isupper"
0x004005ae      mov eax, 0
0x004005b3      call sym.imp.printf         ;[1] ; int printf(const char *format)

The function __ctype_b_loc is then called. It does not need an argument, it will put the unsigned short int** in rax.

0x004005b8      call sym.imp.__ctype_b_loc  ; test_ctype.c:10     i = isupper(A); ;[2]

unsigned short int** is a pointer to a pointer. The pointer to the array is stored in rax (rax = *(result)) :

0x004005ed      mov rax, qword [rax]

Now that rax contains a pointer to the array storing the properties of each character, the isupper function needs to find the information related to our character 'A'. So our character is loaded in rdx.

0x004005f0      movsx rdx, byte [local_5h]

Now we need to find the entry storing the information about our character 'A'. Since our table is storing properties in unsigned short int, a property for a character is stored in 2 bytes (necessary to index 384 entries). So at rax we will have the property for the character 0x00, at rax+16 we will have the property for the character 0x01, etc...

To find the entry corresponding to the character 'A' (0x41 in hex ASCII), we need to look at rax+0x41*2. This is what the assembly code does here :

0x004005f5      add rdx, rdx           ; rdx = rdx * 2 = 0x41 * 2
0x004005f8      add rax, rdx           ; rax = rax + 0x41 * 2

The unsigned short int storing the properties for the character 'A' is then stored in eax:

0x004005fb      movzx eax, word [rax] ; rax contains a pointer to the unsigned short int property. Load it in eax
0x004005fe      movzx eax, ax         ; truncate to only 2 bytes (the unsigned short int)

Reading an entry of __ctype_b_loc

Now that we have the unsigned short int containing the properties of 'A', we want to know if the character 'A' is uppercase. According to the the enum declared in <ctype.h>, if 'A' is uppercase, then the bit 0 of the unsigned short int should be set to 1.

Let's take a look at this integer:

:> dr ax
0x0000d508

We can see that it is equal to 0xd508. But intel processors uses the little-endian system for storage so we have to revert the integer to 0x08d5 to read it as big-endian (which is more natural when reading it as a human):

:> ? 0x08d5
2261 0x8d5 04325 2.2K 0000:08d5 2261 "\xd5\b" 0b0000100011010101 2261.0 2261.000000f 2261.000000 0t10002202

With the binary display shown above, the bit to the extreme right is bit 0 and the bit to the extreme left is bit 15. Note the difference with the little-endian system:

:> ? 0xd508
54536 0xd508 0152410 53.3K 0000:0508 54536 "\b\xd5" 0b1101010100001000 54536.0 54536.000000f 54536.000000 0t2202210212

Since we are analyzing an intel assembly, we will continue this analysis with the little-endian system. Here is an analysis of each bit value:

Bit numberValueDescription
71graphical
61printing
50whitespace
41hexadecimal numeric
30numeric
21alphabetic
10lowercase
01uppercase
0
0
0
0
111alphanumeric
100punctuation
90control
80blank

So here we can see that the character 'A' is :

  • uppercase
  • hexadecimal numeric (0123456789ABCDEF)
  • printing
  • graphical
  • alphanumeric

If we only want to know if 'A' is uppercase, we need to use the bit mask 0x100 against the property integer. If the result is positive, then the property is true:

:> ? 0xd508 & 0x100
256 0x100 0400 256 0000:0100 256 "\x01" 0b0000000100000000 256.0 256.000000f 256.000000 0t100111

This is what is done in the rest of the assembly code:

0x004005d1      and eax, 0x100     ; Test for uppercase
0x004005d6      mov dword [local_4h], eax ; store result in variable i

Now we know how __ctype_b_loc works !

Here is the full assembly code

0x00400596      push rbp                    ; test_ctype.c:6 {
0x00400597      mov rbp, rsp
0x0040059a      sub rsp, 0x20
0x0040059e      mov dword [local_14h], edi
0x004005a1      mov qword [local_20h], rsi
0x004005a5      mov byte [local_5h], 0x41   ; test_ctype.c:8     char A = 'A';    ; 'A' ; 65
0x004005a9      mov edi, str.isupper        ; test_ctype.c:9     printf("isupper");    ; 0x400874 ; "isupper"
0x004005ae      mov eax, 0
0x004005b3      call sym.imp.printf         ;[1] ; int printf(const char *format)
0x004005b8      call sym.imp.__ctype_b_loc  ; test_ctype.c:10     i = isupper(A); ;[2]
0x004005bd      mov rax, qword [rax]
0x004005c0      movsx rdx, byte [local_5h]
0x004005c5      add rdx, rdx                ; '('
0x004005c8      add rax, rdx                ; '('
0x004005cb      movzx eax, word [rax]
0x004005ce      movzx eax, ax
0x004005d1      and eax, 0x100
0x004005d6      mov dword [local_4h], eax