如何基于LLVM IR分析全局函数指针的使用情况

内核中的全局函数指针大致可以分为两种：一种是全局变量本身是一个函数指针，另一种是全局结构体的某个子域是函数指针，例如：

  
struct funcptr {
	void (*foo)();
}s;

void (*fp)(void);

而对这些函数指针的调用，反映到IR上，通常为：

  
fp():
  store void ()* @indirect, void ()** @fp, align 8
  %8 = load void ()*, void ()** @fp, align 8
  call void %8()
s.foo():
  %9 = load void (...)*, void (...)** getelementptr inbounds (%struct.funcptr, %struct.funcptr* @s, i32 0, i32 0), align 8
  %10 = bitcast void (...)* %9 to void ()*
  call void %10()

我们可以根据IR的特征来找到对于这些全局变量的使用情况。

如何识别函数指针

LLVM IR对于全局变量和栈上的变量处理是类似的，都将其看作指针。也就是说，即使对于一个int类型的全局变量a，直接访问其type也是指针类型。例如：

  
C-Level:
	int a = 0;
IR：
	@a = dso_local global i32 0, align 4
Type:
	i32*

因此，对于所有的全局变量，只有通过gv.getType()->getPointerElementType()才能得到真正的类型，再在这个类型的基础上对该变量的类型进行分类。函数指针首先其本身类型为Pointer，其PointerElementType为Function。而对于结构体，或结构体类型的指针，需要对其类型进行一个类BFS的算法，如果类型为Pointer，则对其PointerElementType进行判断，如果该类型为Struct/Array，再依次判断Struct的字段类型或Array中的元素类型，对应地有如下代码：

  
bool hasFuncptr(llvm::Type *type, llvm::GlobalVariable *gv)
{
    assert(isa<ArrayType>(type) || isa<StructType>(type));
    assert(gv != nullptr);
    if (isa<ArrayType>(type))
    {
        if (gv->hasInitializer())
            if (isa<FunctionType>(type->getArrayElementType()))
                return true;
    }
    else if (isa<StructType>(type))
    {
        queue<llvm::Type *> que;
        que.push(type);
        set<llvm::Type *> typeset;
        while (!que.empty())
        {
            llvm::Type *cur = que.front();
            que.pop();
            if (typeset.find(cur) != typeset.end())
                continue;
            else
                typeset.insert(cur);
            if (isa<PointerType>(cur))
            {
                if (isa<FunctionType>(cur->getPointerElementType()))
                {
                    return true;
                }
                else
                    que.push(cur->getPointerElementType());
            }
            else if (isa<StructType>(cur))
            {
                if (gv->hasInitializer())
                {
                    for (int i = 0; i < cur->getStructNumElements(); i++)
                    {
                        que.push(cur->getStructElementType(i));
                    }
                }
            }
            else if (isa<ArrayType>(cur))
            {
                if (gv->hasInitializer())
                    que.push(cur->getArrayElementType());
            }
            else
                continue;
        }
    }
    return false;
}

void check_type(llvm::Module *module)
{
    for (auto &gv : module->getGlobalList())
    {
        Type *type = gv.getType()->getPointerElementType();
        if (isa<PointerType>(type))
        {
            if (isa<FunctionType>(type->getPointerElementType()))
            {
	            // 说明是函数指针
            }
            else
            {
                while (isa<PointerType>(type))
                {
	                // 指针指向指针，继续判断
                    type = type->getPointerElementType();
                }
                if (isa<StructType>(type) || isa<ArrayType>(type))
                {
	                // 结构体或数组
                    if (hasFuncptr(type, &gv))
                        // 子域中存在函数指针
                }
                else
                {
	                // 其他类型，必不可能包含函数指针
                }
            }
        }
        else if (isa<StructType>(type) || isa<ArrayType>(type))
        {
            // 全局变量本身是结构体或数组
            if (hasFuncptr(type, &gv))
	            //子域中存在函数指针
        }
        else
        {
            // 普通数据，必不包含函数指针
        }
    }
}

如何追踪函数指针的调用

对于llvm::Value，可以通过users()找到该变量的所有use点，例如下面这段代码就构成了对fp的一条def-use链：

  
@fp = common dso_local global void ()* null, align 8
%4 = load void ()*, void ()** @fp, align 8
call void %4()

从上面这段IR，我们似乎可以得到一个非常简单的判断方式：从已经判定为指针类型的全局变量开始，顺着def-use链进行DFS，遇到call指令之后检查其是否是间接调用，如果是间接调用，则说明调用了这个函数指针。

然而，上述方法在很大程度上会存在误报和漏报的现象：

当这个函数指针作为另一个间接调用的参数时，沿着def-use链依然会出现间接调用，但是实际上调用的并不是这个函数指针，从而会出现误报。
当某个结构体的子域是函数指针时，这个结构体的另一个子域作为作为另一个间接调用的参数，沿着def-use链依然会出现间接调用，但是调用的并不是作为函数指针的这个子域，从而会出现误报
当函数指针作为函数调用参数时，在callee中通过该参数调用而不是通过该全局变量调用，由于这个参数并不在该全局变量的def-use链上，因此无法将该调用关联到该全局变量本身，从而会出现漏报。例如：

  
C-Level:
void (*fp)(void);

void test(void (*funcptr)(void)) {
    funcptr();
}

void (*ffp)(void (*)(void));

int main() {
    ffp = test;
    test(fp);
    ffp(fp);
}

IR：
@ffp = common dso_local global void (void ()*)* null, align 8
@fp = common dso_local global void ()* null, align 8

; Function Attrs: noinline nounwind optnone
define dso_local void @test(void ()* %0) #0 {
  %2 = alloca void ()*, align 8
  store void ()* %0, void ()** %2, align 8
  %3 = load void ()*, void ()** %2, align 8
  call void %3()
  ret void
}

; Function Attrs: noinline nounwind optnone
define dso_local i32 @main() #0 {
  store void (void ()*)* @test, void (void ()*)** @ffp, align 8
  %1 = load void ()*, void ()** @fp, align 8
  call void @test(void ()* %1)
  %2 = load void (void ()*)*, void (void ()*)** @ffp, align 8
  %3 = load void ()*, void ()** @fp, align 8
  call void %2(void ()* %3)
  ret i32 0
}

那么fp的def-use链为

  
@fp = common dso_local global void ()* null, align 8
%1 = load void ()*, void ()** @fp, align 8
call void @test(void ()* %1)

@fp = common dso_local global void ()* null, align 8
%3 = load void ()*, void ()** @fp, align 8
call void %2(void ()* %3)

对于第一种情况，我们可以通过判断间接调用的calledOperand是否在def-use链上，如果在，那么说明实际调用的很可能就是追踪的这个全局指针。如果不在，那么这个全局变量可能就是作为参数被用于该call指令。

对于第二种情况，也是同样的道理。如果调用的是该全局变量的某个子域，那么在def-use 链上一定存在一条getElementptr指令，将该结构体的子域加载到某个临时变量，随后call该临时变量。而如果只是将全局变量的某个子域作为函数参数传递，那么calledOperand必然不会在def-use链上，所以也不会关联到对该全局变量的调用。

分析代码如下：

  
void checkUsepoint(llvm::Module *module)
{
    for (auto &gv : module->getGlobalList())
    {
        std::string gv_name = gv.getName().str();
        std::set<llvm::Value *> def_chain;
        queue<llvm::Value *> que;
        que.push(&gv);
        def_chain.insert(&gv);
        while (!que.empty())
        {
            llvm::Value *cur = que.front();
            que.pop();
            for (auto user : cur->users())
            {
                if (CallInst *call = dyn_cast<CallInst>(user))
                {
                    if (call->isIndirectCall())
                    {
                        auto called_operand = call->getCalledOperand();
                        if (def_chain.find(called_operand) != def_chain.end()) {
                            //  出现了对该全局变量的调用
                        }   
                    }
                }
                que.push(user);
                def_chain.insert(user);
            }
        }
    }
}

对于第三种情况：除了main函数中对全局变量的调用之外，test函数中的call void %3()实际上也是调用的全局函数指针fp，但是由于数据流的断裂，上述分析方案无法追踪到该调用。那么如何在过程间重建数据流呢？对于不具有可变参数的函数，我们可以根据参数的下标，找到函数形参列表中与该下标对应的下标。例如，全局变量fp的use链上的call void @test(void ()* %1)这条指令，其参数%1在fp的use链上，对应的下标为0，那么根据calledFunction找到test这个函数，通过getArg(0)找到参数列表中的void ()* %0，再在函数test中对%0进行追踪。然而，从%0无法通过 def-use链找到call void %3()这条指令，因为%3与%0实际上是别名关系，其所在的def-use链为：

  
  %2 = alloca void ()*, align 8
  %3 = load void ()*, void ()** %2, align 8
  call void %3()

因此这种情况需要辅以别名分析，找到该函数内部该参数的别名，对这些别名追踪其use点是否存在间接调用。

上述方法只对direct Call有效，因为对于间接调用，其calledFunction返回为nullptr。根据c++ - How can I get Function Name of indirect call from CallInst in LLVM - Stack Overflow，对于间接调用无法直接获得其对应的函数。KSplit中对于间接调用是通过匹配callsite与每个函数的参数列表和返回值类型，如果能够对应，则将该函数认为是callsite的candidate，随后再对每个candidate进行分析。然而此方法会得到的分析结果最终是overapproximate的，并且开销过高，并不适用。

如何基于LLVM IR分析全局函数指针的使用情况

如何识别函数指针

如何追踪函数指针的调用

Further Reading

Static single-assignment form (SSA)

Tai-e Assignment 1 活跃变量分析和迭代求解器

Tai-e Assignment 2 常量传播和Worklist求解器