[CPU] How to Implement CPU(or Core) Pinning

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

봉황대 in CS

[CPU] How to Implement CPU(or Core) Pinning 본문

Computer Science & Engineering/Computer Architecture

[CPU] How to Implement CPU(or Core) Pinning

등 긁는 봉황대 2024. 8. 5. 22:08

What is CPU Pinning ?

‘CPU pinning’은 특정 process 또는 thread가 지정한 CPU core에서만 실행되도록 하는 것을 말한다.

CPU pinning is a stricter form of CPU affinity,

where you bind a process or a thread to a specific CPU core, and prevent it from running on any other core.

What is CPU Affinity ?

여기서 ‘CPU affinity’란, ‘the ability to specify which CPU cores a process or a thread can run on’

특정 process 또는 thread가 지정된 CPU core 집합 내에서 실행될 수 있도록 허용하는 것을 말한다.

기본적으로는 kernel level에서 OS scheduler가 어느 process 또는 thread가 어느 core에서 실행할지를 정하는데,

프로그래머가 manually or programmatically 지정할 수 있다.

CPU affinity를 통해서는 다음과 같은 이점을 얻을 수 있다.

여러 process끼리 동일한 CPU를 사용하도록 함 → cache miss ↓
Process가 실행하는 도중에 다른 CPU로 migration 하는 것을 막음 → constitency in performance ↑
Task migration 또는 context switching으로 인한 overhead ↓

CPU pinning을 하게 된다면 migration, context switching,

그리고 동일 core에서 실행하려는 다른 process에 의한 preemption 가능성을 아예 제거하게 된다.

이에 따른 대표적인 이점으로는 ..

Context switching, memory access latency, synchronization overhead ↓
Cache utilization ↑ : Process 또는 thread를 동일한 core에서 계속 실행시키기 때문이다.

All of these benefits work together
to avoid potential errors or crashes caused by resource conflicts or unexpected interruptions.

Challenges

CPU pinning을 한다고 무조건적으로 성능이 좋아지는 것은 아니며, (어떤 기술이든 그렇지만) trade-off가 존재한다.

가장 critical 한 문제로는 실행할 CPU core를 제한함으로써 시스템의 유연성이 떨어지게 된다는 것이다.

극단적인 예시로, 만약 해당 core가 완전히 고장 나버렸다면 아예 프로그램을 실행시키지 못하므로 single point of failure가 된다.

또한, worker들의 개수, NUMA architecture 등 여러가지를 고려하지 않고 그냥 적용하게 된다면 오히려 성능이 떨어질 수도 있다.

OS scheduler가 알아서 최적으로 잘해주고 있었는데, 그냥 무시해 버린 꼴이 되는 것이다.

(글 맨 아래에 비슷한 경험담 적음요)

How to Implement CPU(or Core) Pinning

구현 방법은 Blink-Hash를 참고했다. \^0^/

https://github.com/chahk0129/Blink-hash/blob/master/include/util.h#L129

우선, CPU architecture가 어떻게 되어있는지를 확인하여, 그에 맞게 CPU(또는 core) allocation 순서를 정해줘야 한다.

이 코드를 돌리려고 하는 Linux server의 architecture는 아래 그림과 같은데,

(Physical core 48개, logical core 96개 / 이전 포스팅 참고 : https://eunajung01.tistory.com/163)

동일한 NUMA node에 thread들이 많이 pinning 되어 있어야 cross NUMA latency가 크게 발생하지 않기 때문에

socket 0, 0, 1, 1 순서로 할당하도록 정의하였다.

bool isHyperThreading_enabled = true;

constexpr static size_t NUMBER_OF_LOGICAL_CORES = 96;
constexpr static size_t NUMBER_OF_PHYSICAL_CORES = 48;

static int coreAllocationMap_hyperThreading[] = {
        0, 1, 2, 3, 4, 5, 6, 7,
        8, 9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23,  // socket 0

        48, 49, 50, 51, 52, 53, 54, 55,
        56, 57, 58, 59, 60, 61, 62, 63,
        64, 65, 66, 67, 68, 69, 70, 71,  // socket 0

        24, 25, 26, 27, 28, 29, 30, 31,
        32, 33, 34, 35, 36, 37, 38, 39,
        40, 41, 42, 43, 44, 45, 46, 47,  // socket 1

        72, 73, 74, 75, 76, 77, 78, 79,
        80, 81, 82, 83, 84, 85, 86, 87,
        88, 89, 90, 91, 92, 93, 94, 95   // socket 1
};

static int coreAllocationMap_numa[] = {
        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,        // socket 0
        13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23,      // socket 0
        24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,  // socket 1
        36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47   // socket 1
};

(1) Hyper-threading을 켰다면 (isHyperThreading_enabled == true)

coreAllocationMap_hyperThreading에 정의되어 있는 순으로,

(2) hyper-threading을 켜지 않았다면 (isHyperThreading_enabled == false)

coreAllocationMap_numa에 정의되어 있는 순으로 할당되도록 할 것이다.

아래는 각 thread들을 지정한 core에 pinning 하는 메서드이다.

inline void pinThreadToCore(size_t threadId) {
    cpu_set_t cpu;
    CPU_ZERO(&cpu);

    if (isHyperThreading_enabled) {
        CPU_SET(coreAllocationMap_hyperThreading[threadId % NUMBER_OF_LOGICAL_CORES], &cpu);

    } else {
        CPU_SET(coreAllocationMap_numa[threadId % NUMBER_OF_PHYSICAL_CORES], &cpu);
    }

    if (pthread_setaffinity_np(pthread_self(), sizeof(cpu), &cpu) != 0) {
        std::cerr << "pinThreadToCore() returns non-0" << std::endl;
        exit(1);
    }
}

[참고] https://man7.org/linux/man-pages/man3/pthread_setaffinity_np.3.html

The pthread_setaffinity_np() function sets the CPU affinity mask of the thread thread

to the CPU set pointed to by cpuset.

Thread들이 이를 호출하기 위해서는 다음과 같이 정의해 주면 되는데 .. (인자 function은 각 thread가 실행할 함수를 말함)

template<typename Func, typename Arg>
inline void threadTodo(uint64_t threadId, Func function, Arg &arg) {
    pinThreadToCore(threadId);
    function((void *) &arg);
}

template<typename Func, typename Args>
inline void startThreads(uint64_t numberOfThreads, Func function, Args &args) {
    std::thread threads[numberOfThreads];

    for (uint64_t threadId = 0; threadId < numberOfThreads; ++threadId) {
        threads[threadId] = std::thread{threadTodo<Func, typename Args::value_type>,
                                        threadId, function, std::ref(args[threadId])};
    }
    for (uint64_t threadId = 0; threadId < numberOfThreads; ++threadId) {
        threads[threadId].join();
    }
}

예를 들어, thread들이 각각 아래의 함수를 실행하고자 한다면,

void *insert(void *threadArgs) {
    auto args = (struct ThreadArgs_insert *) threadArgs;

    for (uint64_t i = args->startOfWorkload; i < args->endOfWorkload; ++i) {
        args->db.insert(entry.key, entry.value);
    }

    return nullptr;
}

아래와 같이 startThreads 메서드를 호출함으로써 원하는 thread 개수만큼 core pinning을 진행한 후,

지정한 함수를 실행하는 것이 가능하다.

startThreads(numberOfThreads, insert, threadArgs);

Evaluation

Core pinning을 하지 않은 경우와 core pinning을 한 경우 각각,

thread 개수를 1, 4, 8, 16, 32, 48, 64, 80, 96개로 늘려가면서 성능을 측정해 봤었다.

(나중에 시간 되면 그래프 그려서 추가하겠음)

Thread 개수 48개 이하로는 Core pinning을 했을 때가 성능이 확 증가했는데, (특히 48일 때 레전드 성능 찍음)

그 초과로는 오히려 이전보다 죽는 것을 볼 수 있었다.

왜 그런지 생각해 보니..

Physical core가 48개라서, thread 개수가 48을 넘어가기 전까지는 NUMA를 매우 잘 활용하고 있는데

(동일한 socket 내에서 모든 thread들이 돌고 있기 때문에 resource utilization도 좋고, cross NUMA latency도 없음)

그걸 넘어가는 순간, 몇몇 thread가 다른 socket에 할당되기 때문에 오히려 더 느려진 것이다.

Scheduler가 알아서 분배해준 것이 더 좋았던 것..

48개 초과부터는 어떻게 해야 할지 고민이가 된다.

참고

https://www.linkedin.com/advice/1/what-advantages-disadvantages-using-cpu-affinity

저작자표시

'Computer Science & Engineering > Computer Architecture' 카테고리의 다른 글

[CPU] What is Hyper-Threading ? (1)	2024.07.15
[Chapter 4. 프로세서] 프로세서 구현에 대한 개요 (0)	2022.08.31
[Chapter 3. 컴퓨터 연산] 연산에 있어서 겪을 수 있는 오류와 함정들 (0)	2022.08.30
[Chapter 3. 컴퓨터 연산] 부동소수점 덧셈과 곱셉 (1)	2022.08.29
[Chapter 3. 컴퓨터 연산] IEEE 754 부동소수점 반올림과 근사 (1)	2022.08.28