导言:本文是在小组内的一个分享,介绍协程实现的几种方法和优化策略,对比GoLang中goroutine实现方式及调度器的设计,与常见后台服务器设计模式对比,使用协程的优劣分析。
some questions
Q1: multitasking ?
time-sharing (1960s,voluntarily/hardware interrupt to relinquish the CPU)
real-time (real-time constraint)
Multiple processes
(heavyweight)are allowed to share processors (CPUs) and other system resources.Threads
(lightweight)are scheduled preemptively and are described as lightweight processes because switching between threads does not involve changing the memory context. It can make use of machines with multiple processors.- other ?
Q2: context switch ?
Most commonly, within some scheduling scheme, one process must be switched out of the CPU so another process can run. This context switch can be triggered by the process making itself unrunnable, such as
by waiting for an I/O or synchronization operation to complete
. On a pre-emptive multitasking system, the scheduler may also switch out processes which are still runnable. To prevent other processes from being starved of CPU time, preemptive schedulers often configure a timer interrupt to fire when a process exceeds its time slice. This interrupt ensures that the scheduler will gain control to perform a context switch.In computing, a
context switch
is the process of storing and restoring the state (more specifically, the execution context) of a process or thread so that execution can be resumed from the same point at a later time. This enables multiple processes to share a single CPU and is an essential feature of amultitasking operating system
.
- thread switch/process switch/task switch
- mode switch (switching between user mode and kernel mode)
- register switch
- stack frame switch
- address space switch (changing virtual memory to physical memory map)
Q3: the cost of context switches ?
Context switches are usually
computationally intensive
. Switching from one process to another requires a certain amount of time for doing the administration – saving and loading registers and memory maps, updating various tables and lists, etc. For example, in the Linux kernel, context switching involves switching registers, stack pointer, and program counter.
Q4: context switching can be selective and store only those registers that need storing ?
- context switching between
threads
in the same process is typically faster than context switching betweenprocesses
. Thread switching is relatively cheap: it requires a context switch (saving and restoring registers and stack pointer), but does not change virtual memory and is thus cache-friendly (leaving TLB valid).- a context switch between threads requires system calls (involving the OS kernel), which can cost more than thousand CPU cycles on x86 CPUs. By contrast,
transferring control
among them requires only fewer than hundred CPU cycles because it does not involve system calls as it is done within a single thread.- Making coroutines fast (reuse of stacks and a lightweight swapcontext implementation.)
Q5: pre-emptive multitasking (time slice/instruction cycle) VS co-operative multitasking (non-preemptive multitasking) ?
Preemptive multitasking
involves the use of aninterrupt mechanism
which suspends the currently executing process and invokes a scheduler to determine which process should execute next. Therefore, all processes will get some amount of CPU time at any given time.As a
cooperatively multitasked system
relies on each process regularly giving up time to other processes on the system, one poorly designed program can consume all of the CPU time for itself, either by performingextensive calculations
or bybusy waiting
; both would cause the whole system to hang. In a server environment, this is a hazard that makes the entire environment unacceptably fragile. A cooperative multitasking system wherein processes or tasksmust be explicitly programmed to yield when they do not need system resources
.
user thread
Coroutines(a language-level construct)are computer program components that generalize subroutines for nonpreemptive multitasking
, by allowing multiple entry points for suspending and resuming execution at certain locations. Coroutines are well-suited for implementing more familiar program components such as cooperative tasks
, exceptions, event loop
, iterators, infinite lists and pipes.
According to Donald Knuth, the term coroutine was coined by Melvin Conway in 1958
. Coroutines originated as an assembly language method
, but are supported in some high-level programming languages. Early examples include Simula(1965) and Modula-2(1978). More recent examples are Ruby(1995), Lua(1993), and Go(2009).
var q := new queue
coroutine produce
loop
while q is not full
create some new items
add the items to q
yield to consume
coroutine consume
loop
while q is not empty
remove some items from q
use the items
yield to produce
yield:In computer science, yield is an action that occurs in a computer program during multithreading, of forcing a processor to
relinquish control
of the current running thread, and sending it tothe end of the running queue
, of the same scheduling priority.
switch-case
Stack-based coroutines
Coroutines in C
// have ten successive calls to the function return the numbers 0 through 9.
#include <stdio.h>
int function(void)
{
static int i = 0;
for (; i < 10; ) {
return i++;
}
return 0;
}
int function2(void) {
static int i, state = 0;
switch (state) {
case 0: goto LABEL0;
case 1: goto LABEL1;
}
LABEL0: /* start of function */
for (i = 0; i < 10; i++) {
state = 1; /* so we will come back to LABEL1 */
return i;
LABEL1:; /* resume control straight after the return */
}
return -1;
}
int function3(void)
{
static int i, state = 0;
switch (state) {
case 0: /* start of function */
for (i = 0; i < 10; i++) {
state = 1; /* so we will come back to "case 1" */
return i;
case 1:; /* resume control straight after the return */
}
}
return -1;
}
int function4(void)
{
static int i, state = 0;
switch (state) {
case 0: /* start of function */
for (i = 0; i < 10; i++) {
state = __LINE__ + 2; /* so we will come back to "case __LINE__" */
return i;
case __LINE__:; /* resume control straight after the return */
}
}
return -1;
}
#define Begin() static int state=0; switch(state) { case 0:
#define Yield(x) do { state=__LINE__; return x; case __LINE__:; } while (0)
#define End() }
int function5(void)
{
static int i;
Begin();
for (i = 0; i < 10; i++)
Yield(i);
End();
return -1;
}
int main(int argc, char *argv[])
{
printf("function() return [%d]\n", function());
printf("function() return [%d]\n", function());
printf("function2() return [%d]\n", function2());
printf("function2() return [%d]\n", function2());
printf("function3() return [%d]\n", function3());
printf("function3() return [%d]\n", function3());
printf("function4() return [%d]\n", function4());
printf("function4() return [%d]\n", function4());
printf("function5() return [%d]\n", function5());
printf("function5() return [%d]\n", function5());
return 0;
}
利用
switch-case
的分支跳转特性,以及预编译的__LINE__
宏,实现了一种隐式状态机,最终实现了yield
语义。
non-local jump
早期C语言标准不支持嵌套函数声明(Boost.Lambda和C++11的lambda已经支持),因此goto
只能完成在同一个函数中的跳转。而setjmp()/longjmp()
可以从嵌套函数中跳出。
#include <stdio.h>
#include <stdlib.h>
#include <setjmp.h>
static jmp_buf env;
static void func(int rvar, int vvar)
{
printf("func(): rvar[%d] vvar[%d]\n", rvar, vvar);
longjmp(env, 1);
}
int main(int argc, char *argv[])
{
register int rvar = 1;
volatile int vvar = 2;
if (setjmp(env) == 0) {
rvar = 3;
vvar = 4;
func(rvar, vvar);
} else {
printf("after longjmp(): rvar[%d] vvar[%d]\n", rvar, vvar);
}
return 0;
}
/* output: ?
* func(): rvar[3] vvar[4]
* after longjmp(): rvar[3] vvar[4]
*/
在调用
setjmp()
时,env会保存当前进程的程序计数器(PC)、栈指针寄存器(SP)以及进程的其他信息,这些信息可以保证longjump()
调用完成后从初始的位置继续执行。但是因为setjmp()
实现无法保证保存所有寄存器和临时栈信息,因此只能用于简单的语境中。C99 defines setjmp()/longjmp() to provide non-local jumps but it does not require that longjmp() preserves the current stack frame. Therefore, jumping into a function which was exited via a call to longjmp() is undefined.
context control
C library functions(ucontext.h
)提供了四个函数(setcontext/getcontext/makecontext/swapcontext)实现上下文的控制。
typedef struct ucontext {
unsigned long int uc_flags;
struct ucontext *uc_link;
stack_t uc_stack;
mcontext_t uc_mcontext;
sigset_t uc_sigmask;
...
} ucontext_t;
// example
char stack[SIGSTKSZ];
getcontext(&context);
context.uc_link = &main_context;
context.uc_stack.ss_sp = stack;
context.uc_stack.ss_size = sizeof(stack);
// void makecontext(ucontext_t *ucp, void (*func)(), int argc, ...);
makecontext(&context, (void (*)(void)) func, 2, &context, &main_context2);
#include <stdio.h>
#include <ucontext.h>
#include <unistd.h>
int main(int argc, const char *argv[]){
ucontext_t context;
getcontext(&context);
puts("Hello world");
sleep(1);
setcontext(&context);
return 0;
}
/* output: ?
*
*/
Since POSIX.1-2003 ucontext_t is
deprecated
and wasremoved
in POSIX.1-2008. POSIX Threads recommended.The third argument of makecontext() specifies the number of integer arguments that follow which will require function pointer cast if func will accept those arguments which is undefined in C99. The arguments in the var-arg list are required to be
integers
, passing pointers in var-arg list is not guaranteed to work, especially it will fail for architectures where pointers are larger than integers.
Boost.Context
Boost.Context is a foundational library that provides a sort of cooperative multitasking on a single thread
.
Boost.Context must be built for the particular compiler(s) and CPU architecture(s)s being targeted. Boost.Context includes assembly code and, therefore, requires GNU AS for supported POSIX systems, MASM for Windows/x86 systems and ARMasm for Windows/arm systems.
// Each instance of fcontext_t represents a context (CPU registers and stack space). Together with its related functions jump_fcontext() and make_fcontext() it provides a execution control transfer mechanism similar interface like ucontext_t. fcontext_t and its functions are located in boost::context and the functions are declared as extern "C".
#include <boost/context/all.hpp>
struct stack_t
{
void *sp;
std::size_t size;
};
struct fcontext_t
{
< platform specific >
stack_t fc_stack;
};
intptr_t jump_fcontext( fcontext_t * ofc, fcontext_t const* nfc, intptr_t vp, bool preserve_fpu = true);
fcontext_t * make_fcontext( void * sp, std::size_t size, void(* fn)(intptr_t) );
Performance of context switch
Platform | ucontext_t | fcontext_t with fpu (Floating-point unit) | fcontext_t without fpu | boost::function |
---|---|---|---|---|
AMD Athlon 64 DualCore 4400+ (32bit Linux) | 846 cycles | 65 cycles | 43 cycles | 15 cycles |
Intel Core2 Q6700 (64bit Linux) | 1481 cycles | 172 cycles | 63 cycles | 25 cycles |
ucontext_t
preserves signal mask between context switches which involves system calls consuming a lot of CPU cycles (ucontext_t
is slower by perfomance_link[factor 13x] relative tofcontext_t
).
goroutine
In Go, each concurrently executing activity is called a goroutine
.
f(); // call f() wait for it to return
go f(); // create a new goroutine that calls f() don't wait
Go scheduler
Why create a userspace scheduler when the operating system can schedule threads for you? The Go scheduler (Go 1.1 is the new scheduler)
model | des | traits |
---|---|---|
N:1 | several userspace threads are run on one OS thread | This has the advantage of being very quick to context switch but cannot take advantage of multi-core systems |
1:1 | one thread of execution matches one OS thread | It takes advantage of all of the cores on the machine, but context switching is slow because it has to trap through the OS |
M:N | It schedules an arbitrary number of goroutines onto an arbitrary number of OS threads | Get quick context switches and take advantage of all the cores in your system (The main disadvantage of this approach is the complexity it adds to the scheduler) |
Here are 2 threads (M), each holding a context (P), each running a goroutine (G). In order to run goroutines, a thread must hold a context. The number of contexts is set on startup to the value of the GOMAXPROCS
environment variable or through the runtime function GOMAXPROCS().
syscall
a thread giving up its context so that another thread can run it. The scheduler makes sure there are enough threads to run all contexts.
Stealing work
When a context runs out, it will try to steal about half of the runqueue from another context. This makes sure there is always work to do on each of the contexts, which in turn makes sure that all threads are working at their maximum capacity.
Difference
traits | process/thread | goroutine |
---|---|---|
stacks | a fixed-size block of memroy, often as large as 2MB. | growable stacks, not fixed, typically 2KB, it grows and shrinks as needed. |
scheduling | OS kernel scheduler, every few milliseconds, a hardware timer interrupts the processor, which causes a kernel function called the scheduler to be invoked. | the Go runtime contains its own scheduler that uses a technique known as m:n scheduling, because it schedules m goroutines on n OS threads. The Go scheduler is not invoked periodically by a hardware timer, but implicitly by certain Go language constructs. |
GOMAXPROCS | NULL | determine how many OS threads may be actively executing Go code simultaneously. |
identity | thread-local storage | a simpler style of programming, explicit |
go 1.2之前,goroutine只会在两种情况下被调度: runtime.GoSched()和systemcall。但是从1.2开始,为了避免饿死其它goroutine,在发生任意函数调用的时候,都有机会触发scheduler。所以从1.2开始如果goroutine中是纯计算,没有任何系统调用,scheduler仍然有机会介入,不会永远独占CPU。(Refer: preemptive scheduling)
TDF
TDF是腾讯内部的一个基于协程的自研后台开发框架,使用ucontext,N:1模型。
ready_list = [A, B, C]
while (!ready_list.empty()) {
for co in ready_list {
co.process()
if (co.finished())
ready_list.remove(co)
}
}
内部协程调度模型
内部事件触发流程
协程优化
Just using an off-the-shelf library like libcoroutine isn’t as fast as you might think. The improvement comes from two optimizations: reuse of stacks and a lightweight swapcontext implementation. (Refer: Making coroutines fast)
conclusion
Scheduling can be done at the kernel level or user level, and multitasking can be done preemptively or cooperatively.
type | switch mode | scheduling | concept |
---|---|---|---|
process/thread | kernel level | preemptive | stack-based structure, one function will be caller and the other will be callee |
coroutine | user level | cooperative | stop thinking of one process as the caller and the other as the callee, resume control from just after the return statement |
同步等待处理模型
* 优点:可以按照人的正常思维实现业务逻辑,开发效率高。
* 缺点:同步接口会导致排队和毛刺现象,并发处理能力弱。
异步多路复用处理模型
* 优点:并发处理能力强。
* 缺点:需要处理各种回调,业务代码实现复杂,开发效率低。
协程(同步编程异步执行)处理模型
进程和线程是抢占式切换,不知道会在什么时刻发生切换,操作系统为了切换到另外的线程再切回到原来的位置,需要保存大部分的寄存器和其他信息。而协程在网络IO函数里发现可能堵塞的情况下进行主动切换,只需要保存函数调用路径中可能发生变化的寄存器(ESP、EIP以及callee-saved寄存器),因此协程之间的切换更高效。
优点:
- without race conditions
- light-weight context switch
- understandable (First do this, then do that)
缺点:
- responsiveness (can not use blocking system calls)
- coroutine crashes a process
- explicit scheduling/starving problem (Preemptive multithreading is generally considered the superior approach, as it allows the operating system to determine when a context switch should occur.)