虚拟机设计与实现--虚拟机组成

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
                               ┌─────────────────────────────┐                                                            
│ │
│ │ ┌─────────────────────────────────────────┐
│ │ │ │
┌───▶│ Loaders and dynamic linkers │──────────────▶│For example, convert symbols to addresses│
│ │ │ │ │
│ │ │ └─────────────────────────────────────────┘
│ │ │
│ └─────────────────────────────┘




│ ┌─────────────────────────────┐
│ │ │
│ │ │ ┌─────────────────────────────────────────┐
│ │ │ │ │
├───▶│ Execution engine │───────────▶│ Execute opcode or machine code │
│ │ │ │ │
┌────────────────────┐ │ │ │ └─────────────────────────────────────────┘
│ │ │ │ │
│ │ │ └─────────────────────────────┘
│ │ │
│ virtual machine │ │ ┌─────────────────────────────────────────┐
│ │────┤ │Use traditional memory managers, such as │
│ │ │ ┌─────▶│ malloc and Free. Focus more on memory │
│ │ │ │ │ allocation │
│ │ │ ┌─────────────────────────────┐ │ └─────────────────────────────────────────┘
└────────────────────┘ │ │ │ │
│ │ │ │
│ │ │ │
├───▶│ Memory manager │─────┤
│ │ │ │
│ │ │ │
│ │ │ │ ┌──────────────────────────────────────────────┐
│ └─────────────────────────────┘ │ │ Or use the virtual machine's own memory │
│ │ │ allocation strategy to pay more attention to │
│ └─────▶│ memory recovery. For example, the garbage │
│ │ collector │
│ │ │
│ └──────────────────────────────────────────────┘

│ ┌─────────────────────────────┐
│ │ │ ┌──────────────────────────────────────────────┐
│ │ │ │ │
│ │ │ │ │
└────▶│ extension │──────────▶│ FFI │
│ │ │ │
│ │ │ │
│ │ └──────────────────────────────────────────────┘
└─────────────────────────────┘

PHP内核对字面量的优化

前几天,我发现PHP内核在处理字面量的时候是比较简单粗暴的,编译出一个常量,就直接把它放到literals里面了。那么这就会导致同一个常量会被存储多份,这显然是没有必要的。然后我对这部分代码优化好几个小时后发现,opcache已经对这个进行了优化,在函数zend_optimizer_compact_literals里面,会对等价的字面量进行合并。

PHP内核对符号的处理

PHP内核在编译PHP脚本的过程中,会把符号的名字转化为符号对应的数据存储空间的地址(注意,不是符号的地址,而是符号对应的数据存储空间的地址)。

我们知道,opline的结构如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
typedef union _znode_op {
uint32_t constant;
uint32_t var;
uint32_t num;
} znode_op;

struct _zend_op {
const void *handler;
znode_op op1;
znode_op op2;
znode_op result;
uint32_t extended_value;
uint32_t lineno;
zend_uchar opcode;
zend_uchar op1_type;
zend_uchar op2_type;
zend_uchar result_type;
};

znode_op这个结构,它是一个uint32_t类型的数字,可以用来存放和操作数地址有关的东西。这也就意味着,编译完PHP脚本之后,可以丢弃这些符号的名字,都转换成地址即可。

而名字到地址的转换,核心函数是lookup_cv

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
static int lookup_cv(zend_string *name) /* {{{ */{
zend_op_array *op_array = CG(active_op_array);
int i = 0;
zend_ulong hash_value = zend_string_hash_val(name);

while (i < op_array->last_var) {
if (ZSTR_H(op_array->vars[i]) == hash_value
&& zend_string_equals(op_array->vars[i], name)) {
return EX_NUM_TO_VAR(i);
}
i++;
}
i = op_array->last_var;
op_array->last_var++;
if (op_array->last_var > CG(context).vars_size) {
CG(context).vars_size += 16; /* FIXME */
op_array->vars = erealloc(op_array->vars, CG(context).vars_size * sizeof(zend_string*));
}

op_array->vars[i] = zend_string_copy(name);
return EX_NUM_TO_VAR(i);
}

这段代码,就是用来确定一个个CV变量在栈中的存储地址。也就意味着,栈的大小,在编译期间就确定好了。

PHP内核的op_array由编译时转化为运行时

PHP内核在pass_two这个函数里面,会对op_array进行一个编译时到运行时的转化。

主要体现在以下几个地方:

重新分配literals

literalsopcodes由原来分散存储的内存合并为连续的一块内存。这么做除了内存连续带来的性能提升之外,另一个好处是,在执行opline的时候,直接通过偏移量就可以拿到对应的字面量了,不需要传递op_array,相当于少传递了一个参数(之前需要通过op_array->literals的方式来获取)。

重新设置常量的constant值

znode_op::constant最终是要存储这个常量相对这条opline的偏移量

在编译完AST生成完opcode之后,znode_op::constant存储的是这个常量在literals数组的索引。

znode_op::constant在从编译期转运行期之后,变成了相对这条opline的偏移量。

重新设置临时变量的var值

znode_op::var最终是要存储这个变量相对execute_data的偏移量

我们知道,IS_CV变量它相对execute_data的偏移量在编译这个变量的时候就已经通过EX_NUM_TO_VAR确定了。但是,IS_TMP类型的变量,它的znode_op::var里面只存了这个临时变量是第几个,还没有确定这个临时变量相对execute_data的偏移量。所以,在编译时转化为运行时的阶段,需要确定好。

那么为什么只有IS_TMP需要做转化呢?而IS_CV不需要呢?这是和PHP栈帧的设计有关的,PHP的栈帧结构如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
/*
* Stack Frame Layout (the whole stack frame is allocated at once)
* ==================
*
* +========================================+
* EG(current_execute_data) -> | zend_execute_data |
* +----------------------------------------+
* EX_VAR_NUM(0) --------> | VAR[0] = ARG[1] |
* | ... |
* | VAR[op_array->num_args-1] = ARG[N] |
* | ... |
* | VAR[op_array->last_var-1] |
* | VAR[op_array->last_var] = TMP[0] |
* | ... |
* | VAR[op_array->last_var+op_array->T-1] |
* | ARG[N+1] (extra_args) |
* | ... |
* +----------------------------------------+
*/

可以发现,前面是IS_CV类型的变量,IS_TMP类型的变量在IS_CV变量的后面。所以,我们在编译出IS_TMP的时候,还无法确定IS_CV变量的个数,所以,也就无法确定IS_TMP相对于execute_data的偏移量。所以,得把IS_TMP的转化放在后面进行。

PHP内核如何确定一个opcode有几个操作数

首先,PHP内核包含的所有zend_ast节点类型在文件zend_ast.h里面:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#define ZEND_AST_SPECIAL_SHIFT      6
#define ZEND_AST_IS_LIST_SHIFT 7
#define ZEND_AST_NUM_CHILDREN_SHIFT 8

enum _zend_ast_kind {
// 省略其他的节点类型

/* 0 child nodes */
ZEND_AST_MAGIC_CONST = 0 << ZEND_AST_NUM_CHILDREN_SHIFT,
ZEND_AST_TYPE,
ZEND_AST_CONSTANT_CLASS,

/* 1 child node */
ZEND_AST_VAR = 1 << ZEND_AST_NUM_CHILDREN_SHIFT,
ZEND_AST_CONST,
ZEND_AST_UNPACK,
ZEND_AST_UNARY_PLUS,
ZEND_AST_UNARY_MINUS,
ZEND_AST_CAST,
ZEND_AST_EMPTY,
ZEND_AST_ISSET,
ZEND_AST_SILENCE,
ZEND_AST_SHELL_EXEC,
ZEND_AST_CLONE,
ZEND_AST_EXIT,
ZEND_AST_PRINT,
ZEND_AST_INCLUDE_OR_EVAL,
ZEND_AST_UNARY_OP,
ZEND_AST_PRE_INC,
ZEND_AST_PRE_DEC,
ZEND_AST_POST_INC,
ZEND_AST_POST_DEC,
ZEND_AST_YIELD_FROM,
ZEND_AST_CLASS_NAME,

ZEND_AST_GLOBAL,
ZEND_AST_UNSET,
ZEND_AST_RETURN,
ZEND_AST_LABEL,
ZEND_AST_REF,
ZEND_AST_HALT_COMPILER,
ZEND_AST_ECHO,
ZEND_AST_THROW,
ZEND_AST_GOTO,
ZEND_AST_BREAK,
ZEND_AST_CONTINUE,

/* 2 child nodes */
ZEND_AST_DIM = 2 << ZEND_AST_NUM_CHILDREN_SHIFT,
ZEND_AST_PROP,
ZEND_AST_NULLSAFE_PROP,
ZEND_AST_STATIC_PROP,
ZEND_AST_CALL,
ZEND_AST_CLASS_CONST,
ZEND_AST_ASSIGN,
ZEND_AST_ASSIGN_REF,
ZEND_AST_ASSIGN_OP,
ZEND_AST_BINARY_OP,
ZEND_AST_GREATER,
ZEND_AST_GREATER_EQUAL,
ZEND_AST_AND,
ZEND_AST_OR,
ZEND_AST_ARRAY_ELEM,
ZEND_AST_NEW,
ZEND_AST_INSTANCEOF,
ZEND_AST_YIELD,
ZEND_AST_COALESCE,
ZEND_AST_ASSIGN_COALESCE,

ZEND_AST_STATIC,
ZEND_AST_WHILE,
ZEND_AST_DO_WHILE,
ZEND_AST_IF_ELEM,
ZEND_AST_SWITCH,
ZEND_AST_SWITCH_CASE,
ZEND_AST_DECLARE,
ZEND_AST_USE_TRAIT,
ZEND_AST_TRAIT_PRECEDENCE,
ZEND_AST_METHOD_REFERENCE,
ZEND_AST_NAMESPACE,
ZEND_AST_USE_ELEM,
ZEND_AST_TRAIT_ALIAS,
ZEND_AST_GROUP_USE,
ZEND_AST_CLASS_CONST_GROUP,
ZEND_AST_ATTRIBUTE,
ZEND_AST_MATCH,
ZEND_AST_MATCH_ARM,
ZEND_AST_NAMED_ARG,

/* 3 child nodes */
ZEND_AST_METHOD_CALL = 3 << ZEND_AST_NUM_CHILDREN_SHIFT,
ZEND_AST_NULLSAFE_METHOD_CALL,
ZEND_AST_STATIC_CALL,
ZEND_AST_CONDITIONAL,

ZEND_AST_TRY,
ZEND_AST_CATCH,
ZEND_AST_PROP_GROUP,
ZEND_AST_PROP_ELEM,
ZEND_AST_CONST_ELEM,

/* 4 child nodes */
ZEND_AST_FOR = 4 << ZEND_AST_NUM_CHILDREN_SHIFT,
ZEND_AST_FOREACH,

/* 5 child nodes */
ZEND_AST_PARAM = 5 << ZEND_AST_NUM_CHILDREN_SHIFT,
};

我们发现,有几个子节点,那么就从子节点个数 << ZEND_AST_NUM_CHILDREN_SHIFT开始。所以,对应的,我们可以通过反过来拿到子节点的个数:

1
zend_ast_kind >> ZEND_AST_NUM_CHILDREN_SHIFT

移进规约冲突

今天写了一个移进规约冲突的文法。文法规则如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
%{
#include <stdio.h>
#include <string.h>
#include "zend_compile.h"
#include "zend_opcode.h"
#include "zend_vm.h"

#define YYDEBUG 1

#define zendparse yyparse

extern int yylex(void);
extern int yyparse(void);
extern FILE *yyin;
extern int yylineno;

int yywrap()
{
return 1;
}

void yyerror(const char *s)
{
printf("[error] %s, in line %d\n", s, yylineno);
}

int main(int argc, char const *argv[])
{
return 0;
}
%}

%left '+' '-'
%left '*' '/'
%left '(' ')'

%token <ident> T_ECHO "'echo'"
%token <ast> T_LNUMBER "integer"
%token <ast> T_VARIABLE "variable"

%union {
zend_ast *ast;
}

%type <ast> top_statement statement
%type <ast> expr
%type <ast> echo_expr
%type <ast> scalar
%type <ast> top_statement_list
%type <ast> variable

%%

start:
top_statement_list { CG(ast) = $1; }
;

top_statement_list:
top_statement_list top_statement { $$ = zend_ast_list_add($1, $2); }
| %empty { $$ = zend_ast_create_list(0, ZEND_AST_STMT_LIST); }
;

top_statement:
statement { $$ = $1; }
;

statement:
T_ECHO echo_expr ';' { $$ = $2; }
| expr ';' { $$ = $1; }
;

echo_expr:
expr {
std::cout << "create echo zend_ast" << std::endl;
$$ = zend_ast_create_1(ZEND_AST_ECHO, 0, $1);
}
;

expr:
variable '=' expr {
std::cout << "create assign zend_ast" << std::endl;
$$ = zend_ast_create_2(ZEND_AST_ASSIGN, 0, $1, $3);
}
|
expr '+' expr {
std::cout << "create + zend_ast" << std::endl;
$$ = zend_ast_create_binary_op(ZEND_ADD, $1, $3);
}
| expr '-' expr {
std::cout << "create - zend_ast" << std::endl;
$$ = zend_ast_create_binary_op(ZEND_SUB, $1, $3);
}
| expr '*' expr {
std::cout << "create * zend_ast" << std::endl;
$$ = zend_ast_create_binary_op(ZEND_MUL, $1, $3);
}
| expr '/' expr {
std::cout << "create / zend_ast" << std::endl;
$$ = zend_ast_create_binary_op(ZEND_DIV, $1, $3);
}
| '(' expr ')' { $$ = $2; }
| scalar { $$ = $1; }
;

scalar:
T_LNUMBER { $$ = $1; }
;

variable:
T_VARIABLE { $$ = $1; }

%%

执行生成代码的命令:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
bison -d -Wcounterexamples zend_language_parser.y

zend_language_parser.y: warning: 4 shift/reduce conflicts [-Wconflicts-sr]
zend_language_parser.y: warning: shift/reduce conflict on token '+' [-Wcounterexamples]
Example: variable '=' expr • '+' expr
Shift derivation
expr
↳ variable '=' expr
↳ expr • '+' expr
Reduce derivation
expr
↳ expr '+' expr
↳ variable '=' expr •

# 省略其他的警告

可以看到,警告说是有4个地方有移进规约的冲突。

那么,什么是移进规约冲突呢?意思就是说,当我们预读了词素的时候,既可以对分析栈里面已有的词素进行规约也可以对预读的词素进行移进,这就是已经规约冲突。

OK,我们来看看上面的报错。可以看到,当有字符串:

1
$a = 1 + 1

输入时,会发生移进规约的冲突。

我们来看看如果是优先移进的话,状态是如何变化的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
+----------------------------------+   +-----+   +----------------------------------+
| | |init | | $a = 1 + 1 |
+----------------------------------+ +-----+ +----------------------------------+




+----------------------------------+ +-----+ +----------------------------------+
| $a | |shift| | = 1 + 1 |
+----------------------------------+ +-----+ +----------------------------------+



+----------------------------------+ +-----+ +----------------------------------+
| $a = | |shift| | 1 + 1 |
+----------------------------------+ +-----+ +----------------------------------+


+----------------------------------+ +-----+ +----------------------------------+
| $a = 1 | |shift| | + 1 |
+----------------------------------+ +-----+ +----------------------------------+



+----------------------------------+ +-----+ +----------------------------------+
| $a = 1 + | |shift| | 1 |
+----------------------------------+ +-----+ +----------------------------------+


+----------------------------------+ +-----+ +----------------------------------+
| $a = 1 + 1 | |shift| | |
+----------------------------------+ +-----+ +----------------------------------+


+----------------------------------+ +------+ +----------------------------------+
| $a = expr | |reduce| | |
+----------------------------------+ +------+ +----------------------------------+



+----------------------------------+ +------+ +----------------------------------+
| expr | |reduce| | |
+----------------------------------+ +------+ +----------------------------------+

对应的AST如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
               +------------+                             
|ZEND_ASSIGN |
| |
+------------+
|
|
+--------------+--------------+
| |
| |
v v
+------------+ +------------+
| $a | | ZEND_ADD |
| | | |
+------------+ +------------+
|
|
|
+---------------+-------------+
| |
| |
v v
+------------+ +------------+
| 1 | | 1 |
| | | |
+------------+ +------------+

计算这个AST,我们会得到$a的最终值为2。这是符合主流语言的预期的。

我们来看看如果是优先规约的话,状态是如何变化的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
+----------------------------------+   +-----+   +----------------------------------+
| | |init | | $a = 1 + 1 |
+----------------------------------+ +-----+ +----------------------------------+




+----------------------------------+ +-----+ +----------------------------------+
| $a | |shift| | = 1 + 1 |
+----------------------------------+ +-----+ +----------------------------------+



+----------------------------------+ +-----+ +----------------------------------+
| $a = | |shift| | 1 + 1 |
+----------------------------------+ +-----+ +----------------------------------+


+----------------------------------+ +-----+ +----------------------------------+
| $a = 1 | |shift| | + 1 |
+----------------------------------+ +-----+ +----------------------------------+



+----------------------------------+ +------+ +----------------------------------+
| expr | |reduce| | + 1 |
+----------------------------------+ +------+ +----------------------------------+


+----------------------------------+ +-----+ +----------------------------------+
| expr + | |shift| | 1 |
+----------------------------------+ +-----+ +----------------------------------+


+----------------------------------+ +------+ +----------------------------------+
| expr + 1 | |shift | | |
+----------------------------------+ +------+ +----------------------------------+



+----------------------------------+ +------+ +----------------------------------+
| expr | |reduce| | |
+----------------------------------+ +------+ +----------------------------------+

对应的AST如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
                                  +-----------------+                 
| |
| ZEND_ADD |
| |
+-----------------+
|
|
+---------------+----------------+
| |
| |
v v
+-----------------+ +-----------------+
| | | |
| ZEND_ASSIGN | | 1 |
| | | |
+-----------------+ +-----------------+
|
|
|
+-----------------+----------------+
| |
| |
v v
+-----------------+ +-----------------+
| | | |
| $a | | 1 |
| | | |
+-----------------+ +-----------------+

计算这个AST,我们会得到$a的最终值为1。这和我们想的不太一样。

所以,移进规约的冲突,会导致一些执行的顺序不一致。如果我们学习过bison官方文档经典的if ... else的移进规约冲突问题的话,我们知道,解决它的办法是修改文法,进而避免冲突(因为这个例子有一点绕,所以我没有用那个例子)。那我们这种情况呢,就可以通过设置词素的优先级来解决掉,我们设置=的优先级低于+即可:

1
2
3
4
%left '='
%left '+' '-'
%left '*' '/'
%left '(' ')'

这样的话,当我们的分析栈为如下情况的时候:

1
2
3
+----------------------------------+   +-----+   +----------------------------------+
| $a = 1 | |shift| | + 1 |
+----------------------------------+ +-----+ +----------------------------------+

我们预读一个+,因为+号的优先级更高一点,所以,此时不会选择规约,而是把+移进。这样,我们可以保证在后续规约的时候,先规约1 + 1,进而也保证了运算符的优先级。

《手把手教你编写PHP编译器》-执行opcode

上一篇文章,我们成功的把AST翻译成了opcode,这样有一个好处,就是它是线性的,连续的,这和我们的CPU去一条一条的执行机器指令是保持一致的,非常便于人类理解。但是,我们还没有去设置这些opcode对应的handler

这篇文章,我们来实现对这些opcode的执行,这一节还是比较难的。

首先,我们来捋一捋opcodehandler的关系。我们参考PHP的实现。首先是我们的_zend_op

1
2
3
4
5
6
7
8
9
struct _zend_op {
znode_op op1;
znode_op op2;
znode_op result;
unsigned char opcode;
char op1_type;
char op2_type;
char result_type;
};

这种结构实际上是一种三地址码的组织形式,这种结构可以方便我们后续进行数据流分析。

我们知道,变量和字面量等等是有类型的,既然有类型,我们的操作数1和操作数2就可能多种组合。所以,这实际上就是一种笛卡尔积的表现形式了。再加上opcode的种类也不止一种,所以,我们有如下笛卡尔积:

1
opcode × op1 × op2

举个例子画个图:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
+----------------------+              +----------------------+        +----------------------+
| | | | | |
| ZEND_ADD | | IS_CONST | | IS_CONST |
| | | | | |
+----------------------+ +----------------------+ +----------------------+




+----------------------+ +----------------------+ +----------------------+
| | | | | |
| ZEND_SUB | | IS_TMP_VAR | | IS_TMP_VAR |
| | | | | |
+----------------------+ +----------------------+ +----------------------+





+----------------------+
| |
| ZEND_MUL |
| |
+----------------------+




+----------------------+
| |
| ZEND_DIV |
| |
+----------------------+

那么,我们就会有4 * 2 * 2spec handler

1
2
3
4
5
ZEND_ADD_IS_CONST_IS_CONST
ZEND_ADD_IS_CONST_IS_TMP_VAR
ZEND_ADD_IS_TMP_VAR_IS_CONST
ZEND_ADD_IS_TMP_VAR_IS_TMP_VAR
# 以此类推

假设,我们的opcode是按照顺序从0开始编号的,并且操作数的类型也是从0开始进行编号,并且,我们的spec handler也是严格按照顺序在内存中进行排序的。那我,我们就可以通过opcodeop1_typeop2_type找到spec handler的位置了,这个有点像一个三维的数组。对应的算法如下:

1
opcode * op1_type的数量 * op2_type的数量 + opt_type的编号 * op2_type的数量 + op2_type的编号

我们的实现都是围绕着这个算法来进行的。

首先,我们来定义一下操作数的类型:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#define OP_TYPE_MAP(XX)                                                                                                \
XX(IS_UNUSED, 0) \
XX(IS_CONST, 1 << 0) \
XX(IS_TMP_VAR, 1 << 1) \
XX(IS_VAR, 1 << 2) \
XX(IS_CV, 1 << 3)

enum op_type_e {
#define OP_TYPE_GEN(name, value) name = value,
OP_TYPE_MAP(OP_TYPE_GEN)
#undef OP_TYPE_GEN
};

enum op_type_code_e {
#define OP_TYPE_CODE_GEN(name, value) _##name##_CODE,
OP_TYPE_MAP(OP_TYPE_CODE_GEN)
#undef OP_TYPE_CODE_GEN
};

接着,我们可以来编写我们的spec handler了。从上面可以看出,我们的操作数有好几个。但是,实际上,对于同一个opcode,它要执行的动作是一样的,只不过操作数的类型不同,获取操作数的方式需要改变。如果我们手写每一种opcode对应的所有handler,那么这个维护成本是非常的高的,所以,我们应该是有一个代码生成的机制,写好通用的模板代码,然后直接生成即可。

下面,我们来给出模板代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
ZEND_VM_HANDLER(0, ZEND_NOP, CONST|TMPVAR, CONST|TMPVAR)
{
return 0;
}

ZEND_VM_HANDLER(1, ZEND_ADD, CONST|TMPVAR, CONST|TMPVAR)
{
int64_t op1, op2;

op1 = GET_OP1();
op2 = GET_OP2();
op_array->literals[opline->result.var] = op1 + op2;
return 0;
}

ZEND_VM_HANDLER(2, ZEND_SUB, CONST|TMPVAR, CONST|TMPVAR)
{
int64_t op1, op2;

op1 = GET_OP1();
op2 = GET_OP2();
op_array->literals[opline->result.var] = op1 - op2;
return 0;
}

ZEND_VM_HANDLER(3, ZEND_MUL, CONST|TMPVAR, CONST|TMPVAR)
{
int64_t op1, op2;

op1 = GET_OP1();
op2 = GET_OP2();
op_array->literals[opline->result.var] = op1 * op2;
return 0;
}

ZEND_VM_HANDLER(4, ZEND_DIV, CONST|TMPVAR, CONST|TMPVAR)
{
int64_t op1, op2;

op1 = GET_OP1();
op2 = GET_OP2();
op_array->literals[opline->result.var] = op1 / op2;
return 0;
}

ZEND_VM_HANDLER(136, ZEND_ECHO, CONST|TMPVAR, UNUSED)
{
int64_t op1;

op1 = GET_OP1();
printf("%lld", op1);
return 0;
}

可以看到,非常的简单。其中,这里的数字0, 1, 2, 3, 4, 136是这个opcode的编号。

接着,我们来用PHP代码来完成这个代码生成的脚本:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
<?php

#define IS_UNUSED 0 /* Unused operand */
#define IS_CONST (1 << 0)
#define IS_TMP_VAR (1 << 1)
#define IS_VAR (1 << 2)
#define IS_CV (1 << 3) /* Compiled variable */

define('ZEND_VM_OP_UNUSED', 1 << 0);
define('ZEND_VM_OP_CONST', 1 << 1);
define('ZEND_VM_OP_TMPVAR', 1 << 2);
define('ZEND_VM_OP_VAR', 1 << 3);
define('ZEND_VM_OP_CV', 1 << 4);

$op_types_map = array(
"UNUSED" => ZEND_VM_OP_UNUSED,
"CONST" => ZEND_VM_OP_CONST,
"TMPVAR" => ZEND_VM_OP_TMPVAR,
"VAR" => ZEND_VM_OP_VAR,
"CV" => ZEND_VM_OP_CV,
);

$op1_get = array(
"UNUSED" => "nullptr",
"CONST" => "opline->op1.num",
"TMPVAR" => "op_array->literals[opline->op1.var]",
"VAR" => "nullptr",
"CV" => "nullptr",
);

$op2_get = array(
"UNUSED" => "nullptr",
"CONST" => "opline->op2.num",
"TMPVAR" => "op_array->literals[opline->op2.var]",
"VAR" => "nullptr",
"CV" => "nullptr",
);

$opcodes = [];
$max_opcode = 0;
$spec_names = [];

function parse_operand_spec($def, $lineno, $str, &$flags)
{
global $op_types_map;

$flags = 0;
$a = explode("|", $str);
foreach ($a as $val) {
if (isset($op_types_map[$val])) {
$flags |= $op_types_map[$val];
} else {
die("ERROR ($def:$lineno): Wrong operand type '$str'\n");
}
}

return array_flip($a);
}

function gen_handler($f, $opcode)
{
global $op1_get, $op2_get, $spec_names, $op_types_map;

$opTypes = array_keys($op_types_map);

foreach ($opTypes as $op1Type) {
foreach ($opTypes as $op2Type) {
if (isset($opcode['op1'][$op1Type]) && isset($opcode['op2'][$op2Type])) {
$specialized_replacements = [
"/GET_OP1\(([^)]*)\)/" => $op1_get[$op1Type],
"/GET_OP2\(([^)]*)\)/" => $op2_get[$op2Type],
];

$name = $opcode['op'];
$templateCode = $opcode['code'];

$spec_name = $name."_SPEC"."_".$op1Type."_".$op2Type;
$spec_names[] = $spec_name;
fputs($f, "static int $spec_name(zend_op_array *op_array, zend_op *opline) ");
$code = preg_replace(array_keys($specialized_replacements), array_values($specialized_replacements), $templateCode);
fputs($f, $code);
} else {
$spec_names[] = 'nullptr';
}
}
}
}

function gen_spec_handlers($f)
{
global $spec_names;

fputs($f, "\tstatic const void * const spec_handlers[] = {\n");
foreach ($spec_names as $spec_name) {
fputs($f, "\t\t(void *) $spec_name,\n");
}
fputs($f, "\t};\n");

fputs($f, "\tzend_spec_handlers = spec_handlers;\n");
}

function gen_vm_execute_code($f)
{
fputs($f, "void zend_execute(zend_op_array *op_array) {\n");
fputs($f, "\tfor (size_t i = 0; i < op_array->last; i++) {\n");
fputs($f, "\t\tzend_op *opline = &(op_array->opcodes[i]);\n");
fputs($f, "\t\t((opcode_handler_t)opline->handler)(op_array, opline);\n");
fputs($f, "\t}\n");
fputs($f, "}\n\n");
}

function gen_vm_init_code($f)
{
fputs($f, "void zend_vm_init() {\n");

gen_spec_handlers($f);

fputs($f, "}\n");
}

function gen_executor_code($f)
{
global $opcodes, $max_opcode;

// define
fputs($f, "const void * const *zend_spec_handlers;\n");
fputs($f, "typedef int (*opcode_handler_t) (zend_op_array *op_array, const zend_op *opline);\n\n");

// Generate zend_vm_get_opcode_handler() function

fputs($f, "static uint32_t zend_vm_get_opcode_handler_idx(const zend_op *opline)\n");
fputs($f, "{\n");
fputs($f, "\tstatic int zend_vm_decode[IS_CV + 1] = {0};\n\n");
fputs($f, "\t#define OP_TYPE_CODE_GEN(name, value) zend_vm_decode[name] = _##name##_CODE;\n");
fputs($f, "\t\tOP_TYPE_MAP(OP_TYPE_CODE_GEN)\n");
fputs($f, "\t#undef OP_TYPE_CODE_GEN\n\n");
fputs($f, "\tuint32_t offset = 0;\n");
fputs($f, "\toffset += opline->opcode * 5 * 5;\n");
fputs($f, "\toffset += zend_vm_decode[(int) opline->op1_type] * 5;\n");
fputs($f, "\toffset += zend_vm_decode[(int) opline->op2_type];\n");
fputs($f, "\treturn offset;\n");
fputs($f, "}\n\n");

fputs($f, "const void *zend_vm_get_opcode_handler(const zend_op *opline)\n");
fputs($f, "{\n");
fputs($f, "\tuint32_t offset = zend_vm_get_opcode_handler_idx(opline);\n");
fputs($f, "\treturn zend_spec_handlers[offset];\n");
fputs($f, "}\n\n");

fputs($f, "void zend_vm_set_opcode_handler(zend_op *opline)\n");
fputs($f, "{\n");
fputs($f, "\topline->handler = zend_vm_get_opcode_handler(opline);\n");
fputs($f, "}\n\n");

$num = 0;

for ($i = 0; $i <= $max_opcode; $i++) {
if (isset($opcodes[$num])) {
gen_handler($f, $opcodes[$num], $num);
} else {
gen_handler($f, [], $num);
}
$num++;
}

gen_vm_execute_code($f);

gen_vm_init_code($f);
}

function gen_vm(string $def)
{
global $opcodes, $max_opcode;

$in = file($def);

$lineno = 0;
$handler = 0;

foreach ($in as $line) {
if (strpos($line, "ZEND_VM_HANDLER(") === 0) {
if (preg_match(
"/^ZEND_VM_HANDLER\(\s*([0-9]+)\s*,\s*([A-Z_]+)\s*,\s*([A-Z_|]+)\s*,\s*([A-Z_|]+)\s*(,\s*([A-Z_|]+)\s*)?(,\s*SPEC\(([A-Z_|=,]+)\)\s*)?\)/",
$line,
$m
) == 0) {
die("ERROR ($def:$lineno): Invalid ZEND_VM_HANDLER definition.\n");
}

$code = (int)$m[1];
$op = $m[2];
$op1 = parse_operand_spec($def, $lineno, $m[3], $flags1);
$op2 = parse_operand_spec($def, $lineno, $m[4], $flags2);
$flags = $flags1 | ($flags2 << 8);

if ($code > $max_opcode) {
$max_opcode = $code;
}

if (isset($opcodes[$code])) {
die("ERROR ($def:$lineno): Opcode with code '$code' is already defined.\n");
}
if (isset($opnames[$op])) {
die("ERROR ($def:$lineno): Opcode with name '$op' is already defined.\n");
}
$handler = $code;

$opcodes[$code] = array("op"=>$op,"op1"=>$op1,"op2"=>$op2,"code"=>"","flags"=>$flags);
} else {
$opcodes[$handler]['code'] .= $line;
}
}

ksort($opcodes);

$f = fopen(__DIR__ . "/zend_vm_opcodes.h", "w+") or die("ERROR: Cannot create zend_vm_opcodes.h\n");
fputs($f, "#pragma once\n\n");

foreach ($opcodes as $code => $dsc) {
$op = str_pad($dsc["op"], 20);
fputs($f, "#define $op $code\n");
}
fclose($f);
echo "zend_vm_opcodes.h generated successfully.\n";

$f = fopen(__DIR__ . "/zend_vm_opcodes.cc", "w+") or die("ERROR: Cannot create zend_vm_opcodes.c\n");
fputs($f, "#include \"zend_vm_opcodes.h\"\n\n");

fputs($f, "static const char *zend_vm_opcodes_names[".($max_opcode + 1)."] = {\n");
for ($i = 0; $i <= $max_opcode; $i++) {
fputs($f, "\t".(isset($opcodes[$i]["op"])?'"'.$opcodes[$i]["op"].'"':"nullptr").",\n");
}
fputs($f, "};\n\n");

fputs($f, "const char* zend_get_opcode_name(char opcode) {\n");
fputs($f, "\treturn zend_vm_opcodes_names[opcode];\n");
fputs($f, "}\n");

fclose($f);
echo "zend_vm_opcodes.cc generated successfully.\n";

$f = fopen(__DIR__ . "/zend_vm_execute.h", "w+") or die("ERROR: Cannot create zend_vm_execute.h\n");
fputs($f, "#pragma once\n\n");
fputs($f, "#include <stdint.h>\n");
fputs($f, "#include <stddef.h>\n");
fputs($f, "#include \"zend_compile.h\"\n\n");

gen_executor_code($f);
echo "zend_vm_execute.h generated successfully.\n";
}

gen_vm(__DIR__ . "/zend_vm_def.h");

接着,我们执行这个脚本,就会生成文件zend_vm_opcodes.hzend_vm_opcodes.cczend_vm_execute.h

这里面有两个核心的函数zend_vm_initzend_execute

其中zend_vm_init会用一块内存来存放我们的spec handler的地址,这样,我们就可以通过上面所说的算法,来找到spec handler了。

zend_execute就非常的简单了,执行opline就好了。

接下来,我们只需要设置好每一个opline对应的handler即可。代码如下:

1
2
3
4
5
6
7
// set opcode spec handler
void pass_two(zend_op_array *op_array) {
for (size_t i = 0; i < op_array->last; i++) {
zend_op *opline = &(op_array->opcodes[i]);
zend_vm_set_opcode_handler(opline);
}
}

最后,我们在文件zend_language_parser.y里面调用zend_vm_initpass_twozend_execute即可。

如何准确的查看opline对应的handler名字

我们在分析opcode对应的handler的时候,往往会根据opcode的命名规则来推断具体的handler。然而,如果我们使用PHP8的话,我们可以利用jitdebug功能来快速的看到opcode对应的handler。我举个例子:

有如下代码:

1
2
$a = [1, 2, 3];
$a[2];

像这个$a[2]对应的handler还是非常的长的,我们很难一口气推断出来。我们只需要配置一下php.ini就可以方便的拿到handler

1
2
3
4
5
6
7
zend_extension=opcache.so
opcache.enable=1
opcache.enable_cli=1

opcache.jit=1201
opcache.jit_buffer_size=64M
opcache.jit_debug=0x01

执行结果如下:

1
2
3
4
5
6
JIT$/Users/hantaohuang/codeDir/cCode/php-src/test.php: ; (/Users/hantaohuang/codeDir/cCode/php-src/test.php)
# 省略其他的汇编代码
mov $ZEND_ASSIGN_SPEC_CV_CONST_RETVAL_UNUSED_HANDLER, %rax
# 省略其他的汇编代码
mov $ZEND_FETCH_DIM_R_INDEX_SPEC_CV_CONST_HANDLER, %rax
# 省略其他的汇编代码

可以看到,handlerZEND_FETCH_DIM_R_INDEX_SPEC_CV_CONST_HANDLER

PHP内核生成zend_vm_opcodes.h

本文基于的PHP8 commit为:14806e0824ecd598df74cac855868422e44aea53

首先,zend_vm_opcodes.h这个文件是通过脚本Zend/zend_vm_gen.php来生成的。而zend_vm_gen.php这个脚本依赖zend_vm_def.hzend_vm_execute.skl来生成文件zend_vm_execute.hzend_vm_opcodes.h

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
 +--------------------+                +--------------------+ 
| | | |
| zend_vm_def.h | |zend_vm_execute.skl |
| | | |
+--------------------+ +--------------------+
| |
+------------------+------------------+
|
v
+--------------------+
| |
| zend_vm_gen.php |
| |
+--------------------+
|
+-------------------+-------------------+
| |
v v
+--------------------+ +--------------------+
| | | |
| zend_vm_opcodes.h | | zend_vm_execute.h |
| | | |
+--------------------+ +--------------------+

我们以文件zend_vm_gen.php分析的起点,来看看生成zend_vm_opcodes.hzend_vm_execute.h的关键步骤。

首先,是函数gen_vm。这个函数会逐行扫描zend_vm_def.h里面的代码。

当扫描到ZEND_VM_HELPER的时候,就会执行下面的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
if (strpos($line,"ZEND_VM_HELPER(") === 0 ||
strpos($line,"ZEND_VM_INLINE_HELPER(") === 0 ||
strpos($line,"ZEND_VM_COLD_HELPER(") === 0 ||
strpos($line,"ZEND_VM_HOT_HELPER(") === 0) {
// Parsing helper's definition
if (preg_match(
"/^ZEND_VM(_INLINE|_COLD|_HOT)?_HELPER\(\s*([A-Za-z_]+)\s*,\s*([A-Z_|]+)\s*,\s*([A-Z_|]+)\s*(?:,\s*SPEC\(([A-Z_|=,]+)\)\s*)?(?:,\s*([^)]*)\s*)?\)/",
$line,
$m) == 0) {
die("ERROR ($def:$lineno): Invalid ZEND_VM_HELPER definition.\n");
}
$inline = !empty($m[1]) && $m[1] === "_INLINE";
$cold = !empty($m[1]) && $m[1] === "_COLD";
$hot = !empty($m[1]) && $m[1] === "_HOT";
$helper = $m[2];
$op1 = parse_operand_spec($def, $lineno, $m[3], $flags1);
$op2 = parse_operand_spec($def, $lineno, $m[4], $flags2);
$param = isset($m[6]) ? $m[6] : null;
if (isset($helpers[$helper])) {
die("ERROR ($def:$lineno): Helper with name '$helper' is already defined.\n");
}

// Store parameters
if (ZEND_VM_KIND == ZEND_VM_KIND_GOTO
|| ZEND_VM_KIND == ZEND_VM_KIND_SWITCH
|| (ZEND_VM_KIND == ZEND_VM_KIND_HYBRID && $hot)) {
foreach (explode(",", $param) as $p) {
$p = trim($p);
if ($p !== "") {
$params[$p] = 1;
}
}
}

$helpers[$helper] = array("op1"=>$op1,"op2"=>$op2,"param"=>$param,"code"=>"","inline"=>$inline,"cold"=>$cold,"hot"=>$hot);

if (!empty($m[5])) {
$helpers[$helper]["spec"] = parse_spec_rules($def, $lineno, $m[5]);
}

$handler = null;
$list[$lineno] = array("helper"=>$helper);

这段代码具体的细节我们不去深究,总结起来就是去正则匹配zend_vm_def.h里面当前行的ZEND_VM_HELPER,然后把相关的信息存在全局变量$helpers里面。例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
ZEND_VM_HELPER(zend_add_helper, ANY, ANY, zval *op_1, zval *op_2)
=>
[
"zend_add_helper" =>
[
"op1" => [
ANY:0
],
"op2" => [
ANY:0
],
"param" => "zval *op_1, zval *op_2",
"code" => "",
"inline" => false,
"cold" => false,
"hot" => false,
]
]

然后

1
2
3
4
5
6
7
else if ($handler !== null) {
// Add line of code to current opcode handler
$opcodes[$handler]["code"] .= $line;
} else if ($helper !== null) {
// Add line of code to current helper
$helpers[$helper]["code"] .= $line;
}

就是去拼接zend_vm_def.h里面的代码。如果是ZEND_VM_HELPER类型的代码,就执行$helpers[$helper]["code"] .= $line;。例如,当拼接完毕的时候,就会得到下面的信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
ZEND_VM_HELPER(zend_add_helper, ANY, ANY, zval *op_1, zval *op_2)
{
USE_OPLINE

SAVE_OPLINE();
if (UNEXPECTED(Z_TYPE_INFO_P(op_1) == IS_UNDEF)) {
op_1 = ZVAL_UNDEFINED_OP1();
}
if (UNEXPECTED(Z_TYPE_INFO_P(op_2) == IS_UNDEF)) {
op_2 = ZVAL_UNDEFINED_OP2();
}
add_function(EX_VAR(opline->result.var), op_1, op_2);
if (OP1_TYPE & (IS_TMP_VAR|IS_VAR)) {
zval_ptr_dtor_nogc(op_1);
}
if (OP2_TYPE & (IS_TMP_VAR|IS_VAR)) {
zval_ptr_dtor_nogc(op_2);
}
ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION();
}
=>
[
"zend_add_helper" =>
[
"op1" => [
ANY:0
],
"op2" => [
ANY:0
],
"param" => "zval *op_1, zval *op_2",
"code" => " USE_OPLINE
SAVE_OPLINE();
if (UNEXPECTED(Z_TYPE_INFO_P(op_1) == IS_UNDEF)) {
op_1 = ZVAL_UNDEFINED_OP1();
}
if (UNEXPECTED(Z_TYPE_INFO_P(op_2) == IS_UNDEF)) {
op_2 = ZVAL_UNDEFINED_OP2();
}
add_function(EX_VAR(opline->result.var), op_1, op_2);
if (OP1_TYPE & (IS_TMP_VAR|IS_VAR)) {
zval_ptr_dtor_nogc(op_1);
}
if (OP2_TYPE & (IS_TMP_VAR|IS_VAR)) {
zval_ptr_dtor_nogc(op_2);
}
ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION();",
"inline" => false,
"cold" => false,
"hot" => false,
]
]

当扫描到ZEND_VM_HANDLER的代码之后,就会执行下面的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
if (strpos($line,"ZEND_VM_HANDLER(") === 0 ||
strpos($line,"ZEND_VM_INLINE_HANDLER(") === 0 ||
strpos($line,"ZEND_VM_HOT_HANDLER(") === 0 ||
strpos($line,"ZEND_VM_HOT_NOCONST_HANDLER(") === 0 ||
strpos($line,"ZEND_VM_HOT_NOCONSTCONST_HANDLER(") === 0 ||
strpos($line,"ZEND_VM_HOT_SEND_HANDLER(") === 0 ||
strpos($line,"ZEND_VM_HOT_OBJ_HANDLER(") === 0 ||
strpos($line,"ZEND_VM_COLD_HANDLER(") === 0 ||
strpos($line,"ZEND_VM_COLD_CONST_HANDLER(") === 0 ||
strpos($line,"ZEND_VM_COLD_CONSTCONST_HANDLER(") === 0) {
// Parsing opcode handler's definition
if (preg_match(
"/^ZEND_VM_(HOT_|INLINE_|HOT_OBJ_|HOT_SEND_|HOT_NOCONST_|HOT_NOCONSTCONST_|COLD_|COLD_CONST_|COLD_CONSTCONST_)?HANDLER\(\s*([0-9]+)\s*,\s*([A-Z_]+)\s*,\s*([A-Z_|]+)\s*,\s*([A-Z_|]+)\s*(,\s*([A-Z_|]+)\s*)?(,\s*SPEC\(([A-Z_|=,]+)\)\s*)?\)/",
$line,
$m) == 0) {
die("ERROR ($def:$lineno): Invalid ZEND_VM_HANDLER definition.\n");
}
$hot = !empty($m[1]) ? $m[1] : false;
$code = (int)$m[2];
$op = $m[3];
$len = strlen($op);
$op1 = parse_operand_spec($def, $lineno, $m[4], $flags1);
$op2 = parse_operand_spec($def, $lineno, $m[5], $flags2);
$flags = $flags1 | ($flags2 << 8);
if (!empty($m[7])) {
$flags |= parse_ext_spec($def, $lineno, $m[7]);
}

if ($len > $max_opcode_len) {
$max_opcode_len = $len;
}
if ($code > $max_opcode) {
$max_opcode = $code;
}
if (isset($opcodes[$code])) {
die("ERROR ($def:$lineno): Opcode with code '$code' is already defined.\n");
}
if (isset($opnames[$op])) {
die("ERROR ($def:$lineno): Opcode with name '$op' is already defined.\n");
}
$opcodes[$code] = array("op"=>$op,"op1"=>$op1,"op2"=>$op2,"code"=>"","flags"=>$flags,"hot"=>$hot);
if (isset($m[9])) {
$opcodes[$code]["spec"] = parse_spec_rules($def, $lineno, $m[9]);
if (isset($opcodes[$code]["spec"]["NO_CONST_CONST"])) {
$opcodes[$code]["flags"] |= $vm_op_flags["ZEND_VM_NO_CONST_CONST"];
}
if (isset($opcodes[$code]["spec"]["COMMUTATIVE"])) {
$opcodes[$code]["flags"] |= $vm_op_flags["ZEND_VM_COMMUTATIVE"];
}
}
$opnames[$op] = $code;
$handler = $code;
$helper = null;
$list[$lineno] = array("handler"=>$handler);
}

这段代码具体的细节我们不去深究,总结起来就是去正则匹配zend_vm_def.h里面当前行的ZEND_VM_HANDLER,然后把相关的信息存在全局变量$opcodes里面。例如:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ZEND_VM_HOT_NOCONSTCONST_HANDLER(1, ZEND_ADD, CONST|TMPVARCV, CONST|TMPVARCV)
=>
[
1 =>
[
"op" => "ZEND_ADD",
"op1" => [
"CONST" => 0,
"TMPVARCV" => 1
],
"op2" => [
"CONST" => 0,
"TMPVARCV" => 1
],
"code" => "",
"flags" => 2827,
"hot" => "HOT_NOCONSTCONST_"
]
]

其中

1
2
3
1 => [
"op" => "ZEND_ADD"
]

实际上就是ZEND_VM_HOT_NOCONSTCONST_HANDLER(1, ZEND_ADD, CONST|TMPVARCV, CONST|TMPVARCV)里面的1ZEND_ADD,这会用来定义opcode,对应zend_vm_opcodes.h文件里面的:

1
#define ZEND_ADD 1
1
2
"CONST" => 0,
"TMPVARCV" => 1

代表CONST|TMPVARCV的序号。实际上就是:

1
array_flip(explode("|", CONST|TMPVARCV))

之后的结果。

1
"flags" => 2827

计算方法是(CONST|TMPVARCV) | ((CONST|TMPVARCV) << 8)。至于CONSTTMPVARCV的值,我们可以在文件zend_vm_gen.php的变量$vm_op_decode里面找到。

接着,对于ZEND_VM_HANDLER就会执行$opcodes[$handler]["code"] .= $line;了,和ZEND_VM_HELPER的类似。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Generate opcode #defines (zend_vm_opcodes.h)
$code_len = strlen((string)$max_opcode);
$f = fopen(__DIR__ . "/zend_vm_opcodes.h", "w+") or die("ERROR: Cannot create zend_vm_opcodes.h\n");

// Insert header
out($f, HEADER_TEXT);
fputs($f, "#ifndef ZEND_VM_OPCODES_H\n#define ZEND_VM_OPCODES_H\n\n");
fputs($f, "#define ZEND_VM_SPEC\t\t" . ZEND_VM_SPEC . "\n");
fputs($f, "#define ZEND_VM_LINES\t\t" . ZEND_VM_LINES . "\n");
fputs($f, "#define ZEND_VM_KIND_CALL\t" . ZEND_VM_KIND_CALL . "\n");
fputs($f, "#define ZEND_VM_KIND_SWITCH\t" . ZEND_VM_KIND_SWITCH . "\n");
fputs($f, "#define ZEND_VM_KIND_GOTO\t" . ZEND_VM_KIND_GOTO . "\n");
fputs($f, "#define ZEND_VM_KIND_HYBRID\t" . ZEND_VM_KIND_HYBRID . "\n");
if ($GLOBALS["vm_kind_name"][ZEND_VM_KIND] === "ZEND_VM_KIND_HYBRID") {
fputs($f, "/* HYBRID requires support for computed GOTO and global register variables*/\n");
fputs($f, "#if (defined(__GNUC__) && defined(HAVE_GCC_GLOBAL_REGS))\n");
fputs($f, "# define ZEND_VM_KIND\t\tZEND_VM_KIND_HYBRID\n");
fputs($f, "#else\n");
fputs($f, "# define ZEND_VM_KIND\t\tZEND_VM_KIND_CALL\n");
fputs($f, "#endif\n");
} else {
fputs($f, "#define ZEND_VM_KIND\t\t" . $GLOBALS["vm_kind_name"][ZEND_VM_KIND] . "\n");
}
fputs($f, "\n");

这段代码就很简单了,直接往zend_vm_opcodes.h文件里面写这些内容。

1
2
3
foreach($vm_op_flags as $name => $val) {
fprintf($f, "#define %-24s 0x%08x\n", $name, $val);
}

这段代码是把zend_vm_gen.php文件里面的$vm_op_flags内容以16进制的格式写在zend_vm_opcodes.h文件里面:

1
2
3
4
5
6
7
8
9
10
11
$vm_op_flags = array(
"ZEND_VM_OP_SPEC" => 1<<0,
"ZEND_VM_OP_CONST" => 1<<1,
// 省略其他的
);

=>

#define ZEND_VM_OP_SPEC 0x00000001
#define ZEND_VM_OP_CONST 0x00000002
// 省略其他的

接着

1
2
3
4
5
6
7
foreach ($opcodes as $code => $dsc) {
$code = str_pad((string)$code,$code_len," ",STR_PAD_LEFT);
$op = str_pad($dsc["op"],$max_opcode_len);
if ($code <= $max_opcode) {
fputs($f,"#define $op $code\n");
}
}

会去用我们上面搜集好的$opcodes来定义我们的opcode,例如:

1
2
3
#define ZEND_NOP                          0
#define ZEND_ADD 1
// 省略其他的

接着

1
2
3
4
5
$code = str_pad((string)$max_opcode,$code_len," ",STR_PAD_LEFT);
$op = str_pad("ZEND_VM_LAST_OPCODE",$max_opcode_len);
fputs($f,"\n#define $op $code\n");

fputs($f, "\n#endif\n");

会去定义PHP内核一共有多少个opcode,例如:

1
#define ZEND_VM_LAST_OPCODE             199

至此,我们的zend_vm_opcodes.h文件生成完毕了。接着,开始生成zend_vm_opcodes.c文件。

其中:

1
2
3
4
5
fputs($f,"static const char *zend_vm_opcodes_names[".($max_opcode + 1)."] = {\n");
for ($i = 0; $i <= $max_opcode; $i++) {
fputs($f,"\t".(isset($opcodes[$i]["op"])?'"'.$opcodes[$i]["op"].'"':"NULL").",\n");
}
fputs($f, "};\n\n");

用来定义我们所有opcode对应的名字,例如:

1
2
3
4
5
static const char *zend_vm_opcodes_names[200] = {
"ZEND_NOP",
"ZEND_ADD",
// 省略其他的
};

这个zend_vm_opcodes_names数组的索引实际上就是opcode对应的id。所以,如果我们要得到一个opcode的名字,那么可以通过以下方式拿到:

1
2
3
zend_vm_opcodes_names[ZEND_ADD]
=>
"ZEND_ADD"

接着

1
2
3
4
5
fputs($f,"static uint32_t zend_vm_opcodes_flags[".($max_opcode + 1)."] = {\n");
for ($i = 0; $i <= $max_opcode; $i++) {
fprintf($f, "\t0x%08x,\n", isset($opcodes[$i]["flags"]) ? $opcodes[$i]["flags"] : 0);
}
fputs($f, "};\n\n");

用来定义opcode对应的flags。例如:

1
2
3
4
5
static uint32_t zend_vm_opcodes_flags[200] = {
0x00000000,
0x00000b0b,
// 省略其他的
};

flags的值的算法我们已经在上面介绍过了,这里再总结下:

1
$flags = $flags1 | ($flags2 << 8);

接着:

1
2
3
4
5
6
fputs($f, "ZEND_API const char* ZEND_FASTCALL zend_get_opcode_name(zend_uchar opcode) {\n");
fputs($f, "\tif (UNEXPECTED(opcode > ZEND_VM_LAST_OPCODE)) {\n");
fputs($f, "\t\treturn NULL;\n");
fputs($f, "\t}\n");
fputs($f, "\treturn zend_vm_opcodes_names[opcode];\n");
fputs($f, "}\n");

定义一个获取opcode name的函数。生成的结果如下:

1
2
3
4
5
6
ZEND_API const char* ZEND_FASTCALL zend_get_opcode_name(zend_uchar opcode) {
if (UNEXPECTED(opcode > ZEND_VM_LAST_OPCODE)) {
return NULL;
}
return zend_vm_opcodes_names[opcode];
}

首先是判断一下是否有这个opcode,有的话返回它的name,没有的话返回NULL

接着:

1
2
3
4
5
6
puts($f, "ZEND_API uint32_t ZEND_FASTCALL zend_get_opcode_flags(zend_uchar opcode) {\n");
fputs($f, "\tif (UNEXPECTED(opcode > ZEND_VM_LAST_OPCODE)) {\n");
fputs($f, "\t\topcode = ZEND_NOP;\n");
fputs($f, "\t}\n");
fputs($f, "\treturn zend_vm_opcodes_flags[opcode];\n");
fputs($f, "}\n");

定义一个获取opcode flags的函数。生成的结果如下:

1
2
3
4
5
6
ZEND_API uint32_t ZEND_FASTCALL zend_get_opcode_flags(zend_uchar opcode) {
if (UNEXPECTED(opcode > ZEND_VM_LAST_OPCODE)) {
opcode = ZEND_NOP;
}
return zend_vm_opcodes_flags[opcode];
}

首先是判断一下是否有这个opcode,有的话返回它的flags,没有的话返回ZEND_NOPflags(也就是0)。

至此,我们的zend_vm_opcodes.c文件生成完毕了。接着,开始生成zend_vm_execute.h文件。

1
2
3
4
5
6
// Support for ZEND_USER_OPCODE
out($f, "static user_opcode_handler_t zend_user_opcode_handlers[256] = {\n");
for ($i = 0; $i < 255; ++$i) {
out($f, "\t(user_opcode_handler_t)NULL,\n");
}
out($f, "\t(user_opcode_handler_t)NULL\n};\n\n");

用来定义一个zend_user_opcode_handlers数组,这个数组初始的时候全都是NULL。生成结果如下:

1
2
3
4
5
6
7
static user_opcode_handler_t zend_user_opcode_handlers[256] = {
(user_opcode_handler_t)NULL,
(user_opcode_handler_t)NULL,
(user_opcode_handler_t)NULL,
// 省略其他的
(user_opcode_handler_t)NULL
};

接着:

1
2
3
4
5
6
out($f, "static zend_uchar zend_user_opcodes[256] = {");
for ($i = 0; $i < 255; ++$i) {
if ($i % 16 == 1) out($f, "\n\t");
out($f, "$i,");
}
out($f, "255\n};\n\n");

用来定义我们的zend_user_opcodes,生成结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
static zend_uchar zend_user_opcodes[256] = {0,
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,
17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,
33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,
49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,
65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,
81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,
97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,
113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,
129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,
145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,
161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,
177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,
193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,
209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,
225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,
241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
};

说明一共支持256zend_user_opcodes

接着,开始调用gen_executor来按照模板文件Zend/zend_vm_execute.skl生成代码。这个函数也是逐行扫描zend_vm_execute.skl文件。

其中zend_vm_execute.skl文件的第一行是:

1
{%DEFINES%}

意味着我们在zend_vm_execute.h里面需要生成一些定义。具体的生成过程如下:

1
2
3
4
5
6
7
8
9
10
11
out($f,"#define SPEC_START_MASK        0x0000ffff\n");
out($f,"#define SPEC_EXTRA_MASK 0xfffc0000\n");
out($f,"#define SPEC_RULE_OP1 0x00010000\n");
out($f,"#define SPEC_RULE_OP2 0x00020000\n");
out($f,"#define SPEC_RULE_OP_DATA 0x00040000\n");
out($f,"#define SPEC_RULE_RETVAL 0x00080000\n");
out($f,"#define SPEC_RULE_QUICK_ARG 0x00100000\n");
out($f,"#define SPEC_RULE_SMART_BRANCH 0x00200000\n");
out($f,"#define SPEC_RULE_COMMUTATIVE 0x00800000\n");
out($f,"#define SPEC_RULE_ISSET 0x01000000\n");
out($f,"#define SPEC_RULE_OBSERVER 0x02000000\n");

这是一些opcode对应的操作数的规则,例如SPEC_RULE_OP1意味着需要用到操作数1,并且支持的类型至少是2种。对应的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
if (isset($dsc["op1"]) && !isset($dsc["op1"]["ANY"])) {
$count = 0;
foreach ($op_types_ex as $t) {
if (isset($dsc["op1"][$t])) {
$def_op1_type = $t;
$count++;
}
}
if ($count > 1) {
$spec_op1 = true;
$specs[$num] .= " | SPEC_RULE_OP1";
$def_op1_type = "ANY";
}
}

接着:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
out($f,"static const uint32_t *zend_spec_handlers;\n");
out($f,"static const void * const *zend_opcode_handlers;\n");
out($f,"static int zend_handlers_count;\n");
if ($kind == ZEND_VM_KIND_HYBRID) {
out($f,"#if (ZEND_VM_KIND == ZEND_VM_KIND_HYBRID)\n");
out($f,"static const void * const * zend_opcode_handler_funcs;\n");
out($f,"static zend_op hybrid_halt_op;\n");
out($f,"#endif\n");
}
out($f,"#if (ZEND_VM_KIND != ZEND_VM_KIND_HYBRID) || !ZEND_VM_SPEC\n");
out($f,"static const void *zend_vm_get_opcode_handler(zend_uchar opcode, const zend_op* op);\n");
out($f,"#endif\n\n");
if ($kind == ZEND_VM_KIND_HYBRID) {
out($f,"#if (ZEND_VM_KIND == ZEND_VM_KIND_HYBRID)\n");
out($f,"static const void *zend_vm_get_opcode_handler_func(zend_uchar opcode, const zend_op* op);\n");
out($f,"#else\n");
out($f,"# define zend_vm_get_opcode_handler_func zend_vm_get_opcode_handler\n");
out($f,"#endif\n\n");
}

这个是根据ZEND_VM_KIND来定义一些变量和函数,生成结果如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
static const uint32_t *zend_spec_handlers;
static const void * const *zend_opcode_handlers;
static int zend_handlers_count;
#if (ZEND_VM_KIND == ZEND_VM_KIND_HYBRID)
static const void * const * zend_opcode_handler_funcs;
static zend_op hybrid_halt_op;
#endif
#if (ZEND_VM_KIND != ZEND_VM_KIND_HYBRID) || !ZEND_VM_SPEC
static const void *zend_vm_get_opcode_handler(zend_uchar opcode, const zend_op* op);
#endif

#if (ZEND_VM_KIND == ZEND_VM_KIND_HYBRID)
static const void *zend_vm_get_opcode_handler_func(zend_uchar opcode, const zend_op* op);
#else
# define zend_vm_get_opcode_handler_func zend_vm_get_opcode_handler
#endif

zend_vm_gen.php默认是ZEND_VM_KIND_HYBRID模式。

接着,会有一大段的代码来定义一些如下宏:

1
2
3
4
5
HYBRID_NEXT()
HYBRID_SWITCH()
HYBRID_CASE(op)
HYBRID_BREAK()
HYBRID_DEFAULT

接着,会调用gen_executor_code来生成opcode的详细handler。例如,我们的操作数有如下类型:

1
2
3
4
5
6
7
8
9
10
$op_types_ex = array(
"ANY",
"CONST",
"TMPVARCV",
"TMPVAR",
"TMP",
"VAR",
"UNUSED",
"CV",
);

那么,就最大就会有op1_type * op1_typehandler。所以,就会有如下代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Produce specialized executor
$op1t = $op_types_ex;
// for each op1.op_type
foreach($op1t as $op1) {
$op2t = $op_types_ex;
// for each op2.op_type
foreach($op2t as $op2) {
// for each handlers in helpers in original order
foreach ($list as $lineno => $dsc) {
if (isset($dsc["handler"])) {
$num = $dsc["handler"];
foreach (extra_spec_handler($opcodes[$num]) as $extra_spec) {
// Check if handler accepts such types of operands (op1 and op2)
if (isset($opcodes[$num]["op1"][$op1]) &&
isset($opcodes[$num]["op2"][$op2])) {
// Generate handler code
gen_handler($f, 1, $kind, $opcodes[$num]["op"], $op1, $op2, isset($opcodes[$num]["use"]), $opcodes[$num]["code"], $lineno, $opcodes[$num], $extra_spec, $switch_labels);
}
}
} else if (isset($dsc["helper"])) {
$num = $dsc["helper"];
foreach (extra_spec_handler($helpers[$num]) as $extra_spec) {
// Check if handler accepts such types of operands (op1 and op2)
if (isset($helpers[$num]["op1"][$op1]) &&
isset($helpers[$num]["op2"][$op2])) {
// Generate helper code
gen_helper($f, 1, $kind, $num, $op1, $op2, $helpers[$num]["param"], $helpers[$num]["code"], $lineno, $helpers[$num]["inline"], $helpers[$num]["cold"], $helpers[$num]["hot"], $extra_spec);
}
}
} else {
var_dump($dsc);
die("??? $kind:$num\n");
}
}
}
}

对于这段代码,$list里面存放了所有的helper的名字和opcode的值,例如:

1
2
3
4
5
"helper" => "zend_add_helper",
"handler" => 1,
"helper" => "zend_sub_helper",
"handler" => 2,
// 省略其他的内容

如果是helper,那么我们从$helpers里面获取到这个helper函数的信息。

如果是handler,那么我们从$opcodes里面获取到这个opcode的信息。

其中:

1
2
$opcodes[$num]["op1"]
$opcodes[$num]["op2"]

里面存放的就是这个opcode对应的操作数1和操作数2支持的所有类型,我们在前面解析的时候就拿到了这些信息。

无论是是helper还是opcode类型的handler,都会调用extra_spec_handler来生成spec函数。在生成spec的时候,会将zend_vm_def.h里面对应的handlercode进行替换,替换的规则在函数gen_code里面。

生成了handler对应的specs之后,就完成了模板文件里面{%DEFINES%}的替换了。

接着,开始替换模板文件里面的{%EXECUTOR_NAME%},也就是开始生成我们的zend_execute函数了:

1
2
3
case "EXECUTOR_NAME":
out($f, $m[1].$executor_name.$m[3]."\n");
break;

这里是名字是execute

接着替换模板文件的{%HELPER_VARS%}

1
2
3
case "HELPER_VARS":
// 省略代码
break;

生成结果如下:

1
2
3
4
5
6
7
8
9
#ifdef ZEND_VM_IP_GLOBAL_REG
const zend_op *orig_opline = opline;
#endif
#ifdef ZEND_VM_FP_GLOBAL_REG
zend_execute_data *orig_execute_data = execute_data;
execute_data = ex;
#else
zend_execute_data *execute_data = ex;
#endif

接着替换模板文件的{%INTERNAL_LABELS%}

1
2
3
4
out($f,$prolog."if (UNEXPECTED(execute_data == NULL)) {\n");
out($f,$prolog."\tstatic const void * const labels[] = {\n");
gen_labels($f, $spec, ($kind == ZEND_VM_KIND_HYBRID) ? ZEND_VM_KIND_GOTO : $kind, $prolog."\t\t", $specs);
out($f,$prolog."\t};\n");

这里定义了一个名字叫做labels的静态变量,也就意味着每次调用zend_execute是共享的。生成的代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
if (UNEXPECTED(execute_data == NULL)) {
static const void * const labels[] = {
(void*)&&ZEND_NOP_SPEC_LABEL,
(void*)&&ZEND_ADD_SPEC_CONST_CONST_LABEL,
(void*)&&ZEND_ADD_SPEC_CONST_TMPVARCV_LABEL,
(void*)&&ZEND_ADD_SPEC_CONST_TMPVARCV_LABEL,
(void*)&&ZEND_NULL_LABEL,
(void*)&&ZEND_ADD_SPEC_CONST_TMPVARCV_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_CONST_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_TMPVARCV_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_TMPVARCV_LABEL,
(void*)&&ZEND_NULL_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_TMPVARCV_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_CONST_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_TMPVARCV_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_TMPVARCV_LABEL,
(void*)&&ZEND_NULL_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_TMPVARCV_LABEL,
(void*)&&ZEND_NULL_LABEL,
(void*)&&ZEND_NULL_LABEL,
(void*)&&ZEND_NULL_LABEL,
(void*)&&ZEND_NULL_LABEL,
(void*)&&ZEND_NULL_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_CONST_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_TMPVARCV_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_TMPVARCV_LABEL,
(void*)&&ZEND_NULL_LABEL,
(void*)&&ZEND_ADD_SPEC_TMPVARCV_TMPVARCV_LABEL,
(void*)&&ZEND_NULL_LABEL
// 省略其他的内容
};

也就意味着,当第一次调用zend_execute的时候,会初始化这个labels变量。

接着,我们会生成一堆的HYBRID_SWITCHHYBRID_CASE。这个和labels变量里面的指针是对应的,并且和我们生成的handler是对应的。我们后面会写一个小demo来解释下这个switch ... case的原理。

接着,会生成$specs

1
2
3
4
5
6
7
static const uint32_t specs[] = {
0,
1 | SPEC_RULE_OP1 | SPEC_RULE_OP2,
26 | SPEC_RULE_OP1 | SPEC_RULE_OP2,
51 | SPEC_RULE_OP1 | SPEC_RULE_OP2 | SPEC_RULE_COMMUTATIVE,
// 省略其他的
};

其中,SPEC_RULE_OP1SPEC_RULE_OP2解释过了。那么它们前面的数字是什么呢?实际上,前面的数字是第一个当前opcode的第一个spec handlerlabels变量的索引。这么说比较抽象,我用下面的图来解释一下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
  +-------------+                                   +-------------+          
| specs | | labels |
+-------------+ +-------------+


+------+----+------+ +--------------------------------+
| | 0 | |---------------------->| ZEND_NOP_SPEC_LABEL |
+------+----+------+ +--------------------------------+
| | 1 | |---------------------->|ZEND_ADD_SPEC_CONST_CONST_LABEL |
+------+----+------+ +--------------------------------+
| | 26 | |-----------+ |ZEND_ADD_SPEC_CONST_TMPVARCV_LAB|
+------+----+------+ | +--------------------------------+
| | 51 | | | | ... |
+------+----+------+ | +--------------------------------+
| | +---------->|ZEND_SUB_SPEC_CONST_CONST_LABEL |
| | +--------------------------------+
| ... | |ZEND_SUB_SPEC_CONST_TMPVARCV_LAB|
| | +--------------------------------+
| | | ... |
| | | |
+------------------+ +--------------------------------+

至此,zend_vm_gen.php生成代码的过程结束了。

PHP内核pass_two源码分析

本文基于的PHP8 commit为:14806e0824ecd598df74cac855868422e44aea53

我们先来看一下PHP脚本到opcode的生成流程,在函数zend_compile里面:

1
2
3
4
5
6
7
8
9
10
11
12
// 删除了部分代码
static zend_op_array *zend_compile(int type)
{
if (!zendparse()) {
init_op_array(op_array, type, INITIAL_OP_ARRAY_SIZE);

zend_compile_top_stmt(CG(ast));
pass_two(op_array);
}

return op_array;
}

总结起来如下:

1
2
3
1. 调用zendparse完成词法分析、语法分析从而生成AST。
2. 调用init_op_array, zend_compile_top_stmt来完成AST到opcode的转化,此时还没有设置opcode对应的handler,以及有一部分东西是编译时的。
3. 调用pass_two完成编译时到运行时信息的转化、设置opcode对应的handler。

我们来看看具体的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
ZEND_API void pass_two(zend_op_array *op_array)
{
zend_op *opline, *end;

if (!ZEND_USER_CODE(op_array->type)) {
return;
}

#if ZEND_USE_ABS_CONST_ADDR
if (CG(context).opcodes_size != op_array->last) {
op_array->opcodes = (zend_op *) erealloc(op_array->opcodes, sizeof(zend_op)*op_array->last);
CG(context).opcodes_size = op_array->last;
}
if (CG(context).literals_size != op_array->last_literal) {
op_array->literals = (zval*)erealloc(op_array->literals, sizeof(zval) * op_array->last_literal);
CG(context).literals_size = op_array->last_literal;
}
#else
op_array->opcodes = (zend_op *) erealloc(op_array->opcodes,
ZEND_MM_ALIGNED_SIZE_EX(sizeof(zend_op) * op_array->last, 16) +
sizeof(zval) * op_array->last_literal);
if (op_array->literals) {
memcpy(((char*)op_array->opcodes) + ZEND_MM_ALIGNED_SIZE_EX(sizeof(zend_op) * op_array->last, 16),
op_array->literals, sizeof(zval) * op_array->last_literal);
efree(op_array->literals);
op_array->literals = (zval*)(((char*)op_array->opcodes) + ZEND_MM_ALIGNED_SIZE_EX(sizeof(zend_op) * op_array->last, 16));
}
CG(context).opcodes_size = op_array->last;
CG(context).literals_size = op_array->last_literal;
#endif

/* Needs to be set directly after the opcode/literal reallocation, to ensure destruction
* happens correctly if any of the following fixups generate a fatal error. */
op_array->fn_flags |= ZEND_ACC_DONE_PASS_TWO;

opline = op_array->opcodes;
end = opline + op_array->last;
while (opline < end) {
if (opline->op1_type == IS_CONST) {
ZEND_PASS_TWO_UPDATE_CONSTANT(op_array, opline, opline->op1);
} else if (opline->op1_type & (IS_VAR|IS_TMP_VAR)) {
opline->op1.var = EX_NUM_TO_VAR(op_array->last_var + opline->op1.var);
}
if (opline->op2_type == IS_CONST) {
ZEND_PASS_TWO_UPDATE_CONSTANT(op_array, opline, opline->op2);
} else if (opline->op2_type & (IS_VAR|IS_TMP_VAR)) {
opline->op2.var = EX_NUM_TO_VAR(op_array->last_var + opline->op2.var);
}
if (opline->result_type & (IS_VAR|IS_TMP_VAR)) {
opline->result.var = EX_NUM_TO_VAR(op_array->last_var + opline->result.var);
}
ZEND_VM_SET_OPCODE_HANDLER(opline);
opline++;
}

return;
}

其中:

1
2
3
4
5
6
7
8
9
10
#if ZEND_USE_ABS_CONST_ADDR
if (CG(context).opcodes_size != op_array->last) {
op_array->opcodes = (zend_op *) erealloc(op_array->opcodes, sizeof(zend_op)*op_array->last);
CG(context).opcodes_size = op_array->last;
}
if (CG(context).literals_size != op_array->last_literal) {
op_array->literals = (zval*)erealloc(op_array->literals, sizeof(zval) * op_array->last_literal);
CG(context).literals_size = op_array->last_literal;
}
#else

是在32位的机器上面进行设置的,此时,会重新分配opcodesliterals,可以避免内存的浪费。

1
2
3
4
5
6
7
8
9
10
11
12
    op_array->opcodes = (zend_op *) erealloc(op_array->opcodes,
ZEND_MM_ALIGNED_SIZE_EX(sizeof(zend_op) * op_array->last, 16) +
sizeof(zval) * op_array->last_literal);
if (op_array->literals) {
memcpy(((char*)op_array->opcodes) + ZEND_MM_ALIGNED_SIZE_EX(sizeof(zend_op) * op_array->last, 16),
op_array->literals, sizeof(zval) * op_array->last_literal);
efree(op_array->literals);
op_array->literals = (zval*)(((char*)op_array->opcodes) + ZEND_MM_ALIGNED_SIZE_EX(sizeof(zend_op) * op_array->last, 16));
}
CG(context).opcodes_size = op_array->last;
CG(context).literals_size = op_array->last_literal;
#endif

是在64位的机器上面进行设置的,此时,会重新分配opcodes,大小是opline的条数加上字面量的个数,然后把literals拷贝到opcodes的最后面。这样,使得opcodesliterals是在一块连续的内存上面。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
while (opline < end) {
if (opline->op1_type == IS_CONST) {
c(op_array, opline, opline->op1);
} else if (opline->op1_type & (IS_VAR|IS_TMP_VAR)) {
opline->op1.var = EX_NUM_TO_VAR(op_array->last_var + opline->op1.var);
}
if (opline->op2_type == IS_CONST) {
ZEND_PASS_TWO_UPDATE_CONSTANT(op_array, opline, opline->op2);
} else if (opline->op2_type & (IS_VAR|IS_TMP_VAR)) {
opline->op2.var = EX_NUM_TO_VAR(op_array->last_var + opline->op2.var);
}
if (opline->result_type & (IS_VAR|IS_TMP_VAR)) {
opline->result.var = EX_NUM_TO_VAR(op_array->last_var + opline->result.var);
}
ZEND_VM_SET_OPCODE_HANDLER(opline);
opline++;
}

调用ZEND_PASS_TWO_UPDATE_CONSTANT来完成常量编译时到运行时的转换。我们来看看这个宏:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* constant-time constant */
# define CT_CONSTANT_EX(op_array, num) \
((op_array)->literals + (num))

# define CT_CONSTANT(node) \
CT_CONSTANT_EX(CG(active_op_array), (node).constant)

#if ZEND_USE_ABS_CONST_ADDR

/* run-time constant */
# define RT_CONSTANT(opline, node) \
(node).zv

/* convert constant from compile-time to run-time */
# define ZEND_PASS_TWO_UPDATE_CONSTANT(op_array, opline, node) do { \
(node).zv = CT_CONSTANT_EX(op_array, (node).constant); \
} while (0)

#else

/* At run-time, constants are allocated together with op_array->opcodes
* and addressed relatively to current opline.
*/

/* run-time constant */
# define RT_CONSTANT(opline, node) \
((zval*)(((char*)(opline)) + (int32_t)(node).constant))

/* convert constant from compile-time to run-time */
# define ZEND_PASS_TWO_UPDATE_CONSTANT(op_array, opline, node) do { \
(node).constant = \
(((char*)CT_CONSTANT_EX(op_array, (node).constant)) - \
((char*)opline)); \
} while (0)

#endif

32位的机器上,走的逻辑是:

1
2
3
4
5
6
7
8
9
10
11
# define CT_CONSTANT_EX(op_array, num) \
((op_array)->literals + (num))

/* run-time constant */
# define RT_CONSTANT(opline, node) \
(node).zv

/* convert constant from compile-time to run-time */
# define ZEND_PASS_TWO_UPDATE_CONSTANT(op_array, opline, node) do { \
(node).zv = CT_CONSTANT_EX(op_array, (node).constant); \
} while (0)

我们知道,在编译的时候,(node).constant存的是字面量在(op_array)->literals的索引,也就是123等等。

而进行编译时到运行时的转换后,(node).constant存的就是字面量在(op_array)->literals的绝对地址了。

我们再来看看64位的机器上,走的逻辑是:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# define CT_CONSTANT_EX(op_array, num) \
((op_array)->literals + (num))

/* run-time constant */
# define RT_CONSTANT(opline, node) \
((zval*)(((char*)(opline)) + (int32_t)(node).constant))

/* convert constant from compile-time to run-time */
# define ZEND_PASS_TWO_UPDATE_CONSTANT(op_array, opline, node) do { \
(node).constant = \
(((char*)CT_CONSTANT_EX(op_array, (node).constant)) - \
((char*)opline)); \
} while (0)

#endif

我们发现,进行编译时到运行时的转换后,(node).constant存的就是字面量相对当前opline的相对地址了。因为在64位的机器上,opcodesliterals是在一块连续的内存上面,所以可以存一个相对地址。如下图:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
+----------------------------+   
| opcodes |
+----------------------------+

+----------------------------+
| opline1 |--+
+----------------------------+ |
| opline2 | |
+----------------------------+ |
| opline3 | |
+----------------------------+ |
| | |
| ...... | |
| Continuous memory | |
| | |
| | |
+----------------------------+ |
| literal1 |<-+
+----------------------------+
| literal2 |
+----------------------------+
| literal3 |
+----------------------------+
| |
| ...... |
| |
+----------------------------+
1
ZEND_VM_SET_OPCODE_HANDLER(opline);

这一步就是设置我们opcode对应的handler了。