Author
Roberto Ierusalimschy (Department of Computer Science, PUC-Rio, Rio de Janeiro, Brazil roberto@inf.puc-rio.br)
Luiz Henrique de Figueiredo(IMPA-Instituto de Matematica Pura e Aplicada, Rio de Janeiro, Brazil lhf@impa.br)
Waldemar Celes(Department of Computer Science, PUC-Rio, Rio de Janeiro, Brazil celes@inf.puc-rio.br)

翻译: xjump.me#at#gmail.com, @xjump
原文: http://www.lua.org/doc/jucs05.pdf

Abstract  We discuss the main novelties of the implementation of Lua 5.0: its register-based virtaul machine, the new algorithm for optimizing tables used as arrays, the implementation of closures, and the addition of coroutines.
Key Words  compilers, virtual machines, hash tables, closures, coroutines
Category  D.3.4, D.3.3, D.3.2, E.2

摘要  我们讨论了Lua 5.0实现上的主要优点:基于寄存器的虚拟机,tables用于数组时新的优化算法,闭包的实现,以及协程。
关键字  编译器,虚拟机,Hash表,闭包,协程
类别  D.3.4,D.3.3,D.3.2,E.2

1 Introduct
1 介绍

Lua was born in an academic laboratory as a tool for in-house software development but somehow was adopted by several industrial projects around the world and is now widely used in the game industry. [1]
Lua诞生于一个学术实验室,当时是作为一个内部软件开发工具的,但逐渐被移植到了全世界的各个行业的工程中,而且现在广泛的用于游戏行业。[1]

How do we account for this widespread use of Lua? We believe that the answer lies in our design and implementation goals: to provide an embeddable scripting language that is simple, efficient, portable and ligthweight. These have been our main goals since the birth of Lua in 1993 and they have been respected during its evolution. (For a history of development of Lua up to just before the release of Lua 5.0, see [12].) These features, plus the fact that Lua has been designed from the start to be embedded into larger application, account for its early adoption by the industry. [2]
我们该如何解释Lua的广泛应用?我们相信答案在我们的设计和实现目标中:提供一个可嵌入的脚本语言,她简单、高效、可移植和轻量级。这些(目标)是自1993年Lua诞生以来我们的主要目标,而且在她的演变中得到了认可。(关于Lua 5.0之前的版本发布历史,参考 [12]。)这些特性,以及Lua从一开始就设计成可嵌入到更大的应用程序中,(这一特性)使得她早早地得到了行业的认可。[2]

Widespread use generates demand for language feature. Several features of Lua have been motivated by industrial needs and user feedback. Important examples are the introduction of coroutines in Lua 5.0 and the implementation of incremental gargage collection in the upcoming Lua 5.1. Both features are specially important for games.
广泛的使用产生了对语言特性的需求。Lua的一些特性是由行业需求和用户反馈驱动的。重点的例子有Lua 5.0对协程的引入和即将发布的Lua 5.1对垃圾收集(gc)的增加。这两个特性对游戏来说是极其重要的。

In this paper, we discuss the main novelties of the implementation of Lua 5.0, compared to Lua 4.0:
在这篇论文中,我们讨论了与Lua 4.0 相比 Lua 5.0实现上的主要优点:

Register-based virtual machine: Traditionally, most virtual machines intended for actual execution are stack based, a trend that started with Pascal's Pmachine [15] and continues today with Java's JVM and Microsoft's .Net environment. Currently, however, there has been a growing interest in registerbased virtual machines (for instance, the planned new virtual machine for Perl 6 (Parrot) will be register based [17]). As far as we know, the virtual machine of Lua 5.0 is the first register-based virtual machine to have a wide use. This virtual machine is presented in Section 7.
基于寄存器的虚拟机:通常,大多数虚拟机为了?执行都是基于栈的,一个从Pascal的Pmachine [15]持续到今天的Java的JVM和微软的.Net环境都是这个趋势。然而,现在对基于寄存器的虚拟机的兴趣在增加(例如,计划中的Perl 6(Parrot)将是基于寄存器的 [17]).据我们了解,Lua 5.0是第一个广泛应用的基于寄存器的虚拟机。这个虚拟机将在Section 7讲解。

New algorithm for optimizing tables used as arrays: Unlike other scripting languages, Lua does not offer an array type. Instead, Lua programmers use regular tables with integer indices to implement arrays. Lua 5.0 uses a new algorithm that detects whether tables are being used as arrays and automatically stores the values associated to numeric indices in an actual array, instead of adding them to the hash table. This algorithm is discussed in Section 4.
新算法对tables用于数组时的优化:与其他脚本语言不同的是,Lua 没有提供数组类型。然而,Lua 程序员使用regular tables配合整数下标来实现数组。Lua 5.0使用一个新的算法来检查tables是否用于数组,并且在真正的数组中自动存储关联到整数下标的值,而不是将他们添加到Hash表。这个算法将在Section 4讨论。

The implementation of closures: Lua 5.0 supports first-class functions with lexical scoping. This mechanism poses a well-known difficulty for languages that use an array-based stack to store activation records. Lua uses a novel approach to function closures that keeps local variables in the (array-based) stack and only moves them to the heap if they go out of scope while being referred by nested functions. The implementation of closures is discussed in Section 5.
闭包的实现:Lua 5.0 从语法上支持第一类型函数。这个机制对于使用基于数组的栈来存储活动记录的语言来说,是一个众所周知的难题。Lua使用了一个新颖的解决方案来(实现)闭包的功能,这些闭包将局部变量保存在(基于数组的)栈上,并且只在他们超出作用域,同时又被包含的函数引用时,才会被移到堆上。闭包的实现将会在Section 5讨论。

The addition of coroutines: Lua 5.0 introduced coroutines in the language. Although the implementation of coroutines is more or less traditional, we present a short overview in Section 6 for completeness.
协程的加入:Lua 5.0 引入了协程。尽管协程的实现有些传统,我们在Section 6进行概述来保证(文章的)完整性。

The other sections complement or give background to this discussion. In Section 2 we present an overview of Lua's design goals and how those goals have driven its implementation. In Section 3 we describe how Lua represents its values. Although the representation itself has no novelties, we need this material for the other sections. Finally, in Section 8 we present a small benchmark and draw some conclusions.
其他Section覆盖或给出了讨论的背景(资料)。在Section 2,我们给出了Lua的设计目标概述,并且说明这些目标如何驱动它的实现。在Section 3,我们描述了Lua如何表示它的值。尽管这个表示(方法)并无新意,但其他Section需要这些内容。最后,在Section 8,我们展示了一个小的性能测试以及给出一些结论。

2 An Overview of Lua's Design and Implementation
2 Lua设计与实现概述

As mentioned in the introduction, the goals in our implementation of Lua are:
在介绍(Section 1)中,我们提到对于Lua,我们的实现目标是:

Simplicity: We seek the simplest language we can afford and the simplest C code that implements this language. This implies a simple syntax with a small number of language constructs, not far from the tradition.
简单:我们寻求能够提供的最简单的语言以及该语言的最简单C代码实现。这意味着一个简单语法附带少数的语言构件,与传统(语言实现)相差不大。

Effciency: We seek fast compilation and fast execution of Lua programs. This implies a fast, smart, one-pass compiler and a fast virtual machine.
高效:我们寻求Lua程序的快速编译与快速执行。这意味着(Lua有)一个快速、智能、单次的编译器和一个快速的虚拟机。

Portability: We want Lua to run on as many platforms as possible. We want to be able to compile the Lua core unmodified everywhere and to run Lua programs unmodified on every platform that has a suitable Lua interpreter. This implies a clean ANSI C implementation with special attention to portability issues, such as avoiding dark corners of C and its libraries, and ensuring that it also compiles cleanly as C++. We seek warning-free compilations.
可移植:我们希望Lua尽可能运行在更多的平台上。我们希望能够在任何(环境)编译Lua核心而不需要修改(Lua源代码),希望能够在拥有配套Lua解释器的每个平台运行Lua程序而不需要修改。这意味着(Lua是)一个纯净的ANSI C实现,同时对移植性问题(如避免C和C库的死角)特殊关注,并且保证作为C++编译时也是纯净的。我们寻求没有警告的编译。

Embeddability: Lua is an extension language; it is designed to provide scripting facilities to larger programs. This and the other goals imply the existence of a C API that is simple and powerful, but which relies mostly on built-in C types.
可嵌入:Lua是一个扩展语言;它被设计成为更大的程序提供脚本基础设施。这个目标和其他目标意味着存在(一些)简单和强大的C API,而它们大部分依赖内置C类型。

Low embedding cost:We want it to be easy to add Lua to an application without bloating it. This implies tight C code and a small Lua core, with extensions being added as user libraries.
低嵌入成本:我们希望添加Lua到应用程序是简单的,(Lua)内部不消耗太多(内存)。这意味着(Lua的实现是)紧凑的C代码和小的Lua内核,使用扩展来添加用户库。

These goals are somewhat conficting. For instance, Lua is frequently used as a data-description language, for storing and loading configuration files and sometimes quite large databases (Lua programs with a few megabytes are not uncommon). This implies that we need a fast Lua compiler. On the other hand, we want Lua programs to execute fast. This implies a smart compiler, one that generates good code for the virtual machine. So, the implementation of the Lua compiler has to balance between these two requirements. However, the compiler cannot be too large; otherwise it would bloat the whole package. Currently the compiler accounts for approximately 30% of the size of the Lua core. For memory-limited applications, such as embedded systems, it is possible to embed Lua without the compiler. Lua programs are then precompiled off-line and loaded at run time by a tiny module (which is also fast because it loads binary files).
这些目标有些互相冲突。例如,Lua被频繁的用作数据描述语言,来存储和加载配置文件,有时这些配置文件是较大的数据库(Lua程序使用几MB并非不常见)。这意味着,我们需要一个快速的Lua编译器。另一方面,我们希望Lua程序执行快速。这意味着,(我们需要)一个智能的编译器,它能为虚拟机产生高质量的代码。所以,Lua编译器的实现需要在这两个需求上做平衡。然而,编译器不能太大;否则,它的所有包都会变大。当前Lua编译器大约占Lua核心大小的30%。对于内存受限的应用程序,如嵌入式系统,嵌入Lua而不需要编译器是可行的。Lua程序将会离线预编译,在运行时由一个极小的模块加载(因为这个模块加载二进制文件,所以它也很快)。

Lua uses a hand-written scanner and a hand-written recursive descent parser. Until version 3.0, Lua used a parser produced by yacc [13], which proved a valuable tool when the language's syntax was less stable. However, the hand-written parser is smaller, more effcient, more portable, and fully reentrant. It also provides better error messages.
Lua使用一个手写的扫描器和一个手写的递归下降解析器。3.0版以前,Lua使用一个由yacc [13]产生的解析器。当语言的语法不太稳定时,yacc提供了一个有用的工具。然而,手写解析器更小,更高效,移植性更强,并且是完全可重入的。同时也提供了更好的错误信息。

The Lua compiler uses no intermediate representation. It emits instructions for the virtual machine 'on the fly' as it parses a program. Nevertheless, it does perform some optimizations. For instance, it delays the generation of code for base expressions like variables and constants. When it parses such expressions, it generates no code; instead, it uses a simple structure to represent them. Therefore, it is very easy to check whether an operand for a given instruction is a constant or a local variable and use those values directly in the instruction, thus avoiding unnecessary and costly moves (see Section 3).
Lua编译器不使用中间表示。它在解析程序时直接生成虚拟机指令。然而,它也做了一些优化。例如,它为基本表达式如变量和常量延迟生成指令。当它解析这些表达式时,它不生成任何代码;相反,它使用一个简单的结构表示他们(表达式)。因此,检查一个给定的指令的操作数是常量还是局部变量就很容易,并且在指令中直接使用这些值,从而避免了不必要和昂贵的(内存)移动(参见Section 3)。

To be portable across many different C compilers and platforms, Lua cannot use several tricks commonly used by interpreters, such as direct threaded code [8, 16]. Instead, it uses a standard while switch dispatch loop. Also, at places the C code seems unduly complicated, but the complication is there to ensure portability. The portability of Lua's implementation has increased steadily throughout the years, as Lua got compiled under many different C compilers in many different platforms (including several 64-bit platforms and some 16-bit platforms).
为了跨多种C编译器和平台移植,Lua不能使用常见的解释器使用的一些技巧,例如直接线程化的代码 [8,16]。相对地,它使用了一个标准的while-switch分发循环。同时,某些地方的C代码看上去过于复杂,但是,这里的复杂性是为了保证移植性。这些年,随着Lua在许多不同的平台(包括一些64位平台和一些16位平台)上不同C编译器上被成编译,Lua实现代码的移植性在稳步提升。

We consider that we have achieved our design and implementation goals. Lua is a very portable language: it runs on any machine with an ANSI C compiler, from embedded systems to mainframes. Lua is really lightweight: for instance, on Linux its stand-alone interpreter, complete with all standard libraries, takes less than 150 Kbytes; the core is less than 100 Kbytes. Lua is effcient: independent benchmarks [2, 4] show Lua as one of the fastest language implementations in the realm of scripting languages (i.e., interpreted and dynamically-typed languages). We also consider Lua a simple language, being syntactically similar to Pascal and semantically similar to Scheme, but this is subjective.
我们认为,我们达到了我们的设计和实现目标。Lua是一个高度可移植的语言:它运行在任何拥有ANSI C编译器的机器上,(包括)从嵌入式系统到主要框架。Lua是真正的轻量级:例如,在Linux上它是一个独立的解释器,完整包含所有的标准库,只占用不到150KB(磁盘空间);核心小于100KB。Lua是高效的:独立的性能测试 [2,4]表明,Lua(的实现)是脚本语言(即,解释和动态类型语言)中最快的语言实现之一。我们还认为,Lua是一个简单的语言,与Pascal的语法相似并且与Scheme的语义相似,但这(种观点)是主观的。

3 The Representation of Values
3 值的表示

Lua is a dynamically-typed language: types are attached to values rather than to variables. Lua has eight basic types: nil, boolean, number, string, table, function, userdata, and thread. Nil is a marker type having only one value, also called nil. Boolean values are the usual true and false. Numbers are double-precision foating-point numbers, corresponding to the type double in C, but it is easy to compile Lua using float or long instead. (Several games consoles and smaller machines lack hardware support for double.) Strings are arrays of bytes with an explicit size, and so can contain arbitrary binary data, including embedded zeros. Tables are associative arrays, which can be indexed by any value (except nil) and can hold any value. Functions are either Lua functions or C functions written according to a protocol for interfacing with the Lua virtual machine. Userdata are essentially pointers to user memory blocks, and come in two avors: heavy, whose blocks are allocated by Lua and are subject to garbage collection, and light, whose blocks are allocated and freed by the user. Finally, threads represent coroutines. Values of all types are first-class values: we can store them in global variables, local variables and table fields, pass them as arguments to functions, return them from functions, etc.
Lua是一门动态类型语言:类型和值绑定在一起,而不是和变量绑定在一起。Lua用于8种基本类型:nil,boolean,number,string,table,function,userdata,还有thread。Nil是一个标记类型,只有一个值,也叫做nil。Boolean值是通常的true和false。Numbers是双精度浮点数,与C语言中的double类型一致,但是将Lua编译成使用float或long来代替(double)也是很容易的。(一些游戏控制台和小机器缺乏对double的硬件支持。)Strings是字节数组,包含一个明确的大小,而且可以容纳任意二进制数据,包含(数据中间)嵌入0的(二进制数据)。Tables是关联数组,可以被任意值(除了nil)索引,并且可以包含任意值。Functions可以是Lua函数也可以是C函数,(要求这些)C函数是按照Lua虚拟机的接口协议来编写的。Userdata是指向用户内存的指针,而且有两种形式:重型,(内存)块是由Lua分配的,并且交由gc处理(内存回收),轻型,(内存)块是由用户分配和释放的。最后,threads表示协程。所有类型的值都是第一类型值:我们可以将他们存储在全局变量、局部变量和table的单元中,将他们作为参数传给functions,从functions返回,等等。

typedef struct {    
    int t;    Value v;
} TObject;
typedef union {    
    GCObject *gc;    
    void *p;    
    lua_Number n;    
    int b;
} Value;

Figure 1: Lua values are represented as tagged unions

Lua represents values as tagged unions, that is, as pairs (t; v), where t is an integer tag identifying the type of the value v, which is a union of C types implementing Lua types. Nil has a single value. Booleans and numbers are implemented as 'unboxed' values: v represents values of those types directly in the union. This implies that the union must have enough space for a double. Strings, tables, functions, threads, and userdata values are implemented by reference: v contains pointers to structures that implement those values. Those structures share a common head, which keeps information needed for garbage collection. The rest of the structure is specific to each type.
Lua用带标记的联合来表示值,也即,通过键值对(t;v)(来表示),其中,t是一个表示值v的类型的整数。这个(联合)是一个实现Lua类型的C语言联合(union 类型)。Nil只有一个值。Booleans和numbers作为'未装箱'值来实现的:v直接使用联合来表示这些类型的值。这意味着,这个联合必须拥有足够的空间来存储double。Strings,tables,functions,threads,以及userdata的值通过引用来实现:v包含执行实现这些值的指针。这些结构拥有一个通用头部,用来保存垃圾收集(gc)需要的信息。结构的其他部分是每个类型特有的。

Figure 1 shows a glimpse of the actual implementation of Lua values. TObject is the main structure in this implementation: it represents the tagged unions (t; v) described above. Value is the union that implements the values. Values of type nil are not explicitly represented in the union because the tag is enough to identify them. The field n is used for numbers (by default, lua_Number is double). The field b is used for booleans. The field p is used for light userdata. The field gc is used for the other values (strings, tables, functions, heavy userdata, and threads), which are those subject to garbage collection.
图1展示了Lua值实现的概况。TObject是这个实现中主要的结构:它表示上述带标记的联合(t;v)。Value是实现值的联合。类型nil的值并没有明确的在union中表示,因为标记就足以标示它们。(union中)的域n是给numbers(默认情况下,lua_Number是double)用的。域b是给booleans用的。域p是给轻型userdata用的。域gc是给其他类型用的(strings,tables,functions,重型userdata,以及threads),这些都由垃圾收集(gc)(来释放内存)。

One consequence of using tagged unions to represent Lua values is that copying values is a little expensive: on a 32-bit machine with 64-bit doubles, the size of a TObject is 12 bytes (or 16 bytes, if doubles are aligned on 8- byte boundaries) and so copying a value requires copying 3 (or 4) machine words. However, it is difficult to implement a better representation for values in ANSI C. Several dynamically-typed languages (e.g., the original implementation of Smalltalk80 [9]) use spare bits in each pointer to store the value's type tag. This trick works in most machines because, due to alignment, the last two or three bits of a pointer are always zero, and therefore can be used for other purposes. However, this technique is neither portable nor implementable in ANSI C. The C standard does not even ensures that a pointer fits in any integral type and so there is no standard way to perform bit manipulation over pointers.
使用带标记的联合来表示Lua值的一个结果是,拷贝值有点昂贵:在一个32位的机器上使用64位doubles,TObject的大小是12字节(或者16字节,如果double是8字节对齐的话),因此拷贝一个值需要拷贝3(或4)个机器字。然而,使用ANSI C实现一个更好的值表示方式是困难的。一些动态类型语言(如,Smalltalk80 [9]的原始实现)使用每个指针的备用位来存储值的类型。这种技巧在大多数机器上凑效,因为,由于对齐,指针的最后2到3位总是0,从而可以用于其他目的。然而,这种技术既不具备可移植性,也不具备ANSI C可实现性。C标准甚至没有保证一个指针符合任何整型,从而没有标准方法来完成指针上的位操作。

Another option to reduce the size of a value would be to keep the explicit tag, but to avoid putting a double in the union. For instance, all numbers could be represented as heap-allocated objects, just like strings. (Python uses this technique, except that it preallocates some small integer values.) However, that representation would make the language quite slow. Alternatively, integer values could be represented as unboxed values, directly inside the union, while foatingpoint values would go to the heap. That solution would greatly increase the complexity of the implementation of all arithmetic operations.
为了减少值的大小,另一个选择是保存明确的标记,但是避免将一个double放入联合中。例如,所有的numbers必须用堆分配的对象来表示,就像strings。(Python使用这种技术,所不同的是它(在内存池)预分配了一些小整数值。)然而,这种重新表示会使语言变得很慢。另外,整数值本该作为未装箱值出现,直接在联合中,而浮点数应该在堆上。这种方案会激增所有算术运算实现的复杂性。

Like earlier interpreted languages, such as Snobol [11] and Icon [10], Lua internalizes strings using a hash table: it keeps a single copy of each string with no duplications. Moreover, strings are immutable: once internalized, a string cannot be changed. Hash values for strings are computed by a simple expression that mixes bitwise and arithmetic operations, thus shuffling all bits involved. Hash values are saved when the string is internalized to allow fast string comparison and table indexing. The hash function does not look at all bytes of the string if the string is too long. This allows fast hashing of long strings. Avoiding loss of performance when handling long strings is important because they are common in Lua. For instance, it is usual to process files in Lua by reading them completely into memory into a single long string.
像早期的解释语言一样,如Snobol [11] 和Icon [10],Lua 使用hash表内化了strings:它使用每个string的单一拷贝而没有使用复制。此外,strings是不可变的:一旦被内化,一个string就不能修改。string的Hash值由一个简单的表达式计算,这个表达式混合了位处理和算术运算,从而使所有的位得到处理。当string内化时,Hash值得到保存,所以允许高速string比较和table索引。如果string太长,hash函数不会计算string的所有字节。这允许长string的快速hash。避免处理长string时的性能损失是非常重要的,因为这在Lua中很常见。例如,Lua中常见的处理文件的方式,是通过将他们完全读入内存,保存在单个的长string中。

4 Tables
4 Tables

Tables are the main in fact, the only data-structuring mechanism in Lua. Tables play a key role not only in the language but also in its implementation. Effort spent on a good implementation of tables is rewarded in the language because tables are used for several internal tasks, with no qualms about performance. This helps to keep the implementation small. Conversely, the absence of any other data-structuring mechanism places a pressure on an efficient implementation of tables.
Tables是Lua中主要的事实上,是唯一的数据结构实现机制。Tables不仅在语言上扮演关键角色,在实现上也是如此。在Lua中,花在一个好的tables实现上的努力是值得肯定的,因为tables用于一些内部任务,这些任务毫无疑问与性能有关。这会保证实现小巧。反过来,其他数据结构机制的缺乏,也给tables的高效实现带来了压力。

Tables in Lua are associative arrays, that is, they can be indexed by any value (except nil) and can hold values of any type. Moreover, they are dynamic in the sense that they may grow when data is added to them (by assigning a value to a hitherto non-existent field) and shrink when data is removed from them (by assigning nil to a field).
在Lua中,tables是关联数组,也就是说,他们可以被任意值(除了nil)索引,并且可以存储任意类型的值。更重要的是,当数据添加(通过对不存在的域赋值)以及数据删除(通过对一个域赋值nil)时,table可以扩张和收缩,他们是动态的。

Unlike many other scripting languages, Lua does not have an array type. Arrays are represented as tables with integer keys. The use of tables for arrays bring benefits to the language. The main one is simplicity: Lua does not need two different sets of operators to manipulate tables and arrays. Moreover, programmers do not have to choose between the two representations. The implementation of sparse arrays is trivial in Lua. In Perl, for instance, you can run out of memory if you try to run the program

$a[1000000000]=1;

as it triggers the creation of an array with one billion elements. The equivalent Lua program,

a={[1000000000]=1};

creates a table with a single entry.
与其他脚本语言不同的是,Lua没有数组类型。数组通过tables和整型的keys来表示。使用tables来表示数组为Lua带来了好处。最主要的一个就是简单:Lua不需要两个不同操作来操纵tables和数组。进而,程序员不需要在两种表示间做选择。Lua中稀疏数组的实现是不值一提的。在Perl中,例如,当你运行

$a[1000000000]=1;

时,可能会内存不够用,因为它将触发创建一个拥有10亿个元素的数组。相同的Lua程序

a={[1000000000]=1};

创建一个只包含一个项的table。

Figure 2: A Lua table

Until Lua 4.0, tables were implemented strictly as hash tables: all pairs were explicitly stored. Lua 5.0 brought a new algorithm to optimize the use of tables as arrays: it optimizes pairs with integer keys by not storing the keys and storing the values in an actual array. More precisely, in Lua 5.0, tables are implemented as hybrid data structures: they contain a hash part and an array part. Figure 2 shows a possible configuration for a table with the pairs

"x"->9.2,
1->100,
2->200,
3->300

Note the array part on the right: it does not store the integer keys. This division is made only at a low implementation level; access to table fields is transparent, even to the virtual machine. Tables automatically and dynamically adapt their two parts according to their contents: the array part tries to store the values corresponding to integer keys from 1 to some limit n. Values corresponding to non-integer keys or to integer keys outside the array range are stored in the hash part.
直到Lua 4.0,tables一直是实现成严格的hash表:所有的(键值)对都是明确存储的。Lua 5.0引入了一个新算法来优化使用tables来表示数组:它通过不存储键而在一个真正的数组中存储值,来优化拥有整型键的(键值)对。更准确的说,在Lua 5.0中,tables实现成混合数据结构:他们包含一个hash部分和一个数组部分。图2展示了一个可能的设置,这个table拥有键值对

"x"->9.2,
1->100,
2->200,
3->300

注意右边的数组部分:它并没有存储整数keys。这种分离仅仅是在底层实现的;访问table的域是透明的,甚至对于虚拟机(来说,也是如此)。talbes根据他们的内容,自动地、动态地适配这两部分:数组部分尝试存储key为1到某个有限的整数n相关的值。非整数key或整数key超出数组大小的值将会存储在hash部分。

When a table needs to grow, Lua recomputes the sizes for its hash part and its array part. Either part may be empty. The computed size of the array part is the largest n such that at least half the slots between 1 and n are in use (to avoid wasting space with sparse arrays) and there is at least one used slot between n=2+1 and n (to avoid a size n when n=2 would do). After computing the new sizes, Lua creates the new parts and re-inserts the elements from the old parts into the new ones. As an example, suppose that a is an empty table; both its array part and hash part have size zero. If we execute a[1]=v, the table needs to grow to accommodate the new key. Lua will choose n = 1 for the size of the new array part (with a single entry 1 -> v). The hash part will remain empty.
当一个table需要扩展时,Lua重新计算它的hash部分和数组部分。每个部分可能为空。数组部分的计算就是计算
1,使得从1到n至少一半的数组位置被使用时的最大n(为了避免稀疏数组浪费内存)
而且
2,使得在n=2+1到n之间,至少有一个位置被使用(避免当n=2时重新计算n)。
新的大小计算后,Lua创建新的部分,并将老部分中的元素重新插入新部分。例如,假设一个空table;数组部分和hash部分大小都为0。如果我们执行

a[1]=v,

这个talbe就要扩展来匹配新的key。Lua将会选择n=1作为新的数组部分(只有一项1->v)。hash部分仍然为空。

This hybrid scheme has two advantages. First, access to values with integer keys is faster because no hashing is needed. Second, and more important, the array part takes roughly half the memory it would take if it were stored in the hash part, because the keys are implicit in the array part but explicit in the hash part. As a consequence, if a table is being used as an array, it performs as an array, as long as its integer keys are dense. Moreover, no memory or time penalty is paid for the hash part, because it does not even exist. The converse holds: if the table is being used as an associative array, and not as an array, then the array part is likely to be empty. These memory savings are important because it is common for a Lua program to create many small tables, for instance when tables are used to implement objects.
这种混合模式有两个优点。第一,因为不需要hash计算,访问整数key的值将会更快。第二,更重要的是,数组部分占用内存几乎是使用hash来存储的一半,因为keys在数组部分是隐含的,而在hash部分是明确的。结果就是,如果一个table用作数组,只要它的整数key是稠密的,它就是一个数组。更重要的是,没有内存和时间消耗花在hash部分,因为它并不存在。反过来就是:如果table用作关联数组,而不是数组,那么数组部分将是空的。这些内存节省是重要的,因为Lua程序创建大量小的table是很常见的,例如使用talbe来实现对象的时候。

The hash part uses a mix of chained scatter table with Brent's variation[3]. A main invariant of these tables is that if an element is not in its main position (i.e., the original position given by its hash value), then the colliding element is in its own main position. In other words, there are collisions only when two elements have the same main position (i.e., the same hash values for that table size). There are no secondary collisions. Because of that, the load factor of these tables can be 100% without performance penalties.
hash部分采用包含Brent变更的混合散链表。这些表主要不变的是,如果一个元素不在它的主位置(即,通过hash值求得的原始位置),那么来碰撞的原始key将在它的主位置。换句话说,当且仅当两个元素有相同的主位置(即,这种table大小下有相同的hash值)时,才会有碰撞。没有第二种碰撞。因此,这些table的加载对性能消耗100%无影响。

5 Functions and Closures
5 函数和闭包

When Lua compiles a function it generates a prototype containing the virtual machine instructions for the function, its constant values (numbers, literal strings, etc.), and some debug information. At run time, whenever Lua executes a function...end expression, it creates a new closure. Each closure has a reference to its corresponding prototype, a reference to its environment (a table wherein it looks for global variables), and an array of references to upvalues, which are used to access outer local variables.
当Lua编译一个function时,它为这个function生成一个包含虚拟机指令的原型,function的常量(numbers,字面strings等等),以及一些调试信息。运行时,只要Lua执行一个function...end表达式,它就创建一个闭包。每个闭包拥有一个指向对应原型的引用,一个指向环境信息的引用(一个用于查找全局变量的table),以及一个指向upvalues的引用,upvalues用来访问外部局部变量。

The combination of lexical scoping with first-class functions creates a well known difficulty for accessing outer local variables. Consider the example in Figure 3. When add2 is called, its body accesses the outer local variable x (function parameters in Lua are local variables). However, by the time add2 is called, the function add that created add2 has already returned. If x was created in the stack, its stack slot would no longer exist.

function add (x)
  return function (y)
           return x+y
         end
end
add2 = add(2)
print(add2(5))

Figure 3: Access to outer local variables

使用第一类型functions词法范围内的组合产生了一个众所周知的访问外部局部变量的难题。考虑图3所示的例子。当 add2 被调用时,它的函数体访问外部局部变量x(函数参数在Lua中是局部变量)。然而,当add2被调用时,创建add2的函数add已经返回。如果x是在栈中创建的,它的栈空间已不存在。

Most procedural languages avoid this problem by restricting lexical scoping (e.g., Python), not providing first-class functions (e.g., Pascal), or both (e.g., C). Functional languages do not have those restrictions. Research in non-pure functional languages like Scheme and ML has created a vast body of knowledge about compilation techniques for closures (e.g., [19, 1, 21]).[3] However, those works do not try to limit the complexity of the compiler. For instance, just the control-flow analysis of Bigloo, an optimizer Scheme compiler [20], is more than ten times larger than the whole Lua implementation: The source for module Cfa of Bigloo 2.6f has 106,350 lines, versus 10,155 lines for the core of Lua 5.0. As explained in Section 2, Lua needs something simpler.
大多数面向过程语言通过来避免这个问题:
1,限制语法返回(如,Python),
2,不提供第一类型functions(如,Pascal),
3,或者两者(如,C)。
函数式语言没有这些限制。研究非纯粹函数式语言,如Scheme和ML,产生了关于闭包编译技术的大量知识(例如,[19,1,21])。然而,这些工作没有尝试限制编译器的复杂性。例如,仅就Bigloo的控制流分析,一个Scheme编译器的优化器,就比整个Lua实现大10倍:Bigloo 2.6f 的 module Cfa有106,350行代码,而Lua 5.0核心才10,155行代码。Lua需更简单,这在Section 2已解释。

Lua uses a structure called an upvalue to implement closures. Any outer local variable is accessed indirectly through an upvalue. The upvalue originally points to the stack slot wherein the variable lives (Figure 4, left). When the variable goes out of scope, it migrates into a slot inside the upvalue itself (Figure 4, right). Because access is indirect through a pointer in the upvalue, this migration is transparent to any code that reads or writes the variable. Unlike its inner functions, the function that declares the variable accesses it as it accesses its own local variables: directly in the stack.


Figure 4: An upvalue before and after being closed

Lua使用一个叫做upvalue的结构体来实现闭包。任何外部局部变量通过一个upvalue间接访问。upvalue起初指向变量存活的栈区(图4,左)。当变量超出作用域,它合并到upvalue自己的一个区中(图4,右)。由于访问时通过一个upvalue中的指针间接进行的,这种合并对于任何读写变量的代码都是透明的。不像内部函数,申明变量的函数访问它就像它访问自己的局部变量:直接在栈上(访问)。

Mutable state is shared correctly among closures by creating at most one upvalue per variable and reusing it as needed. To ensure this uniqueness, Lua keeps a linked list with all open upvalues (that is, those that still point to the stack) of a stack (the pending vars list in Figure 4). When Lua creates a new closure, it goes through all its outer local variables. For each one, if it can find an open upvalue in the list, it reuses that upvalue. Otherwise, Lua creates a new upvalue and links it in the list. Notice that the list search typically probes only a few nodes, because the list contains at most one entry for each local variable that is used by a nested function. Once a closed upvalue is no longer referred by any closure, it is eventually garbage collected.
通过每个变量最多创建一个upvalu,并且按需重用它,使得可变的状态正确的在闭包间共享。为了保证这种一致性,Lua保存了一个关于栈(有待处理的变量列表见图4)拥有所有开放upvalues(即,仍然指向栈)的链表.当Lua创建一个新的闭包时,它遍历所有它的外表局部变量。对于每一个变量,如果能够在链表中找到一个开放upvalue,就重用upvalue。否则,Lua创建一个新的upvalue,并且将它加入链表。注意,链表的查找只是测试少量节点,因为对于每个被嵌套函数使用的局部变量来说,链表至多包含一项。一旦一个闭合upvalue不再被任何闭包引用,最终会(被作为)垃圾收集(gc)。

It is possible for a function to access an outer local variable that does not belong to its immediately enclosing function, but to an outer function. In that case, even by the time the closure is created, the variable may no longer exist in the stack. Lua solves this case by using at closures [5]. With at closures, whenever a function accesses an outer variable that is not local to its enclosing function, the variable also goes to the closure of the enclosing function. Thus, when a function is instantiated, all variables that go into its closure are either in the enclosing function's stack or in the enclosing function's closure.
一个函数,访问外部变量是可能的,这个外部变量不属于它的中间包含函数,但属于一个外部函数。在这种情况下,即使在闭包创建时,变量也可能不在栈上。Lua通过即时闭包(at closures)解决这种问题。使用即时闭包(at closures),一个函数只要访问不位于包含函数中的外部变量,这个变量也会进入包含函数的闭包。所以,当一个函数实例化时,它的闭包使用到的所有变量要么在其包含函数的栈上,要么在包含函数的闭包中。

6 Threads and Coroutines
6 线程和协程

Since version 5.0, Lua implements asymmetric coroutines (also called semisymmetric coroutines or semi-coroutines) [7]. Those coroutines are supported by three functions from the Lua standard library: create, resume, and yield. (These functions live in the coroutine namespace.) The create function receives a 'main' function and creates a new coroutine with that function. It returns a value of type thread that represents that coroutine. (Like all values in Lua, coroutines are first-class values.) The resume function (re)starts the execution of a given coroutine, calling its main function. The yield function suspends the execution of the running coroutine and returns the control to the call that resumed that coroutine.
从Lua 5.0版开始,Lua实现了非对称协程(也叫做半对称协程或半协程)[7]。这些协程由Lua标准库的3个函数提供支持:create,resume,和yield。(这些函数位于coroutine命名空间。)create函数接受一个'main'函数,并且使用这个函数创建几个新的协程。它返回一个表示协程的thread类型的值。(与所有Lua中的值一样,协程是第一类型值。)resume函数(重新)启动一个给定的协程,调用它的main函数。yield函数挂起正在运行协程的执行,并且返回控制权给重新启动这个协程的调用。

Conceptually, each coroutine has its own stack. (Concretely, each coroutine has two stacks, as we shall discuss in Section 7, but we can consider them as a single abstract stack.) Coroutines in Lua are stackful, in the sense that we can suspend a coroutine from inside any number of nested calls. The interpreter simply puts aside the entire stack for later use and continues running on another stack. A program can restart any suspended coroutine at will. The garbage collector collects stacks whose coroutines are no longer accessible.
从概念上说,每个协程都有它自己的栈。(具体地,每个协程有两个栈,我们将在Section 7讨论,然而,我们可以把它们看作单个的抽象栈。我们可以从任意多层调用中挂起一个协程,在这种情况下,Lua中的协程有很多栈。解释器简单地将整个栈放在一边为后续使用,在另一个栈上继续运行。一个程序可以根据需要重启任何挂起的协程。垃圾收集器会收集那些栈,他们的协程不可访问。

The combination of stackfulness and first-class status makes coroutines, as implemented in Lua, is equivalent to one-shot continuations. As such, they allow programmers to implement several advanced control mechanisms, such as cooperative multithreading, generators, symmetric coroutines, backtracking, etc.[7].
如Lua的实现,多栈以及第一类状态的组合,使得协程等价于一次性的延续。因此,它们允许程序员实现许多高级控制机制,如协作多线程,生成器,对称协程,回溯等等。[7]

A key point in the implementation of coroutines in Lua is that the interpreter cannot use its internal C stack to implement calls in the interpreted code. (The Python community calls an interpreter that follows that restriction a stackless interpreter [23].) When the main interpreter loop executes a call operation, it creates a new slot in the stack, adjusts several pointers, and continues the loop with the instructions of the called function. Similarly, a return operation removes the top stack slot, adjusts pointers, and continues the loop with the instructions of the calling function. Not by coincidence, that is exactly what a real CPU does to perform function calls.
Lua中实现协程的关键点是,解释器无法使用内部C栈来实现被解释代码中的调用。(Python社区称一个满足这个约束的解释器为stackless解释器[23]。)当主要的解释器循环执行一个调用操作,它创建一个新的栈区,调整一些指针,并且通过被调函数的指令来继续当前循环。相似的,一个返回操作去掉顶部的栈区,调整指针,并且通过调用函数的指令来继续当前循环。无独有偶,这正是真正的CPU在实现函数调用时做的。

When the interpreter executes a resume, however, it does a recursive call to the main interpreter function. This new invocation is responsible for executing the resumed coroutine, using the coroutine stack to perform calls and returns. When this new loop executes an yield, it returns to the previous interpreter invocation, leaving the coroutine stack with any pending calls. In other words, Lua uses the C stack to keep track of the stack of active coroutines at any given time. Each yield returns to the previous interpreter loop, which is the one that called the corresponding resume.
然而,当解释器执行一个resume(操作),它进行了一个对main解释器函数的递归调用。这个新的调用有义务执行那个resume的协程,使用协程的栈进行调用和返回。但这个新的循环执行一个yield(操作)时,它返回到前一个计时器调用,把所有的等待调用留给协程栈。话句话说,在任意给定时刻,Lua使用C栈来保存活动协程的栈跟踪。每个yield返回前一个解释器循环,这个循环就叫做对应resume。

A source of difficulties in the implementation of coroutines in some languages is how to handle references to outer local variables. Because a function running in a coroutine may have been created in another coroutine, it may refer to variables in a diffierent stack. This leads to what some authors call a cactus structure [18]. The use of at closures, as we discussed in Section 5, avoids this problem altogether.
在某些语言中,实现协程的一个难点是如何处理外部局部变量的引用。因为运行在协程中的函数可能在另外的协程中创建,它可能引用其他栈中的变量。这会导致某些作者称作的cactus结构[18]。使用我们在Section 5讨论到的即时闭包(at closures),(可以)完全避免这个问题。

7 The Virtual Machine
7 虚拟机

Lua runs programs by first compiling them into instructions ('opcodes') for a virtual machine and then executing those instructions. For each function that Lua compiles it creates a prototype, which contains an array with the opcodes for the function and an array of Lua values (TObjects) with all constants (literal strings and numerals) used by the function.
Lua通过将程序编译成虚拟机指令('opcodes'),然后执行这些指令的方式来运行程序。对每个Lua编译的function,它会创建一个原型,这个原型包含一个拥有function的opcodes数组,以及包含一个拥有所有被function使用的常量(字面strings和numbers)的Lua值(TObjects)数组。

For ten years (since 1993, when Lua was first released), Lua used a stackbased virtual machine, in various incarnations. Since 2003, with the release of Lua 5.0, Lua uses a register-based virtual machine. This register-based machine also uses a stack, for allocating activation records, wherein the registers live. When Lua enters a function, it preallocates from the stack an activation record large enough to hold all the function registers. All local variables are allocated in registers. As a consequence, access to local variables is specially efficient.
10年来(从1993年,Lua第一次发布始),Lua以不同的方式使用基于栈的虚拟机。从2003年开始,随着Lua 5.0的发布,Lua使用基于寄存器的虚拟机。这个基于寄存器的虚拟机也使用一个栈,来分配活动的记录,寄存器就在其中。当Lua进入一个function,它预先在栈上分配一个足以容纳function寄存器的活动记录。所有的变量都在寄存器中分配。因此,访问局部变量非常高效。

Register-based code avoids several 'push' and 'pop' instructions that stackbased code needs to move values around the stack. Those instructions are particularly expensive in Lua, because they involve the copy of a tagged value, as discussed in Section 3. So, the register architecture both avoids excessive copying of values and reduces the total number of instructions per function. Davis et al. [6] argue in defense of register-based virtual machines and provide hard data on the improvement of Java bytecode. Some authors also defend register-based virtual machines based on their suitability for on-the-fly compilation (see [24], for instance).
基于寄存器的代码避免多个'push'和'pop'指令,基于栈的代码需要这些指令在栈上移动值。在Lua中,这些指令时极其昂贵的,因为它们涉及拷贝带标记的值,如Section 3讨论到的(那样)。因此,寄存器架构既避免了大量的值拷贝,又减少了每个function的总指令数。Davis等[6]反对基于寄存器的虚拟机而且提供了改进java字节码的严格数据。一些作者也反对基于栈的虚拟机,因为它们要适应 空中编译(如[24])。

There are two problems usually associated with register-based machines: code size and decoding overhead. An instruction in a register machine needs to specify its operands, and so it is typically larger than a corresponding instruction in a stack machine. (For instance, the size of an instruction in Lua's virtual machine is four bytes, while the size of an instruction in several typical stack machines, including the ones previously used by Lua, is one or two bytes.) On the other hand, register machines generate less opcodes than stack machines, so the total code size is not much larger.
通常,有两个与基于寄存器虚拟机相关的问题:代码大小和过度解码。一条寄存器虚拟机中的指令需要特化它的操作数,所以它会比对应的栈虚拟机中的指令要大得多。(例如,Lua虚拟机中的一条指令是4字节,而其它栈虚拟机指令,包括以前Lua中用到的,为1或2字节。)另一方面,寄存器虚拟机产生比栈虚拟机少的操作码,所以总的代码大小不会大多少。

Most instructions in a stack machine have implicit operands. The corresponding instructions in a register machine must decode their operands from the instruction. Such decoding adds overhead to the interpreter. There are several factors that ameliorate this overhead. First, stack machines also spend some time manipulating implicit operands (e.g., to increment or decrement the stack top). Second, because in a register machine all operands are inside the instruction and the instruction is a machine word, the operand decoding involves only cheap operations, such as logical operations. Moreover, instructions in stack machines frequently need multi-byte operands. For instance, in the Java VM, goto and branch instructions use a two-byte displacement. Due to alignment, the interpreter cannot fetch such operands at once (at least not with portable code, where it must always assume worst-case alignment restrictions). On a register machine, because the operands are inside the instruction, the interpreter does not have to fetch them independently.
大部分栈虚拟机指令都有隐含操作数。对应的寄存器虚拟机指令必须从指令中解码它们的操作数。这些解码(运算)增加解释器开销。有多种因素改进这个开销。首先,栈虚拟机也要在操作隐含操作数上花时间(例如,为了抬高或降低栈顶)。其次,由于在寄存器虚拟机中,所有的操作数都在指令中,而且指令时一个机器字,操作数解码运算只引入低开销运算,比如逻辑运算。更重要的是,栈虚拟机指令大多需要多字节操作数。例如,在Java 虚拟机中,跳转和分支指令使用2字节置换。由于对齐,解释器不能一次性获取这些操作数(至少不能使用可移植代码,这里必须总是假设最坏情况的对齐约束。)在寄存器虚拟机中,由于操作数在指令中,解释器就不需要单独来获取。

There are 35 instructions in Lua's virtual machine. Most instructions were chosen to correspond directly to language constructs: arithmetic, table creation and indexing, function and method calls, setting variables and getting values. There is also a set of conventional jump instructions to implement control structures. Figure 5 shows the complete set, together with a brief summary of what each instruction does, using the following notation: R(X) means the Xth register. K(X) means the Xth constant. RK(X) means either R(X) or K(X-k), depending on the value of X | it is R(X) for values of X smaller than k (a build parameter, typically 250). G[X] means the field X in the table of globals. U[X] means the Xth upvalue. For a detailed discussion of Lua's virtual machine instructions, see [14, 22].

MOVE        A B R(A)   := R(B)
LOADK       A Bx R(A)  := K(Bx)
LOADBOOL    A B C R(A) := (Bool)B; if (C) PC++
LOADNIL     A B R(A)   := ... := R(B) := nil
GETUPVAL    A B R(A)   := U[B]
GETGLOBAL   A Bx R(A)  := G[K(Bx)]
GETTABLE    A B C R(A) := R(B)[RK(C)]
SETGLOBAL   A Bx G[K(Bx)] := R(A)
SETUPVAL    A B U[B]   := R(A)
SETTABLE    A B C R(A)[RK(B)] := RK(C)
NEWTABLE    A B C R(A) := {} (size = B,C)
SELF        A B C R(A+1) := R(B); R(A) := R(B)[RK(C)]
ADD         A B C R(A) := RK(B) + RK(C)
SUB         A B C R(A) := RK(B) - RK(C)
MUL         A B C R(A) := RK(B) * RK(C)
DIV         A B C R(A) := RK(B) / RK(C)
POW         A B C R(A) := RK(B) ^ RK(C)
UNM         A B R(A)   := -R(B)
NOT         A B R(A)   := not R(B)
CONCAT      A B C R(A) := R(B) .. ... .. R(C)
JMP         sBx        PC += sBx
EQ          A B C if ((RK(B) == RK(C)) ~= A) then PC++
LT          A B C if ((RK(B) < RK(C)) ~= A) then PC++
LE          A B C if ((RK(B) <= RK(C)) ~= A) then PC++
TEST        A B C if (R(B) <=> C) then R(A) := R(B) else PC++
CALL        A B C R(A), ... ,R(A+C-2) := R(A)(R(A+1), ... ,R(A+B-1))
TAILCALL    A B C return R(A)(R(A+1), ... ,R(A+B-1))
RETURN      A B return R(A), ... ,R(A+B-2) (see note)
FORLOOP     A sBx R(A)+=R(A+2); if R(A) 
TFORLOOP    A C R(A+2), ... ,R(A+2+C) := R(A)(R(A+1), R(A+2));
TFORPREP    A sBx if type(R(A)) == table then R(A+1):=R(A), R(A):=next;
SETLIST     A Bx R(A)[Bx-Bx%FPF+i] := R(A+i), 1 <= i <= Bx%FPF+1
SETLISTO    A Bx
CLOSE A     close stack variables up to R(A)
CLOSURE     A Bx R(A) := closure(KPROTO[Bx], R(A), ... ,R(A+n))    

Figure 5: The instructions in Lua's virtual machine

Lua虚拟机有35条指令。入选的大部分指令和语言结构直接对应:算术计算,table创建和索引,function和方法调用,变量赋值和取值。也有一套常规的跳转指令用于实现控制结构。图5展示了完整的(指令)集合,附带着每个指令做什么的简单说明,使用的标注如下:R(X)表示第X个寄存器。K(X)表示第X个常量。RK(X)要么表示R(X),要么表示K(X-k),取决于X的值——对于小于k(一个编译参数,通常为250)的X来说是R(X)。G[X]表示全局table中的域X。U[X]表示第X个upvalue。关于Lua虚拟机指令的详细讨论,参考[14,22]。

Registers are kept in the run-time stack, which is essentially an array. Thus, access to registers is fast. Constants and upvalues are stored in arrays and so access to them is also fast. The table of globals is an ordinary Lua table. It is accessed via hashing but with good performance, because it is indexed only with strings (corresponding to variable names), and strings pre-compute their hash values, as mentioned in Section 2.
寄存器保存在运行时栈中,运行时栈是一个基本的数组。因此,访问寄存器是快速的。常量和upvalues保存在数组中,所以访问他们也是快速的。全局table是一个原生的Lua table。它由具有高性能hash操作来访问,因为它仅由strings(与变量名对应)来索引,而且如Section 2提到的,strings会预先计算hash值。

The instructions in Lua's virtual machine take 32 bits divided into three or four fields, as shown in Figure 6. The OP field identifies the instruction and takes 6 bits. The other fields represent operands. Field A is always present and takes 8 bits. Fields B and C take 9 bits each. They can be combined into an 18-bit field: Bx (unsigned) and sBx (signed).


Figure 6: Instruction layout

Lua虚拟机指令把32位分成3到4个域,如图6所示。域OP标记指令,占6位。其他域表示操作数。域A总是占8位。域B和C每个占9位。他们可以组成一个18位的域:Bx(无符号的)和sBx(有符号的)。

Most instructions use a three-address format, where A points to the register that will hold the result and B and C point to the operands, which can be either a register or a constant (using the representation RK(X) explained above). With this format, several typical operations in Lua can be coded in a single instruction. For instance, the increment of a local variable, such as a = a + 1, is coded as ADD x x y, where x represents the register holding the local variable and y represents the constant 1. An assignment like a = b.f, when both a and b are local variables, is also coded as the single instruction GETTABLE x y z, where x is the register for a, y is the register for b, and z is the index of the string constant 'f'. (In Lua, the syntax b.f is syntactic sugar for b['f'], that is, b indexed by the string 'f'.)
大部分指令使用3地址格式,A指向结果寄存器,B和C指向操作数,操作数可以是寄存器,也可以是常量(如上所释,使用RK(X)表示)。使用这种格式,Lua中一些常见的操作可以编码到单条指令中。例如,局部变量的增加,如a=a+1,可以编码为ADD x x y,这里的x表示寄存器,y表示常量1。一个赋值(操作)如 a=b.f,当a和b都是局部变量时,也可以编码为单条指令

GETTABLE x y z,

这里的x是寄存器存储a,y是寄存器存储b,z是string常量索引下标'f'。(在Lua中,格式b.f是b['f']的语法糖,即,string 'f'索引b。)

Branch instructions pose a difficulty because they need to specify two operands to be compared plus a jump offset. Packing all this data inside a single instruction would limit jump offsets to 256 (assuming a signed 9-bit field). The solution adopted in Lua is that, conceptually, a test instruction simply skips the next instruction when the test fails; this next instruction is a regular jump, which uses an 18-bit offset. Actually, because a test instruction is always followed by a jump instruction, the interpreter executes both instructions together. That is, when executing a test instruction that succeeds, the interpreter immediately fetches the next instruction and does the jump, instead of doing it in the next dispatch cycle. Figure 7 shows an example of Lua code and the corresponding bytecode. Note the structure of the conditional and jump instructions just described.

function max (a,b)  
    local m = a  
    if b > a then    
        m = b  
    end  
    return m
end
1 MOVE 2 0 0 ; R(2) = R(0)
2 LT 0 0 1 ; R(0) < R(1) ?
3 JMP 1 ; to 5 (4+1)
4 MOVE 2 1 0 ; R(2) = R(1)
5 RETURN 2 2 0 ; return R(2)
6 RETURN 0 1 0 ; return

Figure 7 Bytecode for a Lua function

分支指令遇到了一个困难,因为它们需要指定两个操作数比较,加上一个偏移地址跳转。将所有的这些参数打包到单条指令会使跳转偏移地址限制在256内(假定一个有符号整数使用用9位的域)。从概念上讲,Lua中的解决方案是,当测试指令失败时,简单的跳过下一条指令;下一条指令是一个固定的jump,这个jump使用18位偏移地址。事实上,由于一条测试指令总是伴随着一条jump指令,解释器一起执行两条指令。因此,当执行一条测试指令成功时,解释器立即获取下一条指令并且跳转,而不是在下一个分发循环中完成跳转。图7展示一段Lua代码和对应的字节码。注意刚刚描述的条件和跳转指令的结构。

Figure 8 shows a small sample of the optimizations performed by the Lua compiler. Figure 9 shows the same code compiled for Lua 4.0, which used a stack based virtual machine with 49 instructions. Note how the switch to a registerbased virtual machine allowed the generation of much shorter code. Each executable statement in this example compiles to a single instruction in Lua 5.0, but needs three or four instructions in Lua 4.0.

local a,t,i
a = a+i
a = a+1
a = t[i]
1: LOADNIL 0 2 0
2: ADD 0 0 2
3: ADD 0 0 250 ; 1
4: GETTABLE 0 1 2

Figure 8: Register-based opcode (Lua 5.0)

图8展示了Lua编译器优化的小例子。

local a,t,i
a = a+i
a = a+1
a = t[i]
1: PUSHNIL 3
2: GETLOCAL 0 ; a
3: GETLOCAL 2 ; i
4: ADD
5: SETLOCAL 0 ; a
6: GETLOCAL 0 ; a
7: ADDI 1
8: SETLOCAL 0 ; a
9: GETLOCAL 1 ; t
10: GETINDEXED 2 ; i
11: SETLOCAL 0 ; 

Figure 9: Stack-based opcode (Lua 4.0)

图9展示了Lua 4.0编译的代码,Lua 4.0使用基于栈的虚拟机,拥有49条指令。注意切换到基于寄存器的虚拟机如何允许生成更短小的代码。此例中,每个可执行语句在Lua 5.0里编译成单条指令,但在Lua 4.0里需要3到4条指令。

For function calls, Lua uses a kind of register window. It evaluates the call arguments in successive registers, starting with the first unused register. When it performs the call, those registers become part of the activation record of the called function, which therefore can access its parameters as regular local variables. When this function returns, those registers are put back into the activation record of the caller.
对于函数调用,Lua使用寄存器窗口。它在后续各寄存器(始于第一个未使用的寄存器)中评估调用参数。在完成call过程中,这些寄存器变成被调函数活动记录的一部分,因此被调函数可以把它的参数当做常规局部变量来访问。当函数返回时,这些寄存器放回到调用者的活动记录中。

Lua uses two parallel stacks for function calls. (Actually, each coroutine has its own pair of stacks, as we discussed in Section 6.) One stack has one entry for each active function. This entry stores the function being called, the return address when the function does a call, and a base index, which points to the activation record of the function. The other stack is simply a large array of Lua values that keeps those activation records. Each activation record keeps all temporary values of the function (parameters, local variables, etc.). Actually, we can see each entry in the second stack as a variable-size part of a corresponding entry in the first stack.
Lua在函数调用中使用两个并行栈。(事实上,每个协程都有自己的栈对,如我们在Section 6中讨论的。)一个栈对每个活动函数都有一个元素。这个元素保存被调函数,函数调用时的返回地址,以及指向函数的索引基址。另一个栈是个简单的Lua值大数组,保存那些活动记录。每个活动记录保存函数的所有临时值(参数,局部变量,等等)。事实上,我们可以把第二个栈的每个元素看作是对应的第一个栈元素的可变部分。

8 Conclusion
8 结论

In this paper we have presented the most innovative aspects of the implementation of Lua 5.0: its register-based virtual machine, the new algorithm for optimizing tables used as arrays, and the implementation of closures.
在这篇论文中,我们展示了Lua 5.0实现中最革新的部分:它基于寄存器的虚拟机,table用于数组时的新优化算法,以及闭包的实现。

To our knowledge, Lua is the first language in wide use to adopt a registerbased virtual machine. The optimization for tables allows a table to be partially implemented as an array when it is used that way (that is, when it has enough keys in a range 1... n). Its implementation of closures is also unique, combining the use of an array-based stack with lexically scoped first-order functions, without complex control-flow analysis.
据我们所知,Lua是第一个引入寄存器虚拟机广泛应用的语言。对table的优化允许一个table部分实现为一个数组,当作为数组那样使用的时候(即,它在1...n的返回内有足够的keys)。它的闭包的实现是独特的,组合了基于数组的栈,词法范围的一阶函数,而没有复杂的控制流程分析。

The table in Figure 10 shows some simple performance comparisons between the old implementation and the new one. The tests were run on an Intel Pentium IV machine with 512 Mbytes running Linux 2.6, with Lua compiled with gcc 3.3. Lua 4.0 uses neither the register-based virtual machine (its machine is stack based) nor the table-array optimization. Lua 5' is Lua 5.0 without table-array optimization, tail calls, and dynamic stacks (related to coroutines); Lua 5' is essentially Lua 4.0 with the new register-based virtual machine.
图10中的表格展示了老的实现和新的实现的性能比较。该测试在一台Intel Pentium IV 机器,拥有512M内存,运行Linux 2.6的机器上进行,Lua由gcc 3.3编译。Lua 4.0既没有使用基于寄存器虚拟机(它的虚拟机是基于栈的)也没有使用table-数组优化。Lua 5'为Lua 5.0未使用table-数组优化,尾调用,以及动态栈(与协程相关);Lua 5'基本上是Lua 5.0+新的基于寄存器虚拟机。

We took all test cases from The Great Computer Language Shootout [2], except the first one (sum), which is a simple loop to add all integers from 1 to n. This first test spends most of its time in the virtual machine; it shows that the new virtual machine can be more than twice as fast as the old one. The other tests spend more time in other tasks (function calls, table/array access, etc.), so the gain in the virtual machine has a smaller effect on the total time. In the tests that use arrays (sieve, heapsort, and matrix ), the combination of the new virtual machine with the new optimization for arrays can reduce the running time up to 40%.
除了第一个(求和)从1到n简单的循环相加外,我们通过伟大的计算机语言Shootout[2]来执行测试用例。第一次测试在虚拟机中花费最多的时间;它显示新虚拟机比旧虚拟机快2倍以上。其它测试在另外的任务上花更多时间(函数调用,table/数组访问等等。),所以,从虚拟机获得的(性能提升)对总时间影响较小。在使用数组的测试(过滤,堆排序,以及矩阵)中,新虚拟机和新数组优化的组合减少40%的运行时间。

The complete code of Lua 5.0 is available for browsing at Lua's web site: http://www.lua.org/source/5.0/.
Lua 5.0的完整代码可以通过浏览Lua的网站获取:http://www.lua.org/source/5.0/。

program Lua 4.0 Lua 5' Lua 5.0
sum(2e7) 1.23 0.54 (44%) 0.54 (44%)
fibo(30) 0.95 0.68 (72%) 0.69 (73%)
ack(8) 1.00 0.86 (86%) 0.88 (88%)
random(1e6) 1.04 0.96 (92%) 0.96 (92%)
sieve(100) 0.93 0.82 (88%) 0.57 (61%)
heapsort(5e4) 1.08 1.05 (97%) 0.70 (65%)
matrix(50) 0.84 0.82 (98%) 0.59 (70%)

Figure 10: Benchmarks (times in seconds; percentages are relative to Lua 4.0)

Acknowledgments
致谢

Edgar Toernig provided a key insight into the implementation of closures. Thatcher Ulrich made a previous implementation of coroutines in Lua 4.0 that worked as a proof-of-concept for our implementation in Lua 5.0. The authors are partially supported by grants from CNPq (grants 302608/2002-8, 300392/2003-6, and 401109/2003-8), FINEP (CT-INFO 01/2003), and Microsoft Research (2nd Rotor RFP).

References
参考文献

  1. A. W. Appel. Empirical and analytic study of stack versus heap cost for languages with closures. Journal of Functional Programming, 6(1):47-74, 1996.
  2. D. Bagley. The great computer language shootout. http://www.bagley.org/~doug/shootout/.
  3. R. P. Brent. Reducing the retrieval time of scatter storage techniques. Communications of the ACM, 16(2):105-109, 1973.
  4. A. Calpini. The great Win32 computer language shootout. http://dada.perl.it/shootout/.
  5. L. Cardelli. Compiling a functional language. In LISP and Functional Programming, pages 208-217, 1984.
  6. B. Davis, A. Beatty, K. Casey, D. Gregg, and J. Waldron. The case for virtual register machines. In Proceedings of the 2003 Workshop on Interpreters, Virtual Machines and Emulators, pages 41-49. ACM Press, 2003.
  7. A. de Moura, N. Rodriguez, and R. Ierusalimschy. Coroutines in Lua. Journal of Universal Computer Science, 10(7):910{925, July 2004.
  8. M. A. Ertl and D. Gregg. The structure and performance of efficient interpreters. Journal of Instruction-Level Parallelism, 5:1{25, Nov. 2003.
  9. A. Goldberg and D. Robson. Smalltalk-80: the language and its implementation. Addison-Wesley, 1983.
  10. R. E. Griswold and M. T. Griswold. The Implementation of the Icon Programming Language. Princeton University Press, 1986.
  11. R. E. Griswold, J. F. Poage, and I. P. Polonsky. The SNOBOL 4 Programming Language. Prentice-Hall, 1971.
  12. R. Ierusalimschy, L. H. de Figueiredo, and W. Celes. The evolution of an extension language: A history of Lua. In Proceedings of V Brazilian Symposium on Programming Languages, pages B-14-B-28, 2001.
  13. S. C. Johnson. YACC: Yet another compiler compiler. CS TR 32, Bell Labs, July 1975.
  14. K.-H. Man. A no-frills introduction to Lua 5 VM instructions. http://luaforge.net/docman/?group_id=83.
  15. S. Pemberton and M. Daniels. Pascal Implementation: The P4 Compiler and Interpreter. Ellis Horwood, 1982.
  16. I. Piumarta and F. Riccardi. Optimizing direct threaded code by selective inlining. In ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 291-300, Montreal, Canada, June 1998.
  17. A. Randal, D. Sugalski, and L. Toetsch. Perl 6 and Parrot Essentials. O'Reilly, second edition, 2004.
  18. M. L. Scott. Programming Language Pragmatics. Morgan Kaufmann, 2000.
  19. M. Serrano. Control flow analysis: a functional language compilation paradigm. In 10th ACM Symposium on Applied Computing, pages 118-122, Nashville, TN, Feb. 1995.
  20. M. Serrano and P. Weis. Bigloo: A portable and optimizing compiler for strict functional languages. In 2nd Static Analysis Symposium, pages 366-381, Glasgow, Scotland, Sept. 1995. LNCS 983.
  21. Z. Shao and A. W. Appel. Space-efficient closure representations. In ACM Conference on Lisp and Functional Programming, pages 150-161, June 1994.
  22. Z. A. Shaw. The Lua virtual machine. http://www.zedshaw.com/writings/luas-lvm-instructions/luas_lvm_instructions/luas_lvm_instructions.html.
  23. C. Tismer. Continuations and stackless Python. In Proceedings of the 8th International Python Conference, Arlington, VA, 2000.
  24. P. Winterbottom and R. Pike. The design of the Inferno virtual machine. http://www.herpolhode.com/rob/hotchips.html.

-EOF-