Meterstick Benchmarking Performance Variability In Cloud And Selfhosted Minecraftlike Games Extended Technical Report

From Marvel vs DC
Jump to: navigation, search

Due to increasing popularity and strict performance requirements, online games have become a cloud-based and self-hosted workload of interest for the performance engineering community. One of the most popular types of online games is the Minecraft-like Game (MLG), in which players can terraform the environment. The most popular MLG, Minecraft, provides not only entertainment, but also educational support and social interaction, to over 130 million people world-wide. MLGs currently support their many players by replicating isolated instances that support each only up to a few hundred players under favorable conditions. In practice, as we show here, the real upper limit of supported players can be much lower. In this work, we posit that performance variability is a key cause for the lack of scalability in MLGs, investigate experimentally causes of performance variability, and derive actionable insights. We propose an operational model for MLGs, which extends the state-of-the-art with essential aspects, e.g., through the consideration of environment-based workloads, which are sizable workload components that do not depend on player input (once set in action). Starting from this model, we design the first benchmark that focuses on MLG performance variability, defining specialized workloads, metrics, and processes. We conduct real-world benchmarking of MLGs, both cloud-based and self-hosted. We find environment-based workloads and cloud deployment are significant sources of performance variability: peak-latency degrades sharply to 20.7 times the arithmetic mean, and exceeds by a factor of 7.4 the performance requirements. We derive actionable insights for game-developers, game-operators, and other stakeholders to tame performance variability.



ewmdtheoremenv [linecolor=blue, linewidth=2pt, rightline=false, leftline=false]figrev



The gaming industry is the world’s largest entertainment industry [1]-world-wide, games engage over 3 billion players and yield over $170 billion in revenue [2].



In this work, we focus on MLGs, an emergent and highly popular type of game where users can change almost every part of the environment. The canonical example of an MLG is Minecraft, which is already the best-selling game of all time [3].



All MLGs, including Minecraft, present an important challenge to the performance engineering community: although their user-bases can exceed 100 million active users per month111Minecraft has over 130 million active users/month [4], more than, e.g., MacOS [5]., their scalability is limited to only 200-300 players under favorable conditions [6]. (MLGs support high concurrency by creating separate replicas of their virtual worlds, essentially sharding state and not allowing cross-instance interaction.) What limits MLG scalability? In this work, we posit performance variability is a key limit to MLG scalability, and design and use an MLG benchmark focusing on this concept.



MLGs represent an important and unique class of online multiplayer games. Most importantly, MLGs allow players to create, modify, and remove the game’s objects (e.g., player apparel) and geographical features (e.g., terrain) [7]. Players can and do significantly change the virtual environment by building (e.g., creating cities) and terraforming (e.g., moving mountains). Moreover, some game objects and features are self-acting, that is, they act even when no player input is applied to them; players can use them to create dynamic elements, by “programming” the environment with combinations of self-acting parts.



MLGs are also a rich class of software artifacts: although Minecraft is the best known example of an MLG, to-date, thousands of commercial MLGs exist and a rich modding community has also emerged.



Beyond gaming, MLGs also act as platforms for many societally relevant activities, e.g., as safe-havens for press fighting against real-world censorship [8], as collaborative urban-design tools [9], as visualization tools for metabolic pathways [10], and as safe-spaces for autistic people [11, 12].



In this work, we posit and show empirical evidence that current MLGs experience significant performance variability. Figure 1 gives an example of performance variability in MLGs-even with a single connected player, the game’s performance varies from good (response-time below 60 ms) to unplayable (above 118 ms). (We discuss this and similar real-world experiments in Section V-B.) Performance variability prevents MLG service providers from giving strict Quality of Service (QoS) guarantees, and simultaneously incentivizes overprovisioning of resources and limiting the number of players that can interact together. For example, Minecraft Realms, a Minecraft cloud-based service offered by Microsoft, limits the maximum number of players in a single instance to only ten (10)!



By contrast, Hypixel, the largest Minecraft server with a record of 216,762 online players [13], stitches together thousands of MLG instances using specialized tools to achieve this player count, and players in different instances cannot interact.



Although emerging industry trends such as Dynamic Resolution Scaling (DRS) [14, 15] address performance variability challenges in video rendering, performance variability in the interactive simulation of virtual worlds is poorly understood.



Related work on evaluating MLG performance has shown that MLGs do not scale well, supporting at most 200-300 hundred players under favorable conditions [6]. However, the state-of-the-art does not consider or explain MLG performance variability, which can negatively affect the player’s experience, even when the number of players is far below the supported maximum.



Addressing this important gap, in this work we evaluate the performance variability of popular MLGs to better understand its causes and how to limit its impact. Our main contribution is four-fold:



C1 An operational model of MLGs, including a new model for MLG-specific workloads (Section II).



Because MLGs allow players to terraform the terrain, they support new types of workload not available in traditional online games.



C2 The design of Meterstick, a benchmark that quantifies performance variability in MLGs (Section III). To this end, we propose a novel performance variability metric, and define a benchmarking approach to produce it experimentally. Our benchmark



supports common deployment-environments for MLGs offered as a service, in particular, both cloud-based and self-hosted. Our benchmark is the first to quantify performance variability in MLGs.



C3 Real-world experiments using Meterstick (Section V) leading to actionable insights (Section VI). We evaluate the performance variability of three popular MLGs, running on two popular commercial cloud providers and one local compute cluster. We investigate how their performance variability is affected by MLG-specific workloads and their deployment-environment, and derive actionable insights.



C4 Adhering to open-science and reproducibility principles, we publish Findable, Accessible, Interoperable, and Reusable (FAIR [16]) data and Free-access Open-Source Software (FOSS) artifacts, available on Zenodo [17].



II Operational Model of Minecraft-like Games



For contribution C1, first, we summarize a state-of-the-art operational model and the related reference architecture for MLGs (Section II-A). Second, we define environment-based workloads as MLG-specific workloads, caused by terrain and entity simulation; Section II-B defines the resulting MLG workload model. Third, we model the operational elements of these workloads (Section II-C).



II-A Reference Architecture for MLGs



Earlier work defines a high-level reference architecture for MLGs [6] (see Figure 2).



Overall, MLGs use a client-server architecture and are commonly deployed in cloud environments. Players run a client on their own device, which connects to a server instance running in the cloud. Popular cloud providers such as Amazon AWS and Microsoft Azure provide tutorials for running these servers on their platform. Microsoft, the company that currently owns Minecraft, markets it as a cloud-based service through Minecraft Realms, which offers players a fully-managed Minecraft instance for a monthly fee [18]. Additionally, many smaller companies offer MLGs as a service; Section V-A gives an overview of such companies.



The client ( 1) has two main tasks. First, it translates player input into in-game actions, which it speculatively applies to the local state and also sends to the server for validation. The client-server communication uses an implementation-specific protocol ( 4) that may be used by different MLGs. Second, the client renders the game state for visual display to the player, at a fixed frequency.



The server ( 2) is responsible for performing all in-game (virtual-world) simulations, maintaining the global state, and disseminating state-updates to clients. The game loop ( 3) performs simulations, by applying state-updates to the global state in discrete steps (ticks), at a fixed frequency. In MLGs, this frequency is set to 20 Hz, or 50 ms per tick. If a tick takes less than 50 ms, the MLG will wait for the next scheduled tick start. However, if a tick exceeds 50 ms, the tick frequency drops below 20 Hz, and the server is said to be in an overloaded state. While in an overloaded state, the game fails meet its QoS requirements, and can cause players to experience game stuttering, visual inconsistency (rubber-banding, where elements of the game appear to sporadically teleport), and increased input latency. Prior work has shown direct causality between increased input latency and reduced player experience [19, 20, 21].



MLGs generate workloads, both data- and compute-intensive, that do not exist in other types of games. In contrast to traditional games, MLGs allow modifications to the terrain. This requires the game server to simulate terrain changes and manage terrain state alongside the player state and entity state found in traditional games ( 5). In contrast to other types of state, terrain state can be both data-intensive, when the virtual world is large, and compute-intensive, because it can change without direct player input.



II-B Workloads in MLGs



This section presents our workload model for MLGs, which focuses on players, terrain, and entities. We discuss each of these components, in turn, focusing on unique and challenging aspects. We mark the novel aspects introduced by our research with “novel”.



Figure 3 presents a visual overview of our model. Beyond the state-of-the-art, our workload model captures environment-based workloads, which are caused by simulating the modifiable environment itself, and scale independently from the number of active players. In Figure 3, Terrain Simulation and Entities are examples of environment-based workloads.



II-B1 Workload from Players (known)



Players cause workload for MLGs, and games in general, through their actions. MLGs support player-actions found in traditional games, e.g., player movement and interactions, and also MLG-specific actions, e.g., to modify terrain. For player movement, the game computes collisions to prevent players from walking through obstacles such as walls, and disseminates location-changes to other players. Players can also interact with other players and entities (i.e., objects), for example by collecting resources and exchanging them with other players.



An important difference between MLGs and traditional games is support for player-actions that modify the terrain.



In MLGs, players can terraform-create, modify, and destroy the terrain, as well as the buildings standing on the terrain. This can generate resource-intensive workloads in two ways. First, players can change a large part of the terrain in a short amount of time, for example through the use of explosives. This is both compute- and data-intensive, because the game needs to compute the new terrain, and communicate state updates to keep a consistent view across all players. Second, players can construct dynamic elements such as simulated constructs, which increase the complexity of the terrain simulation and are discussed in Section II-B2. The impact of player workloads has been previously examined in both the context of traditional video games architectures and MLGs specifically [22, 6].



II-B2 Workload from Terrain Simulation (novel)



In contrast to traditional games, a significant part of the MLG workload can come from generating and simulating the terrain. MLGs typically present players with an endless open world. This world is split into areas, which are lazily generated when players come near them. Once the terrain is generated, the game simulates it and allows players to modify it.



We identify four important components of terrain simulation: physics, lighting, plant growth, and simulated constructs. Although physics and lighting simulations are present in traditional games, the modifiable nature of the terrain makes it significantly more challenging to manage such constructs in MLGs. Unlike static environments, where physics simulation only needs to happen for the relatively few entities that can move through the world, MLGs need to perform physics simulations on the many blocks that compose the terrain itself. For example, a bridge can collapse when a player removes its support pillars, or the terrain underneath them. Once the bridge has collapsed, the bridge no longer casts shadow, so the simulator needs to recompute lighting (frequently) at runtime; static environments do not have this dynamic workload.



Plant growth is an example of a dynamic element unique to MLGs. In modifiable worlds, plants and trees change over time, reshaping the nearby terrain, thus generating new workload.



Through terrain modification, players can create simulated constructs. In a simulated construct, players place together dynamic elements (e.g., plants, automatic croppers) to achieve a certain goal. For example, many players build irrigation systems that grow and harvest vegetables automatically, with high yield. Such systems can leverage tens to hundreds of dynamic elements, whose interaction generates compute-intensive workload for terrain simulation.



In MLGs, as we show in Section V, even a single player can overload the game simulator. This is in part because, in MLGs, a single player can trigger complex simulations, for example, by creating simulated constructs of arbitrary size. By contrast, in traditional games, only the number of concurrent players is strongly correlated with workload intensity.



II-B3 Workload from Entities (novel)



An entity is an object that exists in the virtual world but is not a player or terrain. Examples include Non-Playable Characters (NPCs), mobiles (i.e., mobs), and items (e.g., a sword). Entities can typically move or be moved by players and collide with each other. Here we describe two important aspects of entity simulation which are challenging for MLGs. First, games typically instantiate entities at spawn points, e.g., spawn an NPC at a spawn point in a dark cave when a player is about to enter. In contrast to static environments, where game developers typically place these spawn points manually, MLGs need to compute spawn points dynamically because terrain modification may obstruct existing spawn points.



Second, NPCs use path-finding algorithms to move around the map. Static worlds pre-compute overlay graphs with viable NPC locations, improving computational efficiency. In contrast, MLGs have changing terrain, so they must compute path-finding graphs dynamically, leading to additional compute-intensive workload.



II-C Operational Model of MLGs



We detail in this section the game loop used by MLGs. We define the operational model as the set of operations, and of events triggering and linking them, of individual components in the implementation of the game loop. Novel, in this work, we analyze the performance implications of the unique aspects of MLG workloads (see Section II-B) when executed with the MLG operational model.



Figure 4 depicts a constructed, generalized, and simplified operational model of the MLG game loop. To run the game loop, the game server orchestrates primarily three main components, the Networking Queues (component 1 in Figure 4), the Game Loop ( 2), and the Game State ( 3), which we discuss in turn. The Networking Queues ( 1) buffer between the game clients and the server. When a client sends a player-action to the server, it is buffered in the incoming network queue until the next tick. When the server needs to send a state-update to one or several clients, it forwards the message to the networking queues, to be further buffered in the outgoing queue or sent directly to the client.



The Game Loop ( 2) simulates the virtual world and is the core of the game server. In an MLG, the game loop consists of three elements: players, the terrain, and entities. These elements correspond to the workloads specified in Section II-B. Figure 4 shows each of these elements, and how they differ from the others. For each element, its simulation typically requires the Game Loop to read the current Game State ( 3), and may result in terrain state changes that need to be persisted (i.e., written). In this section, we focus on the terrain state because it is idiosyncratic to MLGs. Below, we discuss each simulation element in turn.



The Player Handler ( 4) is driven by player actions, which the Game Loop retrieves from the Networking Queues once per tick. For example, players can move or build something in the virtual world. Because the terrain can obstruct the player from performing these actions, the Player Handler must read the terrain state in the vicinity of the player. If the action is successful, player actions that affect the terrain (e.g., building) need to be written back to the global Game State.



Terrain Simulation ( 5) is largely independent from player input, and is instead driven by terrain state updates. When a terrain state update occurs, the Terrain Simulation applies its simulation rules to the new state. For example, a terrain simulation rule such as if terrain is not physically supported, it falls down can be triggered when a player removes the keystone from a bridge. These rules trigger in a loop, where each iteration informs the adjacent terrain that it is no longer supported. The resulting state changes are written back to the global Game State.



Entities ( 6) are primarily driven by the Game State, including the state of the terrain, players, and entities themselves. Entities such as NPCs need the terrain state primarily for pathfinding. In some cases, entities may themselves modify the terrain. For example, an NPC may place or remove terrain, or items such as explosives may remove large parts of terrain all at once.



Although, from a performance perspective, it is desirable to run the game loop elements concurrently, there are two challenges with this approach. First, while these elements are in principle independent, they have implicit dependencies through the game state which they access. Individual elements can only run concurrently as long as they do not access the same state. Second, terrain simulation rules can cause a sequence of state changes which cannot be parallelized, as is the case in the bridge example above.



Using our operational model for MLGs, we formulate two implications for MLG performance variability. First, because environment-based workloads do not rely on the presence of players, large environment-based workloads can cause ticks to exceed their maximum duration, even with few or no players connected. Second, because player simulation and environment-based workloads must be completed sequentially when they access the same state, even small environment-based workloads can affect tick duration given they are spatially clustered.



III Meterstick Benchmark Design



To address contribution C2, we design Meterstick, a benchmark for evaluating performance variability in MLGs. We define eight requirements for Meterstick, and discuss how Meterstick addresses these requirements in its high-level and detailed design.



III-A System Requirements



Here we describe our eight requirements for Meterstick. We define the first three requirements specifically for our use case. The last five relate to benchmarking computer systems in general, and are based on existing guidelines [23, 24].



R1 Captures performance variability of MLGs: Meterstick must be capable of capturing relevant performance metrics at a granularity sufficient for analysis of variability. The specific measure of variability must be applicable and meaningful in the context of MLGs.



R2 Validity of workloads: The workloads used in benchmarking of the MLG should be representative of real-world use and address the workload types listed in Section II-B.



R3 Relevant metrics and experiments: The benchmark should support relevant experiments to isolate different sources of variability, and collect meaningful metrics to allow suitable analysis of these sources.



R4 Fairness: The benchmark should provide a fair assessment for compatible systems. In particular, bias towards any one system should be limited.



R5 Ease of Use: The benchmark should be easy to configure and use with any compatible system.



R6 Clarity: The benchmark should present results to the user in a way that is suitable for system performance variability.



R7 Portability: The benchmark should support common deployment environments and be easy to port to others.



R8 Scalability: Benchmark workloads should be scalable to accommodate benchmarking on increasingly powerful hardware or with more performant systems.



III-B Design Overview



Here we present the design of Meterstick, our system for benchmarking of performance variability in MLGs. Figure 5 presents Meterstick’s high-level design. We discuss the benchmark workloads (addresses R2) and metrics (partially addresses R3) in more detail in Section III-C and Section III-E respectively.



In our design, the user mainly interacts with Meterstick through its Configuration component ( 1). The Configuration allows the user to capture performance variability by specifying the duration and number of iterations of experiments (partially addresses R1). The Configuration further allows users to configure benchmark parameters, such as the systems under test and workload, and deployment parameters, such as machine IP addresses (partially addresses R5).



After specifying the configuration, the user launches Meterstick. This triggers the Deployment component ( 2), which deploys components and software dependencies to remote machines specified in the configuration. To complete these actions, the user needs to specify only a set of IP addresses of ssh-accessible machines. This makes Meterstick portable (R7), and allows users to evaluate MLG performance variability under cloud or self-hosted deployments.



When deployment is complete, the Deployment component hands control to the Control Server ( 3). Meterstick follows a Controller/Worker pattern, with the Control Server as the controller, and the Control Clients as the workers ( 4). The Control Server contains the operation logic, and is responsible for synchronizing the operation of all workers by exchanging messages with the Control Client running on each worker. The Control Server and Clients exchange the messages enumerated in Table I. Depending on the configuration, the Control Client instructs the worker to run player emulation or the MLG.



Meterstick uses one or more workers for player emulation ( 5). These workers emulate players by connecting the MLG and automatically sending player actions based on programmed behavior. Meterstick implements this by using the player emulation from Yardstick [6], an existing MLG benchmark which we compare to Meterstick in detail in Section VII.



One worker runs the MLG ( 6), which is the system under test. Meterstick captures the MLG’s performance variability metrics using the player emulator ( 5), the metric externalizer ( 7), and the system metrics collector ( 8). We detail the operation of these components and the metrics they collect, including our novel metric to capture performance variability, in Section III-E.



When the benchmark experiments are done, the Control Server activates the Data Retrieval component ( 9). This component moves the collected data from the worker nodes to the user’s local machine, where it pre-processes the data through aggregation and reformatting. The Data Visualization component ( 10) takes as input the processed data and automatically outputs basic plots for MLG performance and performance variability. Users can view these plots, and, if desired, provide their own advanced plotting scripts for in-depth analysis (concludes R5, R6).



III-C Benchmark Workloads (address R2, R4, R8)



This section presents Meterstick’s workloads. Meterstick uses the workload model presented in Section II, which divides workloads in three main components: players, terrain simulation, and entities. By using this model, Meterstick’s workloads are applicable to MLGs in general, thus avoiding favoring specific systems (partially addresses R4). In practice, the user specifies in the Configuration ( 1) only the player and terrain simulation parts of the workload, as entities are a result of terrain simulation (spawning, see Section II-B).



As future systems may become sufficiently performant to mitigate the impact of Meterstick’s workloads, Meterstick supports workload scaling (R8). To increase Meterstick’s workload complexity, the user can specify an increased the number of players to scale the player workload, and use Meterstick’s scale parameter to select higher-complexity versions of the virtual world.



In the remainder of this section, we describe Meterstick’s workloads and how they fulfill our validity requirement (R2).



III-C1 The Player-Based Workload



Meterstick uses a player-based workload facilitated by the player emulation component ( 5). This workload connects 25 players which move randomly inside a shared 32-by-32 area. The workload represents a high-density area in MLGs and allows Meterstick to compare the impact of environment-based workloads with a traditional player-based workload. We select a player count of 25 based on the Minecraft Wiki’s dedicated server recommendation [25], as well as the recommendations from various commercial cloud providers (see Section V-A).



III-C2 The Environment-Based Workloads



The environment workload is determined by the world that is loaded into the MLG. Meterstick uses four environment workloads, each using its own world. To cover the range of valid workloads (R2) we include two worlds that result in a best-case workload and worst-case workload respectively. During all environment-based workloads, Meterstick connects to the game a single player that performs no actions. This is necessary to correctly capture the response time metric discussed in Section III-E. The remainder of this section describes the worlds and their resulting workloads.



The Control world results in a best-case workload while still being realistic. The Control world is an unmodified world generated by Minecraft version 1.16.4 using the seed -392114485. The measured results of this workload are used as a workload-level baseline to compare the other workloads.



The TNT world contains a 16-by-16-by-14 cuboid filled with TNT blocks which are set to explode around 20 seconds after a player connects. In the systems tested, TNT operates by spawning an entity, which can be interacted with by other entities, including other TNT entities. Thus, when a large section of TNT is activated, the MLG must perform a large number of both entity-collision and physics calculations. Intentionally creating large-scale TNT chain reactions is a popular activity, which can be observed in a plethora of community-made content. For example, a video from 2018 that shows a chain reaction of thousands of TNT blocks has 21 million views [26].



The Farm world contains multiple resource farms, which are simulated constructs built by players to automatically generate in-game resources. Table III gives an overview of all resource farms present in the Farm world. The Entity and Stone farms are activated at a fixed interval of around 4 seconds, whereas the Kelp and Item sorter use event-based activation. All of these farms rely on entities in their functioning. The Entity farm relies on spawning driven entities and manipulating their movement. The Stone and Kelp farm continuously destroy blocks, which create passive entities to represent items. These item entities are then transported through terrain simulation rules (e.g. liquid physics simulation).



The Lag world results in a worst-case workload. This world contains a simulated construct known in the MLG community as a Lag Machine. Lag Machines are a specific subset of simulated constructs that are designed to cause high computational load for the MLG, either for the purpose of stress testing it, or to cause it to crash as part of a denial of service attack. The design of the Lag Machine used in this workload is publicly available and provided by a community-creator with 52 thousand subscribers [27]. It is chosen as it operates based on terrain simulation rules. Specifically, it uses many logic-gate constructs in a small area to cause a high volume of simulation rule activations. Importantly, the simulation rules this Lag Machine uses are generally non-malicious, being both common and useful in normal operation of a Minecraft server, for example forming the basis for simulated constructs such as an operational digital Computer [28].



III-D Configuration Parameters



Meterstick is highly configurable, a list of configuration parameters are given in Table IV. By adding servers into the servers folder, new systems can easily be added. Similarly, since the deployment process simply copies the servers folder to the remote node, new world workloads can be added by copying them into the worlds directory.



The experiment can be configured by specifying the duration, number of iterations, what servers to run, how many players to connect and the behavior of those players. Since the IPs of the nodes that the deployment tool connects to can be anywhere that is network accessible, nodes can be chosen that are geographically distant or within the same datacenter in order to measure (or avoid) the performance impact caused by public network infrastructure.



Although the benchmark workloads are already computationally intense, future hardware and systems may become sufficiently performant to mitigate their impact. Thus, the TNT, Farm and Lag workloads described in Section III-C2 are tileable, and can be extended by copying sections of the terrain to achieve a increasing complexity. Provided in the benchmark are versions of these worlds where this copying has been done to relevant sections for double and quadruple complexity versions. The ‘scale’ parameter accepts the values 1, 2 and 4 to decide between these scaled world versions.



III-E Metrics (address R1, R3, R4)



This section describes the metrics collected by Meterstick, selected to fulfill R3. We first describe our novel Variability Index (VI) metric that quantifies performance variability (concludes R1). Next, we describe the application-level and system-level metrics collected by Meterstick. Our VI metric and all application and system metrics are general to MLGs to avoid bias for specific implementations (concludes R4). Table V gives an overview of the collected metrics.



III-E1 Variability Index



We describe here our novel variability index (VI) metric, a normalized metric based on cycle to cycle jitter [33, 34]. In the context of MLGs, cycle to cycle jitter is measured as the difference in duration between consecutive ticks. The full metric equation is shown in Equation 1. Here, Nasubscript𝑁𝑎N_aitalic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is the actual number of ticks, Nesubscript𝑁𝑒N_eitalic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT is the expected number of ticks, tisubscript𝑡𝑖t_iitalic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the duration of the ithsuperscript𝑖thi^\textthitalic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT game tick, and b𝑏bitalic_b is the overloaded threshold in milliseconds.



∑n=1Na|max(b,tn)-max(b,tn-1)|Ne×2bsuperscriptsubscript𝑛1subscript𝑁𝑎max𝑏subscript𝑡𝑛max𝑏subscript𝑡𝑛1subscript𝑁𝑒2𝑏\frac\textmax(b,t_n)-\textmax(b,t_n-1)\rightN_e\times 2bdivide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | max ( italic_b , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - max ( italic_b , italic_t start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) | end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × 2 italic_b end_ARG (1)



Using this metric as a measure of variability, the range of possible values is 0 to 1. VI of 0 indicates no variance: either no ticks in the trace have a duration larger than b𝑏bitalic_b ms, or all ticks have the same duration of at least 50 ms. If either of these conditions hold, tick frequency is constant throughout the trace. VI of 1 indicates maximum variance, and is reached when the sum of differences between consecutive ticks is equal to twice the duration of the trace (i.e., Ne×2bsubscript𝑁𝑒2𝑏N_e\times 2bitalic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × 2 italic_b). To see why this is the case, consider the following example consisting of three ticks: a tick with duration b𝑏bitalic_b, followed by a tick lasting the remaining time up until a final tick with duration b𝑏bitalic_b. We define the first tick to have a duration difference of 0. The other two ticks each have a duration difference of Ne×b-3bsubscript𝑁𝑒𝑏3𝑏N_e\times b-3bitalic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × italic_b - 3 italic_b with the previous tick. Summing these differences gives Ne×2b-6bsubscript𝑁𝑒2𝑏6𝑏N_e\times 2b-6bitalic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × 2 italic_b - 6 italic_b, which converges to Ne×2bsubscript𝑁𝑒2𝑏N_e\times 2bitalic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT × 2 italic_b for increasing values of Nesubscript𝑁𝑒N_eitalic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT (i.e., increasing trace duration).



III-E2 Application-level Metrics



Meterstick collects three application-level metrics:



response time, tick duration, and tick distribution.



Response time is how system latency becomes visible to the user. Lower values are better, and values exceeding genre-dependent thresholds cause the latency to become noticeable to players, and can even make the game unplayable. The response time is measured as the time between a player taking an action and the results of that action becoming visible. During this time, the action is sent over network to the MLG server, added to an input queue, simulated during the next tick, has its resulting state changes added to the output queue, and then sent back to the client over the network. In our workload model (Figure 4), this is visible as the time difference between the client sending player actions ( A) and receiving state updates ( C). Meterstick captures this metric by having a player send chat messages to all players (including itself), and measuring how long it takes for the player to receive its own message.



While tick duration and tick distribution cannot be directly observed by players, MLGs typically expose these metrics through interfaces commonly used by debugging tools. Meterstick’s Metric Externalizer ( 7) uses these interfaces to gain access to these metrics without requiring access to the game’s source code. The tick duration is the amount of time it takes the MLG to complete a single tick (i.e., one iteration of the game loop, B in Figure 4), and tick distribution is the percent of tick time the MLG spent simulating each workload component, such as simulating entities. Both metrics are directly related to game response time and are important indicators of game performance. More detail about the relationship between these metrics is available in Section II-A.



III-E3 System-level Metrics



Meterstick captures system-level metrics to allow users to perform a more in-depth performance analysis. Meterstick collects system-level metrics using the System Metrics Collector ( 8), which queries the operating system twice per second.



Meterstick collects CPU utilization, memory usage, the number of operating-system threads associated with the MLG, disk I/O, and network I/O. These metrics allow users to analyze causes of high tick duration, and check for potential performance bottlenecks.



IV Experimental Results



In this section we describe our experimental configuration including choice of Minecraft-like games and hardware, then provide an overview of findings from our experiments as well as detailed analysis of each finding.



V Real-World Experiments



To address contribution C3, we present here the setup and results from our real-world experiments. We use Meterstick to evaluate the performance variability of three MLGs: Minecraft, Forge, and PaperMC. Our experiments use the workloads described in Section III-C, and are conducted on two commercial cloud environments, Amazon AWS and Microsoft Azure, as well as on DAS-5, a supercomputer for scientific workloads [35]. Our Main Findings are:



MF1 Player experience is negatively impacted by performance variability (Section V-B). We find that the maximum game response time can be up to 20.7 times higher than the arithmetic mean, and exceed by more than a factor of 7.4 the threshold that makes the game unplayable.



MF2 Environment-based workloads cause significant performance variability (Section V-C). We find that environment-based workloads introduce significant performance variability, and can overload and crash the game.



MF3 MLGs exhibit increased variability in commercial cloud environments compared to self-hosted environments (Section V-D). We show that both AWS and Azure introduce additional performance variability between iterations of the same workload compared to DAS-5, with the minimum observed VI on AWS and Azure exceeding the maximum observed VI on DAS-5.



MF4 Processing the state of entities is computationally expensive (Section V-E).



MF5 The common hardware resource recommendations are insufficient to avoid performance variability (Section V-F). The recommended node size exhibits high performance variability and high mean tick duration. Larger node sizes result in lower values of both, such that a node with 8 vCPUs reduces performance variability and mean tick duration to acceptable levels.



V-A Experimental Setup



In this section we describe our experimental setup. In our experiments, we evaluate three MLGs, i.e., the systems under test, in three different environments.



V-A1 System under test



We use in our experiments three MLGs that use the Minecraft protocol: the original Minecraft as developed by Mojang [36], Forge, and PaperMC. We select these services because of their popularity and utility.



Forge is the most popular MLG for operating modified (i.e., modded) services [37]. Of the top-50 most downloaded Minecraft mods, 45 work exclusively with Forge. Of the 5 mods that are not exclusive to Forge, only one is incompatible with it [38]. PaperMC is marketed as a high-performance alternative to Minecraft [39]. While the PaperMC project does not quantify its performance improvement over Minecraft, it does provide documentation of its optimizations, which include extensive changes to threading models and virtual environment processing.



V-A2 Deployment Environment



We evaluate the MLGs in two commercial cloud environments, Amazon AWS and Microsoft Azure, and DAS-5, a supercomputer for academic and educational use [35]. We choose AWS and Azure because they are the two cloud environments with the biggest market share, with 32% and 20% respectively [63]. We use DAS-5 to evaluate how commercial cloud environments affect the performance variability of MLGs, compared to self-hosting these games on dedicated hardware.



Our experiments on cloud environments use T3.Large and Standard_D2_v3 nodes respectively. Both node types are equipped with 2 vCPUs and 8 GB memory. We choose these nodes based on the default hardware configurations recommended by Minecraft service providers as well as guidelines published by AWS and Azure [61, 62]. An incomplete overview of these defaults is shown in Table VI.



On AWS and Azure, we use T3.Large and Standard_D2_v3 nodes respectively. Both node types are equipped with 2 vCPUs and 8 GB memory. On DAS-5, we use a regular node, which is equipped with a dual 8-core 2.4 GHz processor and 64 GB memory, and limit the number of CPU cores available to the MLG by setting its CPU affinity to two cores, unless indicated otherwise. Because the MLGs used in our experiments run on the Java Virtual Machine (JVM), we limit memory available to the MLG in all experiments by setting the JVM’s maximum heap size to 4 GB.



V-B MF1: Player experience is negatively impacted by performance variability



Due to significant performance variability, the median and mean game response times give an optimistic view of game performance, and is worse than the performance observed by players. Figure 6 depicts the result, and shows that the maximum game response time can be up to 20.7 times higher than the arithmetic mean, and exceed by more than a factor of 7.4 the threshold that makes the game unplayable.



Figure 6 shows the response time (horizontal axis) for two MLGs (Minecraft in green, and Forge in blue) under three different workloads (vertical axis). The workloads and response time metric are described in Section III-C and Section III-E2 respectively. The whiskers extend to 1.5 times the interquartile range, and the pink diamond indicates the arithmetic mean. The Noticeable Delay line (at 60 ms, in orange) and Unplayable Game line (at 118 ms, in red) indicate high game-latency which respectively marks the values where latency becomes noticeable to players and makes the game so unresponsive it becomes unplayable [21, 20].



Under the Control workload (top two boxes), more than 95% of response time samples are below the noticeable threshold for both Minecraft and Forge. However, the maximum value for Forge (514 ms) is 20.7 times larger than the mean, and the maximum value for Minecraft (679 ms) is exceeds by 7.4 times the Unplayable threshold at 118 ms. These outliers occur directly after the player connects to the game. This means that, even with good average performance, the game can still be unplayable if players frequently connect, which is a common occurrence in online multiplayer games.



Compared to the Control workload, the Farm and TNT workloads show significantly more performance variability, further affecting player experience. In all cases, the mean and median values give an overly optimistic view of the game’s performance. For the Farm workload, the mean and median values for Forge (third bar from the top) indicate the response time is noticeable, but not unplayable. However, the plot shows outliers up to 876 ms, which is over 7 times more than the Unplayable threshold. For Minecraft (fourth bar from the top), the mean and median values indicate that the response time is not noticeable for players. However, the plot shows that performance variability causes the response time to exceed the Noticeable threshold more than 25% of the time (box’s right edge exceeds the Noticeable threshold), and exceeds the Unplayable threshold up to 5% of the time. The TNT workload causes the highest performance variability for both Forge and Minecraft (bottom two boxes, 547 ms interquartile range for Forge and 503 ms for Minecraft). In both cases, the median response time is below the noticeable threshold, while the maximum observed values (indicated with black arrows) are at least 19 times larger than the unplayable threshold.



Based on these results, we conclude that the mean and median values give an overly optimistic view of MLG performance, and that performance variability in MLGs results in noticeable and unplayable game latency, negatively affecting players.



V-C MF2: Environment-based workloads cause significant performance variability



Environment-based workloads cause significantly increased performance variability on each game and in each environment tested, and can overload or crash the game. Figure 7 shows the performance variability of each MLG when running environment-based workloads on AWS and DAS-5. Compared to the control workload, each MLG on each environment exhibits higher performance variability when operating environment-based workloads.



Figure 7 shows performance variability, quantified using VI (see Equation 1). The three top-level rows show three environment configurations, each containing five workloads. The color and shape of the marker indicate one of three MLGs.



The results show that for all three games in each environment, the game’s performance variability when running an environment-based workloads (i.e., Farm, TNT, Lag) is significantly higher than for the Control workload. With the exception of PaperMC on AWS (red bubbles in the top row), the game’s performance variability is also higher than for the Players workload, which provides evidence that environment workloads can cause more performance variability than the player workload. Further analysis into the behavior of PaperMC reveals that it contains performance optimizations specifically for handling TNT explosions, improving its performance on the TNT workload, and Redstone, a simulated block type which is used in the Farm workload (analysis of PaperMC given Appendix A). This provides evidence that the performance variability caused by these environment-based workloads are known to the MLG community.



Of all workloads, the Lag workload causes the most performance variability. Further analysis of this result reveals that this is caused by the Lag workload significantly extending the duration of every other tick, resulting in a near-maximum VI. There are no results for running the Lag workload on AWS because all three games crash when a player joins and environment simulation begins. The corresponding increase in tick duration causes the player’s connection request to time out and the MLG forcibly stops when it cannot write player data to file.



To further analyze the impact of performance variability, we show in Figure 8 the game’s tick duration over time for each game when running on AWS. The dashed black line indicates the overloaded threshold at 50 ms, and the green line allows calibrating the vertical axis across the four sub-plots.



The low performance variability observed when running the Control workload in Figure 7 is visible in the top-left sub-plot in Figure 8 as three relatively stable curves with few spikes. In contrast, the high performance variability observed for the Farm and TNT workloads is visible in the top-right and bottom-left sub-plots as jittery curves. The Farm workload depicted in the top-right shows curves which change value at high frequency, resulting in high VI. PaperMC’s tick durations are frequently below the 50 ms threshold, resulting in lower VI. The TNT workload depicted in the bottom-left shows curves which change value at a much lower frequency, but reach significantly higher values, exceeding 2500 ms for both Minecraft and Forge. Similar to the Farm workload, PaperMC’s tick durations are frequently below the 50 ms threshold, resulting in lower VI.



V-D MF3: MLGs exhibit increased variability in commercial cloud environments



In our experiments, all MLGs show increased performance variability (i.e., VI) when run on the AWS and Azure cloud environments, compared to the self-hosted DAS-5. Figure 9 shows VI across 50 iterations of the Player workload of all three games (on the vertical axis) in DAS-5 (green), Azure (blue), and AWS (red). For more details on this workload, see Section III-C1.



The results show that all three games have the lowest VI (horizontal position of boxes) and the lowest VI (width of boxes) when run on DAS-5. The maximum VI observed on the DAS-5 is 0.021 (Forge), which is smaller than 0.029, the minimum VI observed in AWS and Azure (PaperMC on Azure).



From this result, we highlight two surprising observations. First, no game performs best in all environments. On DAS-5, PaperMC performs best, slightly outperforming Minecraft with a median VI of 0.007 and 0.010 respectively. Although PaperMC also has the lowest median VI on Azure, it simultaneously has the highest interquartile range: 0.028, 0.009, and 0.011 for PaperMC, Forge, and Minecraft respectively. Moreover, on AWS, PaperMC is the worst performing game, with a median VI of 0.094. Second, neither cloud performs best for all games. While AWS performs better for Minecraft and Forge, Azure performs best for PaperMC.



Increased performance variability in commercial cloud environments is a well-documented phenomenon [64, 65, 66, 67], with a wide variety of sources identified for the cause of increased variability, including hardware manufacturing differences, shared tenancy of hardware and networks, specific software configurations, and resource allocation and scheduling systems. With so many variables operating in the context of commercial cloud hosting, it is infeasible to identify a single source responsible for the of variability for these games, especially since commercial cloud hosting companies do not make internal data on resource allocation and shared tenancy publicly available. However, we can conclude that this variability observably impacts the performance of MLGs, and can be compared between MLGs and commercial cloud services.



V-E MF4: Processing entity-state is computationally expensive



Entity workload components account for a large majority of computation time and state update messages.



Figure 10 shows that entity related workload components contribute to a majority of non-waiting tick computation time. After entities, the next most time intensive component is environment rule processing, and then block creation or destruction.



The horizontal axis is percentage of tick computation time. The vertical axis on the left is MLG and Workload on the right. The color of each bar indicates workload operation, and the width of each bar corresponds to their share of tick computation time throughout the experimental duration on the AWS environment.



Entities account for a majority of non-waiting tick time during every workload on each server. Most notably, PaperMC has a much smaller proportion of entity calculation time under each workload compared to Minecraft and Forge. During the TNT workload, in which Minecraft and Forge show large percentages of entity tick computation time compared to both the Control and Farm workload, PaperMC has only a small increase. This relative reduction in Entity computation may be the reason PaperMC manages its comparatively high performance during the TNT workload as seen in MF1.



Table VII shows that entity-related state updates account for a majority of messages sent to the client from the server in all configurations except PaperMC running the Farm workload. Conversely, entity-related state updates account for only a small percent of network bytes sent.



Thus, we find that efficient computation and dissemination of entity state is a crucial performance challenge to MLGs. Unlike environment processing, which exhibits spacial locality; entities require computation and take actions regardless of proximity to players, terrain or other entities. This ability to move freely within the grid world means that physics simulation features such as gravity, liquid dynamics and collision must also be computed each tick for each entity. As MLGs have no pre-calculated paths, way-points, or fixed locations, entities must make use of grid-based pathfinding, and their paths must be recalculated upon each terrain state update within their vicinity. Similarly, entities in MLGs require computation even before they have been created. As there are no set locations for entities to appear, these must be calculated each tick from the configuration of local game state: terrain, player positions, and existing entities.



V-F MF5: Using recommended hardware results in significant performance variability



Recommended hardware configurations in cloud environments result in unacceptable levels of performance variability, which degrades player experience. By using more powerful cloud hardware, performance variability can be limited to acceptable levels. Figure 11 shows this result, showing both the mean tick duration and VI for varying node types in AWS.



Companies that specialize in cloud hosting of MLGs commonly list recommended hardware configurations, with the most frequent recommendation being 2 vCPUs and 4 GB memory. An overview of these recommendations is available in Table VI. These recommended values are significantly lower than those listed on the community-driven Minecraft wiki, which recommends a dedicated full CPU (e.g., Intel i5 or i7, or AMD Ryzen 5 or 7) and 6 GB memory [25]. This indicates that players experienced performance problems with the recommended hardware configuration.



Figure 11 shows that using the recommended hardware configuration as listed by cloud-hosting companies, which corresponds to the t3.large node type, results in poor performance and significant performance variability. On this node size, each MLG becomes significantly overloaded by environment-based workloads and exhibits high performance variability.



The larger node types t3.xlarge and t3.2xlarge have 4 and 8 vCPUs respectively [68]. While the t3.xlarge provides better performance and less performance variability than the t3.large, it remains insufficient to keep the mean tick time below 50 ms. The t3.2xlarge node type is required to provide sufficiently low mean tick duration. However, this node type still shows significant performance variability for Minecraft (green cross) and Forge (blue square), which means these games can still become overloaded temporarily, as shown in Section V-C.



Interestingly, we observe that the benefit of more powerful hardware varies per MLG. Specifically, while PaperMC’s (red circle) performance variability (i.e., VI) increases significantly when decreasing hardware resources, from 0.025 to 0.08 in the top-right sub-plot, it is the only game whose mean tick duration stays well below the 50 ms threshold. Further analysis shows that, while PaperMC becomes overloaded and its tick duration exceeds 50 ms, the number of extreme outliers is significantly less, preventing this performance problem from becoming visible in the mean tick duration.



VI Actionable Insights



In this section we provide a summary of actionable insights based on our main findings.



AI1 Game developers and hardware producers should report performance variability when evaluating the performance of online games, using measures of variance such as VI (see Section III-E) and the distribution of game response time and frames per second (FPS). Games must provide consistently good performance to their users. Our experiments show that MLGs can be overloaded and become unplayable, even when mean and median performance values indicate good performance (MF1, Figure 6).



AI2 Game developers and hardware producers should include environment-based workloads in their benchmarks for MLGs. It is not sufficient to evaluate the performance of MLGs using only large numbers of players (i.e., player-based workloads). environment-based workloads cause significant performance variability in MLGs and make them unplayable (MF2, Figure 7), and must therefore be included in MLG benchmarks.



AI3 Players should choose their cloud environment depending on their MLG, and should consider self-hosting their game. Our results indicate that choice of best cloud provider depends on the MLG. Minecraft and Forge obtain lower performance variability on AWS, while PaperMC obtains lower performance variability on Azure (MF3, Figure 9). Moreover, self-hosting remains a valuable alternative, resulting in significantly lower performance variability overall.



AI4 MLG service providers should increase their hardware recommendations. Prior work has shown that when asked to estimate in advance the hardware requirements of a given program, users either pick a provided default configuration, or overestimate to an extreme value to avoid performance issues [69, 70, 71]. We find that recommended hardware configurations for hosting MLGs on cloud environments are insufficient (MF5) and conclude that users who employ the first strategy will experience decreased quality of service which may cause them to switch to a competing commercial cloud provider.



To prevent this, commercial cloud providers should update hardware recommendations in line with our findings in Figure 11: a 2-core size is insufficient, and a node with 8 cores is necessary for smooth operation. The 4-core size provides a balance of cost and performance. Beyond these recommendations, commercial cloud providers should use our benchmark to determine adequate hardware allocations capable of fulfilling the service requirements of MLGs under realistic workloads, and further adapt resource scheduling to be aware of the performance patterns of MLGs.



Similarly, the second category of users who seek to avoid performance issues when operating MLG cloud environments should choose node sizes comparable to the 8 core t3.2xlarge node, or use our benchmark to compare both various cloud providers and the specific implementations of MLGs.



AI5 Game developers should engineer MLGs to reduce impact of environment-based workloads. Engineering for this goal can reduce the impact (i.e., performance variability) of environment-based workloads by 60% on the same hardware, as shown by PaperMC operating the Farm workload on AWS (Figure 7). The goal of the PaperMC project is to implement a high performance MLG, including efficient processing of environment-based workloads. We provide an analysis of PaperMC in Appendix A Developers creating MLGs should consider and mitigate the impact of environment-based workloads, and use our benchmark to measure, analyze and subsequently reduce the performance variability caused by such workloads.



VII Related Work



We summarize in this section a developing overview of related work. Overall, this study is the first to evaluate performance variability in MLGs. This is challenging because there is neither a generally accepted set of relevant workloads for MLGs, nor a standardized metric to quantify performance variability in computer systems.



Closest to our work, Yardstick is an MLG benchmark used to show the limited scalability of MLGs [6]. The authors use Yardstick to evaluate the scalability and network characteristics of several MLG services. However, Yardstick does not quantify performance variability, resulting in optimistic results. Moreover, the authors do not evaluate MLG performance under environment-based workloads, or when deployed in a commercial cloud environment.



The MineRL competition [72] provides a dataset of Minecraft player sessions. These sessions provide demonstrations to train artificial intelligence systems to complete a challenging in-game task. In contrast, the workloads used in this work focus on commonly observed patterns in the MLG community.



There exist several systems that aim to improve the scalability of MLGs. Manycraft [73] increases the maximum number of players in a Minecraft instance by using Kiwano. Kiwano [74] allows horizontal scaling of virtual environments through Voronoi partitioning, but requires a static environment, which disables the MLG’s modifiable world and is incompatible with environment-based workloads.



Koekepan [75] is another distributed architecture for Minecraft. Similar to Manycraft, Koekepan scales horizontally by partitioning the world into zones. Koekepan supports a modifiable environment, but currently lacks a performance evaluation.



Iosup et al. [76] find that commercial cloud environments exhibit significant yearly and daily performance variability patterns. The authors show that performance variability varies per cloud operator, and use simulation experiments to show that this can affect the performance of applications, including a social online game. In contrast, our benchmark uses real-world experiments to evaluate the effect of performance variability on MLGs, which are real-time online games.



VIII Conclusion



Online gaming is a popular and lucrative part of the entertainment industry that raises important performance challenges. In this work, we focus on the performance and scalability of Minecraft and Minecraft-like games, which are important exemplars of online games in which players have fine-grained control over the virtual environment (MLGs). Earlier work has shown that MLGs scale only to 200-300 players under favorable conditions; and to as low as 10 players in commercial MLG-as-a-service operations.



In this work, we posit performance variability is an important cause for the lack of scalability in MLGs. We make a four-fold contribution to better understand the behavior of these systems. First, we propose a novel workload model for MLGs, which identifies important sources of performance variability and explains it in practical MLG operation. Second, we design and implement Meterstick, the first benchmark to evaluate performance variability in MLGs. Meterstick uses realistic workload types, including novel, environment-based workloads, and can evaluate MLGs running both in self-hosted environments and in clouds such as Amazon AWS and Microsoft Azure. Third, we use Meterstick to perform real-world experiments and analyze the results. We find that performance variability negatively affects players in MLGs, that both environment-based workloads and cloud environments can cause significant performance variability, and that MLGs exhibit significant performance variability when using the recommended hardware configuration. This leads us to formulate five actionable insights. Fourth, we release FAIR and FOSS artifacts that enable reproducibility for this work.



In future work, we aim create a public score-board where operators of MLG-as-a-service can publish benchmark scores. We also aim to conduct user studies to link values of VI with player-perceived quality of experience for different games and MLGs.