low吧 关注:102贴子:1,117
  • 3回复贴,共1

英语翻译狂人,你敢来挑战吗?

只看楼主收藏回复

Our concurrency control schemes are designed for a partitioned main-memory database system similar to H-Store. This section gives an overview of the relevant aspects of the H-Store design.
Traditional concurrency control comprises nearly half the
CPU instructions executed by a database engine. This
suggests that avoiding concurrency control can improve
throughput substantially. H-Store was therefore designed
explicitly to avoid this overhead.
2.1 Transactions as Stored Procedures
H-Store only supports executing transactions that have
been pre-declared as stored procedures. Each stored procedure invocation is a single transaction that must either abort
or commit before returning results to the client. Eliminating
ad-hoc transactions removes client stalls, reducing the need
for concurrency.
2.2 No Disk
Today, relatively low-end 1U servers can have up to 128
GB of RAM, which gives a data center rack of 40 servers
an aggregate RAM capacity of over 5 TB. Thus, a mod-
est amount of hardware can hold all but the largest OLTP
databases in memory, eliminating the need for disk access
during normal operation. This eliminates disk stalls, which
further reduces the need for concurrent transactions.
Traditional databases rely on disk to provide durability.
However, mission critical OLTP applications need high availability which means they use replicated systems. H-Store
takes advantage of replication for durability, as well as high
availability. A transaction is committed when it has been
received by k > 1 replicas, where k is a tuning parameter.
2.3 Partitioning
Without client and disk stalls, H-Store simply executes
transactions from beginning to completion in a single thread.
To take advantage of multiple physical machines and multiple CPUs, the data must be divided into separate partitions. Each partition executes transactions independently.
The challenge becomes dividing the application’s data so
that each transaction only accesses one partition. For many
OLTP applications, partitioning the application manually
is straightforward. For example, the TPC-C OLTP benchmark can be partitioned by warehouse so an average of 89%
of the transactions access a single partition. There is
evidence that developers already do this to scale their applications, and academic research provides some
approaches for automatically selecting a good partitioning
key. However, unless the partitioning scheme is
100% effective in making all transactions only access a single partition, then coordination across multiple partitions
for multi-partition transactions cause network stalls and executing a transaction to completion without stalls is not possible. In this paper, we focus on what the system should do
in this case.
3. EXECUTING TRANSACTIONS
In this section, we describe how our prototype executes
transactions. We begin by describing the components of our
system and the execution model. We then discuss how single
partition and multi-partition transactions are executed.



1楼2011-11-04 22:23回复
    3.1 System Components
    The system is composed of three types of processes, shown
    in Figure 1. First, the data is stored in partitions, a single
    process that stores a portion of the data in memory, and executes stored procedures using a single thread. For each par-
    tition, there is a primary process and k−1 backup processes,
    which ensures that data survives k − 1 failures. Second, a
    single process called the central coordinator is used to coordinate all distributed transactions. This ensures that distributed transactions are globally ordered, and is described
    in Section 3.3. Finally, the client processes are end-user ap-
    plications that issue transactions to the system in the form of
    stored procedure invocations. When the client library connects to the database, it downloads the part of the system
    catalog describing the available stored procedures, network
    addresses for the partitions, and how data is distributed.
    This permits the client library to direct transactions to the
    appropriate processes.
    Transactions are written as stored procedures, composed
    of deterministic code interleaved database operations. The
    client invokes transactions by sending a stored procedure
    invocation to the system. The system distributes the work
    across the partitions. In our prototype, the mapping of work
    to partitions is done manually, but we are working on a query
    planner to do this automatically
    Each transaction is divided into fragments. A fragment is
    a unit of work that can be executed at exactly one partition.
    It can be some mixture of user code and database operations.
    A single partition transaction, for example, is composed of
    one fragment containing the entire transaction. A multipartition transaction is composed of multiple fragments with
    data dependencies between them.
    3.2 Single Partition Transactions
    When a client determines that a request is a single partition transaction, it forwards it to the primary partition
    responsible for the data. The primary uses a typical primary/backup replication protocol to ensure durability. In
    the failure free case, the primary reads the request from the
    network and sends a copy to the backups. While waiting
    for acknowledgments, the primary executes the transaction.
    Since it is a single partition transaction, it does not block.
    When all acknowledgments from the backups are received,
    the result of the transaction is sent to the client. This protocol ensures the transaction is durable, as long as at least
    one replica survives a failure.
    No concurrency control is needed to execute single partition transactions. In most cases, the system executes these
    transactions without recording undo information, resulting
    in very low overhead. This is possible because transactions
    are annotated to indicate if a user abort may occur. For
    transactions that have no possibility of a user abort, concurrency control schemes that guarantee that deadlock will
    not occur (see below) do not keep an undo log. Otherwise,
    


    2楼2011-11-04 22:23
    回复
      广告
      立即查看
      the system maintains an in-memory undo buffer that is discarded when the transaction commits.
      3.3 Multi-Partition Transactions
      In general, multi-partition transaction can have arbitrary
      data dependencies between transaction fragments. For example, a transaction may need to read a value stored at
      partition P1, in order to update a value at partition P2.
      To ensure multi-partition transactions execute in a serializable order without deadlocks, we forward them through
      the central coordinator, which assigns them a global order.
      Although this is a straightforward approach, the central coordinator limits the rate of multi-partition transactions. To
      handle more multi-partition transactions, multiple coordinators must be used. Previous work has investigated how
      to globally order transactions with multiple coordinators,
      for example by using loosely synchronized clocks. We
      leave selecting the best alternative to future work, and only
      evaluate a single coordinator system in this paper.
      The central coordinator divides the transaction into fragments and sends them to the partitions. When responses are
      received, the coordinator executes application code to determine how to continue the transaction, which may require
      sending more fragments. Each partition executes fragments
      for a given transaction sequentially.
      Multi-partition transactions are executed using an undo
      buffer, and use two-phase commit (2PC) to decide the outcome. This allows each partition of the database to fail independently. If the transaction causes one partition to crash or
      the network splits during execution, other participants are
      able to recover and continue processing transactions that
      do not depend on the failed partition. Without undo information, the system would need to block until the failure is
      repaired.
      The coordinator piggybacks the 2PC “prepare” message
      with the last fragment of a transaction. When the primary
      receives the final fragment, it sends all the fragments of the
      transaction to the backups and waits for acknowledgments
      before sending the final results to the coordinator. This is
      equivalent to forcing the participant’s 2PC vote to disk. Finally, when the coordinator has all the votes from the participants, it completes the transaction by sending a “commit”
      message to the partitions and returning the final result to
      the application.
      When executing multi-partition transactions, network stalls
      can occur while waiting for data from other partitions. This
      idle time can introduce a performance bottleneck, even if
      multi-partition transactions only comprise a small fraction
      of the workload. On our experimental systems, described
      in Section 5, the minimum network round-trip time between two machines connected to the same gigabit Ethernet switch was measured using ping to be approximately 40
      µs. The average CPU time for a TPC-C transaction in our
      system is 26 µs. Thus, while waiting for a network acknowledgment, the partition could execute at least two singlepartition transactions. Some form of concurrency control is
      needed to permit the engine to do useful work while otherwise idle. The challenge is to not reduce the efficiency of
      simple single partition transactions. The next section describes two concurrency control schemes we have developed to address this issue.


      3楼2011-11-04 22:23
      回复

        不敢翻译-


        4楼2011-11-09 18:17
        回复