1、引言文献是由Rick Cattell撰写的论文,论文讨论了可扩展的结构化数据的、非结构化的(包括基于键值对的、基于文档的和面向列的)数据存储方案(注:NOSQL是支撑大数据应用的关键所在。事实上,将NOSQL翻译为“非结构化”不甚准确,因为NOSQL更为常见的解释是:Not Only SQL(不仅仅是结构化),换句话说,NOSQL并不是站在结构化SQL的对立面,而是既可包括结构化数据,也可包括非结构化数据)。论文信息Scalable SQL and NoSQL Data StoresRick Cattell Originally published in 2010, last revised
2、December 2011摘要ABSTRACTIn this paper, we examine a number of SQL and so called “NoSQL” data stores designed to scale simple OLTP-style application loads over many servers。Originally motivated by Web 2。0 applications, these systems are designed to scale to thousands or millions of users doing updates
3、 as well as reads, in contrast to traditional DBMSs and data warehouses。We contrast the new systems on their data model, consistency mechanisms, storage mechanisms, durability guarantees, availability, query support, and other dimensions. These systems typically sacrifice some of these dimensions, e
4、.g。 database-wide transaction consistency, in order to achieve others, e。g。 higher availability and scalability.在这篇文献中,我们验证了许多SQL和所谓的NoSQL数据存储(它设计于支持简单的OLTP风格的应用,能够用于扩展在很多服务器上)它最先由Web 2。0应用引起,与传统的数据库管理系统和数据仓库对比,这些系统设计为可扩展到数以千计或数以百万计的用户做更新,同时读取。我们对比了新系统上的数据模型,一致性机制, 存储机制,持久性保证,可用性,支持的查询以及其它属性,这些系统典型的
5、牺牲(为了实现其它属性而去掉)了一些属性。如数据库常有的事务一致性,牺牲了这个是为了其它的属性,如高可用,可扩展.Note: Bibliographic references for systems are not listed, but URLs for more information can be found in the System References table at the end of this paper。注:参考书没列出来(翻译省)Caveat: Statements in this paper are based on sources and documentation
6、 that may not be reliable, and the systems described are “moving targets,” so some statements may be incorrect。 Verify through other sources before depending on information here。 Nevertheless, we hope this comprehensive survey is useful! Check for future corrections on the authors web site cattell。n
7、et/datastores。警告:一些提及的书可能不可用。尽管如此,我们还是希望这篇综合的文献对大家有帮助,我们网站:cattell。net/datastores.Disclosure: The author is on the technical advisory board of Schooner Technologies and has a consulting business advising on scalable databases。透漏:作者是可扩展数据库商业顾问。1。 OVERVIEWIn recent years a number of new systems have b
8、een designed to provide good horizontal scalability for simple read/write database operations distributed over many servers。 In contrast, traditional database products have comparatively little or no ability to scale horizontally on these applications. This paper examines and compares the various ne
9、w systems。近年,很多系统的设计提供良好水平扩展,支持在多服务器上分布式读写.相比较传统的系统,一般为无扩展,规模小.本篇文献研究与对比很多不同的新系统(Yol注,其实就是各种NOSQL设计进行对比,比如Mongo与Hbase分类,简介)Many of the new systems are referred to as “NoSQL” data stores。 The definition of NoSQL, which stands for “Not Only SQL” or “Not Relational”, is not entirely agreed upon. For th
10、e purposes of this paper, NoSQL systems generally have six key features:NoSQL等于Not Only SQL, 或者Not Relational(弱关系型数据库,与mysql比较起来),NoSQL的systems一般有6重要特征:1。 the ability to horizontally scale “simple operation” throughput over many servers,通过简单操作在多服务器上水平扩展的能力2. the ability to replicate and to distribut
11、e (partition) data over many servers,复制和分发 (分区) 数据在多个服务器的能力3。 a simple call level interface or protocol (in contrast to a SQL binding),一种简单的调用级接口或协议 (相比较于 SQL 绑定)4. a weaker concurrency(并发性,并行性) model than the ACID transactions of most relational (SQL) database systems,对比大多数关系数据库 (SQL) 数据库管理系统 ACID
12、事务,它是一种较弱的并发模型5。 efficient use of distributed indexes and RAM for data storage,有效地利用分布式的索引和 RAM 的数据存储6。and the ability to dynamically add new attributes to data records.动态地在数据记录中添加新的属性The systems differ in other ways, and in this paper we contrast those differences。 They range in functionality from
13、the simplest distributed hashing, as supported by the popular memcached open source cache, to highly scalable partitioned tables, as supported by Googles BigTable 1. In fact, BigTable, memcached, and Amazons Dynamo 2 provided a “proof of concept” that inspired many of the data stores we describe her
14、e:这些系统在其他方面也有不同,在本文中我们对比了这些差异。它们的范围从简单的分布式哈希算法,如流行的开源memcached缓存,到高度可扩展的已分区表,如谷歌的BigTable 1。事实上,BigTable,memcached和亚马逊的Dynamo 2 提供”概念证明”,催动了许多我们在这儿描述的数据存储: Memcached demonstrated(论证,证明) that in-memory indexes can be highly scalable, distributing and replicating objects over multiple nodes. Memcached
15、表明内存中索引可以是高度可伸缩、分布式和在多个节点上复制对象。 Dynamo pioneered the idea of eventual consistency as a way to achieve higher availability and scalability: data fetched are not guaranteed to be upto-date, but updates are guaranteed to be propagated to all nodes eventually. Dynamo的先驱想了一个idea,以实现更高的可用性和可伸缩性的最终一致性, 那就是
16、: 获取数据不能保证是最新的,但保证这个最新能最终传播到所有节点。 BigTable demonstrated that persistent record storage could be scaled to thousands of nodes, a feat that most of the other systems aspire to。 BigTable表明,持续的记录存储可以缩放到数千个节点,是其他系统最向往的。A key feature of NoSQL systems is “shared nothing” horizontal scaling replicating and
17、partitioning data over many servers。 This allows them to support a large number of simple read/write operations per second. This simple operation load is traditionally called OLTP (online transaction processing), but it is also common in modern web applicationsNoSQL系统的一个核心特征是”无共享”的水平扩展- - 复制和数据分区在多台
18、服务器。这使他们能够支持大量的每秒简单的读写操作.这个简单的操作负荷传统上称为 OLTP (联机事务处理),但这在 web 应用程序中很常见。The NoSQL systems described here generally do not provide ACID transactional properties: updates are eventually propagated, but there are limited guarantees on the consistency of reads。 Some authors suggest a “BASE acronym in con
19、trast to the “ACID” acronym:通常这里描述的NoSQL系统不提供事务的 ACID 属性: 更新最终传播,但一致性的读取有有限的保证.对比ACID的缩写,有些作者建议”BASE”的首字母缩略词,意义如下: BASE = Basically Available, Soft state, Eventually consistent 基本可用,软状态,最终一致 ACID = Atomicity, Consistency, Isolation, and Durability 原子性、一致性、隔离和耐久性The idea is that by giving up ACID con
20、straints, one can achieve much higher performance and scalability。这其中的想法是通过放弃ACID约束,可以实现多更高的性能和可扩展性。However, the systems differ in how much they give up。 For example, most of the systems call themselves “eventually consistent, meaning that updates are eventually propagated to all nodes, but many of
21、them provide mechanisms for some degree of consistency, such as multiversion concurrency control (MVCC).然而,系统在他们放弃多少有所不同.例如,大部分的系统调用自己”最终一致性,意味着更新最终传播到所有节点,但其中许多人提供一定程度的一致性的机制,例如多版本并发控制 (MVCC)Proponents(n. (某事业、理论等的)支持者,拥护者) of NoSQL often cite Eric Brewers CAP theorem 4, which states that a system
22、can have only two out of three of the following properties: consistency, availability, and partitiontolerance。 The NoSQL systems generally give up consistency。 However, the trade-offs are complex, as we will see。NoSQL的拥护者经常援引 Eric Brewer 帽定理 4,其中指出,一个系统可以有只有 2 / 3 的以下属性: 一致性、可用性和分区容忍性。NoSQL系统通常会放弃一致
23、性.然而,权衡取舍是复杂的正如我们将看到New relational DBMSs have also been introduced to provide better horizontal scaling for OLTP, when compared to traditional RDBMSs. After examining the NoSQL systems, we will look at these SQL systems and compare the strengths of the approaches。 The SQL systems strive to provide h
24、orizontal scalability without abandoning SQL and ACID transactions。 We will discuss the tradeoffs(权衡取舍) here。此外介绍了新的关系型Dbms提供更好水平扩展用于 OLTP,相比传统的Rdbms。在检查后的NoSQL系统,我们将看看这些 SQL 系统,然后比较优势。SQL 系统极力在不放弃 SQL 和 ACID 事务的前提下提供水平可伸缩性。我们将在这里讨论权衡取舍In this paper, we will refer to both the new SQL and NoSQL syste
25、ms as data stores, since the term “database system” is widely used to refer to traditional DBMSs。 However, we will still use the term “database to refer to the stored data in these systems。 All of the data stores have some administrative unit that you would call a database: data may be stored in one
26、 file, or in a directory, or via some other mechanism that defines the scope of data used by a group of applications。 Each database is an island unto itself, even if the database is partitioned and distributed over multiple machines: there is no “federated database concept in these systems (as with
27、some relational and object-oriented databases), allowing multiple separately-administered databases to appear as one. Most of the systems allow horizontal partitioning of data, storing records on different servers according to some key; this is called “sharding”。 Some of the systems also allow verti
28、cal partitioning, where parts of a single record are stored on different servers.在本文中,我们将新 SQL 和NoSQL系统称为数据存储,因为”数据库系统”一词被广泛用于指传统 DBMS.但是,我们仍将使用数据库”一词指在这些系统中存储的数据引用。数据存储的都是一些数据库的(行政,管理)单位,: 数据可能存储在一个文件中,或在目录中,或通过定义范围的数据使用的其他一些机制的一组应用程序。每个数据库是一座孤岛本身,即使数据库分区并且分布在多台机器: 在这些系统中有没有联邦的数据库”概念 (如一些关系数据库和面向对象
29、数据库),允许多个单独管理的数据库,显示为一个(Yol注:也就是不允许多个单独的显示为一个)。大多数系统允许根据一些键,进行水平分区存储数据,记录在不同的服务器,;这就被所谓”切分”。一些系统还允许进行垂直分区,单个记录的分成部分,分布存储在不同服务器上。1。1 Scope of this Paper此文献讨论范围Before proceeding, some clarification is needed in defining “horizontal scalability” and “simple operations。 These define the focus of this pa
30、per。在开始之前,在定义横向扩展”和”操作简单”需要一些澄清。这些定义本文的重点。By “simple operations, we refer to key lookups, reads and writes of one record or a small number of records. This is in contrast to complex queries or joins, read mostly access, or other application loads。 With the advent of the web, especially Web 2。0 sites
31、 where millions of users may both read and write data, scalability for simple database operations has become more important。 For example, applications may search and update multiserver databases of electronic mail, personal profiles, web postings, wikis, customerrecords, online dating records, class
32、ified ads, and many other kinds of data。 These all generally fit the definition of “simple operation applications: reading or writing a small number of related records in each operation。“简单的操作,指:我们是指关键的查找、读取和写入一条记录或记录的小数目。这是与复杂的查询或联接(joins),只读主要访问,或其他应用程序加载相对比的.随着互联网的出现,特别是 Web 2。0 网站在那里数以百万计的用户可同时读
33、取和写入数据,简单的数据库操作的可扩展性已变得更为重要。例如,应用程序可以搜索和更新多个服务器数据库上的电子邮件、个人配置文件、网络帖子、 wiki、客户记录、在线约会记录,分类广告和许多其他类型的数据。这些一般都符合定义的应用程序”操作简单”: 即读取或写入每个操作中的相关记录的小数目。The term “horizontal scalability means the ability to distribute both the data and the load of these simple operations over many servers, with no RAM or di
34、sk shared among the servers. Horizontal scaling differs from “vertical” scaling, where a database system utilizes (利用)many cores and/or CPUs that share RAM and disks。 Some of the systems we describe provide both vertical and horizontal scalability, and the effective use of multiple cores is importan
35、t, but our main focus is on horizontal scalability, because the number of cores that can share memory is limited, and horizontal scaling generally proves less expensive, using commodity(商品) servers。 Note that horizontal and vertical partitioning are not related to horizontal and vertical scaling, ex
36、cept that they are both useful for horizontal scaling。“横向扩展”,(Yol注:英文中horizontal scalability可以说成横向扩展,水平扩展,与纵向扩展,垂直扩展相对应)是指在多个服务器,进行数据分布式和简单操作的负载,这些服务器之间没有 RAM 共享或磁盘共享.水平扩展,有别于”垂直扩展,垂直扩展是一个数据库系统利用多核和/或共享 RAM 和磁盘的Cpu。一些我们所描述的系统同时提供纵向和横向的可扩展性,当然多个内核的有效利用是重要的,但我们的主要焦点是水平可伸缩性,因为可以共享内存的内核的数量是有限的,水平缩放一般提供便
37、宜,商用的服务器.请注意,水平和垂直分区与水平和垂直扩展无关的,虽然他们都有益于水平扩展。1。2 Systems Beyond our Scope超过我们范围的系统Some authors have used a broad definition of NoSQL, including any database system that is not relational. Specifically, they include:一些作者已经使用是广义定义的NoSQL,包括任何不是关系型的如: Graph database systems: Neo4j and OrientDB provide e
38、fficient distributed storage and queries of a graph of nodes with references among them.图形数据库系统: Neo4j 和OrientDB提供了高效的分布式的存储和在相互引用的节点中查询。 Objectoriented database systems: Object-oriented DBMSs (e。g。, Versant) also provide efficient distributed storage of a graph of objects, and materialize these obj
39、ects as programming language objects.面向对象数据库系统: 面向对象的数据库管理系统 (例如,Versant) 也提供对象的高效的分布式的图存储,实现这些对象作为编程语言对象 Distributed object-oriented stores: Very similar to objectoriented DBMSs, systems such as GemFire distribute object graphs inmemory on multiple servers。分布式面向对象存储:非常类似于面向对象的数据库管理系统,像GemFire,在多个服务
40、器内存上进行分布式对象的图形存储These systems are a good choice for applications that must do fast andextensive referencefollowing(索引跟踪), especially where data fits in memory. Programming language integration is also valuable。 Unlike the NoSQL systems, these systems generally provide ACID transactions。 Many of them
41、 provide horizontal scaling for referencefollowing and distributed query decomposition, as well。 Due to space limitations, however, we have omitted these systems from our comparisons. The applications and the necessary optimizations for scaling for these systems differ from the systems we cover here
42、, where key lookups and simple operations predominate over reference- following and complex object behavior. It is possible these systems can scale on simple operations as well, but that is a topic for a future paper, and proof through benchmarks。对于那些应用程序是必须do fast和索引跟踪的需求,尤其是应用数据在内存中的情况,这些系统是一个不错的选
43、择。编程语言集成也是有价值的(?这句没懂)。不像NoSQL系统,这些系统一般提供 ACID 事务。其中许多为提供索引跟踪和分布式查询分解,提供水平扩展.然而,由于篇幅的限制,我们省略了这些系统间的比较。应用程序和为这些系统的必要优化不是我们在这里要讨论的,我们重点是关键查询和操作简单而不是索引跟踪和复杂的对象行为。它是可能这些系统可以通过简单的操作进行扩展,但那是未来的文献再讨论并通过一些原则再证明的了。Data warehousing database systems provide horizontal scaling, but are also beyond the scope of t
44、his paper. Data warehousing applications are different in important ways:数据仓库数据库系统提供水平扩展,但也超出了本文的范围。数据仓库应用程序是不同的重要途径(本小节以下略) They perform complex queries that collect and join information from many different tables. The ratio of reads to writes is high: that is, the database is readonly or readmostl
45、y.Thereare existing systems for data warehousing that scale well horizontally. Because the data is infrequently updated, it is possible to organize or replicate the database in ways that make scaling possible。1.3 Data Model Terminology数据模型术语Unlike relational (SQL) DBMSs, the terminology(术语) used by
46、NoSQL data stores is often inconsistent. For the purposes of this paper, we need a consistent way to compare the data models and functionality.不像关系型数据库系统,NoSQL数据存储的术语往往是不一致的。对于本文而言,我们需要以一致的方式进行比较的数据模型和功能All of the systems described here provide a way to store scalar values, like numbers and strings,
47、 as well as BLOBs。 Some of them also provide a way to store more complex nested or reference values. The systems all store sets of attributevalue pairs, but use different data structures, specifically:所有这里描述的系统提供一种标量值,如数字、字符串,如 Blob 存储方式.其中有些还提供存储更复杂的嵌套或参考值的方法.系统所有存储组属性值对,但使用了不同的数据结构,具体为: A “tuple i
48、s a row in a relational table, where attribute names are predefined in a schema, and the values must be scalar。 The values are referenced by attribute name, as opposed to an array or list, where they are referenced by ordinal position。 “元组是一个关系表中的一行,在这里面,属性名称在schema预定义,值必须是标量。由属性名称做值的索引,而不像数组或列表中,值由
49、它们的序号位置做索引。 A “document” allows values to be nested documents or lists as well as scalar values, and the attribute names are dynamically defined for each document at runtime。 A document differs from a tuple in that the attributes are not defined in a global schema, and this wider range of values are permitted。 “文档”允许将嵌套的文档或
©2010-2024 宁波自信网络信息技术有限公司 版权所有
客服电话:4008-655-100 投诉/维权电话:4009-655-100