From Wikipedia
Older version, mostly by "Comps" A database is an organized collection of data, today typically in digital form. The data are typically organized to model relevant aspects of reality (for example, the availability of rooms in hotels), in a way that supports processes requiring this information (for example, finding a hotel with vacancies). The term database is correctly applied to the data and their supporting data structures, and not to the database management system (DBMS). The database data collection with DBMS is called a database system. The term database system implies that the data is managed to some level of quality (measured in terms of accuracy, availability, usability, and resilience) and this in turn often implies the use of a general-purpose database management system (DBMS).[1] A general-purpose DBMS is typically a complex software system that meets many usage requirements to properly maintain its databases which are often large and complex. The utilization of databases is now so widespread that virtually every technology and product relies on databases and DBMSs for its development and commercialization, or even may have DBMS software embedded in it. Also, organizations and companies, from small to large, depend heavily on databases for their operations. Well known DBMSs include Oracle, IBM DB2, Microsoft SQL Server, Microsoft Access, PostgreSQL, MySQL, and SQLite. A database is not generally portable across different DBMS, but different DBMSs can inter-operate to some degree by using standards like SQL and ODBC together to support a single application built over more than one database. A DBMS also needs to provide effective run-time execution to properly support (e.g., in terms of performance, availability, and security) as many database end-users as needed. A way to classify databases involves the type of their contents, for example: bibliographic, document-text, statistical, or multimedia objects. Another way is by their application area, for example: accounting, music compositions, movies, banking, manufacturing, or insurance. The term database may be narrowed to specify particular aspects of organized collection of data and may refer to the logical database, to the physical database as data content in computer data storage or to many other database sub-definitions. HistoryDatabase conceptThe database concept has evolved since the 1960s to ease increasing difficulties in designing, building, and maintaining complex information systems (typically with many concurrent end-users, and with a large amount of diverse data). It has evolved together with database management systems which enable the effective handling of databases. Though the terms database and DBMS define different entities, they are inseparable: a database's properties are determined by its supporting DBMS. The Oxford English dictionary cites[citation needed] a 1962 technical report as the first to use the term "data-base." With the progress in technology in the areas of processors, computer memory, computer storage and computer networks, the sizes, capabilities, and performance of databases and their respective DBMSs have grown in orders of magnitudes. For decades it has been unlikely that a complex information system can be built effectively without a proper database supported by a DBMS. The utilization of databases is now spread to such a wide degree that virtually every technology and product relies on databases and DBMSs for its development and commercialization, or even may have such embedded in it. Also, organizations and companies, from small to large, heavily depend on databases for their operations.No widely accepted exact definition exists for DBMS. However, a system needs to provide considerable functionality to qualify as a DBMS. Accordingly its supported data collection needs to meet respective usability requirements (broadly defined by the requirements below) to qualify as a database. Thus, a database and its supporting DBMS are defined here by a set of general requirements listed below. Virtually all existing mature DBMS products meet these requirements to a great extent, while less mature either meet them or converge to meet them. Evolution of database and DBMS technology
In the earliest database systems, efficiency was perhaps the primary concern, but it was already recognized that there were other important objectives. One of the key aims was to make the data independent of the logic of application programs, so that the same data could be made available to different applications. The first generation of database systems were navigational,[2] applications typically accessed data by following pointers from one record to another. The two main data models at this time were the hierarchical model, epitomized by IBM's IMS system, and the Codasyl model (Network model), implemented in a number of products such as IDMS. The relational model, first proposed in 1970 by Edgar F. Codd, departed from this tradition by insisting that applications should search for data by content, rather than by following links. This was considered necessary to allow the content of the database to evolve without constant rewriting of applications. Relational systems placed heavy demands on processing resources, and it was not until the mid 1980s that computing hardware became powerful enough to allow them to be widely deployed. By the early 1990s, however, relational systems were dominant for all large-scale data processing applications, and they remain dominant today (2012) except in niche areas. The dominant database language is the standard SQL for the Relational model, which has influenced database languages also for other data models. Because the relational model emphasizes search rather than navigation, it does not make relationships between different entities explicit in the form of pointers, but represents them rather using primary keys and foreign keys. While this is a good basis for a query language, it is less well suited as a modeling language. For this reason a different model, the entity-relationship model which emerged shortly later (1976), gained popularity for database design. In the period since the 1970s database technology has kept pace with the increasing resources becoming available from the computing platform: notably the rapid increase in the capacity and speed (and reduction in price) of disk storage, and the increasing capacity of main memory. This has enabled ever larger databases and higher throughputs to be achieved. The rigidity of the relational model, in which all data is held in tables with a fixed structure of rows and columns, has increasingly been seen as a limitation when handling information that is richer or more varied in structure than the traditional 'ledger-book' data of corporate information systems: for example, document databases, engineering databases, multimedia databases, or databases used in the molecular sciences. Various attempts have been made to address this problem, many of them gathering under banners such as post-relational or NoSQL. Two developments of note are the object database and the XML database. The vendors of relational databases have fought off competition from these newer models by extending the capabilities of their own products to support a wider variety of data types. General-purpose DBMSA DBMS has evolved into a complex software system and its development typically requires thousands of person-years of development effort.[citation needed] Some general-purpose DBMSs, like Oracle, Microsoft SQL Server, and IBM DB2, have been undergoing upgrades for thirty years or more. General-purpose DBMSs aim to satisfy as many applications as possible, which typically makes them even more complex than special-purpose databases. However, the fact that they can be used "off the shelf", as well as their amortized cost over many applications and instances, makes them an attractive alternative (Vs. one-time development) whenever they meet an application's requirements.Though attractive in many cases, a general-purpose DBMS is not always the optimal solution: When certain applications are pervasive with many operating instances, each with many users, a general-purpose DBMS may introduce unnecessary overhead and too large "footprint" (too large amount of unnecessary, unutilized software code). Such applications usually justify dedicated development. Typical examples are email systems, though they need to possess certain DBMS properties: email systems are built in a way that optimizes email messages handling and managing, and do not need significant portions of a general-purpose DBMS functionality. Types of people involvedThree types of people are involved with a general-purpose DBMS:
Database machines and appliances
Main article: Database machine
In the 1970s and 1980s attempts were made to build database systems
with integrated hardware and software. The underlying philosophy was
that such integration would provide higher performance at lower cost.
Examples were IBM System/38, the early offering of Teradata, and the Britton Lee, Inc. database machine. Another approach to hardware support for database management was ICL's CAFS
accelerator, a hardware disk controller with programmable search
capabilities. In the long term these efforts were generally unsuccessful
because specialized database machines could not keep pace with the
rapid development and progress of general-purpose computers. Thus most
database systems nowadays are software systems running on
general-purpose hardware, using general-purpose computer data storage.
However this idea is still pursued for certain applications by some
companies like Netezza and Oracle (Exadata).Database researchDatabase research has been an active and diverse area, with many specializations, carried out since the early days of dealing with the database concept in the 1960s. It has strong ties with database technology and DBMS products. Database research has taken place at research and development groups of companies (e.g., notably at IBM Research, who contributed technologies and ideas virtually to any DBMS existing today), research institutes, and academia. Research has been done both through theory and prototypes. The interaction between research and database related product development has been very productive to the database area, and many related key concepts and technologies emerged from it. Notable are the Relational and the Entity-relationship models, the atomic transaction concept and related Concurrency control techniques, Query languages and Query optimization methods, RAID, and more. Research has provided deep insight to virtually all aspects of databases, though not always has been pragmatic, effective (and cannot and should not always be: research is exploratory in nature, and not always leads to accepted or useful ideas). Ultimately market forces and real needs determine the selection of problem solutions and related technologies, also among those proposed by research. However, occasionally, not the best and most elegant solution wins (e.g., SQL). Along their history DBMSs and respective databases, to a great extent, have been the outcome of such research, while real product requirements and challenges triggered database research directions and sub-areas.The database research area has several notable dedicated academic journals (e.g., ACM Transactions on Database Systems-TODS, Data and Knowledge Engineering-DKE, and more) and annual conferences (e.g., ACM SIGMOD, ACM PODS, VLDB, IEEE ICDE, and more), as well as an active and quite heterogeneous (subject-wise) research community all over the world. Database type examplesThe following are examples of various database types. Some of them are not main-stream types, but most of them have received special attention (e.g., in research) due to end-user requirements. Some exist as specialized DBMS products, and some have their functionality types incorporated in existing general-purpose DBMSs. Though may differ in nature and functionality, these various types typically have to comply with the usability requirements below to comply as databases.
Main article: Active database
Main article: Cloud database
Main article: Data warehouse
Main article: Distributed database
Main article: Document-oriented database
Main article: Embedded database
Main articles: Federated database system and Heterogeneous Database System
Main article: Graph database
Main article: Hypertext
Main article: In-memory database
Main article: Knowledge base
Main article: Parallel database
Main article: Real time database
If a DBMS system responses users' request in a given time period, it can be regarded as a real time database.
Main article: Spatial database
A spatial database can store the data with multidimensional features.
The queries on such data include location based queries, like "where is
the closest hotel in my area".
Main article: Temporal database
A temporal database is a database with built-in time aspects, for
example a temporal data model and a temporal version of Structured Query
Language (SQL). More specifically the temporal aspects usually include
valid-time and transaction-time.
Main article: Unstructured data
Major database usage requirementsThe major purpose of a database is to provide the information system (in its broadest sense) that utilizes it with the information the system needs according to its own requirements. A certain broad set of requirements refines this general goal. These database requirements translate to requirements for the respective DBMS, to allow conveniently building a proper database for the given application. If this goal is met by a DBMS, then the designers and builders of the specific database can concentrate on the application's aspects, and not deal with building and maintaining the underlying DBMS. Also, since a DBMS is complex and expensive to build and maintain, it is not economical to build such a new tool (DBMS) for every application. Rather it is desired to provide a flexible tool for handling databases for as many as possible given applications, i.e., a general-purpose DBMS.Functional requirementsCertain general functional requirements need to be met in conjunction with a database. They describe what is needed to be defined in a database for any specific application.Defining the structure of data: Data modeling and Data definition languagesThe database needs to be based on a data model that is sufficiently rich to describe in the database all the needed respective application's aspects. A data definition language exists to describe the databases within the data model. Such language is typically data model specific.Manipulating the data: Data manipulation languages and Query languagesA database data model needs support by a sufficiently rich data manipulation language to allow all database manipulations and information generation (from the data) as needed by the respective application. Such language is typically data model specific.Protecting the data: Setting database security types and levelsThe DB needs built-in security means to protect its content (and users) from dangers of unauthorized users (either humans or programs). Protection is also provided from types of unintentional breach. Security types and levels should be defined by the database owners.Describing processes that use the data: Workflow and Business process modeling
Main articles: Workflow and Business process modeling
Manipulating database data often involves processes of several
interdependent steps, at different times (e.g., when different people's
interactions are involved; e.g., generating an insurance policy). Data
manipulation languages are typically intended to describe what is needed
in a single such step. Dealing with multiple steps typically requires
writing quite complex programs. Most applications are programmed using
common programming languages and software development tools. However the area of process description has evolved in the frameworks of workflow and business processes
with supporting languages and software packages which considerably
simplify the tasks. Traditionally these frameworks have been out of the
scope of common DBMSs, but utilization of them has become common-place,
and often they are provided as add-on's to DBMSs.Operational requirementsOperational requirements are needed to be met by a database in order to effectively support an application when operational. Though it typically may be expected that operational requirements are automatically met by a DBMS, in fact it is not so in most of the cases: To be met substantial work of design and tuning is typically needed by database administrators. This is typically done by specific instructions/operations through special database user interfaces and tools, and thus may be viewed as secondary functional requirements (which are not less important than the primary).AvailabilityA DB should maintain needed levels of availability, i.e., the DB needs to be available in a way that a user's action does not need to wait beyond a certain time range before starting executing upon the DB. Availability also relates to failure and recovery from it (see Recovery from failure and disaster below): Upon failure and during recovery normal availability changes, and special measures are needed to satisfy availability requirements.PerformanceUsers' actions upon the DB should be executed within needed time ranges.Isolation between usersWhen multiple users access the database concurrently the actions of a user should be uninterrupted and unaffected by actions of other users. These concurrent actions should maintain the DB's consistency (i.e., keep the DB from corruption).Recovery from failure and disaster
Main articles: Data recovery and Disaster recovery
All computer systems, including DBMSs, are prone to failures for many
reasons (both software and hardware related). Failures typically
corrupt the DB, typically to the extent that it is impossible to repair
it without special measures. The DBMS should provide automatic recovery from failure procedures that repair the DB and return it to a well defined state.Backup and restore
Main article: Backup
Sometimes it is desired to bring a database back to a previous state
(for many reasons, e.g., cases when the database is found corrupted due
to a software error, or if it has been updated with erroneous data). To
achieve this a backup operation is done occasionally or
continuously, where each desired database state (i.e., the values of its
data and their embedding in database's data structures) is kept within
dedicated backup files (many techniques exist to do this effectively).
When this state is needed, i.e., when it is decided by a database
administrator to bring the database back to this state (e.g., by
specifying this state by a desired point in time when the database was
in this state), these files are utilized to restore that state.Data independence
Main article: Data independence
Data independence pertains to a database's life cycle (see Database building, maintaining, and tuning
below). It strongly impacts the convenience and cost of maintaining an
application and its database, and has been the major motivation for the
emergence and success of the Relational model, as well as the
convergence to a common database architecture. In general the term "data
independence" means that changes in the database's structure do not
require changes in its application's computer programs, and that changes
in the database at a certain architectural level (see below) do not
affect the database's levels above. Data independence is achieved to a
great extent in contemporary DBMS, but it is not completely attainable,
and achieved at different degrees for different types of database
structural changes.Major database functional areasThe functional areas are domains and subjects that have evolved in order to provide proper answers and solutions to the functional requirements above.Data models
Main articles: Data model and Database model
A data model is an abstract structure that provides the means to
effectively describe specific data structures needed to model an
application. As such a data model needs sufficient expressive power to
capture the needed aspects of applications. These applications are often
typical to commercial companies and other organizations (like
manufacturing, human-resources, stock, banking, etc.). For effective
utilization and handling it is desired that a data model is relatively
simple and intuitive. This may be in conflict with high expressive power
needed to deal with certain complex applications. Thus any popular
general-purpose data model usually well balances between being intuitive
and relatively simple, and very complex with high expressive power. The
application's semantics is usually not explicitly expressed in the
model, but rather implicit (and detailed by documentation external to
the model) and hinted to by data item types' names (e.g., "part-number")
and their connections (as expressed by generic data structure types
provided by each specific model).Early data modelsThese models were popular in the 1960s, 1970s, but nowadays can be found primarily in old legacy systems. They are characterized primarily by being navigational with strong connections between their logical and physical representations, and deficiencies in data independence.Hierarchical model
Main article: Hierarchical database model
In the Hierarchical model different record types (representing real-world entities) are embedded in a predefined hierarchical (tree-like)
structure. This hierarchy is used as the physical order of records in
storage. Record access is done by navigating through the data structure
using pointers combined with sequential accessing.This model has been supported primarily by the IBM IMS DBMS, one of the earliest DBMSs. Various limitations of the model have been compensated at later IMS versions by additional logical hierarchies imposed on the base physical hierarchy. Network model
Main article: Network model (database)
In this model a hierarchical relationship between two record types (representing real-world entities) is established by the set construct. A set consists of circular linked lists
where one record type, the set owner or parent, appears once in each
circle, and a second record type, the subordinate or child, may appear
multiple times in each circle. In this way a hierarchy may be
established between any two record types, e.g., type A is the owner of
B. At the same time another set may be defined where B is the owner of
A. Thus all the sets comprise a general directed graph (ownership defines a direction), or network construct. Access to records is either sequential (usually in each record type) or by navigation in the circular linked lists.This model is more general and powerful than the hierarchical, and has been the most popular before being replaced by the Relational model. It has been standardized by CODASYL. Popular DBMS products that utilized it were Cincom Systems' Total and Cullinet's IDMS. Inverted file model
Main article: Inverted index
An inverted file or inverted index
of a first file, by a field in this file (the inversion field), is a
second file in which this field is the key. A record in the second file
includes a key and pointers to records in the first file where the
inversion field has the value of the key. This is also the logical
structure of contemporary database indexes. The related Inverted file data model utilizes inverted files of primary database files to efficiently directly access needed records in these files.Notable for using this data model is the ADABAS DBMS of Software AG, introduced in 1970. ADABAS has gained considerable customer base and exists and supported until today. In the 1980s it has adopted the Relational model and SQL in addition to its original tools and languages. Relational model
Main article: Relational model
The relational model is a simple model that provides flexibility. It organizes data based on two-dimensional arrays known as relations, or tables as related to databases. These relations consist of a heading and a set of zero or more tuples
in arbitrary order. The heading is an unordered set of zero or more
attributes, or columns of the table. The tuples are a set of unique
attributes mapped to values, or the rows of data in the table. Data can
be associated across multiple tables with a key. A key is a single, or
set of multiple, attribute(s) that is common to both tables. The most
common language associated with the relational model is the Structured
Query Language (SQL), though it differs in some places.Entity-relationship model
Main article: Entity-relationship model
Object model
Main articles: Object model and Object database
A variety of these ways have been tried[by whom?] for storing objects in a database. Some products have approached the problem from the application-programming side, by making the objects manipulated by the program persistent. This also typically requires the addition of some kind of query language, since conventional programming languages do not provide language-level functionality for finding objects based on their information content. Others[which?] have attacked the problem from the database end, by defining an object-oriented data model for the database, and defining a database programming language that allows full programming capabilities as well as traditional query facilities. Object relational model
Main article: Object-relational database
XML as a database data model
Main articles: XML database and XML
Other database models
Some of these extensions to the relational model integrate concepts from technologies that pre-date the relational model. For example, they allow representation of a directed graph with trees on the nodes. The German company sones implements this concept in its GraphDB. Some post-relational products extend relational systems with non-relational features. Others arrived in much the same place by adding relational features to pre-relational systems. Paradoxically, this allows products that are historically pre-relational, such as PICK and MUMPS, to make a plausible claim to be post-relational. The resource space model (RSM) is a non-relational data model based on multi-dimensional classification.[9] Database languagesDatabase languages are dedicated programming languages, tailored and utilized to
SQL for the Relational model
Main article: SQL
A major Relational model language supported by all the relational DBMSs and a standard.SQL was one of the first commercial languages for the relational model. Despite not adhering to the relational model as described by Codd, it has become the most widely used database language.[10][11] Though often described as, and to a great extent is a declarative language, SQL also includes procedural elements. SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of the International Organization for Standards (ISO) in 1987. Since then the standard has been enhanced several times with added features. However, issues of SQL code portability between major RDBMS products still exist due to lack of full compliance with, or different interpretations of the standard. Among the reasons mentioned are the large size, and incomplete specification of the standard, as well as vendor lock-in. OQL for the Object model
Main article: OQL
An object model language standard (by the Object Data Management Group) that has influenced the design of some of the newer query languages like JDOQL and EJB QL, though they cannot be considered as different flavors of OQL.XQuery for the XML modelXQuery is an XML based database language (also named XQL). SQL/XML combines XQuery and XML with SQL.[12]Database architectureDatabase architecture (to be distinguished from DBMS architecture; see below) may be viewed, to some extent, as an extension of data modeling. It is used to conveniently answer requirements of different end-users from a same database, as well as for other benefits. For example, a financial department of a company needs the payment details of all employees as part of the company's expenses, but not other many details about employees, that are the interest of the human resources department. Thus different departments need different views of the company's database, that both include the employees' payments, possibly in a different level of detail (and presented in different visual forms). To meet such requirement effectively database architecture consists of three levels: external, conceptual and internal. Clearly separating the three levels was a major feature of the relational database model implementations that dominate 21st century databases.[13]
The above three-level database architecture also relates to and being motivated by the concept of data independence which has been described for long time as a desired database property and was one of the major initial driving forces of the Relational model. In the context of the above architecture it means that changes made at a certain level do not affect definitions and software developed with higher level interfaces, and are being incorporated at the higher level automatically. For example, changes in the internal level do not affect application programs written using conceptual level interfaces, which saves substantial change work that would be needed otherwise. In summary, the conceptual is a level of indirection between internal and external. On one hand it provides a common view of the database, independent of different external view structures, and on the other hand it is uncomplicated by details of how the data is stored or managed (internal level). In principle every level, and even every external view, can be presented by a different data model. In practice usually a given DBMS uses the same data model for both the external and the conceptual levels (e.g., relational model). The internal level, which is hidden inside the DBMS and depends on its implementation (see Implementation section below), requires a different level of detail and uses its own data structure types, typically different in nature from the structures of the external and conceptual levels which are exposed to DBMS users (e.g., the data models above): While the external and conceptual levels are focused on and serve DBMS users, the concern of the internal level is effective implementation details. Database security
Main article: Database security
Database security deals with all various aspects of protecting the
database content, its owners, and its users. It ranges from protection
from intentional unauthorized database uses to unintentional database
accesses by unauthorized entities (e.g., a person or a computer
program).The following are major areas of database security (among many others). Access control
Main article: Access control
Database access control deals with controlling who (a person or a
certain computer program) is allowed to access what information in the
database. The information may comprise specific database objects (e.g.,
record types, specific records, data structures), certain computations
over certain objects (e.g., query types, or specific queries), or
utilizing specific access paths to the former (e.g., using specific
indexes or other data structures to access information).Database access controls are set by special authorized (by the database owner) personnel that uses dedicated protected security DBMS interfaces. Data security
Main articles: Data security and Encryption
The definition of data security varies and may overlap with other
database security aspects. Broadly it deals with protecting specific
chunks of data, both physically (i.e., from corruption, or destruction,
or removal; e.g., see Physical security),
or the interpretation of them, or parts of them to meaningful
information (e.g., by looking at the strings of bits that they comprise,
concluding specific valid credit-card numbers; e.g., see Data encryption).Database audit
Main article: Database audit
Database audit primarily involves monitoring that no security breach,
in all aspects, has happened. If security breach is discovered then all
possible corrective actions are taken.Database design
Main article: Database design
Database design is done before building it to meet needs of end-users
within a given application/information-system that the database is
intended to support. The database design defines the needed data and
data structures that such a database comprises. A design is typically
carried out according to the common three architectural levels of a
database (see Database architecture above). First, the conceptual level
is designed, which defines the over-all picture/view of the database,
and reflects all the real-world elements (entities) the database intends
to model, as well as the relationships among them. On top of it the
external level, various views of the database, are designed according to
(possibly completely different) needs of specific end-user types. More
external views can be added later. External views requirements may
modify the design of the conceptual level (i.e., add/remove entities and
relationships), but usually a well designed conceptual level for an
application well supports most of the needed external views. The
conceptual view also determines the internal level (which primarily
deals with data layout in storage) to a great extent. External views
requirement may add supporting storage structures, like materialized
views and indexes, for enhanced performance. Typically the internal
layer is optimized for top performance, in an average way that takes
into account performance requirements (possibly conflicting) of
different external views according to their relative importance. While
the conceptual and external levels design can usually be done
independently of any DBMS (DBMS-independent design software packages
exist, possibly with interfaces to some specific popular DBMSs), the
internal level design highly relies on the capabilities and internal
data structure of the specific DBMS utilized (see the Implementation
section below).A common way to carry out conceptual level design is to use the entity-relationship model (ERM) (both the basic one, and with possible enhancement that it has gone over), since it provides a straightforward, intuitive perception of an application's elements and semantics. An alternative approach, which preceded the ERM, is using the Relational model and dependencies (mathematical relationships) among data to normalize the database, i.e., to define the ("optimal") relations (data record or tupple types) in the database. Though a large body of research exists for this method it is more complex, less intuitive, and not more effective than the ERM method. Thus normalization is less utilized in practice than the ERM method. The ERM may be less subtle than normalization in several aspects, but it captures the main needed dependencies which are induced by keys/identifiers of entities and relationships. Also the ERM inherently includes the important inclusion dependencies (i.e., an entity instance that does not exist (has not been explicitly inserted) cannot appear in a relationship with other entities) which usually have been ignored in normalization.[14] In addition the ERM allows entity type generalization (the Is-a relationship) and implied property (attribute) inheritance (similarly to the that found in the object model). Another aspect of database design is its security. It involves both defining access control to database objects (e.g., Entities, Views) as well as defining security levels and methods for the data itself (See Database security above). Entities and relationships
Main article: Entity-relationship model
The most common database design methods are based on the entity
relationship model (ERM, or ER model). This model views the world in a
simplistic but very powerful way: It consists of "Entities" and the
"Relationships" among them. Accordingly a database consists of entity and relationship types, each with defined attributes
(field types) that model concrete entities and relationships. Modeling a
database in this way typically yields an effective one with desired
properties (as in some normal forms; see normalization below).
Such models can be translated to any other data model required by any
specific DBMS for building an effective database.Database normalization
Main article: Database normalization
In the design of a relational database, the process of organizing database relations to minimize redundancy is called normalization.
The goal is to produce well-structured relations so that additions,
deletions, and modifications of a field can be made in just one relation
(table) without worrying about appearance and update of the same field
in other relations. The process is algorithmic
and based on dependencies (mathematical relations) that exist among
relations' field types. The process result is bringing the database
relations into a certain "normal form". Several normal forms exist with different properties.Database building, maintaining, and tuning
Main article: Database tuning
After designing a database for an application arrives the stage of building the database. Typically an appropriate general-purpose DBMS can be selected to be utilized for this purpose. A DBMS provides the needed user interfaces
to be utilized by database administrators to define the needed
application's data structures within the DBMS's respective data model.
Other user interfaces are used to select needed DBMS parameters (like
security related, storage allocation parameters, etc.).When the database is ready (all its data structures and other needed components are defined) it is typically populated with initial application's data (database initialization, which is typically a distinct project; in many cases using specialized DBMS interfaces that support bulk insertion) before making it operational. In some cases the database becomes operational while empty from application's data, and data are accumulated along its operation. After completing building the database and making it operational arrives the database maintenance stage: Various database parameters may need changes and tuning for better performance, application's data structures may be changed or added, new related application programs may be written to add to the application's functionality, etc. Miscellaneous areasDatabase migration between DBMSs
Implementation: Database management systemsor How database usage requirements are met
Main article: Database management system
A database management system (DBMS) is a system that allows to
build and maintain databases, as well as to utilize their data and
retrieve information from it. A DBMS defines the database type that it
supports, as well as its functionality and operational capabilities. A
DBMS provides the internal processes for external applications built on
them. The end-users of some such specific application are usually
exposed only to that application and do not directly interact with the
DBMS. Thus end-users enjoy the effects of the underlying DBMS, but its
internals are completely invisible to end-users. Database designers and
database administrators interact with the DBMS through dedicated
interfaces to build and maintain the applications' databases, and thus
need some more knowledge and understanding about how DBMSs operate and
the DBMSs' external interfaces and tuning parameters.A DBMS consists of software that operates databases, providing storage, access, security, backup and other facilities to meet needed requirements. DBMSs can be categorized according to the database model(s) that they support, such as relational or XML, the type(s) of computer they support, such as a server cluster or a mobile phone, the query language(s) that access the database, such as SQL or XQuery, performance trade-offs, such as maximum scale or maximum speed or others. Some DBMSs cover more than one entry in these categories, e.g., supporting multiple query languages. Database software typically support the Open Database Connectivity (ODBC) standard which allows the database to integrate (to some extent) with other databases. The development of a mature general-purpose DBMS typically takes several years and many man-years. Developers of DBMS typically update their product to follow and take advantage of progress in computer and storage technologies. Several DBMS products like Oracle and IBM DB2 have been in on-going development since the 1970s-1980s. Since DBMSs comprise a significant economical market, computer and storage vendors often take into account DBMS requirements in their own development plans. DBMS architecture: major DBMS componentsDBMS architecture specifies its components (including descriptions of their functions) and their interfaces. DBMS architecture is distinct from database architecture. The following are major DBMS components:
Database storage
Main article: Computer data storage
Database storage is the container of the physical materialization of a database. It comprises the Internal (physical) level in the database architecture. It also contains all the information needed (e.g., metadata, "data about the data", and internal data structures) to reconstruct the Conceptual level and External level
from the Internal level when needed. It is not part of the DBMS but
rather manipulated by the DBMS (by its Storage engine; see above) to
manage the database that resides in it. Though typically accessed by a
DBMS through the underlying Operating system (and often utilizing the operating systems' File systems
as intermediates for storage layout), storage properties and
configuration setting are extremely important for the efficient
operation of the DBMS, and thus are closely maintained by database
administrators. A DBMS, while in operation, always has its database
residing in several types of storage (e.g., memory and external
storage). The database data and the additional needed information,
possibly in very large amounts, are coded into bits. Data typically
reside in the storage in structures that look completely different from
the way the data look in the conceptual and external levels, but in ways
that attempt to optimize (the best possible) these levels'
reconstruction when needed by users and programs, as well as for
computing additional types of needed information from the data (e.g.,
when querying the database).In principle the database storage can be viewed as a linear address space, where every bit of data has its unique address in this address space. Practically only a very small percentage of addresses is kept as initial reference points (which also requires storage), and most of the database data is accessed by indirection using displacement calculations (distance in bits from the reference points) and data structures which define access paths (using pointers) to all needed data in effective manner, optimized for the needed data access operations. DataCoding the data and Error-correcting codes
Main articles: Code, Character encoding, Error detection and correction, and Cyclic redundancy check
Data compression
Main article: Data compression
Data compression methods allow in many cases to represent a string of
bits by a shorter bit string ("compress") and reconstruct the original
string ("decompress") when needed. This allows to utilize substantially
less storage (tens of percents) for many types of data at the cost of
more computation (compress and decompress when needed). Analysis of
trade-off between storage cost saving and costs of related computations
and possible delays in data availability is done before deciding whether
to keep certain data in a database compressed or not.Data compression is typically controlled through the DBMS's data definition interface, but in some cases may be a default and automatic. Data encryption
Main article: Cryptography
For security reasons certain types of data (e.g., credit-card information) may be kept encrypted
in storage to prevent the possibility of unauthorized information
reconstruction from chunks of storage snapshots (taken either via
unforeseen vulnerabilities in a DBMS, or more likely, by bypassing it).Data encryption is typically controlled through the DBMS's data definition interface, but in some cases may be a default and automatic. Data storage typesThis collection of bits describes both the contained database data and its related metadata (i.e., data that describes the contained data and allows computer programs to manipulate the database data correctly). The size of a database can nowadays be tens of Terabytes, where a byte is eight bits. The physical materialization of a bit can employ various existing technologies, while new and improved technologies are constantly under development. Common examples are:
Databases are usually too expensive (in terms of importance and needed investment in resources, e.g., time, money, to build them) to be lost by a power interruption. Thus at any point in time most of their content resides in nonvolatile storage. Even if for operational reason very large portions of them reside in volatile storage (e.g., tens of Gigabytes in volatile memory, for in-memory databases), most of this is backed-up in nonvolatile storage. A relatively small portion of this, which temporarily may not have nonvolatile backup, can be reconstructed by proper automatic database recovery procedures after volatile storage content loss. More examples of storage types:
Storage metrics
Protecting storage device content: Device mirroring (replication) and RAID
Main articles: Disk mirroring and RAID
Database storage layoutDatabase bits are laid-out in storage in data-structures and grouping that can take advantage of both known effective algorithms to retrieve and manipulate them and the storage own properties. Typically the storage itself is design to meet requirements of various areas that extensively utilize storage, including databases. A DBMS in operation always simultaneously utilizes several storage types (e.g., memory, and external storage), with respective layout methods.Database storage hierarchyA database, while in operation, resides simultaneously in several types of storage. By the nature of contemporary computers most of the database part inside a computer that hosts the DBMS resides (partially replicated) in volatile storage. Data (pieces of the database) that are being processed/manipulated reside inside a processor, possibly in processor's caches. These data are being read from/written to memory, typically through a computer bus (so far typically volatile storage components). Computer memory is communicating data (transferred to/from) external storage, typically through standard storage interfaces or networks (e.g., fibre channel, iSCSI). A storage array, a common external storage unit, typically has storage hierarchy of it own, from a fast cache, typically consisting of (volatile and fast) DRAM, which is connected (again via standard interfaces) to drives, possibly with different speeds, like flash drives[disambiguation needed] and magnetic disk drives (non-volatile). The drives may be connected to magnetic tapes, on which typically the least active parts of a large database may reside, or database backup generations.Typically a correlation exists currently between storage speed and price, while the faster storage is typically volatile. Data structures
Main article: Database storage structures
Databases may store data in many data structure types.[15] Common examples are the following:
Application data and DBMS dataA typical DBMS cannot store the data of the application it serves alone. In order to handle the application data the DBMS need to store this data in data structures that comprise specific data by themselves. In addition the DBMS needs its own data structures and many types of bookkeeping data like indexes and logs. The DBMS data is an integral part of the database and may comprise a substantial portion of it.Database indexing
Main article: Index (database)
Indexing
is a technique for improving database performance. The many types of
indexes share the common property that they reduce the need to examine
every entry when running a query. In large databases, this can reduce
query time/cost by orders of magnitude. The simplest form of index is a
sorted list of values that can be searched using a binary search
with an adjacent reference to the location of the entry, analogous to
the index in the back of a book. The same data can have multiple indexes
(an employee database could be indexed by last name and hire date.)Indexes affect performance, but not results. Database designers can add or remove indexes without changing application logic, reducing maintenance costs as the database grows and database usage evolves. Given a particular query, the DBMS' query optimizer is responsible for devising the most efficient strategy for finding matching data. Indexes can speed up data access, but they consume space in the database, and must be updated each time the data is altered. Indexes therefore can speed data access but slow data maintenance. These two properties determine whether a given index is worth the cost. Database data clusteringIn many cases substantial performance improvement is gained if different types of database objects that are usually utilized together are laid in storage in proximity, being clustered. This usually allows to retrieve needed related objects from storage in minimum number of input operations (each sometimes substantially time consuming). Even for in-memory databases clustering provides performance advantage due to common utilization of large caches for input-output operations in memory, with similar resulting behavior.For example it may be beneficial to cluster a record of an item in stock with all its respective order records. The decision of whether to cluster certain objects or not depends on the objects' utilization statistics, object sizes, caches sizes, storage types, etc. In a relational database clustering the two respective relations "Items" and "Orders" results in saving the expensive execution of a Join operation between the two relations whenever such a join is needed in a query (the join result is already ready in storage by the clustering, available to be utilized). Database materialized views
Main article: Materialized view
Often storage redundancy is employed to increase performance. A common example is storing materialized views, which consist of frequently needed external views
or query results. Storing such views saves the expensive computing of
them each time they are needed. The downsides of materialized views are
the overhead incurred when updating them to keep them synchronized with
their original updated database data, and the cost of storage
redundancy.Database and database object replication
Main article: Database replication
Database transactions
Main article: Database transaction
As with every software system, a DBMS that operates in a faulty
computing environment is prone to failures of many kinds. A failure can
corrupt the respective database unless special measures are taken to
prevent this. A DBMS achieves certain levels of fault tolerance by
encapsulating operations within transactions. The concept of a database transaction (or atomic transaction)
has evolved in order to enable both a well understood database system
behavior in a faulty environment where crashes can happen any time, and recovery
from a crash to a well understood database state. A database
transaction is a unit of work, typically encapsulating a number of
operations over a database (e.g., reading a database object, writing,
acquiring lock, etc.), an abstraction supported in database and also
other systems. Each transaction has well defined boundaries in terms of
which program/code executions are included in that transaction
(determined by the transaction's programmer via special transaction
commands).ACID rules
Main article: ACID
Every database transaction obeys the following rules:
Isolation, concurrency control, and lockingIsolation provides the ability for multiple users to operate on the database at the same time without corrupting the data.
Query optimization
Main articles: Query optimization and Query optimizer
A query is a request for information from a database. It can be as
simple as "finding the address of a person with SS# 123-45-6789," or
more complex like "finding the average salary of all the employed
married men in California between the ages 30 to 39, that earn less than
their wives." Queries results are generated by accessing relevant
database data and manipulating it in a way that yields the requested
information. Since database structures are complex, in most cases, and
especially for not-very-simple queries, the needed data for a query can
be collected from a database by accessing it in different ways, through
different data-structures, and in different orders. Each different way
typically requires different processing time. Processing times of a same
query may have large variance, from a fraction of a second to hours,
depending on the way selected. The purpose of query optimization,
which is an automated process, is to find the way to process a given
query in minimum time. The large possible variance in time justifies
performing query optimization, though finding the exact optimal way to
execute a query, among all possibilities, is typically very complex,
time consuming by itself, may be too costly, and often practically
impossible. Thus query optimization typically tries to approximate the
optimum by comparing several common-sense alternatives to provide in a
reasonable time a "good enough" plan which typically does not deviate
much from the best possible result.DBMS support for the development and maintenance of a database and its application
References
|
Blog of Yoav Raz
Tuesday, April 23, 2013
Database
Subscribe to:
Posts (Atom)