31 December 2009

DBMS: Database Design (Just the Quotes)

"The database is the heart of our information processing systems and not simply a by-product of processing. […] A database that has been properly structured and organized, and which actually represents and contains that data of an enterprise, will stabilize the information processing systems. [...] The use of data is reckoning, the generation of information which will in turn be used to manage the enterprise. The generation of information is the only reason for having a database." (John K Lyon, "What is a Data Base", Bulletin of ACM SIGFIDET 5(1) , 1973)

"A database has structure which exists in three forms. It has a perceived structure, which is the way that each of us looks at the database. […] The physical structure of the database relates to the manner in which data is physically represented within the database in terms of sort sequence, record and field sizes, and its existence on one or more master files. […] The logical structure describes the way things really exist within the environment. The entities, i.e., the things within the enterprise, really exist; they have properties which distinguish them from all the other things within the business; and the relationships which exist among the entities." (John K Lyon, "What is a Data Base", Bulletin of ACM SIGFIDET 5(1) , 1973)

"Two of the most difficult areas of data-base management are the design of an information structure and the reduction of that structure to a data structure which is compatible with and managed by the DBMS. […] Data-base management systems are tools to be applied by the users of these systems to build an accurate and useful model of their organization and its information needs. To accomplish this, the information structure must accurately define and characterize the items of data and the relations among them that are of interest to the users. This is no small task, for it demands a knowledge of the organization and the distribution of information among its various parts." (Robert W Taylor & Randall L Frank, "CODASYL Data-Base Management Systems", 1976)

"A variety of reasons make careful database design essential. These include data redundancy, application performance, data independence, data security, and ease of programming. All are important factors in the data processing environment, and all can be adversely affected by a poor database design." (Mark L Gillenson, "Database: Step by Step", 1985)

"Database design refers to the process of organizing the data fields needed by one or more applications into an organized structure. This structure must foster the required relationships among the fields while conforming to the physical constraints of the particular database management system in use. There are two parts to the process that people usually associate with the term 'database design'. One is 'logical database design' which has two components. The first involves organizing the data fields into nonredundant groupings based on the data relationships. The second involves an initial organizing of those logical groupings into structures based on the nature of the DBMS and of the applications that will use the data. […] The second part of database design is 'physical database design, which refers to refitting the previously  described derived structures to conform to the performance and operational idiosyncrasies of the DBMS, again guided by the application's processing requirements." (Mark L Gillenson, "Database: Step by Step", 1985)

"The conceptual model is the focal point of the process of database design. All activities either converge upon or emanate from the conceptual model. All structures, through mappings either into or out of the conceptual model, must be to some extent compatible with it." (Andrew Pletch, "Conceptual Modeling in the Classroom", SIGMOD Record 18(1), 1989)

"Because one has to be an optimist to begin an ambitious project, it is not surprising that underestimation of completion time is the norm." (Fernando J Corbató, "On Building Systems That Will Fail", 1991)

"It is important to emphasize the value of simplicity and elegance, for complexity has a way of compounding difficulties and as we have seen, creating mistakes. My definition of elegance is the achievement of a given functionality with a minimum of mechanism and a maximum of clarity." (Fernando J Corbató, "On Building Systems That Will Fail", 1991)

"A database is basically a collection of related facts (e.g. a company's personnel records, or a bus timetable). Discovering the essential kinds of facts that underlie an application, and the conditions that apply to them, is both interesting and illuminating. The quality of the database design used for these facts and conditions is important. Just as a house built from a good architectural plan is more likely to be safe and convenient for living, a well-designed database simplifies the task of ensuring that its facts are correct and easy to get at." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"If an application requires maintenance and retrieval of large amounts of data, a DBMS offers many advantages over manual record-keeping systems. Operations on data may often be performed faster. Data may be stored compactly on disk, and redundancy may be reduced by data sharing. Many data errors can be avoided by automatic integrity checking. With multi-user systems, access rights to data can be enforced by the system. People can spend more time on creative design rather than on routine tasks more suited to computers."(Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"No matter how sophisticated the information system, if we give it the wrong picture of our UoD to start with, we can't expect to get much sense out of it. This is one aspect of the gigo (Garbage In Garbage Out) principle. Most of the problems with many database applications can be traced to bad database design." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"When we design a database for a particular application, we create a model of it. Technically, the application area that we are modeling is called the universe of discourse (UoD), since it is the world (or universe) that we are interested in talking (or discoursing) about. Typically the UoD is a 'part' of the 'real world'. To build a good model requires a good understanding of the world we are modeling, and hence is a task ideally suited to people rather than machines. The main challenge is to describe the UoD clearly and precisely. Great care is required here, since errors introduced at this stage filter through to later stages in software development, and the later the errors are detected the more expensive they are to remove." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"A bad logical design means that a good physical design cannot be performed. Good logical design is crucial to good database performance, and a bad logical design will result in a physical design that attempts to cover up the weaknesses in it. A bad logical design is hard to change, and once the system is implemented it will be almost impossible to do so." (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"[…] physical database design is not a static, one-off process. Once the database has gone into production, the user requirements are likely to change. Even if they do not, the database data is likely to be volatile, and tables are likely to grow." (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"The physical design process is a key phase in the overall design process. It is too often ignored until the last minute in the vain hope that performance will be satisfactory. Without a good physical design, performance is rarely satisfactory and throwing hardware at the problem is rarely completely effective. There is no substitute for a good physical design, and the time and effort spent in the physical design process will be rewarded with an efficient and well-tuned database, not to mention happy users!" (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"There is no right way to design a database; there are a number of possible approaches and all these may be perfectly valid. It is sometimes said that performance tuning is an art, not a science. This may be true, but it is important to undertake performance tuning experiments with the same kind of rigorous, controlled conditions under which scientific experiments are performed." (Ken England, "Microsoft SQL Server 2000 Performance Optimization and Tuning Handbook", 2001)

"One problem area for refactoring is databases. Most business applications are tightly coupled to the database schema that supports them. That's one reason that the database is difficult to change. Another reason is data migration. Even if you have carefully layered your system to minimize the dependencies between the database schema and the object model, changing the database schema forces you to migrate the data, which can be a long and fraught task." (Martin Fowler et al, "Refactoring: Improving the Design of Existing Code", 2002)

29 December 2009

DBMS: Database Management Systems (Just the Quotes)

Disclaimer: The following definitions are intended for tracing the long road of defining what Database Management Systems (DBMS) are about, therefore the value of the quotes is regarded from the historical perspective.

"The software that allows one or many persons to use and/or modify this data is a database management system (DBMS). A major role of the DBMS is to allow the user to deal with the data in abstract terms, rather than as the computer stores the data. In this sense, the DBMS acts as an interpreter for a (very) high-level language such as APL, ideally allowing the user to specify what must be done, with little or no attention on the user's part to the detailed algorithms or data representation used by the system. However, in the case of a DBMS, there may be even less relationship between the data as seen by the user and as stored in the computer, than between APL arrays and the representation of these arrays in memory." (Jeffrey D Ullman, "Principles of Database Systems", 1980)

"An integrated data management system, that is a true database system, is one in which data can be held nonredundantly while at the same time, a query which requires a mixture of different kinds of data (such as the query that we've been looking at) can be specified in a single command from the highest level programming interface. Any system that does not have this property really should not call itself a 'database' system; 'file management' system would be a more accurate term." (Mark L Gillenson, "Database: Step by Step", 1985)

"The DBMS is a program (or a set of programs) that allows stored data to be integrated, reduces data duplication, ensures data integrity, eliminates program dependency on file formats, and allows even  complicated objects to be easily represented and retrieved. In short, a DBMS is the program that 'processes the database'." (David M Kroenke & Kathleen Dolan , Database Processing: Fundamentals, design, implementation” 3rd Ed., 1988)

"A database management system (DBMS) consists of a collection of interrelated data and a set of programs to access that data. The collection of data, usually referred to as the database, contains information about one particular enterprise. The primary goal of a DBMS is to provide an environment that is both convenient and efficient to use in retrieving and storing database information." (Henry F. Korth & Abraham Silberschatz, "Database System Concepts" 2nd Ed., 1991)

"A database management system is a collection of interrelated files and a set of programs that allow users to access and modify these files. A major purpose of a database system is to provide users with an abstract view of the data. That is, the system hides certain details of how the data is stored and maintained. However, in order for the system to be usable, data must be retrieved efficiently. This concern has lead to the design of complex data structures for the representation of data in the database. Since many database systems users are not computer-trained, the complexity is hidden from them through several levels of abstraction in order to simplify their interaction with the system." (Henry F. Korth & Abraham Silberschatz, "Database System Concepts" 2nd Ed., 1991)

"A DBMS is a large collection of computer processes that can be entirely software or firmware that operate in a generalized or specialized computer that provides languages or mechanisms to: Define the logical database component of a specific database either directly or through reference to the IRDS metadata, that is, the data record types, data record elements, and relationships. Define the physical component of a specific database, that is, the storage structure definition, and the techniques for access strategies, data loading and updates, and database backup. Specify interrogations that access data through one or more types of languages such as host languages like COBOL or FORTRAN, and/or natural languages like query-update, report writers, or procedure-oriented. Specify system control facilities, that is, audit trails, backup and recovery, concurrent operations, and the like." (Michael M Gorman, "Database Management Systems: Understanding and Applying Database Technology", 1991)

"Database packages, more correctly called Database Management Systems or DBMSs, such as Oracle and Microsoft Access provide us with all the facilities necessary to design and use an application database. A DBMS can be viewed as having three components: (1) The user interface is the software that talks' with the user in creating and  accessing the database. […] (2) The database engine: this is the software that manages the storage and accessing of the physical data stored on disk. (3) The data dictionary, a repository of information about the application database." (Colin Ritchie, "Relational Database Principles" 2nd Ed., 2002)

"A Database Management System (DBMS) is an organised set of facilities for accessing and maintaining one or more databases. A DBMS is a shell which surrounds a database or series of databases and through which all interactions take place with the database." (Paul Beynon-Davies, "Database Systems" 3rd Ed. 2004)

"A software system that enables users to define, create, and maintain the database and also provides controlled access to this database." (Thomas M Connolly & Carolyn E Begg, "Database Solutions: A step-by-step guide to building databases", 2004)

"A product or tool that manages any kind of database, not just relational databases." (Rod Stephens, "Beginning Database Design Solutions", 2009)

DBMS: Databases (Just the Quotes)

Disclaimer: The definitions below are intended for tracing the long road of defining what databases  are about, therefore the value of the quotes should be regarded from the historical perspective.

"Historically, data base systems evolved as generalized access methods [...] As a result, most data base systems emphasize the question of how data may be stored or accessed, but they ignore the question of what the data means to the people who use it." (John F Sowa, "Conceptual Graphs for a Data Base Interface", 1976)

"A database is a set of data whose structure can be described by a connected graph, and whose instances can be accessed simultaneously." (Rob Gerritsen et al, "On Some Metrics for Databases or What is a Very Large Database?", 1977)

"Data […] that is stored more-or-less permanently in a computer we term a database." (Jeffrey D Ullman, "Principles of Database Systems", 1980)

"When we talk informally about a database, we refer to a collection of mutually related data, to the computer hardware that is used to store it, and to the programs used to manipulate it. […] A database is a collection of related data. The data storage for a database is accomplished by the use of one or more files. All files of one database are accessible from one computer or from any interconnected computer if the database is distributed over several computers." (Gio Wiederhold, "Database Design" 2nd Ed., 1983)

"[...] there is a class of methodologies - really more of an environment - for storing, accessing, and in general, managing data which can form the foundation for survival in a world which is going to become ever  more data oriented as time goes on. That class of methodologies or environment has come to be known as database." (Mark L Gillenson, "Database: Step by Step", 1985)

"A database is a self-describing collection of integrated records. […] A database is self-describing in that it contains, in addition to application data, a description of its own structure. […] A database is more than a collection of files. A database includes not only files, but also a data dictionary and a description of the relationships among the records in the files. These relationship descriptions are stored and recalled during database. […] A database is a data model of an organization." (David M Kroenke & Kathleen Dolan, , "Database Processing: Fundamentals, design, implementation" 3rd Ed., 1988)

"A database is not a technology. Rather, it is an expression of organization, clarity, and precision. It may or may not be computerized. If it is, it may exist on a microcomputer, a minicomputer, or a large mainframe. Finally, a database may or may not be centralized. However a database is implemented and operated, success is impossible without the codification of and adherence to data semantics, which are the rules for meaning, validity, and usage." (Michael M Gorman, "Database Management Systems: Understanding and Applying Database Technology", 1991)

"What is a database? It's a file - that's an electronic version of information sheets and binders. It enables you to efficiently use your information. […] Just as a binder contains information sheets, a database holds records. […] First, a database enables you to store larger amounts of information than a binder. […] Second, a database enables you to retrieve your information more efficiently, more flexibly, and more creatively. […] Third, a database can manipulate the information it contains. […] Fourth, a database enables you to print your information on Information envelopes, labels, fax cover sheets, reports, and letters." (Guy Kawasaki, "Database 101", 1991)

"When a database is computerized, it represents the automation of the knowledge component of a business, which is manifest through the business's quality operation, planning, and management With a successful database, the managers of a business can research the past, organize the present, and plan for the future."  (Michael M Gorman, "Database Management Systems: Understanding and Applying Database Technology", 1991)

"Conceptually, the database is a set of sentences expressing propositions asserted to be true of the UoD [Universe of Discourse]. Since sentences may be added to or deleted from the database, the database may undergo transitions from one state to another. However, at any particular time, the sentences populating the database must individually and collectively conform to the application specific grammar or design plan which is the conceptual schema." (Terry Halpin, "Conceptual Schema & Relational Database Design" 2nd Ed., 1995)

"In using a data base, first look at the metadata, then look at the data. [...] The old computer acronym GIGO (Garbage In, Garbage Out) applies to the use of large databases. The issue is whether the data from the database will answer the research question. In order to determine this, the investigator must have some idea about the nature of the data in the database - that is, the metadata." (Gerald van Belle, "Statistical Rules of Thumb", 2002)

"To be worthy of being called a database, a system must have two essential properties. 1. It holds data as an integrated system of records. 2. It contains self-describing information." (Colin Ritchie, "Relational Database Principles" 2nd Ed., 2002)

"A database can be considered as an organised collection of data which is meant to represent some UoD [Universe of Discourse]. […] A database is made up of two parts: an intensional part and an extensional part. The intension of a database is a set of definitions which describe the structure or organisation of a given database. The extension of a database is the total set of data in the database. The intension of a database is also referred to as its schema." (Paul Beynon-Davies, "Database Systems" 3rd Ed. 2004)

"A database in manual terms is analogous to a filing cabinet, or more accurately to a series of filing cabinets. A database is a structured repository for data. The overall purpose of such a repository is to maintain data for some set of organisational objectives. Traditionally, such objectives fall within the domain of administration. Most database systems are built to retain the data required for the running of the day-to-day activities of an organisation. […] Structure normally implies some logical division, usually hierarchical. Hence we speak of a database as being a collection of data structures. […] Each data structure in a database is in turn also a hierarchical collection of data. In manual terms, such data structures would be folders hung in a filing cabinet." (Paul Beynon-Davies, "Database Systems" 3rd Ed. 2004)

"Databases rarely begin their life empty. More often the starting point in their lifecycle is a data conversion from some previously exiting data source. And by a cruel twist of fate, it is usually a rather violent beginning. Data conversion usually takes the better half of new system implementation effort and almost never goes smoothly." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"[...] data conversion is the most difficult part of any system implementation. The error rate in a freshly populated new database is often an order of magnitude above that of the old system from which the data is converted. As a major source of the data problems, data conversion must be treated with the utmost respect it deserves." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"The corporate data universe consists of numerous databases linked by countless real-time and batch data feeds. The data continuously move about and change. The databases are endlessly redesigned and upgraded, as are the programs responsible for data exchange. The typical result of this dynamic is that information systems get better, while data deteriorates. This is very unfortunate since it is the data quality that determines the intrinsic value of the data to the business and consumers. Information technology serves only as a magnifier for this intrinsic value. Thus, high quality data combined with effective technology is a great asset, but poor quality data combined with effective technology is an equally great liability." (Arkady Maydanchik, "Data Quality Assessment", 2007)

"As we know, every action produces an equal and opposite reaction. Adding security reduces performance. Sharding and partitioning the database affords greater performance and distribution but creates complexity that is difficult to manage. Adding robust monitoring can generate huge volumes of log data to be stored, rotated, secured, and cleansed. Keeping the design “simple” often defers the interests of flexibility until later, where it becomes very expensive." (Eben Hewitt, "Technology Strategy Patterns: Architecture as strategy" 2nd Ed., 2019)

22 December 2009

DBMS: Databases (Unconventional Quotes)

"It is almost everywhere the case that soon after it is begotten the greater part of human wisdom is laid to rest in repositories." (Georg C Lichtenberg, Notebook K, 1789-1793) 

"An observatory, like any other laboratory of research, may concentrate its attention upon either one of two widely different objects: the accumulation of great stores of data in existing departments of knowledge, or the opening up and exploration of new fields of investigation." (George E Hale, "Ten Years' Work of a Mountain Observatory", 1915)

"As information itself becomes the largest business in the world, data banks know more about individual people than the people do themselves. The more the data banks record about each one of us, the less we exist." (Marshall McLuhan & Wilfred Watson, "From Cliché To Archetype" , 1970) 

"Cyberspace. A consensual hallucination experienced daily by billions of legitimate operators, in every nation, by children being taught mathematical concepts. [...] A graphic representation of data abstracted from banks of every computer in the human system. Unthinkable complexity. Lines of light ranged in the nonspace of the mind, clusters and constellations of data." (William Gibson, "Neuromancer", 1984)

"Information is data that has been given meaning by way of relational connection. This 'meaning' can be useful, but does not have to be. In computer parlance, a relational database makes information from the data stored within it. (Russell L Ackoff, "Towards a Systems Theory of Organization", 1985)

"I found myself answering the same questions asked frequently of me by different people. It would be so much easier if everyone could just read my database." (Tim Berners-Lee, "Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web", 1999)

"Perhaps there is supranatural: reason beyond the normal definitions of fact or data-based logic; something that only makes sense if you can see a bigger picture of reality. Maybe that is where faith fits in." (William P Young, "The Shack", 2007)

"Like the French, database programmers speak a slightly different language, and since we’re working with a database we need to learn some vocabulary." (Allen B Downey, "Think Stats: Probability and Statistics for Programmers", 2011)

"We’ve got one huge advantage - people believe what they see in databases. They’ve never learned the most important rule of cyberspace - computers don’t lie but liars can compute." (Terry Hayes, "I Am Pilgrim", 2013)

"Things get even messier when linked into networks, which can literally scatter one’s mind even at today’s rudimentary levels of connectivity. The 'transactive memory system' called Google is already rewiring the parts of our brains that used to remember facts locally; now those circuits store search protocols for remote access of a distributed database. And Google doesn’t come anywhere close to the connectivity of a real hive mind." (Peter Watts, "Echopraxia", 2014)

"Finally, most identity thefts occur when databases are not securely managed. So, my advice? Don’t ever end up in a database." (Joseph Fink & Jeffrey Cranor, "Welcome to Night Vale", 2015)

"Everything is 'smart' now. The library cataloging system is smart, classification and indexing information entered into a uniform online database. People wept and lamented the loss of the old cards, then forgot them. They pretty much forget everything they weep over and lament." (Christopher Sorrentino, "The Fugitives", 2017)

"I'm an oracle of the past. I can accurately predict up to 1 minute in the future, by thoroughly investigating the last 2 years of your life. Also, I look like an old database – flat and full of useless info." (Will Advise & Jarod Kintz, "Nothing is here…") 

"Practically all large corporations insure their data bases against loss or damage or against their inability to gain access to them. Some day, on the corporate balance sheet, there will be an entry which reads, 'Information', for in most cases, the information is more valuable than the hardware which processes it." (Grace M Hopper)

"The body of truth has a much larger data base than the conscience of lies." (Anthony T Hincks)

20 December 2009

ERP Systems: ERP Software Predictions for 2010

ERP Systems

    Again on Panorama Consulting Group’s 360 ° Blog post on Top Ten ERP Software Predictions for 2010. Sorry, given the current economical context, excepting the eventual acquisition moves and few sporadic attempts to change something in ERP vendors’ strategy I don’t feel anything big coming for the next year. Most probably ERP vendors like any other companies will be more focused on cutting down losses, reducing the headcount, introducing one or two features in their products, maybe putting on hold some of the projects they are working on unless immediate profit is expected, the shift changing from “thinking big” to “focused thinking”, keeping the flag up.

    It’s true that the crisis we are going through stresses the importance of having tangible benefits after implementing an ERP solution, cutting down the costs, breaking such projects in easily chewing pieces in the attempt of reducing the risks and obtaining results faster, though whether phased rollouts is best approach for that it’s more a philosophical question. Even more, an ERP solution is just a piece of the puzzle, you have to consider in the end the overall infrastructure, the many other systems floating around as isolated islands, the culture of people, the maturity of the business, and maybe the most important - the means by which an ERP system could be leveraged to higher level of performance, and maybe this is the best factor considered when calculating the ROI; however you can’t achieve that if you’re not having in place adequate (business intelligence) tools and mindset to derive benefit out of your ERP system! Maybe that’s the direction CIOs have to follow – striving for performance and eliminating waste, thinking lean and green, sustaining business value.

    The crisis stressed also the importance of having in place adequate risk management, though that have been on managers’ plate since several good years, and if they ignored it until now, most probably they won’t change things over night, as some experience and again mindset is required in order to get things right.

    I agree on the increased adoption of SaaS by SMBs, though I wonder if that can be achieved at large scale, how SMBs will deal with specific requirements, how they will integrate the SaaS ERP solution with the existing systems. I don’t think the ERP and SaaS market is mature enough to address the challenges concerning the merging of the two solutions, of course that don’t mean that attempts won’t be made in this direction.

    For small vendors will be difficult to enter on a market dominated of giants like Oracle and SAP, it takes lot of time and effort to come up with a feasible ERP system, maybe small vendors can better address the requirements of SMBs by offering more customized solutions, better and cheaper support than big vendors do. Everything is possible, first of all you need to have a good product, a good marketing strategy, gain customers’ trust and maintain it over time.

ERP Systems: The right ERP software

ERP Implementation

    Reading an interesting post in Panorama Consulting Group’s 360 ° Blog on Top Ten ERP Software Predictions for 2010, I was struck by the syntagm “choose the right software for their organizations”. I know from my own experience that this is a quest for Pandora’s Box, no matter how much we like to be confident and connoisseur about it, in the end is just philosophy.

    Putting myself in the position of IT Manager or whoever implied in taking decisions related to software adoption I would ask myself: “what’s the right ERP software for my company?”. In theory things are not complicated, I evaluate my requirements and the functionality provided in the various ERP solutions, the costs involved with them, the amount of time and resources I can afford, and in the end I choose whatever may seem appropriate for my business model. It sounds simple, isn’t it? Of course, this supposing that I understand my business as a whole, its infrastructure and its culture, the issues it faces, the short term needs vs. the long term needs, that the requirements are defined upfront, etc.

    In order to be sure that I’m doing the right thing, I even discuss with the sales representatives of the ERP vendors, have maybe one or two presentation sessions supplemented by a Q&A session in which my most experienced workers from each department express their concerns, ask for details, etc. Now jumping over the fact that the presentational skills and convincing tone of the sales representative might be a decisive factor in choosing a solution, I realize that the most important ERP solutions offer relatively similar functionality, most probably the differences rely in details. Now, there are a few questions that might occur to me… How much the people in the room, who maybe never worked with an ERP system, would understand what an ERP system is about? How much can they articulate their needs, identify which are the details that makes the most important impact on the business? How much the sales representative has understood my business and the overall context? In definitive he’s just trying to sell a product, how much he’s willing to dive into my requirements, analyze them and identify feasible solutions? Actually for that a few presentation sessions are not enough, it might take weeks, a whole team of resources, multiple iterations until you’ll come up with a feasible solution. Even then, once the ERP system is in place you observe that it doesn’t look exactly with what you wanted, with what you team intended, but that’s normal for IT solutions, unfortunately.

    Sometime after Go Live, most probably the employees will understand what an ERP system is about – sticking to the processes, data ownership, more time spent on data entry and data management, unified implementation starting with the strategic planning and ending with the booking of revenue, dealing with issues not considered during implementation, functionality that is not so easy to use as expected so Excel or MS Access seems to be a more flexible solution, difficulty of changing the system and processes when needed by the business, more systems need to be integrated with it, that the 360 ° overview of the business is just a myth, and so on. From my experience I observed that the users have great expectations before and during implementing an ERP system, though the reality brings them down to earth, almost no magic behind the software the whole company was talking about, just a different way of approaching things!

    The fact is that the decision of going with one ERP solution is validated only by the final outcome considered on long term, the impact it has on the business, the overall adoption and the degree to which it will fit the business needs, the flexibility of modifying it when needed. On the other side, if the ERP system doesn’t impact the business in a negative way, then the system can be considered successful, even if creepy things come to the surface from time to time. Was it the right decision? That I will not know for sure unless I’m moving to another ERP solution and I can compare the outcome with what I had in place, otherwise we just consider hypothetical situations. Actually the comparison might not be well founded because in such situation I benefit from the experience of already implementing an ERP system, I better understand the issues I was confronted with and eventually better address them in the new implementation.

    The important point I would like to highlight is that a company needs to have a certain maturity when going with an ERP solution, situation that needs to be addressed by vendors or/and organizations themselves in order to increase the chances of success, otherwise the mixture of acronyms like ROI, CIOs, SMBs, SaaS, ERP, CRM in vision philosophies are just nice stories to read before going to bed. Of course the learning by doings syntagm can be applied to ERP implementations too, though the costs are too high for such a scenario. The state of art in ERP world – the vendor wants to sell you a product and profit on customer’s expense also after doing that, often not being interested whether the product fits the purpose as long more issues lead to be more income, while on the other side the customer wants an affordable flexible solution that allows bringing the business to higher level of performance. Most probable something must be changed in how the two parties work, and this might be, at least from my point of view, the most important challenge for the next years.

05 December 2009

DBMS: Databases (An Introduction)

Database Management

So, you’ve heard the reporting guy or somebody else talking about getting some data or a report from the database, or you found out that you can’t use one of the fancy applications your company has in place just because the database is not available. Dam, that database must be something important! 
Thus you may wonder what a database is, and, with a few clicks, you find out in Wikipedia what all is about, a database is “an integrated collection of logically related records or files consolidated into a common pool that provides data for one or more multiple uses” [1]. This definition doesn’t clear up things at all, isn’t it? Terms like “integrated collection of logical related records” or “files consolidated into a common pool” even if they seem semantically right seems to be hard to digest. Without pretending to give a better definition, I would define a database as a logical and physical structure used to contain data in a consistent form. For sure, this definition won’t revolutionize the world of databases, and most probably might be other similar definitions out there, better formulated and sustained. 

Actually also this definition might need some clarifications, I’m talking about a logical structure because there is a logic on how the data are stored, physical structure because the data are stored on a physical device (e.g. computer memory, hard disk or any other type of storage device). I used contained and not stored, because containment imply certain control over the structure in comparison with the simple storage of data. Ok, this being said, somebody would question: hey, also a delimited text file can store data in a consistent form, what’s the difference between a database and a delimited text file? At a first view, I would say the difference resides in context, in the fact that a database “exists” in the context of a database management system (DBMS), a (software) system that provides a mechanism for storage, retrieval, modification and management of data. There are DBMS that store their data as delimited flat files, known also as flat file database, however such simple structures hardly cope with the requirements of modern DBMS, that need to provide a scalable, reliable and secure storage system. Why is a delimited flat file a simple structure?! This affirmation needs indeed some further explanation…

A delimited text file is a text file in which the chunks of data are delimited by special characters and stored in a tabular format with rows and columns, much like in Excel if you want, though Excel offers a better visual structure that facilitates data visualization and manipulation. The delimited text file supposes the existence of at least two types of delimiters, one to delimit the columns, usually a colon, semicolon or pipe, and one to delimit the rows, usually a combination of carriage return and line feed. Unfortunately such a structure imposes one important problem: what happens with the chunks of text containing a character or a set of characters used already as delimiters? Therefore the text is typically encompassed between two quotes, and in case a double quote exists already in a text, then the contained double quote it’s duplicated facilitating thus text’s interpretation. There could be even more problems, given the two systems of writing a number, comma versus dot, could be tricky to use a comma delimited file, semicolon or tab would be a better delimiter. An application that reads such a file would need thus to understand the delimiters used, and even more the format in which numeric and date values are stored, of whether the first row contains the column names, being required to store such metadata in a second file.

In contrast, a database stores its data in multiple tables designed around logical entities (e.g. Vendors, Customers, etc.) with attributes (e.g. Name, Address, City, Country, etc) translated into columns and the actual values forming a record in a table, resulting thus again a row/column structure for each table. In a flat file database, each table is stored in one file, while in a typical database the tables are stored together in one file, this implying the existence of another delimiter for tables themselves, not to mention that each table has its own structure with different column names, though the columns names, their data types and several types of metadata on columns and tables are stored in auxiliary tables.

And that’s not all, normally a simple search functionality might be enough in order to find a value in a text file, though for databases that’s something complicated to achieve because first of all a “search” is performed against one or more tables, and even against two or more databases, thus first of all, for efficiency, is needed a mechanism that reflects where a table begins, secondly it makes sense to have an explicit and/or implicit unique identifier (UID) for a record, that would allow identifying a record in a table in a unique manner. Such unique identifiers might be a combination of one or more columns, the best candidates from a performance standpoint being (positive) integer values, though text and as types of values could qualify as such too. A table could have more than one such unique identifiers, for example an integer running number (sequence) used especially for fast retrieval of records, and another attribute (e.g. Material Number) or a combination of several attributes (e.g. Material Number, Vendor, Serial Number) from a entity standpoint. One of unique identifiers, typically single attributes, is a candidate for the primary key, used by other tables to reference a record from the respective table. Why is needed such a mechanism when in theory one table can store all the data, much like in Excel fashion?! It’s mainly a question of design, storage and maintenance efficiency, being more feasible of storing redundant data in a table of their own and referencing the corresponding record from the main table, having thus a relationship between the two tables, the identifier that references the primary key of the table thus formed being called the foreign key. Even if not always feasible, the foreign key could be based on multiple attributes, fact that increases relation’s complexity, because when the data are retrieve, the primary/foreign key columns are used to merge the results from the two tables into a common result set. Even if logical, the mechanism can be become complicated, the relationship needing to be stored together with the actual data in a structure of its own, even more a reference constrain can be enforced in order to assure that references are not invalidated by the deletion of a record.

Tables can become really big, ranging from a few thousand of records to millions and even milliards records, making data retrieval quite complex. Therefore it makes sense to have in place a mechanism that allows faster data retrieval at least for the most used attributes in searching, and that can be achieved with the help of indexes created based on one or more columns, they are stored in a dedicated structure and include a reference to the actual record. Indexes allow a DBMS to perform a search based on the columns used in an index on an optimized index structure rather than performing the search on the table itself, and this, from performance point of view, can make quite a difference for big tables.

Data are retrieved from a database with the help of a query, a well structured statement based on SQL (Structured Language Query) standard, specifying typically the columns to be retrieved, together with the tables they belong and the primary/foreign columns used to build the relation (join) between tables, and the columns on which the search will be performed on. A query can be more complex than that, it can include other statement modifiers, subqueries, views and functions. A subquery is a query nested inside of another query, referred also as nested query, the nesting can go even several levels, making queries hard to “read” and maintain. Fortunately a database allows encapsulating a query in database objects like views, functions and stored procedures, enabling a better management of queries, facilitating also code reuse. A database view can be seen as a virtual table based on a single statement query, storing no data on its own and being used to limit the number of columns or records retrieved. In contrast with views, user-defined functions and stored procedures can be based on multi-statement code, the main difference between the two residing in the fact that functions can be used in queries, while stored procedures are executed individually, this implying also some differences on the manner they are executed; there are many more other differences between these two types of database objects, mainly DBMS vendor related.

 Another important topic in the world of databases is concurrency, handling “simultaneous” requests coming from multiple “users” and this involves access of the same piece of data by multiple users. A DBMS has to take care in the background of such scenarios, avoid when possible and address locks, restrict users the access to data they are not entitled to see or modify. The access to data and database objects is based on accounts and roles, and even if a user has accidental access to the server on which the database stored, above the binary encoding of data, a DBMS might even encrypt the data and database objects.

As can be seen, there is a whole arsenal of database objects associated with a database, typically stored independently of the actual database that holds the data, in tables spanning over one or more databases that belong to the internal kitchen of the DBMS. This arsenal makes the difference between a simple delimited text file and a database, DBMS vendors building in their solutions even more concepts mechanism in order to address issues like reliability, accessibility, scalability, performance, security, heterogeneity and so on.

Now, after this being said, the Wikipedia definition for a database seems to be relatively accurate, and that up to a point because relatively recently appeared the concept of in-memory database (IMDB) that primarily relies on the main memory for data storage. This doesn’t make the respective definition less valuable, though frankly I prefer my own definition, hopefully I haven’t left anything important out of it.


References:
[1] Wikipedia. 2009. Database. [Online] Available from: http://en.wikipedia.org/wiki/Database (Accessed: 5 December 2009)

07 November 2009

SQL Server Troubleshooting: AdventureWorks requires FILESTREAM enabled II (one year later)

   I tried a few days ago to reinstall the AdventureWorks examples, I knew about the FILESTREAM issue mentioned in a previous post, though this time after I enabled FILESTREAM via Management Studio I got another error message:
     "AdeventureWorks2008 OLTP requires FILESTEAM which is not enabled for the instance you selected. FILESTREAM must be enabled BOTH via the SQL Server Configuration Manager and via Management Studio for this instance."

    I searched for the issues without any hits, then on how to install the sample and found a post on Stuart Cox’s Tech Punch blog.     I tried to set the FILESTREAM values in SQL Server Configuration Manager and didn’t worked, getting the following error message:    

"There was an unknown error applying the FILESTREAM settings. Check the parameters are valid"    

The search returned several hits but not for Vista. In the end following without success the script-based solution proposed on MSDN, I tried my luck in the register and using regedit tool I found the entry for Filestream in Software/Microsoft SQL Server/MSSQL10.MSSQLSERVER/MSSQLServer/Filestream, and changed Enabled Level to 2. Now the values in SQL Server Configuration Manager were updated and after I installed the AdeventureWorks example everything looks to be ok.

04 November 2009

Database Architecture: Views II

    Apart from the various versions of SQL ANSI standards, each RDBMS vendor takes the liberty to implement its own solutions in accordance with the problems it’s trying to address – performance, parameterization, partitioning, distribution or whatever other problem might arise. There are several types of views evolved from such attempts, several other names being mentioned in database literature:

Union view – view based on the union of one or more database objects.

Join view - view based on the join of one or more database objects.

Hierarchical view – view based on hierarchical relationships processing techniques.

Base view – view consisting of all of the rows from a table and a subset of columns.

Derived view – view created from a base view by selecting only a subset of rows.

Projection view – in NetWeaver, view used to hide the fields of a table, containing one table and all its rows [1].

In-line view – it’s a “pseudo view” and refers to (in-line) self-contained subqueries used inside of other queries, without actually defining a view. The query can be run independently from the query in which is contained.

Updatable view – views that allow data changes on the tables they are based on.

Maintenance view – in NetWeaver, view that allows viewing and updating data from a set of tables that form a logical unit [1].

Object view - virtual object table based on user-defined types that allow conversion of relational tables into object-relational tables, facilitating thus the use of object-oriented programming techniques without converting existing tables, and the consumption of relational data by object oriented applications [3].

Federative view – view in a federated database system, a meta-database which integrates multiple autonomous database systems.

Consolidated view – view which provides data in a form more frequently required by users [4].

Indexed view or materialized view – is a virtual table that allows storing the output of a query, being ideal for aggregating data across multiple rows, the values being updated and materialized, being queried without continuous recalculation.

Regular view – view that stores no data, as opposed to an indexed view.

Parameterized view – is a view which, much like a function or stored procedure, accepts a set of parameters that dictates the output. In SQL Server parameterized views are implemented with the help of inline table-valued functions.

Partitioned view – a view formed by the union of more tables with the same structure and found within one or multiple autonomous SQL Server instances.

Distributed-partitioned views (DPV) – are portioned views working across multiple instances

System view – refers to the views used by RDBMS vendors to expose metadata about their systems. For example MS SQL Server exposes several types of system views:
catalog views – return Database Engine related metadata, according to BOL they are the “most efficient way to obtain, transform, and present customized forms of this information” [2].
information schema views - "provide an internal, system table-independent view of the SQL Server metadata", which “enable applications to work correctly although significant changes have been made to the underlying system tables” [2].
compatibility views – views implementing backwards compatibility functionality, implemented previously within tables.
• replication views – views based on replication system tables used for implementing SQL Server data replication.
dynamic management views - "return server state information that can be used to monitor the health of a server instance, diagnose problems, and tune performance" [2].

System internal views - "views that make available low-level storage metadata for the SQL Server Database Engine" [2].

Help view – implemented in NetWeaver, a help view it’s not a typical database view, being based on a primary table and a search table used in NetWeaver’s online help system.

Conceptual view – concept-based view in a conceptual data model which maps the concepts and relationship between them.


References:
[1] SAP. 2009. SAP NetWeaver 7.0 Views. [Online] Available from: http://help.sap.com/saphelp_nw70/helpdata/en/cf/21ecc5446011d189700000e8322d00/frameset.htm (Accessed: 2 November 2009)
[2] MSDN. 2009. SQL Server Books Online. [Online] Available from: http://msdn.microsoft.com/en-us/library/ms130214.aspx (Accessed: 4 November 2009)
[3] Oracle. 2000. Chapter 14: Object Views. Oracle 8i Concepts. [Online] Available from: http://www.cis.unisa.edu.au/LearningResources/oracle/server.815/a67781/c13obvw.htm (Accessed: 4 November 2009)
[4] Sybase. 2009. Consolidated views. [Online] Available from: http://infocenter.sybase.com/help/topic/com.sybase.help.sqlanywhere.11.0.1/dbreference_en11/views-s-4117677.html (Accessed: 2 November 2009)

31 October 2009

Database Architecture: Views I

    Typically the RDBMS vendors offer on top of their table-based storage 3 types of database objects used for data access: stored procedures, functions and views, each with their pluses and minuses. The views, same as the functions and stored procedures, can be seen as abstraction layers that stand between the physical database and users, allowing functionality reuse, logical structuring and better code maintenance . As Tony Regerson mentions, logically (and it only is logically) a view can be visualized as a virtual table, but from an optimization perspective a view is not a virtual table – “this is extremely important to remember when designing queries that you want to scale and perform well in a real environment”(T. Regerson, 2007). The view as virtual table is maybe the most realistic definition for a view, without reusing the term of view redundantly, as in “a view offers a view within a data set based on one or more tables”.


    Over the years I saw several pros and cons against using views, even several fights on whether to use views or not in database-based development. I’m not an SQL guru, though I often dealt with various applications making heavily use of database side programming, and I’m using the old fashioned views on a daily basis because of one of the below benefits.


    Structure the code in logical table-like query-able units that can cover many possibilities of querying the same data and which can be reused in other database objects, including views (nesting views). It can be thus created an abstract model on top of the existing table model, with multiple levels of details and perspectives.


    Hide implementation complexity from users, being easier to use than the base tables. There are situations in which users want to query the data by themselves, instead of using an existing report, a simple view can reduce the complexity of a query by hiding the implementation details and it simplifies data manipulation by focusing on a restrained scope specific to the problem is supposed to model. Making use of a often met quip, the users don’t have to “reinvent the wheel”, a view reducing concomitantly the effort and potential mistakes which can occur in queries’ construction; as somebody was remarking sarcastically, “not everybody is a Bill Gates”, so you can’t expect from users to be good query developers and know in and outs of a database model.


    When accessing data over MS Access or other similar tools, views might allow reducing the network traffic and increase query performance compared with the queries built on top of linked tables, in theory the fewer linked tables, the less network traffic and better the performance as the heavy processing is moved from the client to the server. Based on their use, the reduction of network traffic and the increased in performance can be relative.


    Stored as database objects, views keep code duplication and maintenance at minimum, offering better code traceability and testability. Imagine creating a set of queries for each possible scenario the users might use to query a set of tables, the simplest such scenario being the one in which the only difference resides in the attributes the users need; it’s must easier to create a view with all the attributes, the users selecting then only the attributes they need. In some situations might be necessary to create several views for the same set of tables, when are targeted different levels of details or views. There is another important aspect, the more queries you use on the same topic, the more maintenance work is needed when changes occurred in their logic; not to mention that often queries are run directly by users, the synchronization and testing of the various versions of the same query can become a nightmare. Even more, working with a named entity facilitates the communication between users and developers, allowing to easier identifying the piece of code in discussion.


    Views allow changing the order of attributes, adding formatting, calculations and scalar functions on top of the existing attributes, using table-valued functions and views.


    Views allow enforcing security at object, vertical (column) and horizontal (rows) level, partitioning thus data access based on the required security levels.



    Another “unorthodox” feature of views is that they allow developers to directly insert, update or delete data from tables they are based on, however this can be done under certain circumstances (e.g. such views need to reference directly and individually table’s columns, thus no aggregations and transformations can be performed).


    There are several cases against using views, the most important of them stresses the lower views’ performance when compared with the one of queries or stored procedures, however the benefit of caching query-plans can be relative, and the balancing between the cost of flexibility and maintainability vs. the cost of performance can favour the views. I find somehow paradoxical that in applications programming everybody goes for OOP weighting reusability more than performance, often creating allegorical monsters deviating from their purpose, and when it comes to databases, everyone is against views though the performance decrease is not so big, while the structures built derive intrinsically from data. On the other side it’s true that databases have a higher workload and are more sensitive to it, but I think people should learn to make use of tools and knowledge more wisely than hiding behind philosophies.


    Tony Regerson considers that views are more difficult to understand when nested, this making tuning and debugging more difficult. I can partially agree with him; from my point of view the code becomes easier to read, understand, maintain and test when logic is split in smaller fragments, and even if it can be more difficult to identify the source of a field from the final query, for this needing to traverse multiple views, the benefits of views can diminish the impact of this downside. It remains the problem of tuning, though actually this reflects the need for smarter database engines, while the other understanding related issues reflect the need for development tools that can identify the source tables and fields from nested objects. Documentation like ERDs and other types of mappings can improve the understanding of nested views.


    Several applications I worked with, for example Oracle APPS, built the data access on top of views, each form being based on one or more views, which encapsulate the basic layer of business logic, further processing and customization being done in the form itself. This architecture facilitates to some degree developers’ work. The same views can be used also for the various reports, however such views often have the inconvenience they resume only on the attributes needed by a form, thus the adding of other attributes resume at extending the views, the creation of similar views that include the needing attributes, or at the recombination of views. The extension of standard views is not recommended because there is the risk the vendor might decline the responsibility for the errors occurring in a view, another important risk residing the fact that the views might be replaced with the installation of patches or during version upgrade. The combination of views might lead to the unnecessary reuse of the same table more than once, resulting in performance decrease which can be substantial. The remaining approach, often the best, is to create another view replicating the needed logic, though the developed views need to be synchronized with the standard views in case changes occur.


    A similar scenario occurs when users recombine views in order to obtain the attributes needed in their analysis, again is recommended to provide a view which offers the same needed behaviour. The problem is that it’s difficult to identify such issues, because usually the developer has low visibility on how the user is using the tables. This could be theoretically avoided by better communication between users and developers, by moving the code, when possible, from personal databases to production databases.


    Doing changes through (updatable) views diminishes the chances of trapping or raising customized errors from code, letting the application or DBMS to handle the issue and all the implications derived from it. This eventual issue could be partially solved by view triggers, however this functionality is provided only by a few vendors. In web or desktop applications, such direct data access increases also the risk for SQL injection attacks, stored procedure based access being more recommended for such architectures.


    There are several other aspects which need to be considered with views, facts resulting mainly from the various types of views; I’ll try to summarize them in a second post.


References:

T. Regerson. (2007). Views – they offer no optimisation benefits they are simply inline macros use sparingly. [Online] Available from:
http://sqlblogcasts.com/blogs/tonyrogerson/archive/2008/01/03/views-they-offer-no-optimisation-benefits-they-are-simply-inline-macros-use-sparingly.aspx (Accessed: 12 January 2007)

22 September 2009

SQL Reloaded: Incremental Update Technique(s)

In Cognos Enterprise Business Intelligence Developer Guides mentioned in the previous post is repeated over and over again in the WHERE clause a set of constraints like in the below example(e.g. Accounts Payable Analysis for Oracle e-Business Suite, p. 21), involving more or less than 3 tables:
 (A.CREATION_DATE > TO_DATE(:FROM_DATETIME,'YYYYMMDDHH24MISS') AND A.CREATION_DATE <= TO_DATE(':TO_DATETIME','YYYYMMDDHH24MISS')) OR (B.CREATION_DATE > TO_DATE(:FROM_DATETIME,'YYYYMMDDHH24MISS') AND B.CREATION_DATE <= TO_DATE(':TO_DATETIME','YYYYMMDDHH24MISS')) OR (C.CREATION_DATE > TO_DATE(:FROM_DATETIME,'YYYYMMDDHH24MISS') AND C.CREATION_DATE <= TO_DATE(':TO_DATETIME','YYYYMMDDHH24MISS'))     For 3 tables it's not a lot of work, though the set of constraints becomes more complicated with the increase in the number of tables. It can be simplified by using the LEAST & GREATEST Oracle functions, resulting the equivalent writting: (LEAST(A.CREATION_DATE, B.CREATION_DATE, C.CREATION_DATE) > TO_DATE(:FROM_DATETIME,'YYYYMMDDHH24MISS') AND GREATEST(A.CREATION_DATE, B.CREATION_DATE, C.CREATION_DATE) <= TO_DATE(':TO_DATETIME','YYYYMMDDHH24MISS'))     The second formulation is simpler and should have the same performance as the first formulation, however it has a small impediment – the LEFT JOINs and LEAST/GREATEST behaviour when working with NULL values. If one of the parameters provided to the two functions is NULL then the result will be NULL, thing which can mess up a query. That’s not difficult to avoid by using for example the NVL function. Supposing that there is a LEFT JOIN between A and two other tables B & C, the above constraints can be rewritten as: (LEAST(A.CREATION_DATE, NVL(B.CREATION_DATE, A.CREATION_DATE), NVL(C.CREATION_DATE, A.CREATION_DATE)) > TO_DATE(:FROM_DATETIME,'YYYYMMDDHH24MISS') AND GREATEST(A.CREATION_DATE, NVL(B.CREATION_DATE, A.CREATION_DATE), NVL(C.CREATION_DATE, A.CREATION_DATE)) <= TO_DATE(':TO_DATETIME','YYYYMMDDHH24MISS'))     One might arrive to write such constraints for example when attempting to do an incremental update in a data warehouse based on the Creation Date from multiple source tables, loading the records having a Creation Date greater than the date of the previous update (of course the records existing already in data warehouse needs to be deleted or updated). The Creation Date works only for transactional data which don’t change after their creation, for the other types of cases it should be considered the Last Update Date (Date of the Last Update for a record). Also this approach has its flows, working well when records are not deleted from the table; this issue could be fixed for example by creating a trigger on deletion and save the ID of the record deleted in a third party table or actually delete the record from the third party database when the deletion happens in the source table.

ERP Systems: Learning about Oracle APPS internals II

    In a previous post I mentioned the FND_TABLES & FND _VIEWS as sources that can be used to find more about Oracle APPS internals. The two tables are pretty useful though they don't contain all the tables/views created in Oracle unless they were registered in them. So some of the objects developed for example during the various customization or integration phases might not be found in them, therefore it makes sense to check also ALL_ALL_TABLES & ALL_VIEWS tables, providing all the tables available for a certain schema.

SELECT *
FROM ALL_ALL_TABLES
WHERE OWNER = :OWNER

SELECT *
FROM ALL_VIEWS
WHERE OWNER = :OWNER

    Getting the differences between the two sources it’s pretty easy:

SELECT A.*
FROM ALL_ALL_TABLES A
    LEFT JOIN APPS.FND_TABLES B
        ON A.TABLE_NAME = B.TABLE_NAME
WHERE A.OWNER = :OWNER
AND B.TABLE_NAME IS NULL

    If you are interested on Oracle internals you can find more goodies by querying the Oracle dictionary table:

SELECT *
FROM DICT
WHERE TABLE_NAME LIKE '%VIEW%'

    One of such goodies is the view that stores the list of Materialized views:

SELECT *
FROM DBA_MVIEWS

    PS: I know that these are pretty basic stuff for APPS DBAs though not all of us discovered them from beginning.

13 September 2009

SQL Reloaded: Randomized Data Selection

 A few years back I was asked to provide a random set of data from a data collection. Dealing only with a small number of records and as they were in Excel, it was much easier to use the Excel’s RAND function. Even if SQL Server provides a RAND function too, it returns the same value within a data set when the same parameter is provided. In order to avoid this behavior it would be useful if could be provided a distinct parameter value for each row, for example using the data set's integer unique identifier or eventually create one, thus if ID is such a column the query would look something like that: 

SELECT TOP 10 RAND(ID) PERC, ID, N1, N2 
FROM 
ORDER BY RAND(ID)     

The fact that RAND will return always the same value for the same parameter value might be a problem because the above query will always return the same records. In exchange you can use a multiplier for the parameter provided to RAND function - for example in a first attempt you can use RAND(id), in a second RAND(id*2), in a third RAND(id*3), etc.     

This approach offers unfortunately a relatively "weak" randomization which might be not sufficient for repetitive selections, therefore under SQL Server 2005+ could be used instead a CLR-based UDF (user defined function) and use the random functions available in VB, C# or any other CLR programming language, or even the .Net Random class designed to produce a sequence of numbers that meet certain statistical requirements for randomness.

12 September 2009

SQL Reloaded: Number of Records I (via COUNT)

Getting the number of records from a table is straightforward, a simple SELECT doing the job:

-- general syntax 
SELECT count(*) NoRecords
FROM <schema>.<table>

-- example 
SELECT count(*) NoRecords
FROM [Person].[Address]

Some prefer using a count(1) instead of a count(*), though it doesn't make a difference considering how SQL Server handles the two scripts in the background.

-- general syntax 
SELECT count(1) NoRecords
FROM <schema>.<table>

-- example 
SELECT count(1) NoRecords
FROM [Person].[Address]

Same can be achieved by using a SUM:

-- general syntax 
SELECT SUM(1) NoRecords
FROM <schema>.<table>

-- example 
SELECT SUM(1) NoRecords
FROM [Person].[Address]

However the SUM is typically used when counting only certain records within the same line (the attributes must be know however beforehand):

-- counting the occurrences for a certain values via SUM
SELECT SUM(CASE WHEN Class IS NULL THEN 1 ELSE 0 END) [Nulls]
, SUM(CASE WHEN Class = 'H' THEN 1 ELSE 0 END) [H]
, SUM(CASE WHEN Class = 'L' THEN 1 ELSE 0 END) [L]
, SUM(CASE WHEN Class = 'M' THEN 1 ELSE 0 END) [M]
, count(*) NoRecords
FROM [Production].[Product]

The sane output can be used via a GROUP BY based on the grouping attributes:

-- counting the occurrences for a certain attribute via COUNT
SELECT Class
, count(*) NoRecords
FROM [Production].[Product]
GROUP BY Class

-- counting the occurrences for a group of attributes via COUNT
SELECT Class
, Color
, count(*) NoRecords
FROM [Production].[Product]
GROUP BY Class
, Color

If you want to implement the functionality in a function or stored procedure and provide the name of the table as parameter, then things get a little more complicated as the query needs to be built dynamically on runtime. In stored procedures a query built dynamically can be usually executed using EXEC or EXECUTE stored procedures, though none of them will do as we need to reuse the actual value returned by the query. This can be achieved with the help of a third stored procedure called sp_executesql, and the work can be done relatively easily with the below stored procedure:

-- dropping the stored procedure
--DROP PROCEDURE IF EXISTS dbo.pGetNumberRecords

-- creating the stored procedure
CREATE PROCEDURE dbo.pGetNumberRecords(
  @Table varchar(100) 
, @NumberRecords int OUTPUT 
)
AS 
DECLARE @SglQuery nvarchar(250) 

IF IsNull(@table,'')<>'' 
BEGIN 
  SET @SglQuery = N'SELECT @NumberRecords = count(*) FROM ' + @table 

  --get the number of records 
  EXEC sp_executesql @Query = @SglQuery
                 , @params = N'@NumberRecords INT OUTPUT' 
                 , @NumberRecords = @NumberRecords OUTPUT 
END 

-- testing the stored procedure 
DECLARE @NumberRecords int 
EXEC dbo.pGetNumberRecords '[Production].[Product]', @NumberRecords OUTPUT 
SELECT @NumberRecords 

The overhead of using a stored procedure for getting the records count is bearable though the solution is not necessarily elegant, a nicer approach being the use of functions for the same purpose:

-- dropping the function
--DROP FUNCTION IF EXISTS dbo.GetNumberRecords

-- creating the test function
CREATE FUNCTION dbo.GetNumberRecords(
@table as nvarchar(100)) 
RETURNS int 
AS 
BEGIN 
DECLARE @sql varchar(200) 
DECLARE @NumberRecords int

IF IsNull(@table,'')<>'' 
BEGIN 
  SET @sql = N'SELECT @NumberRecords = count(*) FROM ' + @table 

  --get the number of records 
  EXEC sp_executesql @Query = @sql
                 , @params = N'@NumberRecords INT OUTPUT' 
                 , @NumberRecords = @NumberRecords OUTPUT 
END 

RETURN @NumberRecords
END

-- testing the function
SELECT dbo.GetNumberRecords('[AdventureWorks2014].[Production].[Product]')

Unfortunately UDFs have the restriction that they can’t execute stored procedures, and thus the above technique can’t be used. Even if the function is created, when attempting to run it, the following error occurs:
Msg 557, Level 16, State 2, Line 185
Only functions and some extended stored procedures can be executed from within a function.

There is actually an alternative for executing pass-through queries in functions by using OPENQUERY, OPENROWSET or OPENDATASOURCE functions, allowing thus to execute DML statements against a given linked server, as is the case of first two functions, respectively using ad-hoc connection information as is the case of the third function. For exemplification I chose OPENQUERY function, this being the code created to return the number of records from a specified table (you'll need to replace with your server, see below):

--dropping the test function 
-- DROP FUNCTION IF EXISTS dbo.GetNumberRecordsTest

-- creating the test function
CREATE FUNCTION dbo.GetNumberRecordsTest() 
RETURNS int 
AS 
BEGIN 
  RETURN ( 
     SELECT NoRecords 
  FROM OPENQUERY( <server_name>, 'SET NOCOUNT ON; SELECT count(*) NoRecords FROM [AdventureWorks2014].[Production].[Product]') AS O) 
END 

-- retrieving the number of records
SELECT dbo.GetNumberRecordsTest()

 For this piece of code to work is first necessary to refer the server as a linked server:

EXEC sp_serveroption <server_name>, 'Data Access', true

Withought doing this first you’ll get the error:
Msg 7411, Level 16, State 1, Procedure GetNumberRecords, Line 4 Server '>server_name<' is not configured for DATA ACCESS.

You can use sp_helpserver to get server's name:
-- getting server's name
EXEC sp_helpserver 

The above function doesn’t help much. It would be helpful if we could provide the table’s name as parameter and store the query string in a variable, something like:

-- creating the test function
CREATE FUNCTION dbo.GetNumberRecords(
@TableName as nvarchar(100)) 
RETURNS int 
AS 
BEGIN 
  DECLARE @sql varchar(200) 
  SET @sql = 'SET NOCOUNT ON; SELECT count(*) NoRecords FROM ' + @TableName

  RETURN ( 
     SELECT NoRecords 
  FROM OPENQUERY( , @sql) AS O) 
END 

-- retrieving the number of records
SELECT dbo.GetNumberRecords()

Unfortunately OPENQUERY function same as OPENROWSET and OPENDATASOURCE expect a constant parameter as query and therefore the error (as per SQL Server 2005):

Msg 102, Level 15, State 1, Procedure GetNumberRecords, Line 12 Incorrect syntax near '@sql'. Expecting STRING or TEXT_LEX.

I wonder why the SQL Server team hasn’t allowed the use of variables as parameters for OPENQUERY functions, if not in the previous SQL Server versions at least in SQL Server 2008! Anyway, another impediment would be raised by the deterministic nature of functions.
Related Posts Plugin for WordPress, Blogger...

About Me

My photo
IT Professional with more than 24 years experience in IT in the area of full life-cycle of Web/Desktop/Database Applications Development, Software Engineering, Consultancy, Data Management, Data Quality, Data Migrations, Reporting, ERP implementations & support, Team/Project/IT Management, etc.