I read a fascinating paper here by folks involved with the original C-Store project that was the genesis of Vertica (thanks to this blog for providing the link to this paper). I hadn’t realized that Vertica essentially sprang right out of academia, and it’s really interesting to see how the original research concepts and principles translated to production code. Most interesting is that many of the principles held up!
One aspect of Vertica that I read mixed reviews about was the notion of a Projection. ParAccel says that projections are a crutch that Vertica uses to improve performance, Vertica says that projections are a misunderstood core feature. This confusion about projections is so rampant that Vertica wrote a three-part blog post to explain what projections are and why they’re good.
The summary on projections is as follows. Think about a regular relational table. In most relational databases, a plain table will be stored as a heap, which is an unordered pile of rows. Fast to insert into (just throw the inserted row on the pile), generally slow to query from (look at every row to find what you need). To access the data quicker in this standard scenario, you could:
1) Store the data in some sort of ordered manner instead of in a heap. The ordering will give most queries a shot at not having to scan every row.
2) Build indexes on top of the data that order the data in various ways that are useful to common queries.
#1 above is what is called a Clustered Index in SQL Server or an Index Organized Table in Oracle, and it is a good technique for improving the performance of most queries. There’s a penalty during data insertion because the trees need to be split and rebalanced, but the payoff usually outweighs the cost.
#2 above is the standard approach of ordering the data and storing pointers into the heap for data retrieval. There are variations of this, such as including additional columns in the index, that allow index access to fully satisfy the query without needing to follow the pointers to the actual data rows.
With this background, we’re close to being able to talk about Vertica’s projections. The final piece is to talk about Materialized Views. MV’s are queries (views) whose results are persisted in the database for future access. They are very powerful in relational databases because they can act as aggregate tables, denormalizations, and other precalculations without the original query directly accessing them – the optimizer simply uses them if they contain the data has been requested.
So we’ve got three notions: data stored in a heap with indexes for fast access, a Clustered Index or IOT where the data is stored in an optimized way so that indexes may not be needed, and materialized views where data is stored in a heap/CI/IOT and then stored again in a form that supports querying. How do Vertica’s projections fit in here?
Projections are the actual data storage mechanisms in Vertica. They’re not indexes on top of data, they are the data. They’re like clustered indexes in that the data is stored in sorted order to enable excellent compression and fast retrieval. They’re like materialized views in that the collection of columns they contain is arbitrarily defined by the admin. They’re like covering indexes in that excellent projections contain all the columns a query needs.
At one extreme, each column of data is stored just once. At the other extreme, various columns are stored multiple times due to their membership in different projections. Let’s look at an example. Below is a transaction table consisting of three projections:
The projections are as follows:
- Projection 1 holds the data for columns OrderNbr, ItemId, and Qty
- Projection 2 holds the data for columns ItemID, Qty, and UnitPrice
- Projection 3 holds the data for columns OrderNbr, Item Id, Qty, and Discount
Together, all the columns in the table are represented in at least one projection. The projections themselves are useful subsets of the overall set of columns in the table, such that the query optimizer could choose which projection contains most or all of the columns that need to be processed by the query. Each projection stores the actual data of the columns it contains, so all projections take up space.
Vertica diverges from theory in the sense that it defines a fourth (or zeroeth?) projection, the superprojection. The superprojection contains every column in the table and ensures that there’s a baseline storage and retrieval mechanism for Vertica to use without the data modeler having to explicitly define it. All the other projections are in addition to the super projection.
And now, finally, we can circle back to the mixed reviews on Vertica’s projections.
Vertica’s clarification of all the FUD surrounding projections is this:
“The idea that Vertica needs a special projection for every query in order to get good performance is just wrong.” (from the link above)
This statement is in response to ParAccel and others telling prospects that Vertica is compensating for slow queries by building special ‘materialized views’ that pre-store only the data that is needed for queries. Instead, the argument is that a few well-chosen projections will result in very good performance for the majority of queries, and the standard superprojection will handle everything else. This is no different than indexes on SQL Server handling the majority of queries and the clustered index handling the rest.
There is an argument to be made that Vertica’s redundant storage of column data in the projections makes them like materialized views, but the counter to this statement is that the data compression is so significant that you can afford to carry around a few copies of a column.
Ultimately, it feels like Vertica’s core storage mechanism, the projection, is being used against it pretty well by competitors. I myself like the whole projection idea, although I think it would be ideal if the superprojection could be replaced by a complete set of projections that together represented each column in the table once and only once. This would allow you to provide an optimal ‘starter’ set of projections that would handle most queries efficiently without redundant column storage or need for additional projections in many cases.