Querying Several Columns
Columnar data warehouses excel at quickly delivering search results because of the way that the data is stored. However, this storage optimization can be lost when you issue a “SELECT * FROM foo;” type query on a table with lots of columns.The database engine has to query data from each column individually and then “glue” the columns back together to create a row. The performance impact becomes noticeable when you exceed a hundred columns. The number of rows that you pull amplifies the effect.
Break down the data that you want to pull and be as concise as possible with the columns that you use for your processing. You can readily run computations across billions of rows in seconds if you select only a handful of columns and dimensions.
Queries Without Sort Keys
Sort keys are like indexes, but for columnar databases. Let’s say you have a table of books with two columns. One column contains each book’s title and the other column has each book’s author. If you sort the table by title, you will be able to search your books by title with sub-second performance. But, if you instead issue a “SELECT COUNT(*) FROM books WHERE author LIKE ‘Bob%’;” type query, you have triggered a full table scan because the database is unaware of any sorting in the column containing authors. This means the database has to scan every value in the table to find all of the rows that match your query.
For columnar databases that only allow you to have one sort key, one approach is to sort your information based on the most commonly queried column. For example, sorting your data on a column of date values will allow you to reduce the amount of data queried to a particular time range.
Nested Subqueries and Calculated Columns
Nested subqueries and calculated columns have massive potential to impact your query performance. The practice of using nested subqueries isn’t inherently bad and calculated columns are prevalent in SQL and reports. However, nested subqueries and calculated columns affect the way memory is used. Memory management for data warehouse software running on a single machine is significantly different from memory management for massively parallel processing (MPP) columnar data warehouse technology. If the differences aren’t taken into consideration, query times can increase from a few minutes to hours.
Continuing with the table of books example above, you have sorted all of your books by title and you want to get a count of all of the books that share the same second word in their titles. For instance, “Brave New World” becomes a match with “A New Hope”. Your database now has to determine the second word in each of the billions of titles that it has stored. This calculation has to occur for every title regardless of the index for the database. In addition, the result of each calculation has to be stored somewhere, ideally in memory. Most MPP systems have a large total of memory across the cluster, but relatively low memory on each individual node. Storing your calculated result dataset will exceed the memory capacity of a single node and will have to “spill to disk” which means writing to physical storage.
You can avoid using disk in your processing by creating staging or temporary tables that perform the needed calculations beforehand in separate processes. Doing this, instead of performing the calculations on demand within a large nested subquery, will require more planning and work to break apart your query, but the performance gain will validate the effort.
We have fun tackling these kinds of challenges and love helping other companies facing them.