Data Warehousing | Amazon Redshift

Introduction to Amazon Redshift

What is AWS Redshift and why is it different?

If you are a data analyst or part of reporting team, no doubt you have heard of “data warehousing”. If your company also uses AWS as its cloud provider, you are likely to use Redshift. Redshift is a fully managed data warehousing service by AWS.

Prior to diving into Redshift, I would like to reflect on a common journey many companies take. Before going to the cloud, many companies will have their reporting databases hosted on MS SQL Server, Oracle, MySQL, and the sorts. These are great database technologies suited for fast application access. Microsoft takes SQL Server one step further to add the BI suite of tools to transform SQL Server into a warehouse ecosystem of tools. Slowly over time, as data grows and reporting needs increase, these on-premises systems are migrated to Redshift if using AWS. However, the data warehousing methodology and the reporting mindsets do not evolve to adapt to using cloud technologies. As part of this article, I’ll try to highlight some differences between the “traditional” warehouses (for lack of better term) and Redshift.

Redshift is a distributed and highly scalable data warehouse solution. It has some clear differentiators that make it one of the leading solutions in cloud. Let’s take these one by one and discuss.

Scalability

Data is ever-growing and the need to report on it is becoming critical. Redshift is a clustered solution; containing multiple nodes and the ability to grow and shrink the compute and storage. There are 3 types of nodes available and storage ranges from 5TB to 2PB depending on selected nodes. There are two methods to scale a cluster; “classic” which incurs downtime and “elastic” no downtime. Both methods have their pros and cons but will not be discussed in this post.

Recently, AWS has introduced scaling compute and storage independently allowing flexibility and cost control as data grows or compute needs change over time.

Performance

Redshift is designed and built as a massively parallel processing database. This means the performance is scaled efficiently based on the size of the data across multiple nodes. Redshift is based on columnar storage which is highly beneficial for analytical queries. As much as hardware has a role, so does schema design. Redshift offers multiple optimisations including compression and encoding at a column level. Combining all these, result in sub-minute query times even on terabytes of data.

Security

Security is ingrained into Redshift. It offers everything from disk-level encryption, to table and column permissions. IAM policies allow controlling access to the infrastructure as well as multiple authentication methods that allow control at connection level (JDBC/ODBC). A RBAC model is the foundation of all authentication and authorisation.

Reliability

Amazon Redshift is a highly available and durable managed service. Providing 99.9% online uptime, Redshift provides auto-recovery from hardware failures and even network availability. A robust set of tooling to manage reliable service for end-users through the granular Workload Management (WLM) system.

Amazon Redshift is a massively parallel processing data warehouse solution offering highly reliable, consistent, and secure service.