“Don’t forget to give time for yourself” is a common piece of advice given by experienced parents. So when it was my time, I got straight to work on my next home project.
I had a Raspberry Pi 1B and 2B+ and a Raspberry Pi camera v1. My RPi 2B+ was already running HomeAssistant and being slightly more powerful could perform some light processing. I decided to use this RPi to stream the video itself. I knew that video streaming is challenging and can be compute-intensive so the first step was to give the onboard GPU about 50% of the…
This description is a good summary of this format. This post will talk about the features of the format and why it is beneficial to analytical data queries in the data warehouse or lake.
BBC micro:bit is a cheap microcontroller with a few inputs/outputs, buttons and more importantly a Bluetooth radio. For one of my Raspberry Pi (RPi) projects, I needed an Analogue-to-Digital converter (ADC) to read some sensor values. Some of you may know the RPi does not have any analogue inputs, so you need an external converter like the MCP3008 chip.
I didn’t want to use a breadboard for my RPI so I looked in my box of electronics components and found the BBC micro:bit. The nice thing is that the micro:bit has 3 ADC inputs that are easily accessible and is…
Data Engineers or even analysts, it is important to understand the technology to utilise it fully and efficiently. In many cases, Redshift is seen as a traditional database like SQL Server and management is left to DBAs. I would argue that if Redshift best practices are followed, the role of dedicated DBA diminishes to occasional management and upkeep.
In this post, we’ll discover the architecture and understand the effect and impact each component has on queries.
From 10,000 ft, Redshift appears like any other relational database with fairly standard SQL and entities like tables, views, stored procedures, and usual data…
What is AWS Redshift and why is it different?
If you are a data analyst or part of reporting team, no doubt you have heard of “data warehousing”. If your company also uses AWS as its cloud provider, you are likely to use Redshift. Redshift is a fully managed data warehousing service by AWS.
Prior to diving into Redshift, I would like to reflect on a common journey many companies take. Before going to the cloud, many companies will have their reporting databases hosted on MS SQL Server, Oracle, MySQL, and the sorts. These are great database technologies suited for…
AWS DMS is a service designed to migrate one database to another. Whether it is on-premise DB to AWS RDS or AWS EC2 (self-managed DB) to RDS. The intent is simple and one with an assumption that the migration is usually short-lived. DMS not only allows migrating an entire database but also continuous replication of changed while the full load takes place.
Technically, there is no restriction on the duration of the migration task. Having an essentially an infinitely running migration task means it could be used to stream DB changes to S3. …
Earlier this month (December 2019) I had a chance to talk at Python London meetup hosted by ComparetheMarket.com. While thinking of topics, rather than talk about something work-related, I decided to talk about something I do as a hobby. Turning my home into a smart home.
Many of you may have heard of the term MQTT in relation to IoT. MQTT is a publish/subscribe protocol for small devices that do not have a lot of computing and network connectivity. MQTT stands for MQ Telemetry Transport, named after the IBM MQ service. This has become a standard in machine-to-machine communications…
The previous posts discussed how the MongoDB oplog works. In this post, we look at how to tail and introduce you to a python module for tailing MongoDB oplog.
I developed this as a fun exercise to see how the different pieces fit together and what needs to be considered for tailing the oplog.
At a high level, this module works with technologies in AWS but is designed to be extensible. The module has 2 responsibilities: Collect oplog and push it to a stream. As discussed in the first post, I use
pymongo package. …
In a previous post, I covered what the MongoDB oplog is and its semantics. In this post, I will look at how to process it to get the new state of documents.
First, let’s remind ourselves of the data manipulation operations: Insert, Update & Delete. For Inserts and Deletes, only the
o field exists with either the full document or just the
_id being deleted. For Updates,
o field contains the updates as
$unset commands and
o2 notes the
_id of the document being updated.
We can ignore
c (DB Commands) and
n (NOOP) operations as these do…
MongoDB, similar to other databases operates using a transaction log internally. In MongoDB’s case, it is called
oplog. I have been looking into the oplog to understand the operations a bit more so as to process them for data ingestion. This post documents my learnings.
Oplog is a log of every internal operation used for replication in a MongoDB cluster. In a sharded cluster, each replica set has its own oplog. The oplog in all cases is a capped collection and can be accessed like any other collection in MongoDB. …