What is Data Engineering and why you should consider learning it as a Data Scientist

Simbarashe Chikaura
Jul 4, 2020
9 min read

Updated: Jul 5, 2020

I think it’s now safe to say Data Analytics has truly taken over the global business landscape. Even in the local Zimbabwean context, I’ve come across more Data related job ads this year alone than I’ve in the previous 2 years combined, and we’re only in July. If you’ve worked in Zim corporate environments, you’ll know that they are extremely conservative and it takes something special to motivate a change in their strategies.

With Data Analytics now being an appreciated and recognised field, the once murky & interchangeable roles it comprises have now taken distinct forms, duties and responsibilities. I’m talking of course of roles like Data Analysts, Data Scientists and Data Engineers. Granted they all have overlapping skill sets, but the competency level associated with the core skills of each role vary, in some instances, very drastically. This article will focus on the latter role and why it is important for Data Scientists (or even Data Analysts) to polish up on the core skills of a Data Engineer.

What is a Data Engineer and what does s/he do?

At a high level, we can describe data engineering using a data pipeline. As the figure above shows, a data pipeline receives data, explores it, then ultimately passes it to other processes like, but not limited to, machine learning models. If you’ve ever come across the database Extract-Transform-Load (ETL) process, then this might sound familiar. However, they’re not the same thing. ETL is actually a cog in the data pipeline architecture, usually done at the very beginning.

Likewise, if you’ve ever carried out a machine learning project as a Data Scientist/Analyst, then that workflow would also look familiar. However, those are local projects that are carried out in self-contained environments like a Jupyter notebook. They involve relatively small datasets that consume relatively low resources (computer processing and memory) so performance is never an issue. Now the converse of that situation is where a Data Engineer shines. A Data Engineer manages data analytics workflows at scale. The term ‘at scale’ in this context refers to a data analytics workflow that receives large amounts of data and is accessed by more than a few users, simultaneously. For data to be available in such a parallel structure, it’ll have to sit on a physical server, or in the cloud. A Data Engineer builds an efficient system for accessing that data.

More than any other role in data analytics, the Data Engineer needs to have solid programming skills. These go beyond just querying data or building a ML model, but actually understanding the internals of programming languages such as data structures & algorithms and database architecture. The 3 most valuable resources on a server, cloud or otherwise, (and even on your local machine) are:

storage - the actual hard disk space
computing power - CPU/processor
memory - the RAM.

No matter how powerful a server is, if the architecture of a data analytics workflow is not built in the most efficient way possible, these 3 resources will be used up and the system will become extremely difficult to use.

The Data Engineer builds the data pipeline we discussed before with these things in mind. I find the best way of understanding concepts is by way of an example. Let’s say we have a logistics business that is entirely digital, i.e., the only interface it has with its customers is its mobile application. Let’s call the app ‘DigiShip’ for reference’s sake. The way DIgiShip works is that a client logs into the app, requests a driver & vehicle type for the size of their package, supplies a destination and then pays a calculated fee for the delivery. DigiShip has 3 analytics professionals on its team, and these are their responsibilities:

Data Analyst
- Explores the data searching for insights that might be useful for the business, e.g., the average number of customers in a certain area, revenue per vehicle type, how orders behave over time, how customers are distributed, etc
- Tracks metrics and KPIs
- Generates reports using dashboards and visualizations
Data Scientist
- Develops hypotheses based on the exploration results of the data analyst
- Develops machine learning products that can be used to increase revenue, e.g., a recommendation system that suggests where drivers should park to anticipate customer orders, reducing latency and increasing the number of trips per day
Data Engineer
- Make sure that the duties of the above 2 team members are executed in the most efficient manner possible.

The Data Engineer would achieve this by building a pipeline/workflow:

The application data is triggered by events, and these events are saved in a NoSQL database in the form of logs. The logs are nothing more than JSON data structures. An event contains all the data for a transaction, for example, coordinates, package type, driver name, customer name, etc. The Data Analyst and the Data Scientist need this data to be able to execute their duties but it is not optimized for fast and efficient retrieval. The Data Engineer would then design an algorithm that loops over the JSONs and transforms them into an ordered structure for each category. This algorithm would need to be as programatically efficient as possible. The ordered data structures (e.g. a list of lists) are then loaded into a database in the form of different tables. This is what is known as ETL.
The Data Analyst has a routine JD which more or less follows the same pattern everyday. In other words, s/he makes the same query for data on a regular basis. The Data Engineer extends the pipeline to meet this request. The most efficient “SELECT” statements are designed for all of the scenarios in which the analyst might need data. The queries are then ported to the platform the Data Analyst uses, e.g., Tableau, in an abstracted manner such that the Data Analyst won’t be able to tamper with them.
The Data Scientist develops ML models that increase revenue for the business. However, training these models consumes a lot of memory and computing power. If left unchecked, the models can potentially overwhelm the entire server, crippling the business by crushing the application. The Data Engineer then designs a docker container that creates a virtual machine for the ML model to run in. A virtual machine is an environment in which you can dedicate certain parts of the server’s resources to a process. If a model is constrained to those resources, it won’t overwhelm the rest of the server.
Both the Data Scientist and Analyst need to publish their results when they’re done with their cycles. This might require writing data back to the database. The Data Engineer provides the most efficient SQL queries for this to happen. End of pipeline.

As you could probably tell, the DE architects a lot of systems in his overall execution.

Skills required to be a Data Engineer

This list is not in any way exhaustive. It’s only intended to provide you with the context of skills one should possess to be a competent Data Engineer.

Database architecture: SQL, PostgreSQL, MySQL, MS-SQL Server, etc
Data structures and algorithms
Data mining
Manipulating data using packages like numpy, pandas, etc
Distributed computing: cloud systems (AWS, GCP, Azure, etc), docker, Spark, Hadoop, Kubernetes, etc
Machine Learning optimization
Data exploration

Why you should care about Data Engineering as a Data Scientist/Analyst

Most of my experience has been as a data analytics consultant/freelancer, and this has exposed me to a lot of different scenarios where analytics can be applied. In some of these scenarios, a Data Engineer (or at the very least an efficient pipeline) is already in place so my only tasks areto carry out the analysis or train a model. In some cases, however, this isnot the case. I will tell you about one experience i had, which is actually the motivation for this article.

I was engaged by a company that runs an app which searches for public transport options in a city. The user logs in, searches the possible modes of transport, routes and destinations. All this information is saved in a JSON like data structure (similar to the example we discussed above). The JSONs were saved on disk (server) and the company needed an efficient way to explore the JSON files, query the required information from the keys in the JSON and supply the resultant data to different areas in the organisation, whether as intermediate data, or analysed data. I was also to build a recommender system for the app, but I had to source all the data myself.

This sounded easy to me at first until they supplied me with a sample of the JSON files. If I succeeded in developing an algorithm that could query the data, I’d get the job. Those JSON files are to date the messiest sources of data I have ever seen. The keys weren’t consistent across files, and some of the data was missing altogether. The actual data I needed from the logs was nested in other keys within the JSON. This made it difficult to develop an algorithm that could uniformly mine data without rerturning an error. My solution was to include a lot of conditional logic in my data scraper and to facilitate this using up to THREE nested for loops (bad idea, I’ll explain why). So after a few tries the algorithm worked and I submitted it to the company. In less than an hour, I got a message back and the person said my code “wasn’t of a high enough quality” for it to be accepted. At this point I was very much a “as long as it works” kind of Data Analyst - i didn't care about efficiency. I was confused so I asked why my algorithm was being rejected when it was completing the task they gave me. The guy on the other end was kind enough to explain it to me:

my algorithm had a bottleneck (a design flaw that slows down execution dramatically as the size of the data increases)
the bottleneck was due to my THREE nested for loops. This essentially meant that my algorithm had a complexity of order O(N^3). What this meant was, if the data being fed into the algorithm tripled, time taken would be raised by an exponent of 3. For context, the sample data I was given had 900 JSON files and these took around 6 minutes to go through. If the number of JSON files were increased to 2700, the time would become 6^3 = 216 minutes. That’s 3.6 hours running a single algorithm. Furthermore, the actual app received these JSONs in the thousands every hour. Imagine the harm my algorithm would have done there.

This was my first push into data engineering. In all the roles I had been in before that, the heavy lifting was abstracted from me by team members or personnel from other departments. However, the more you explore data analytics opportunities the more you’ll encounter scenarios such as the one I described above. Now I’m not asking you to be a jack of all data analytics trades, but know enough of each to be a competent Data Scientist/Analyst. I wasn’t asked to design something complex like a distributed system, but to simply build an efficient querying algorithm. I failed to do that because I had not exposed myself to elementary data engineering skills. A simple course on data structures and time complexity of algorithms would’ve equipped me with those skills (as it has now) but I was so stuck in my Data Analyst silo I never thought it important.

The biggest wins from learning data engineering as a data scientist are knowing how systems are designed, data is structured and how this interacts with the code you're writing or processes you are running. Furthermore, you may find yourself in an organization with budget constraints or where it simply doesn't make sense to hire an entire team of data analytics professionals. Having data engineering skills will allow you to handle projects efficiently from end to end.

Conclusion

In conclusion, data engineering is definitely something you should learn (at a beginners level at the very least). Not only will it make your coding of a better quality, but you’ll find yourself asking questions before you carry out tasks. Questions like “will numpy arrays be more efficient for this solution than a pandas dataframe” or “will this recommender system consume less resources if I use an apriori algorithm rather than tensorflow”. In the end you’ve nothing to lose by picking up data engineering skills, and everything to gain.

Learning resources to consider

edX - CS50’s Introduction to Computer Science: this course will teach you to think like a computer scientist. It has a very good ’Introduction to C’ topic that’ll teach you how code is implemented at a low level in a computer. This will encourage you to use higher level programming languages like Python more efficiently and more responsibly. (FREE - $90 Verified Certificate)
dataquest - Data Engineer Path: this learning path will equip you with most of the skills a data engineer needs: PostgreSQL, data structures, time complexity of algorithms, numpy and pandas for big data, etc. ($49/mo subscription)
data camp - Introduction to Data Engineering: this path is quite similar to the one on dataquest, but at a much cheaper price. ($29/mo subscription)
udacity - Data Engineering Nano Degree Program: the most diverse of all the options on this list. The program is 5 months long and covers topics like data modelling, data warehouses, cloud data lakes, and more. Intermediate Python and SQL are prerequisites. It is, however, the most expensive in the bunch ($349/mo subscription self-paced)

These are probably the ones you should take if you’re a beginner. If you’re past that, then you might be ready for professional courses/certifications like the ones cloud giants provide, but that’s an article for another day.

I am a data analytics enthusiast who likes to read and write about data. I also do a lot of freelance work so please do hit me up if you want a job done. Here's a link to my portfolio if you'd like to see some of my work.

2 comentários

eddychetz

08 de jul. de 2020

You really opened my mind brother, i did not know how important integration of that knowledge is when it comes to becoming the most valuable human capital to the smart world at its best. Thanks man for the 'career guidance' as i place it in my brains using my own algorithm, probably a bit faster than i thought.:) Am definitely going to project myself towards that to fine-tune my skills.

Curtir

tanatswachikaura

04 de jul. de 2020

I really like this. Well detailed and interesting