This blog was written together Martin Rusnak from Rusnak Consulting and Bujar Bakiu.
Not that long ago (maybe somewhere this is still the case), companies had multiple one-dimensional data teams. Each team was composed of either only Data Scientists, Business Analysts, or Data Engineers. With this setup, companies struggled to integrate data products into their wider software architecture. Commonly accepted reasons for this are:
Lack of communication between teams. Requirements prioritized in one of them were not aligned with the other teams. For instance, if the Data Science team needed to explore the new marketing campaign data, it had to wait for the Data Engineering team to make these data available.
Considering solutions in isolation. Data Scientists might not be considering the performance of the solution during inference, but rather optimizing for accuracy during testing and evaluation. However, the inference would be a huge challenge for the operations team.
Overall, there were huge gaps when building end-to-end processes like automation, orchestration, and testing.
Modern Data Team Hats
To solve the problems with one-dimensional teams, the proven approach of cross-functional teams came to the rescue. In these teams, there are members focused more on Data Analyst, Data Science, ML Engineering, etc. They work together bringing more depth, a wider scope of information, and a diversity of opinions to reach their goal.
We believe that there are no clear boundaries between the roles one can play in the team. Therefore, in this post, we name these hats. A hat is a position someone holds when discussing or solving a problem.
Every team is different, however, these are the most commonly used terms to describe these hats in a data team.
The Data Engineering hat builds reliable data pipelines and data infrastructure. They serve as a bridge with the infrastructure team to deploy specialized components and upgrades. They take care of integrating other data sources and implementing data quality checks. If needed, data versioning is implemented by this hat. A big part of the work is as well optimization of the performance in terms of ingesting data and answering queries. Most often used tools are:
Orchestration, e.g. Airflow, Dagster, Prefect
Data processing, e.g. Pandas, Spark, Dask
Data warehousing, e.g. BigQuery, Redshift, Hive
Data versioning, e.g. DVC, Pachyderm
The Analytics Engineering hat is occupied primarily with cleaning and transforming the data. Together with the data engineering hat, they bring software engineering best practices to analytics code, like version control, automated testing, and deployment. Tools usually used:
Data warehousing, e.g. BigQuery, Redshift, Snowflake
Transformation, e.g. dbt, Dataform
The Data Analyst hat interrogates the data looking for insights to support data-driven decision making. They have strong collaboration and skill overlap with the Analytics Engineer. They visualize the data to help everyone make sense of it. Tools used:
Visualization, e.g. Metabase, Looker, Power BI, Tableau
Transformation, e.g. dbt, Dataform, SQL
The Data Scientist hat finds the best way to model the data for predictions. They have strong skills in feature engineering. People wearing this hat have deep knowledge of machine learning techniques, statistics, and analytics. Used tools are:
ML libraries like scikit-learn, XGboost
Deep Learning libraries, e.g. Tensorflow, PyTorch
Experiment tracking, e.g. MLflow, Kubeflow, Aim
Feature store, e.g. Feast, Hopsworks
Explainability, e.g. Lime, SHAP
Machine Learning Engineer
Machine Learning Engineering hat brings in a thorough knowledge of software engineering best practices. They productionize ML models to solve business needs and integrate them with the current organization infrastructure. They build the infrastructure for A/B testing, distributed model training, and ML workflow orchestration as well as extend existing platforms. Tools used:
Orchestration: MLflow, Kubeflow, Flyte, Kubernetes
Model serving, e.g. seldon-core, BentoML, TensorFlow Serving, Torchserve
Training, e.g. Horovod, Ray
Feature store, e.g. Feast, Hopsworks
The MLOps hat focuses on integrating automation and monitoring at all steps of ML system construction. They bring DevOps best practices to the team, like integration, deployment, model monitoring, etc. The most commonly used tools are:
Model Monitoring, e.g. whylabs, evidently
Automation, e.g. Gitlab CI, Github Actions
Infrastructure, e.g. Terraform, Kubernetes, Helm charts
The Product Manager hat is usually separate from the other very technical hats. They make sure what is being developed bring value to the users and stakeholder.
A team will likely not contain all these hats. Which ones are required depends on the team size, challenge at hand, and many other factors. Often, one person covers more than one hat.
At Data Max, we focus on covering all the hats mentioned here. We are proud of our expertise and are eager to share our knowledge. Reach out to us at firstname.lastname@example.org.