An Introduction to Data Science for Technology Leaders

Leverage machine learning capabilities to fulfill your organization's mission statement.

Author: Nicole Janeway | Post Date: Jan 3, 2022 | Last Update: Jun 4, 2024 | Related Posts

Increasingly, Data Science is recognized as a key capability for capitalizing on data as a strategic resource. In this article, we outline how leaders in Data Management and Information Technology can use Data Science and Machine Learning to open the door to new opportunities.

Data Science can unlock complex patterns in large datasets, including text and images. Organizations can use tools based in Machine Learning to revolutionize operations, ways of working, and interactions with customers and constituents.

Here, we demystify Data Science so that decision makers can clearly understand how to leverage this field to meet their needs.

Getting Started
Data Science Team
Considerations
Statistical Techniques
Natural Language Processing
Computer Vision
Other Applications
Conclusion
Related Posts

Getting Started

When an organization begins thinking about data as a strategic asset, it gains the capability to experiment, to speak a common language across functions, and to better serve its customers. Data Science provides the information and tools to enable these improvements.

As an organization's data maturity increases, it informs a shift from basic reporting to predictive analytics. Data scientists use statistical methods and algorithms to derive predictive insight from large quantities of data. These modeling techniques can help an organization attain data-driven decision making.

Machine Learning represents the subfield of Data Science that uses artificial neural networks to identify complex relationships in data. This discipline holds the promise to deliver advanced data products such as chatbots, text synthesis, and image recognition.

Through Data Science and Machine Learning, organizations can operate more efficiently. These domains improve employee satisfaction by reducing manual, tedious tasks and freeing up more time for value-added work. Moreover, an initiative that leverages these techniques can produce tools to dramatically improve customer service.

Whether your organization has advanced data capabilities or is just getting started in this space, you'll need to think carefully about how your team is structured in order to empower data scientists to do their best work.

Data Science Team

In order for a Data Science initiative to be successful, the organization must be committed to good data management practices across the board. This includes foundational training in data literacy, data quality reporting processes, and the establishment of a data governance charter and supporting institutions, such as a data governance council. Read more about how to initiate Data Strategy.

As an advanced function, Data Science is enabled by supporting teams such as Human Centered Design, DataOps, and DevSecOps.

Human Centered Design: responsible for ensuring the end user is at the center of solutions developed by the Data Scientists. They conduct interviews, construct customer personas, and help translate business needs into technical requirements.
DataOps: responsible for data architecture and engineering. This group is tasked with maintaining high quality data that flows from transactional data sources to the endpoints such as data warehouses where it can be accessed by business units, data analysts, and data scientists. Data quality issues should be addressed through a formal process that reports issues to the DataOps team so that the issue can be rectified at its source.
DevSecOps: responsible for creating the pipelines that move data through the organization and deliver data products to the end user, while maintaining focus on data security. This team should be comfortable working with infrastructure as code, the process of maintaining and provisioning servers based on programmable requirements.

Supported by these teams, data scientists can capitalize on insights latent within an organization's data sources. The work of a data scientist is distinguished from that of a data analyst by the use of computer programming (typically in the languages R or Python) and statistical methodologies to conduct hypothesis testing, produce advanced analytics, and generate predictive insight.

Machine learning engineers work with large datasets typically composed of labeled training examples. The data could come from an organization's transactional database, or it could be text or images. After data preparation, the machine learning engineer passes the training data into a neural network, a complex mathematical model composed of inputs, weights and biases, an activation function, etc. These components form layers of neurons — much like the natural neural network upon which this computing metaphor is based.

Machine Learning results in impressive data products such as chatbots, image recognition tools, recommendation engines, and many more applications. These capabilities may be referred to collectively as artificial intelligence.

Grab your free copy of our report today

If you haven't already, check out our detailed writeup on the seven top certifications across a variety of data-related domains. Aim higher with your career ambitions for 2025 and pursue the proven qualifications that will help you demonstrate your value. Check out our comprehensive report to learn how to attain the credentials to break into a new field or accelerate your career trajectory.

Considerations

Here are some questions to ask as your team considers the implementation of Data Science.

Is my organization ready?

Not all challenges can or should be addressed by advanced statistical techniques. A data scientist is not needed, for example, to perform a basic linear regression or ANOVA, features inbuilt into Microsoft Excel and other business intelligence tools.

Do we have a sufficient quantity of data?

Generally speaking, Data Science, particularly Machine Learning, requires a very large dataset. As a rule of thumb, 5,000 observations per category is required to obtain good performance from a neural network. Upwards of 10,000 observations per category is required to match human performance.

Is this model better than the baseline?

Data scientists should always compare their work against benchmarks. Before building a statistical or machine learning approach, the Data Science team should evaluate the quality of results achieved from naive methodology such as:

Taking an average of past outcomes
Projecting the most recent observations forward in time
Using linear regression

This way, the team will have three benchmarks to compare their work against.

Model development should then proceed in order of increasing complexity. If a neural network (high complexity) doesn't significantly outperform a random forest (moderate complexity), then modeling efforts should focus on the random forest. This approach reduces unnecessary complexity in order to save on runtime and compute power.

Where should we incorporate humans in the loop?

Data Science doesn't take place in a vacuum — decision makers should carefully consider where and how to integrate human experts into the workflow.

In The Signal and The Noise, Nate Silver of FiveThirtyEight fame describes how baseball scouts represent the hybrid model that combines statistics and human intuition. In Silver's experience, the scouts outperformed the statistical models, which isn't surprising because scouts use quantitative analysis as well additional sources of information, such as their sense of the athlete's mental preparedness, in order to make their judgments.

Humans still have an important role to play in tuning models, determining where their use is appropriate, and interpreting outputs.

Statistical Techniques

Data Science encompasses statistical techniques that began with the invention of regression analysis in the early 19th century. The field offers a suite of methods to assess data, resulting in prediction, classification, and clustering.

Regression: used to predict a continuous variable. For example, a healthcare system might deploy a regression algorithm using patient data to predict length of stay.
Classification: used to segment a target variable into predetermined categories. For example, a tax bureau might conduct anomaly detection by using a logistic model to classify returns as fraudulent or not fraudulent.
Clustering: used to create groupings from unlabeled data. For example, clustering could be used to better understand connections within a transaction dataset, thereby enabling investigators to detect money laundering.

Rather than build on formalized business logic rules (i.e., Robotic Process Automation [RPA]), Data Science is constructed from statistical methodologies. All these techniques involve an element of probability. Data products are highly flexible and respond well to the incorporation of more data.

Natural Language Processing

One powerful application of Machine Learning is Natural Language Processing (NLP). Data Scientists can deploy state-of-the-art neural networks to turn unstructured text data into business insights and user applications. Here are some capabilities of NLP:

Text Summarization: used to generate short summaries of long documents
Sentence Classification: automatically categorize sentences within a document
Named Entity Recognition: extract words or phrases that represent a concept of interest
Text Regression: predict numerical values (e.g., prices) from text descriptions
Unsupervised Topic Modeling: discover latent themes buried in large document sets
Document Similarity: find related documents based on thematic similarity
Open-Domain Question-Answering: submit questions to a large text corpus and receive exact answers

As an example of how these techniques could be used in the public sector, a grant processing organization might use NLP to create automatic summaries of applications. A preliminary stage of review could intelligently search the application for key terminology, taking advantage of a neural network's capability to understand context and leverage synonyms. Finally, the reviewers might use document similarity to compare the application against previous grant submissions, and then deploy a predictive model to evaluate the potential success of the new application.

Computer Vision

The field of Computer Vision encompasses all image-related capabilities of Machine Learning. Three prominent use cases:

Image Classification: automatically categorize images across various dimensions
Image Detection: determining whether an image contains a specific entity
Image Regression: predict numerical values from photos

A defense agency might use these techniques to transform satellite images into situational intelligence. Or a local city government could use traffic cameras at intersections to better understand local patterns in multi-modal transit such as bicycles, scooters, and pedestrians. Finally, a manufacturing operation could set up a camera to record the output of manual gages on dated equipment, using computer vision to turn the images into data.

Other Applications

Data Science can also be used to achieve the following:

Recommendation engines: system for predicting user preferences. A civic technology app store might use this type of algorithm to suggest content on its platform
Reinforcement learning: iterative approach to address challenges such as real time vehicle routing
Generative AI: creative content produced by a large language model

Conclusion

Investing in Data Science informs gradual maturity from basic reporting to predictive analytics. The use of algorithms for data modeling can help an organization to move past basic business intelligence into predictive analytics and Machine Learning. Ultimately, Data Science can help organizations improve their services through streamlined operations, advanced insights, and augmented intelligence.

Nicole Janeway Bills

Data Strategy Professionals Founder & CEO

Nicole offers a proven track record of applying Data Strategy and related disciplines to solve clients' most pressing challenges. She has worked as a Data Scientist and Project Manager for federal and commercial consulting teams. Her business experience includes natural language processing, cloud computing, statistical testing, pricing analysis, ETL processes, and web and application development.