Comprehensive Glossary of Data Analytics & BI Terms

Welcome to our comprehensive glossary dedicated to the essential terms and concepts within the realms of Data Analytics and Business Intelligence (BI). Navigating the complex landscape of data can be daunting, especially with the continuous emergence of new technologies, methodologies, and terminologies. Whether you are a student, budding analyst, data scientist, business executive, or simply a curious learner, this glossary is designed to be your navigational compass, illuminating the path with clear, concise definitions.

Our glossary acts as a dynamic resource, providing clarity and understanding for terms ranging from the foundational elements of data to advanced analytical techniques. Each term is carefully explained with the aim of offering a balanced view, accessible to readers of various levels of expertise. Here, you’ll not only find definitions but also brief explanations that provide context and relevance to real-world applications.

Embark on your learning journey with confidence, armed with a resource that demystifies the jargon and intricacies of Data Analytics and BI. Use this glossary as a reference, a study aid, or as a tool to facilitate communication in professional settings. Dive in, explore, and enhance your knowledge and comprehension of the language of data!

A . B . C . D . E . F . G . H . I . J . K . L . M . N . O . P . Q . R . S . T . U . V . W . X . Z

A

Algorithm: A set of rules or procedures for solving a problem. In data analytics, algorithms are used to analyze and process data to extract valuable insights.

Analytics: The science of examining data to draw conclusions and support decision-making. It involves collecting, processing, and analyzing large datasets to uncover patterns and trends.

Anomaly Detection: A property or characteristic of an entity. Attributes hold the data that describes entities.

API (Application Programming Interface): A set of rules that allows different software entities to communicate with each other. In data analytics, APIs are often used to access data from external services or platforms.

Association: A relationship between two entities or objects.

Attribute: A property or characteristic of an entity. Attributes hold the data that describes entities.

Augmented Analytics: The use of advanced technologies like machine learning and AI to automate data preparation, insight discovery, and sharing. It augments human intelligence, making the analytics process faster and more accessible to non-experts.

B

BI (Business Intelligence): A technology-driven process that analyzes data and presents actionable information to help executives, managers, and other corporate end-users make informed business decisions.

Business Analytics: A subset of business intelligence, business analytics focuses on statistical analysis and the use of business data to predict and improve business performance.

Big Data: Extremely large datasets that can be analyzed to reveal patterns, trends, and associations. It also refers to extremely large datasets that are too big or complex to be handled by traditional data-processing software. Big data is characterized by its volume, variety, and velocity.

C

Calculated Metric: A metric derived from mathematical calculations on one or more existing measures. It’s used to create new insights from the available data.

Cardinality: Describes the numerical attributes of the relationship between two entities or tables.

Column: In a table, a column holds data for a single attribute of an entity.

CSV (Comma Separated Values): A simple file format used to store tabular data, such as a spreadsheet or database. Each line of the file represents a data record, and each record consists of one or more fields, separated by commas.

Composite Key: A primary key that consists of more than one attribute.

Constraint: Rules enforced on data columns to preserve their accuracy and reliability.

Conceptual Model: An abstract representation of the relationships and entities within a system. It focuses on the high-level understanding of the system and provides a foundation for creating more detailed models. The conceptual model helps in defining the structure and scope of the data model, serving as a blueprint for designing the database schema and relationships between entities.

Correlation: A statistical measure that describes the association between two variables.

Clustering (Cluster Analysis): A technique used to group data points or items that are similar to each other. It's often used in market research, pattern recognition, and data analysis to identify and leverage patterns within the data.

D

Database: A structured collection of data that can be easily accessed, managed, and updated.

Database Management System (DBMS): Software that handles the storage, retrieval, and updating of data in a computer system.

Data Cleansing (or Data Cleaning): The process of identifying and correcting (or removing) errors and inconsistencies in data to improve its quality.

Data Dictionary: A collection of descriptions of data objects or items in a data model.

Data Exploration: The initial process of analyzing a dataset to discover its main characteristics and understand its structure, variables, and values. Data exploration is crucial for becoming familiar with a dataset, identifying anomalies, and detecting patterns or trends. This process often involves summarizing the main characteristics of a dataset using visual methods (charts, graphs, etc.) and descriptive statistics.

Data Lake: A storage repository that holds a vast amount of raw data in its native format. Data lakes allow for the storage and analysis of unstructured data, which is not possible with traditional databases.

Data Mart: A subset of a data warehouse that supports the requirements of a particular department or function.

Data Migration: This is the process of selecting, preparing, extracting, and transforming data and permanently transferring it from one computer storage system to another. This process is often necessary when an organization decides to use a new computing system or application. Data migration is crucial for ensuring that data is accurately and securely transferred, and it is accessible and functional within the new system. This process often involves data cleansing and the addition of new data structures.

Data Mining: The practice of examining large databases to generate new information. Data mining techniques discover patterns and relationships in the data that may not be apparent through traditional analytics.

Data Modeling: The process of creating a data model for the data to be stored in a database. This process involves defining how data is connected, accessed, and stored.

Data Science: An interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

Data Source: The location from which data originates. This could be a database, a data warehouse, a data lake, or external data sources accessed via APIs or other means.

Data Transformation: The process of converting data from one format or structure into another. It often involves cleaning, aggregating, enriching, and reformatting the data..

Data Type: The kind of data that can be stored in an attribute, such as integer, string, date, etc.

Data Walls: Visual displays of data used to track and improve the performance of individuals or groups within an organization. Data walls are often used in education and sales to encourage improvement and competition.

Data Warehouse: A central repository of integrated data collected from one or more disparate sources. It stores historical and current data in one place and is used for creating reports and data analysis. Data warehouses are essential components in the field of business intelligence, allowing for the retrieval and analysis of data to support decision-making processes.

Decision Trees: A predictive modeling approach. Decision trees are used for classification and regression tasks, providing a graphical representation that illustrates the decision-making process. The tree is constructed in a way that splits datasets into subsets based on the value of input variables, ultimately leading to a predicted output or decision.

Descriptive Analytics: This is the initial stage of data processing, summarizing and visualizing historical data to identify patterns, trends, and insights. Descriptive analytics helps businesses understand what has happened in the past and analyze the performance metrics, providing a solid basis for further analysis and decision-making.

Dimension (DIM or DIMS): A structure that categorizes data. Dimensions are used to slice and dice data in a data warehouse, providing a means to organize and group data. For example, a “Time” dimension might include hierarchy levels such as year, quarter, month, and day. Dimensions help in analyzing data in various ways and are crucial for creating meaningful reports.

Dimensionality Reduction: This is a technique used in data analytics and machine learning to reduce the number of input variables in a dataset. Dimensionality reduction is essential when dealing with datasets that have a large number of variables (high-dimensional), as it helps in reducing computational complexity, mitigating the risk of overfitting, and improving model performance.

Drill-Down: The process of exploring and visualizing data at more detailed levels. Users start with high-level data and then navigate down to more granular data by focusing on specific elements. Drill-down functionality is crucial in dashboards and reports, helping analysts and decision-makers to understand the details behind the summarized data.

Dashboard: A visual interface that presents data in an easy-to-read manner, often using charts and graphs. Dashboards are commonly used in business intelligence to display key performance indicators (KPIs).

E

Entity: A thing or object of importance that needs to be represented in a database.

ETL (Extract, Transform, Load): A process that involves copying data from one or more sources into a destination system which represents the data differently from the source or in a different context.

Exploratory Data Analysis (EDA): An approach to analyzing datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods.

Entity Relationship Diagram (ERD): A visual representation of entities and their relationships to each other.

F

Fact Table: The process of using domain knowledge to create features that make machine learning algorithms work. Feature engineering is crucial for applying machine learning effectively.

Feature Engineering: The process of using domain knowledge to create features that make machine learning algorithms work. Feature engineering is crucial for applying machine learning effectively.

Foreign Key: An attribute or set of attributes in a table that refers to the primary key of another table.

Forecasting: The process of making predictions about future values based on historical data. This technique is used in various fields, including finance and weather prediction.

G

Grain (Granularity): The level of detail or depth of the data stored in a database or a dataset. Specifically, in data warehousing and business intelligence, the grain of data represents the finest level at which data is stored. Understanding the grain is crucial for effective data modeling and analysis as it influences how data can be interpreted and used.

GUI (Graphical User Interface): A type of user interface that allows users to interact with software through graphical icons and visual indicators, often used in data analytics tools for ease of navigation and operation.

H

Hierarchy: In data modeling, hierarchy refers to a structured arrangement of items in which items are organized into levels, with each level representing a certain degree of granularity or detail.

Histogram: A graphical representation of the distribution of a dataset, usually depicted as bars. It provides a visual interpretation of numerical data by showing the number of data points that fall within a range of values.

Hadoop: An open-source framework for distributed storage and processing of large data sets. Hadoop is designed to scale up from single servers to thousands of machines, each providing computation and storage.

I

Index: A data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate a data record given its search key without having to search every row in a database table each time a database table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

Insights: Valuable pieces of information derived from data analysis. Insights often reveal trends, patterns, or anomalies that may be significant for business strategies and decision-making.

IoT (Internet of Things): Refers to the network of physical devices that are embedded with sensors and software to collect and exchange data. IoT generates large amounts of data that can be analyzed for insights.

In-memory Computing: A technology that stores data in the system’s main RAM (rather than on traditional disk drives) to improve performance, providing faster data retrieval and analysis.

J

JSON (JavaScript Object Notation): A lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate. It is often used for asynchronous browser/server communication.

K

K-Means: An algorithm to divide a dataset into k clusters

Key: An attribute or set of attributes that uniquely identifies an instance of an entity.

KPI (Key Performance Indicator): A measurable value that demonstrates how effectively a company is achieving key business objectives. KPIs are used by organizations to evaluate their success at reaching targets.

L

Linear Regression: A statistical method used to model the relationship between a dependent variable and one or more independent variables. The method assumes the relationship between the variables is linear.

Log Files: Files that record either events that occur in an operating system or other software runs, or messages between different users of a communication software.

Logical Model: It represents the logical entities, attributes, and relationships among the entities. It provides a conceptual view of the data, abstracting away from the physical storage and implementation details. The logical model is typically used as a blueprint for designing the physical database, and it helps in understanding how data should be organized and how relationships among data are handled.

Lookup Table: A table that holds discrete values that can be used to represent or translate other values. Lookup tables are often used in data transformation processes to map source values to target values.

M

Machine Learning (ML): A subset of AI that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. ML is pivotal for analyzing large volumes of data and making data-driven predictions or recommendations.

Measure: In data analytics, a measure is a quantifiable data point or metric that can be analyzed. Measures are typically numerical data that can be aggregated.

Metric: A quantifiable measure used to track and assess the status of a specific process. In data analytics, metrics are used to provide insights and are the basis for further analysis.

Metadata: Data that provides information about other data. Metadata summarizes basic information about data, making finding and working with particular instances of data easier.

Multidimensional Analysis: Analytical processing that involves viewing data through various dimensions. It enables the user to analyze data from different perspectives and supports complex calculations.

Multidimensional Cubes: In OLAP databases, these are data structures that allow fast retrieval of data for analytical queries. Each dimension represents a different perspective for analysis.

N

Nested Queries: SQL queries where one query (the inner query) is embedded within another query (the outer query). Nested queries are used to retrieve data that will be used in the main query as a condition to further restrict the data to be retrieved.

Normalization: A process used to organize a database to reduce redundancy and improve data integrity by grouping properties into tables based on relationships.

NoSQL: A class of database systems that provide a mechanism for storage and retrieval of data that is modeled in means other than tabular relations used in relational databases. NoSQL is particularly useful for storing unstructured or semi-structured data.

O

OLAP (Online Analytical Processing): A category of software tools that allows users to analyze data from multiple dimensions, supporting complex calculations, trend analysis, and sophisticated data modeling.

OLAP Cube: A multi-dimensional array of data optimized for querying and reporting. Cubes are used in OLAP (Online Analytical Processing) databases to allow users to analyze data along multiple dimensions.

OLTP (Online Transaction Processing): A category of data processing that supports the real-time transaction-oriented applications. OLTP systems are optimized for fast query processing and maintaining data integrity in multi-access environments.

Outlier: An observation that lies an abnormal distance from other values in a random sample from a population. In data analysis, identifying outliers is crucial for accurate data interpretation.

P

Pattern Recognition: The process of identifying and classifying data patterns or regularities. It's crucial in various applications, including data mining, image and speech analysis, and statistics.

Predictive Analytics: Techniques that use statistical algorithms and machine learning to identify patterns in data and forecast future outcomes and trends. Predictive analytics doesn't tell what will happen in the future but provides an estimation.

Predictive Modeling: The use of statistics to predict future outcomes based on historical data.

Prescriptive Analytics: Not only anticipates what will happen and when it will happen but also provides explanations and suggested actions to benefit from the predictions.

Primary Key: A unique identifier for a record in a table.

Q

Quantitative Data: Data that can be measured and written down with numbers. It’s often collected for statistical analysis to help understand patterns and make predictions.

Query: A request for data retrieval from a database. Queries are used to find specific data by filtering specific criteria.

R

R (Programming Language): A programming language and free software environment for statistical computing and graphics. It's widely used for data analysis and visualization.

Record: A row in a table, which contains data about a specific item.

Referential Integrity: Ensures that relationships between tables in a database remain consistent.

Regression Analysis: A set of statistical processes for estimating the relationships among variables. It helps in understanding how the value of the dependent variable changes when any one of the independent variables is varied.

Relationship: Describes how two entities interact.

Relational Database: A type of database that uses a structure that allows users to identify and access data in relation to another piece of data in the database, often used to organize and manage large amounts of data.

Report: A document that visually communicates the results of data analysis. Reports often include graphs, charts, tables, and narrative text to convey information and insights derived from data. They can be interactive or static, and they serve as a vital tool for decision-makers to understand business performance, trends, and areas that need attention.

Row: A record in a database table.

S

Sample: A subset of individuals or data points from within a statistical population.

Schema: A blueprint or structure that represents the logical configuration of a database. It defines how data is organized and how relationships between data are handled. Schemas are used to map out the structure of data and to define constraints on the data, ensuring that data in the database is accurate and reliable.

Segmentation: The process of dividing a large unit into smaller segments.

SQL (Structured Query Language): A domain-specific language used to manage and manipulate relational databases, including querying data, updating data, inserting data, and deleting data from a database.

Self-Service BI: This is a form of business intelligence where end-users can create their own reports and dashboards without technical assistance. Self-service BI tools are designed to be user-friendly, allowing individuals with no technical expertise to visualize and analyze data, thereby enabling them to make informed business decisions.

Slice and Dice: The ability to break down a dataset into smaller parts and look at it from different angles and levels of detail. This process helps users analyze various dimensions of data to extract meaningful insights. Users can "slice" data to view a subset and "dice" data to analyze it in different ways.

Sentiment Analysis: A technique used to determine the attitude, opinion, or feeling expressed in a piece of text, which is essential for social media monitoring, product reviews, and customer service.

Snowflake Schema: An extension of the star schema used in a data warehouse, where the related dimension tables are normalized, resulting in a structure that uses less disk space and looks like a snowflake.

Star Schema: A type of database schema in data warehousing where a central fact table is connected to one or more dimension tables using foreign keys. It resembles a star, with the fact table in the center and the dimension tables radiating outwards.

Statistical Analysis: The collection and interpretation of data in order to uncover underlying patterns.

Stored Procedure (SPL): A precompiled collection of one or more SQL statements and, optionally, control-of-flow statements. These are stored under a name and processed on the database server. Stored procedures can be invoked by triggers, other stored procedures, or applications, and are used for a variety of tasks, including data validation, access control, and performance improvement.

Structured Data: Data that adheres to a pre-defined data model and is therefore straightforward to analyze.

Supervised Learning: A type of machine learning where the algorithm is trained on a labeled dataset, which means that each training example is paired with an output label.

Surrogate Key: A unique identifier for either an entity in the modeled world or an object in the database. It is a system-generated, artificial key, not derived from the application data. Surrogate keys are often used as the primary key in a table, serving as a stand-in for natural keys that are cumbersome or have other issues.

System Performance: The effectiveness and efficiency of a computational system in processing and analyzing data to generate desired insights and reports. In the context of data analytics and BI, it encompasses several aspects including query performance, data loading speed, data transformation efficiency, and responsiveness of the data visualization and reporting tools. Optimizing system performance is crucial for ensuring that data analytics and BI tools can handle large datasets and complex analyses in a timely manner, providing users with the insights they need without unnecessary delay. Performance can be influenced by hardware specifications, system architecture, database design, and the efficiency of the algorithms used for data processing and analysis.

T

Table: A structure that organizes data into rows and columns.

Table Joins: A method in SQL for retrieving data from two or more tables based on related columns between them. Types include INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN.

Tabular Models: A type of data model used in analysis, particularly with Microsoft Analysis Services, that is efficient for querying and offers rapid performance with huge amounts of data.

Transaction: A sequence of queries that represents a logical unit of work.

Time Series Data: A series of data points indexed or listed or graphed in time order. It's often used to track changes over time, like monitoring stock prices or tracking sales data

Training Data: The dataset used to train a machine learning model. The training data helps the model understand and learn the relationships among the data.

U

Unique Key: A set of one or more attributes that uniquely identifies each record in a database table. While similar to a primary key, tables can have multiple unique keys but only one primary key. A unique key constraint ensures that all values in the specified column(s) are unique across the table. Each unique key corresponds to a specific record, and no two records can have the same unique key value.

Unstructured Data: Information that either doesn't have a pre-defined data model or isn't organized in a pre-defined manner. It includes formats like text, images, and videos.

Unsupervised Learning: A type of machine learning where the algorithm is given data without explicit instructions on what to do with it. The system tries to learn the patterns and the structure from the data.

V

Variable: A characteristic or attribute that can assume different values. In data analytics and machine learning, variables can be categorized as dependent (target) or independent (feature).

View: A virtual table that represents the result of a SELECT query.

Visualization: The representation of data in a graphical or pictorial format. Visualization tools and techniques help analysts to understand complex data sets by arranging data in a visual context.

W

Web Analytics: The process of analyzing the behavior of visitors to a website. It helps in attracting more visitors, retaining or attracting new customers, or increasing the dollar volume each customer spends.

X

XML (eXtensible Markup Language): A markup language designed to store and transport data. It uses tags to define elements within the data, making it both human-readable and machine-readable.

Z

Z-Score: A statistical measurement that describes a value's relationship to the mean of a group of values. It is measured in terms of standard deviations from the mean, helping identify outliers in the data.

Learn Power BI by Studying Real-World Reports

Download Free PBIX report files used in real-world situation and amend them to suite your own projects and reports.

Directory Listing Report

This PBIX file shows what is possible using Power BI various Map Visuals (Map, Filled Map and ESRI ArcGIS). The data was taken from a listing on Microsoft website, transformed and built to allow for Geo grouping, filtering and price comparison.

Download the Report

Comprehensive Glossary of Data Analytics & BI Terms

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Z

Learn Power BI by Studying Real-World Reports

Directory Listing Report

Demystifying Business Intelligence