Introduction to Big Data Techniques

Learning outcomes - Describe the parts of fintech that matter for gathering and analyzing financial data. - Describe Big Data, artificial intelligence, and machine learning. - Describe applications of Big Data and data science to investment management.

SEE THIS BEFORE EXAM

Big Data = 3 Vs + 1 warning V - Volume: a huge amount of data. - Velocity: data arrive fast, often in real time or near real time. - Variety: data come in many formats and from many sources. - Veracity: when using Big Data for inference or prediction, always ask whether the data are credible and reliable. Data types - Structured data: fits into tables. - Semistructured data: partly organized, partly messy. - Unstructured data: cannot be stored cleanly in rows and columns. Main alternative data sources - Individuals - Business processes - Sensors Main ML classes - Supervised learning: labeled inputs and outputs. - Unsupervised learning: unlabeled data, search for structure. - Deep learning: neural networks with many hidden layers. Key processing steps - Capture - Curation - Storage - Search - Transfer Text tools - Text analytics: computer analysis of large text or voice datasets. - NLP: programs that analyze and interpret human language. Big Data looks exciting. What is the exam trap? Big volume is useless if veracity is weak, data are biased, or the dataset is wrong for the analysis. A model predicts beautifully on training data but fails on new data. What happened? Think overfitting first: it learned noise as if it were truth. You are given labels and targets. Which ML bucket? Supervised learning. You are grouping firms without labels. Which ML bucket? Unsupervised learning. You need trading decisions from real-time prices. What system feature matters? Low latency, because delays can break the application.

MEMORISE

Fintech is finance plus technology-driven innovation.
Big Data is not just big files; it is volume, velocity, variety, and often a veracity problem.
AI performs tasks that traditionally required human intelligence.
ML tries to find the pattern and apply the pattern.
NLP is the text-and-language weapon inside the larger text analytics world.

Core Idea

This module is about what happens when finance collides with massive data, fast computing, and machines that can learn patterns humans would miss.
The source frames this through fintech. What is fintech: technology-driven innovation in the design and delivery of financial services and products.
In common usage, fintech can also refer to the companies building these tools and the business sector around them.
Early fintech automated routine tasks. Later systems executed decisions through explicit rules. Now some systems learn and make decisions using far more complex logic.
These tools matter not only for quant managers but also for fundamental managers using hybrid decision-making processes.

Fintech in Investment Analysis

The source highlights two areas that matter most for quantitative investment analysis: analysis of large datasets and analytical tools.
The first area is straightforward. There is far more traditional data and far more alternative data than older workflows were built to handle.
Traditional data include prices, volumes, corporate financial statements, economic indicators, annual reports, regulatory filings, earnings figures, and conference calls.
Alternative data come from non-traditional sources like social media, sensor networks, web traffic, emails, texts, satellites, and company exhaust.
What is company exhaust: information generated in the normal course of doing business.
The second area is the toolset itself. AI-based techniques may spot complex, non-linear relationships that older statistical methods can miss or process too slowly.
Analysts can use these tools to sort through gigantic filing sets, annual reports, and earnings calls to identify what matters most.

WHEN THE MACHINE READS BEFORE THE ANALYST FINISHES COFFEE

Imagine earnings season opening like floodgates. Filings pile up, conference-call transcripts start landing, and commentary keeps pouring in. A human team can sample. A machine can sweep the whole field, rank the language shifts, and flag where sentiment cracked before the headline numbers did.

Big Data

Big Data refers to the vast amount of information generated by industry, governments, individuals, and electronic devices.
What is Big Data: extremely large and diverse datasets that often arrive quickly and require new ways to store, process, and analyze them.
The source says Big Data is characterized by three Vs: volume, velocity, and variety.
Volume means the dataset is huge, often millions or billions of data points.
Velocity means the data are recorded and transmitted much faster than before, often in real time or near real time.
Variety means the data come from many sources and in many formats rather than one clean tabular structure.
When Big Data is used for inference or prediction, the source adds a fourth V: veracity.
What is veracity: the credibility and reliability of the data source and the truthfulness of the data content.
This is the grown-up warning in the whole section. Big Data does not solve the old quality problem; it amplifies it.

Structured, Semistructured, and Unstructured Data

Big Data can be structured, semistructured, or unstructured.
Structured data fit naturally into tables and databases, where each field represents the same type of information.
Semistructured data have some organizational features but do not fit neatly into traditional rows-and-columns storage.
Unstructured data are disparate and unorganized and cannot be represented cleanly in tabular form.
Social media posts, emails, texts, voice recordings, pictures, blogs, scanners, and sensors often create unstructured data.
Why does this matter: unstructured data usually need specialized applications or customized code before analysts can use them meaningfully.

Sources of Big Data

The source names five broad origins of Big Data: financial markets, businesses, governments, individuals, and sensors.
Financial-market data include equity, fixed income, futures, options, and other derivatives data.
Business data include corporate financials, commercial transactions, credit card purchases, supply-chain data, and point-of-sale scanner data.
Government data include trade, economic, employment, and payroll data.
Individual-generated data include product reviews, search logs, personal data trails, and social media posts.
Sensor-generated data include satellite imagery, geolocation, shipping information, traffic patterns, and other machine-generated streams.
The Internet of Things sits inside this sensor story. What is the Internet of Things: a network of physical devices embedded with electronics, sensors, software, and connections so they can interact and share information.

Alternative Data

Alternative data are non-traditional data used to support investment decisions and models.
The source classifies the three main alternative-data sources as data generated by individuals, business processes, and sensors.
Data generated by individuals are often unstructured and can appear as text, video, photos, audio, website clicks, and browsing behavior.
Data generated by business processes are often structured and can act as leading or real-time indicators of performance.
What is a leading indicator here: information that may signal performance before traditional quarterly or annual reports arrive.
Sensor data often have the greatest scale, sometimes orders of magnitude larger than the other streams.
Investors use alternative data to search for factors affecting prices, improve asset selection, improve trade execution, and uncover trends.

SEEING A BUSINESS BEFORE THE QUARTER CLOSES

A retail chain may not have reported quarterly sales yet, but scanner data, card transactions, parking-lot imagery, and web traffic already whisper the story. The point is not magic. The point is timing. Alternative data can act before the official report catches up.

Legal and Ethical Caution

The source explicitly warns that alternative data can create legal and ethical issues.
Web scraping may capture personal information protected by regulation or shared without explicit knowledge and consent.
Best practices are still evolving, and guidance can differ across jurisdictions.
Exam-wise, remember this as a compliance smell test: non-public-looking data are not automatically safe just because a machine collected them.

Big Data Challenges

Big Data creates practical problems before analysis even begins.
The source asks whether the dataset has selection bias, missing data, outliers, enough volume, and fitness for the intended analysis.
What is selection bias: a dataset distortion caused by how observations were chosen rather than by the true underlying phenomenon.
What are outliers: observations far away from the rest that can distort models or summaries.
Most datasets must be sourced, cleansed, and organized before analysis.
Alternative data make this harder because they are often unstructured and more qualitative than quantitative.

Artificial Intelligence

Artificial intelligence refers to computer systems capable of performing tasks that traditionally required human intelligence.
What is AI used for here: to perform or assist with complex analysis, pattern recognition, and decision-support tasks at human-comparable or superior levels.
An early AI example is the expert system, which tried to simulate human expertise using rule-based logic like if-then rules.
Since the 1980s, finance has used AI tools such as neural networks in fraud detection for abnormal charges and claims.
The broader story is that stronger processors and better networks allowed AI to move from toy problems into logistics, data mining, finance, and diagnosis.

Machine Learning

Machine learning is a subset of this evolution. What is machine learning: computer-based techniques that extract knowledge from large datasets by learning patterns from examples.
The source gives a very exam-friendly summary: find the pattern, apply the pattern.
ML does not require assumptions about the data's underlying probability distribution in the way many traditional methods do.
In ML, the algorithm is given inputs and may also be given outputs, depending on the learning setup.
Training occurs as the algorithm identifies relationships in the data and refines its learning process.
The source says ML usually splits data into training, validation, and test datasets.
Training data help the model learn historical relationships between inputs and outputs.
Validation data help tune the model and check whether the learned relationships hold up.
Test data evaluate whether the model predicts well on new data.
Human judgment still matters. Why is human judgment used: people must understand the data, clean the data, and choose appropriate analytical techniques.

Overfitting and Underfitting

Overfitting is one of the biggest exam traps in this reading.
What is overfitting: the model learns the training data too precisely and treats noise as if it were true signal.
An overfitted model can look brilliant on the training set and fail embarrassingly on a different dataset.
Underfitting is the opposite mistake. What is underfitting: the model is too simple and treats true structure as if it were noise.
Underfitted models fail to discover important patterns that actually exist in the data.
Some ML methods also look like black boxes. What is black box here: a model whose path from input to output is hard to explain clearly.

MODEL TRAP

If a model looks perfect on the data it studied, do not clap yet. The source's warning is brutal: perfect fit can mean overtraining, false relationships, and prediction failure on new data.

Types of Machine Learning

Supervised learning uses labeled training data, meaning the algorithm is given both inputs and known outputs.
Supervised learning is useful when you want prediction, such as forecasting returns or predicting whether a market will be up, down, or flat.
Unsupervised learning uses data without labels and tries to uncover structure, groupings, or relationships.
Grouping companies into peer clusters based on characteristics instead of sector labels is a clean unsupervised example from the source.
Deep learning uses neural networks, often with many hidden layers, to perform multistage, non-linear processing.
Deep learning can operate in supervised or unsupervised settings.
Why are hidden layers used: they let the model build from simple patterns toward more complex representations in stages.

WHEN THE NETWORK STARTS SEEING WHAT HUMANS ONLY GLIMPSE

Watson winning Jeopardy, DeepMind beating Go masters, and recommendation engines quietly steering product choices all come from the same deeper shift: the machine is no longer just obeying fixed instructions. It is building layered pattern recognition from enormous data exposure.

Data Science

Data science is the interdisciplinary field that combines computer science, statistics, and related disciplines to extract information from data.
Data scientists are the people trying to convert raw data into usable insight for business and investment decisions.
The structure of the data matters because unstructured alternative data often need special treatment before analysis begins.

Data Processing Methods

The source highlights five key processing methods: capture, curation, storage, search, and transfer.
Capture refers to how data are collected and transformed into a format the analytical process can use.
Low-latency systems matter when automated trading depends on real-time prices and market events.
What is low latency: minimal delay in communication and processing.
Curation refers to cleaning the data and ensuring quality and accuracy.
Storage refers to how data are recorded, archived, accessed, and designed at the database level.
Search refers to how the system queries and retrieves the requested content from very large datasets.
Transfer refers to how data move from the source or storage location into the analytical tool.

Data Visualization

Visualization is not cosmetic. It is a thinking tool for understanding large and complex datasets.
Traditional structured data can be shown with tables, charts, and trends.
Unstructured or multidimensional data often need richer visual forms like interactive 3D graphics, heat maps, tree diagrams, network graphs, tag clouds, and mind maps.
What is a tag cloud: a visual display where words appear larger when they occur more frequently in the source text.
What is a mind map: a visualization showing how concepts relate to one another rather than just how often words appear.

Text Analytics and NLP

Text analytics uses computer programs to analyze and derive meaning from large text- or voice-based datasets.
This includes filings, reports, earnings calls, social media, emails, postings, and surveys.
More advanced text analytics can include lexical analysis, meaning frequency analysis of words and recognition of patterns in words and phrases.
Natural language processing is an important application inside this area.
What is NLP: a field combining computer science, AI, and linguistics to analyze and interpret human language.
NLP can handle translation, speech recognition, text mining, sentiment analysis, and topic analysis.
NLP can also support compliance monitoring, fraud detection, and confidentiality controls in employee communications.
In investing, NLP can analyze analyst commentary, annual reports, call transcripts, news articles, and policy-maker communications at a scale no human team can match.
It can tag sentiment changes before an analyst formally changes a buy, hold, or sell recommendation.
It can also detect subtle topic shifts in central-bank or policy communications around inflation, output, or rate policy.

WHEN A CENTRAL BANK WHISPERS INSTEAD OF SHOUTS

Policymakers do not always move markets by changing the rate. Sometimes they move markets by changing the language around the rate. NLP matters because it can catch when a topic grows louder, softer, or more anxious before the headline action arrives.

Programming Languages and Databases

The source names several common programming languages used in data science.
Python is open-source and approachable and underlies many fintech applications.
R is open-source and historically strong in statistics, ML, optimization, econometrics, and financial analysis.
Java runs across different machines and operating systems and supports many internet applications.
C and C++ are specialized for speed and processing performance and are used in algorithmic and high-frequency trading.
Excel VBA helps automate workflows, update data, run macros, gather web data, and build customized reports.
On the database side, SQL works for structured data stored in rows and columns on servers.
SQLite is a structured-data database embedded into programs and is common in mobile applications.
NoSQL is used for unstructured data that cannot be summarized in traditional tables.

Final Exam Traps

Big Data is not defined by size alone. The full exam frame is volume, velocity, variety, and often veracity.
Alternative data are not the same as traditional corporate or market data.
ML still needs human judgment, clean data, and enough data.
Overfitting means learning noise as signal; underfitting means missing real structure.
Supervised learning uses labels; unsupervised learning does not.
NLP sits inside text analytics and is directly useful for sentiment, topic, and language interpretation in finance.

Introduction to Risk Management Probability Trees and Conditional Expectations

On This Page

Core Idea Fintech in Investment Analysis Big Data Structured, Semistructured, and Unstructured Data Sources of Big Data Alternative Data Legal and Ethical Caution Big Data Challenges Artificial Intelligence Machine Learning Overfitting and Underfitting Types of Machine Learning Data Science Data Processing Methods Data Visualization Text Analytics and NLP Programming Languages and Databases Final Exam Traps