Siksha Sarovar

Siksha Sarovar (sikshasarovar.com) is a free educational web application that helps students in India learn programming and prepare for academic and competitive exams. The platform offers structured coding courses (C, C++, Python, Java, HTML, CSS, PHP, Power BI, AI, Machine Learning, Data Science), complete university curriculum notes for BCA/MCA students with previous year question papers, Class 10 and Class 12 CBSE/HBSE school notes, and dedicated preparation material for SSC, UPSC, Banking, Railway and other government exams. Browsing the site is completely free and requires no account. Users may optionally sign in with Google solely to save their learning progress, quiz scores and personal preferences across devices.

Privacy Policy | Terms of Service | Contact Siksha Sarovar | About Siksha Sarovar

v4.0.9 · PWA
Siksha Sarovar logo
Siksha Sarovar
Your Learning Universe

Siksha Sarovar is a free e-learning platform for coding courses, BCA university notes and competitive exam preparation. Optional Google sign-in saves your learning progress across devices.

Initializing knowledge base…
Compiling modules 0%

Types of Data: Structured, Unstructured & Semi-Structured

Lesson 7 of 37 in the free Data Science notes on Siksha Sarovar, written by Rohit Jangra.

Types of Data

Understanding the different types of data is fundamental to Data Science, because the type of data determines which tools, storage systems, and analytical techniques can be applied. Data can be broadly categorized into three types based on its organization and format.

---

1. Structured Data

Definition: Structured data is highly organized data that conforms to a predefined schema (format). It resides in fixed fields within a record or file, making it easily searchable and queryable.

Characteristics:

  • Has a well-defined data model (rows and columns).
  • Stored in Relational Databases (RDBMS) like MySQL, PostgreSQL, Oracle.
  • Accounts for roughly 20% of all data generated worldwide.
  • Easy to enter, store, query, and analyze.

Examples:

  • Employee records in an HR database (Name, ID, Salary, Department).
  • Transaction records in a banking system (Account No, Amount, Date).
  • Sensor readings stored in time-series databases (Timestamp, Temperature, Humidity).

Advantages:

  • Can be queried using standard SQL.
  • Well-suited for traditional Business Intelligence (BI) tools.
  • Data integrity is enforced through schemas and constraints.

Disadvantages:

  • Rigid schema makes it difficult to adapt to changing data requirements.
  • Limited in representing complex or hierarchical data.

---

2. Unstructured Data

Definition: Unstructured data is data that does not have a predefined data model or is not organized in a predefined manner. It is often text-heavy but may also contain dates, numbers, and other facts.

Characteristics:

  • Does not conform to a tabular (rows and columns) format.
  • Accounts for roughly 80% of all data generated worldwide.
  • Requires specialized tools and techniques (NLP, Computer Vision, Deep Learning) for analysis.
  • Stored in NoSQL databases, Data Lakes, or file systems.

Examples:

  • Social media posts (tweets, comments, status updates).
  • Images and videos (medical scans, surveillance footage, YouTube videos).
  • Audio files (call center recordings, podcasts, voice assistants).
  • Email bodies and attachments.
  • PDF documents and word processing files.

Advantages:

  • Contains extremely rich and diverse information.
  • Captures context that structured data cannot (tone, sentiment, visual content).

Disadvantages:

  • Difficult to search, query, and analyze without advanced preprocessing.
  • Storage and processing are more expensive and complex.
  • Extracting value requires specialized AI/ML techniques.

---

3. Semi-Structured Data

Definition: Semi-structured data falls between structured and unstructured. It does not reside in a relational database or conform to a strict tabular schema, but it contains tags, markers, or keys that separate data elements and enforce hierarchies.

Characteristics:

  • Has some organizational properties but does not fit neatly into a table.
  • Self-describing — contains metadata that defines the data structure.
  • Examples include markup languages and serialization formats.

Examples:

  • JSON (JavaScript Object Notation): Widely used in web APIs.
  • XML (eXtensible Markup Language): Used in web services and configuration files.
  • HTML: Web pages have structure (tags) but content is unstructured.
  • CSV files with inconsistent columns.
  • Log files: Server and application logs with semi-consistent formats.

Advantages:

  • More flexible than structured data.
  • Easier to parse than fully unstructured data.
  • Widely used in modern web applications and APIs.

---

Comprehensive Comparison Table

FeatureStructured DataSemi-Structured DataUnstructured Data
SchemaPredefined, rigidPartial / FlexibleNone
FormatRows and ColumnsJSON, XML, HTMLText, Images, Video
StorageRDBMS (MySQL, PostgreSQL)NoSQL (MongoDB), FilesData Lakes, Blob Storage
Search/QueryEasy (SQL)Moderate (JSONPath, XPath)Difficult (requires AI/ML)
% of World's Data~20%~5-10%~80%
ExampleExcel spreadsheetAPI response (JSON)YouTube video
Analysis ToolsSQL, Excel, TableauPython, SparkNLP, Computer Vision

---

Data Types in Statistics

Beyond the structural classification, data can also be classified by its statistical nature:

Quantitative (Numerical) Data

Data that can be measured and expressed as numbers.

  • Discrete: Countable values (e.g., Number of students = 30).
  • Continuous: Measurable values on a continuous scale (e.g., Temperature = 36.7°C).

Qualitative (Categorical) Data

Data that represents categories or groups.

  • Nominal: No inherent order (e.g., Colors: Red, Blue, Green).
  • Ordinal: Has a meaningful order (e.g., Education Level: High School < Bachelor's < Master's).

Statistical Data Types Summary

TypeSub-TypeOrderExample
QuantitativeDiscreteN/A (Numeric)Number of cars (1, 2, 3)
QuantitativeContinuousN/A (Numeric)Weight (65.5 kg)
QualitativeNominalNo orderBlood Group (A, B, O, AB)
QualitativeOrdinalHas orderRating (Poor, Average, Good)

Summary

  • Data is classified as Structured (~20%), Unstructured (~80%), or Semi-Structured.
  • Structured data is organized in tables; Unstructured lacks a predefined schema.
  • Semi-Structured data (JSON, XML) has some organization but is more flexible.
  • Statistically, data can be Quantitative (numbers) or Qualitative (categories).
  • Understanding data types is crucial for choosing the right storage, tools, and analysis techniques.