📚 Biostatistics: Understanding Data Variables, Measurement, and Collection
Source Information: This study material is compiled from a lecture audio transcript and copy-pasted text, likely from a presentation or notes, provided by the Institute of Epidemiology and Biostatistics with Medical Informatics, University St. Cyril and Methodius, Medicine Faculty.
🎯 Introduction to Biostatistics
Biostatistics is a vital discipline that applies statistical methods to biological and medical data. It is fundamental for understanding health phenomena, disease patterns, and treatment effectiveness. The core functions of biostatistics involve:
- ✅ Summarizing data: Condensing raw data into meaningful forms.
- ✅ Analyzing data: Applying statistical tests to uncover relationships and trends.
- ✅ Graphically presenting data: Visualizing data for clearer interpretation.
A foundational concept in biostatistics is the statistical data variable, which refers to any characteristic that can vary or differ among individuals or groups. These variables are the specific items about which data are collected, forming the basis for all subsequent statistical analysis.
📊 Statistical Data Variables: Types and Classification
Understanding data variables is crucial for selecting appropriate statistical methods. Variables are broadly categorized into two main types: Categorical (Qualitative) and Numerical (Quantitative).
1. Categorical (Qualitative) Variables
📚 Definition: These variables describe individuals as belonging to specific categories or groups. They do not have a unit of measurement.
- Characteristics: Individuals are assigned to one of several categories.
- Examples:
- Gender (e.g., male, female)
- Satisfaction status (e.g., satisfied, neutral, not satisfied)
- Marital status (e.g., single, married, divorced)
- Eye color (e.g., blue, brown, green)
- Vaccination status (e.g., vaccinated, unvaccinated)
- Health condition (e.g., good, fair, poor)
- Type of symptoms (e.g., fever, cough, headache)
- ⚠️ Important Note: Even if categorical variables are coded with numbers (e.g., 1=female, 2=male), they remain categorical. The numbers are merely labels, not quantities.
Categorical variables are further divided into two types:
a. Nominal Variables
📚 Definition: Categorical variables that have no intrinsic order or ranking among their categories.
- Characteristics: The order in which categories are listed does not change their meaning.
- Examples:
- Sex: (female, male) – can also be (male, female) without changing meaning.
- Blood groups: (A, B, AB, O) – any order is acceptable.
- Nationality: (e.g., American, British, Japanese) – no inherent order.
- Dichotomous or Binominal Variables: A special type of nominal variable with only two possible categories.
- Examples:
- Sex (Male, Female)
- Answer to a question (Yes, No)
- Disease status (Diseased, Not diseased)
- Examples:
b. Ordinal Variables
📚 Definition: Categorical variables that have a meaningful order or ranking among their categories, but the differences between categories may not be equal or quantifiable.
- Characteristics: Categories can be logically ordered from lowest to highest, or vice versa.
- Examples:
- BMI status: (underweight, normal, overweight, obese, extremely obese) – there's a clear progression.
- Agreement level: (excellent, good, medium, poor, very bad) – indicates a scale of agreement.
- Pain score: (no pain, low pain, moderate pain, severe pain) – represents increasing pain intensity.
- Social class: (low class, middle class, high class) – implies a social hierarchy.
- ⚠️ Important Note: Similar to nominal variables, if ordinal variables are coded numerically (e.g., 1=very bad, 5=excellent), they are still ordinal. The numbers represent rank, not a measurable quantity.
2. Numerical (Quantitative) Variables
📚 Definition: These variables are either measured or counted, represented by numbers, and always possess a measurement unit.
- Characteristics: They provide quantitative information.
- Examples:
- Weight (e.g., in kg)
- Height (e.g., in cm)
- Age (e.g., in years)
- Incubation period (e.g., in days)
- Antibody titer (e.g., in units/mL)
- Erythrocyte count (e.g., in cells/µL)
Numerical variables are further divided into two types:
a. Discrete Variables
📚 Definition: Numerical variables that can take only integer numbers (whole numbers) and usually represent a count of something.
- Characteristics: There are distinct, separate values; no values between consecutive integers are possible.
- Examples:
- Number of kids in a family (e.g., 0, 1, 2, 3...)
- Number of stents inserted into the coronaries (e.g., 1, 2, 3...)
- Number of patient visits to the hospital (e.g., 0, 1, 2...)
b. Continuous Variables
📚 Definition: Numerical variables that can take any real numerical value, including decimals, within a given range. They involve precise measurement.
- Characteristics: There are infinitely many possible values between any two given values.
- Examples:
- Weight (e.g., 65.3 kg, 72.85 kg)
- Height (e.g., 175.2 cm, 160.0 cm)
- Blood glucose level (e.g., 98.5 mg/dL, 120.1 mg/dL)
- Body temperature (e.g., 36.6 °C, 37.1 °C)
💡 How to Identify Variable Types: A Step-by-Step Guide
1️⃣ Step 1: Is there a unit of measurement? * If No ➡️ It is Categorical. * If Yes ➡️ It is Numerical. 2️⃣ Step 2: For Categorical variables: Is there an order? * If No ➡️ It is Nominal. * If Yes ➡️ It is Ordinal. 3️⃣ Step 3: For Numerical variables: Is it counted or measured? * If Counted (integer values) ➡️ It is Discrete. * If Measured (can have decimals) ➡️ It is Continuous.
Example Dataset Analysis:
Let's apply the steps to a sample dataset:
| Student | Sex | Blood group | BMI | BMI group | N of courses | Body temp. | | :------ | :----- | :---------- | :---- | :------------- | :----------- | :--------- | | 1 | male | O | 17.8 | Underweight | 4 | 36.6 | | 2 | female | AB | 26 | Overweight | 5 | 37.1 | | 3 | male | A | 24.5 | Healthy weight | 4 | 36.9 | | 4 | male | B | 31.6 | Obese | 4 | 36.8 |
- Sex: No unit of measurement, no order (male/female are just labels) ➡️ Nominal (Dichotomous), Categorical.
- Blood group: No unit of measurement, no order (A, B, AB, O are just labels) ➡️ Nominal, Categorical.
- BMI group: No unit of measurement, clear order (Underweight < Healthy < Overweight < Obese) ➡️ Ordinal, Categorical.
- N of courses: Has a unit (courses), counted (whole numbers) ➡️ Discrete, Numerical.
- BMI: Has a unit (kg/m²), measured (can have decimals) ➡️ Continuous, Numerical.
- Body temp: Has a unit (°C), measured (can have decimals) ➡️ Continuous, Numerical.
📏 Measuring Scales
Measuring scales provide further detail on the nature of numerical and ordinal data. There are three primary types:
1. Ratio Scale
📚 Definition: Applies to variables that have a true zero point, meaning zero signifies the complete absence of the measured quantity. Ratios between values are meaningful.
- Characteristics: All mathematical operations (addition, subtraction, multiplication, division) are valid.
- Example: Weight. A weight of 0 kg means no weight. A person weighing 30 kg is exactly twice as heavy as a person weighing 15 kg. Other examples include height, age, and income.
2. Interval Scale
📚 Definition: Applies to variables that have no true zero point. The intervals between values are meaningful and equal, but ratios are not.
- Characteristics: Addition and subtraction are valid, but multiplication and division are not.
- Example: Temperature in Celsius (°C). 0°C does not mean the absence of heat. 30°C is not twice as hot as 15°C (because 0°C is an arbitrary point, not an absolute absence). The difference between 10°C and 20°C is the same as between 20°C and 30°C (10°C difference).
3. Ordinal Scale
📚 Definition: Applies to variables that have an order or ranking, but the differences between categories are not necessarily equal or quantifiable.
- Characteristics: Only comparisons of "greater than" or "less than" are meaningful.
- Examples:
- Pain score (e.g., 1-10 scale): A score of 8 is more pain than 4, but it's not necessarily "twice" the pain, and the difference between 1 and 2 might not be the same as between 7 and 8.
- Social class (e.g., low, middle, high).
- 💡 Insight: Sometimes, ordinal variables with a large number of levels (like a 10-level pain score) might be treated as discrete numerical variables for certain analyses, though this is a simplification.
📈 Levels of Data Measurement and Transformation
Data variables exist in a hierarchy of measurement levels, which dictates how they can be transformed. It's possible to change the type of data variable, but only in one direction: Numerical Continuous → Numerical Discrete → Ordinal → Nominal
- Example: Age Transformation
- Numerical Continuous: Exact age (e.g., 25.7 years, 30.1 years).
- Numerical Discrete: Age in years (e.g., 25 years, 30 years).
- Ordinal: Age groups (e.g., 18-25 years, 26-35 years, 36-45 years).
- Nominal: Simplified categories (e.g., "young" vs. "old").
⚠️ Critical Principle: Whenever possible, collect your data at the highest level (numerical continuous or numerical discrete). This approach preserves the most information, ensures greater accuracy, and provides flexibility for later categorization or analysis without losing detail.
📝 Data Collection Methods
Accurate and reliable data collection is paramount in medical research. Methods vary based on study design and research objectives, broadly classified as primary (collected directly) or secondary (collected from existing sources).
Common data collection methods include:
- Surveys and Questionnaires:
- ✅ Purpose: Efficiently gather data from a large number of respondents.
- Administration: In-person, phone, mail, or online.
- Types of Questions: Can include both open-ended (allowing free text) and closed-ended (multiple choice, yes/no) questions.
- Interviews:
- ✅ Purpose: More direct and often in-depth data collection, allowing for clarification and probing.
- Administration: Typically in-person or via video call.
- Direct Measurements:
- ✅ Purpose: Obtain objective and precise data.
- Methods: Physical examinations, laboratory tests (e.g., blood glucose), imaging studies (e.g., X-rays, MRI).
- Output: Quantitative health parameters (e.g., blood pressure, cholesterol levels).
- Medical Records:
- ✅ Purpose: Access historical patient data.
- Content: Patient history, diagnoses, treatments, outcomes.
- Advantage: Electronic Health Records (EHRs) streamline data access and analysis.
- Census:
- ✅ Purpose: Complete enumeration of an entire population.
- Characteristics: Comprehensive, typically conducted periodically (e.g., every ten years for population census). Requires extensive inquiry.
💻 Data Entry Principles
After data collection, especially from paper forms, data entry is a critical step to prepare information for analysis. The goal is to arrange data into a structured computer file, usually a spreadsheet.
Characteristics of a Well-Arranged Datasheet:
- ✅ Each column represents one variable.
- ✅ Each row represents a case (e.g., an individual patient).
- ✅ The unit of measurement is unified within each column (e.g., all weights in kg, not a mix of kg and lbs).
- ✅ Each cell contains only one data point.
- ✅ Nominal and ordinal data are coded using numeric codes.
Examples of Numeric Coding for Categorical Data:
- Severity of disease:
- Mild → 1
- Moderate → 2
- Severe → 3
- Binary (Yes/No):
- Yes → 1
- No → 0
- Severity of Pain:
- No pain → 0
- Mild pain → 1
- Moderate pain → 2
- Severe pain → 3
Handling Multiple Answers:
- 💡 If a question allows multiple selections (e.g., "Which chronic conditions do you have?"), create a separate column for each choice.
- Code each choice as 1 (Yes) or 0 (No).
- Example: For chronic conditions (DM, CVD, Hypertension): | DM | CVD | Hypertension | | :-- | :-- | :----------- | | 1 | 0 | 1 | | 1 | 0 | 0 | | 0 | 1 | 1 |
⚠️ Tips for Data Entry of Numeric Variables:
- Be precise: Enter exact values (e.g.,
1.56, not1.5or1.6). - Only numbers: Enter numerical values, not text (e.g.,
2, nottwo). - Keep consistent units: Use one unit throughout a column (e.g., all
cmor allm). - Don't write the unit: The unit should be in the column header, not in the cell (e.g.,
2, not2 timesor2 years). - Use basic measurements: Enter raw data like weight and height; calculated values like BMI can be derived later.
- Don't categorize: Collect exact numeric values (e.g., exact age
27, not20-25 years). Categorization can be done during analysis. - Only one data element per cell: Avoid combined representations (e.g., for gestational age, enter
142days or20weeks, not20+2).
✅ Conclusion
A solid grasp of statistical data variables, their classification, and associated measurement scales is the bedrock of biostatistics. Coupled with judicious data collection methods and meticulous data entry practices, this foundational knowledge ensures that raw data is accurately captured, structured, and prepared for rigorous statistical analysis. This, in turn, enables sound scientific inquiry and evidence-based decision-making in health and medicine.








