From Raw HTML to Clean Data: Structured Extraction Techniques

The raw HTML of a webpage is messy. Extracting clean, structured data requires understanding selectors, handling edge cases, and normalizing data into consistent formats.

Selector Strategies

Two main approaches exist for locating data in HTML: CSS selectors and XPath. Each has strengths depending on the situation.

CSS Selectors

CSS selectors are intuitive for anyone familiar with web development. They're the default choice for most scraping tasks.

// Select by class
document.querySelector('.product-title')

// Select by attribute
document.querySelector('[data-product-id]')

// Nested selection
document.querySelector('.product-card .price')

XPath

XPath is more powerful for complex selections, especially when you need to navigate up the DOM tree or select by text content.

// Select by text content
//div[contains(text(), 'Price:')]

// Navigate to parent
//span[@class='price']/parent::div

// Select nth element
(//div[@class='product'])[3]

Handling Edge Cases

Real-world data is messy. Your extraction code needs to handle:

Missing elements - Always check if an element exists
Multiple formats - Prices might be "$100" or "100 USD"
Whitespace - Extra spaces, newlines, and tabs
HTML entities -   and other encoded characters

Data Normalization

Once extracted, data needs to be normalized for consistency:

Text Cleaning

function cleanText(text) {
  return text
    .trim()
    .replace(/\\s+/g, ' ')
    .replace(/\\n/g, '')
}

Price Parsing

function parsePrice(priceStr) {
  const cleaned = priceStr.replace(/[^0-9.,]/g, '')
  return parseFloat(cleaned.replace(',', ''))
}

Date Standardization

function standardizeDate(dateStr) {
  const parsed = new Date(dateStr)
  return parsed.toISOString().split('T')[0]
}

Building Robust Extractors

A robust extractor combines multiple strategies:

Try the primary selector
Fall back to alternative selectors
Clean and normalize the data
Validate the result
Log warnings for unexpected patterns

Tools for Testing

Before running your scraper at scale, test your selectors:

Browser DevTools - Test selectors in the Console
JSON Vision - Our free extension for inspecting API responses
Scrapy Shell - Interactive testing for Python scrapers

Clean data extraction is part art, part science. With practice, you'll develop intuition for building extractors that handle real-world messiness gracefully.