Data Extraction

Welcome to Monitoro Herd’s powerful data extraction system! This guide will walk you through how to extract structured data from web pages using our intuitive selector system and transformation pipelines.

Understanding Selectors

Herd provides a flexible and powerful way to extract data from web pages using a declarative JSON-based selector system.

Basic Extraction

The simplest form of extraction uses CSS selectors to target elements:

// Extract basic text content
const data = await page.extract({
  title: 'h1',           // Extracts the main heading
  description: 'p',      // Extracts the first paragraph
  link: 'a'              // Extracts the first link text
});

console.log(data.title);       // "Welcome to Our Website"
console.log(data.description); // "This is our homepage."

Advanced Selector Syntax

For more complex extraction needs, use the expanded object syntax:

const data = await page.extract({
  title: {
    _$: 'h1',            // CSS selector
    attribute: 'id'      // Extract the ID attribute instead of text
  },
  price: {
    _$: '.price',        // Target price element
    pipes: ['parseNumber'] // Apply transformation
  }
});

Extracting Lists of Items

To extract multiple elements that match a pattern, use the _$r (repeat) selector:

const data = await page.extract({
  items: {
    _$r: '.item',        // Find all elements with class "item"
    title: 'h2',         // For each item, get the title
    price: '.price',     // For each item, get the price
    date: 'time'         // For each item, get the date
  }
});

// Access the extracted items
data.items.forEach(item => {
  console.log(`${item.title}: ${item.price}, Posted: ${item.date}`);
});

Nested Extraction

You can nest selectors to extract hierarchical data:

const data = await page.extract({
  product: {
    name: '.product-name',
    details: {
      _$: '.product-details',
      specs: {
        _$r: '.spec-item',
        label: '.spec-label',
        value: '.spec-value'
      }
    }
  }
});

Special Selectors

Herd provides special selectors to handle various extraction scenarios:

Root Selector (`:root`)

The :root selector refers to the current element in context:

const data = await page.extract({
  items: {
    _$r: '.item',
    someElement: ':root',        // Extract text of the .item element itself
    classes: {
      _$: ':root',
      attribute: 'class'  // Extract class attribute of the same element
    }
  }
});

Property Extraction

You can extract JavaScript properties from elements:

const data = await page.extract({
  dimensions: {
    _$: '.box',
    property: 'getBoundingClientRect'  // Get element dimensions
  },
  html: {
    _$: '.content',
    property: 'innerHTML'  // Get inner HTML
  }
});

Transformation Pipelines

Herd includes powerful transformation pipelines to process extracted data:

Available Transformations

Pipe	Description	Example Input	Example Output
`trim`	Removes whitespace from start/end	`" Hello "`	`"Hello"`
`toLowerCase`	Converts text to lowercase	`"HELLO"`	`"hello"`
`toUpperCase`	Converts text to uppercase	`"hello"`	`"HELLO"`
`parseNumber`	Extracts numbers from text	`"$1,2K.45"`	`1200.45`
`parseDate`	Converts text to date	`"2024-01-15"`	`"2024-01-15T00:00:00.000Z"`
`parseDateTime`	Converts text to datetime	`"2024-01-15T12:00:00Z"`	`"2024-01-15T12:00:00.000Z"`

Using Transformations

Apply transformations using the pipes property:

const data = await page.extract({
  price: {
    _$: '.price',
    pipes: ['parseNumber']  // Convert "$1,234.56" to 1234.56
  },
  title: {
    _$: 'h1',
    pipes: ['trim', 'toLowerCase']  // Apply multiple transformations
  }
});

Handling Currency and Large Numbers

The parseNumber transformation handles various formats:

const data = await page.extract({
  price1: {
    _$: '.price-1',  // Contains "$1,234.56"
    pipes: ['parseNumber']  // Result: 1234.56
  },
  price2: {
    _$: '.price-2',  // Contains "$1.5M"
    pipes: ['parseNumber']  // Result: 1500000
  },
  price3: {
    _$: '.price-3',  // Contains "1.5T€"
    pipes: ['parseNumber']  // Result: 1500000000000
  }
});

Real-World Examples

Let’s look at some practical examples of data extraction:

E-commerce Product Listing

Extract products from a search results page:

const searchResults = await page.extract({
  products: {
    _$r: '[data-component-type="s-search-result"]',
    title: {
      _$: 'h2 .a-link-normal',
      pipes: ['trim']
    },
    price: {
      _$: '.a-price .a-offscreen',
      pipes: ['parseNumber']
    },
    rating: {
      _$: '.a-icon-star-small .a-icon-alt',
      pipes: ['trim']
    },
    reviews: {
      _$: '.a-size-base.s-underline-text',
      pipes: ['trim']
    }
  }
});

News Article List

Extract articles from a news site:

const articles = await page.extract({
  items: {
    _$r: '.item',
    title: {
      _$: 'h2',
      pipes: ['trim', 'toLowerCase']
    },
    price: {
      _$: '.price',
      pipes: ['parseNumber']
    },
    date: {
      _$: 'time',
      pipes: ['parseDate']
    }
  }
});

Advanced Techniques

Handling Dynamic Content

For dynamic content that loads after the page is ready:

// Wait for dynamic content to load
await page.waitForElement('#dynamic span');

// Then extract the content
const data = await page.extract({
  content: '#dynamic span'
});

Extracting Page Metadata

Extract information about the page itself:

const pageInfo = await page.extract({
  title: 'title',
  metaDescription: 'meta[name="description"]',
  canonicalUrl: {
    _$: 'link[rel="canonical"]',
    attribute: 'href'
  }
});

Tips for Effective Extraction

Use Specific Selectors: The more specific your CSS selectors, the more reliable your extraction
Test Incrementally: Build your extraction schema step by step, testing each part
Handle Missing Data: Always account for elements that might not exist on the page
Apply Appropriate Transformations: Use pipes to clean and format data as needed
Combine with Interactions: For complex sites, interact with the page before extraction

Next Steps

Now that you understand Herd’s data extraction system, you can:

Create complex extraction schemas for any website
Transform raw data into structured, usable formats
Build powerful automations that collect and process web data

No headings found