Index Optimization for Real-Time Filtering of Tens of Millions of Records with Astro & MySQL

Backend Architecture

Word Count: 2.4k

Read Times: 15 Min

We were facing a thorny performance issue. A “Task Monitoring Dashboard” in an internal admin panel needed to display, filter, sort, and paginate a MySQL table with nearly 80 million rows. The initial version, where the frontend made API requests to a backend that performed direct queries, worked fine with a small dataset. But as the tasks table grew, any operation involving multi-conditional filtering or sorting resulted in API response times exceeding 30 seconds. The page would completely freeze, rendering it unusable.

The initial stack was Astro with React UI components and a simple Node.js API layer. The bottleneck was clearly the database. Simply adding single-column indexes on the filter fields had a negligible effect. The problem was the dynamic nature of user queries: they might filter by task status, by a creation date range, or sort by both status and assignee simultaneously. This combinatorial query pattern made it difficult for any single-column index to be effective.

An initial idea was to bypass the API layer and leverage Astro’s Server-Side Rendering (SSR) capabilities. By connecting directly to the database during the page request, we could co-locate the data fetching and page rendering logic. This would save one network round-trip but wouldn’t solve the fundamental problem of slow database queries. The real challenge was designing database indexes to support these highly dynamic, multi-dimensional queries and enabling Astro to fetch and render the data in milliseconds on the server.

Phase 1: The Naive Implementation and Runaway Queries

Let’s review the initial design that led to this disaster. Here is a simplified structure of the tasks table:

-- Simplified 'tasks' table schema
CREATE TABLE `tasks` (
  `id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  `uuid` VARCHAR(36) NOT NULL,
  `task_name` VARCHAR(255) NOT NULL,
  `status` ENUM('PENDING', 'RUNNING', 'SUCCESS', 'FAILED', 'CANCELLED') NOT NULL,
  `priority` TINYINT UNSIGNED NOT NULL DEFAULT '0',
  `assignee_id` INT UNSIGNED,
  `created_at` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `updated_at` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `payload` JSON,
  PRIMARY KEY (`id`),
  KEY `idx_status` (`status`), -- Initial single-column index
  KEY `idx_created_at` (`created_at`) -- Initial single-column index
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

In the server-side script of our Astro component (the --- code fence in an .astro file), we retrieved filter conditions from the URL’s query parameters and constructed the SQL.

// src/pages/dashboard.astro ---
import TaskTable from '../components/TaskTable.tsx';
import Pagination from '../components/Pagination.tsx';
import dbClient from '../lib/mysql'; // A hypothetical database client

// Get filter parameters from the URL
const url = new URL(Astro.request.url);
const page = parseInt(url.searchParams.get('page') || '1', 10);
const status = url.searchParams.get('status'); // e.g., 'FAILED'
const assignee = url.searchParams.get('assignee');
const limit = 50;
const offset = (page - 1) * limit;

// Build the WHERE clause
let whereClauses = ['1=1'];
const queryParams = [];

if (status) {
  whereClauses.push('status = ?');
  queryParams.push(status);
}
if (assignee) {
  whereClauses.push('assignee_id = ?');
  queryParams.push(parseInt(assignee, 10));
}

const whereSql = whereClauses.join(' AND ');

// Query total count for pagination
const totalQuery = `SELECT COUNT(*) as count FROM tasks WHERE ${whereSql}`;
const [totalRows] = await dbClient.query(totalQuery, queryParams);
const totalTasks = totalRows[0].count;
const totalPages = Math.ceil(totalTasks / limit);

// Query data for the current page, ordered by creation time descending
const dataQuery = `
  SELECT uuid, task_name, status, priority, assignee_id, created_at
  FROM tasks
  WHERE ${whereSql}
  ORDER BY created_at DESC
  LIMIT ? OFFSET ?
`;
const [tasks] = await dbClient.query(dataQuery, [...queryParams, limit, offset]);

// ... render logic
---
<Layout>
  <!-- Filter UI components here -->
  <TaskTable client:load tasks={tasks} />
  <Pagination currentPage={page} totalPages={totalPages} />
</Layout>

This code worked fine when the tasks table had only a few tens of thousands of rows. But at the scale of tens of millions, running EXPLAIN on the dataQuery reveals a disastrous result, especially when a user navigates to a high page number (e.g., page 1000).

EXPLAIN SELECT uuid, task_name, status, priority, assignee_id, created_at
FROM tasks
WHERE status = 'FAILED'
ORDER BY created_at DESC
LIMIT 50 OFFSET 50000;

The EXPLAIN output from MySQL would show that even with an index on the status field, it might not be used for sorting. The worse culprit is OFFSET. LIMIT 50 OFFSET 50000 means MySQL must find 50,050 records matching the criteria, discard the first 50,000, and return only the last 50. This process, especially without a suitable index covering the sort order, involves a massive filesort operation and data scan. The performance degrades linearly as the page number increases. This is the classic “deep pagination” problem.

Phase 2: Redemption Through Indexing - Composite and Covering Indexes

The core of the problem is that our query pattern is “filter + sort.” A single idx_status or idx_created_at index cannot satisfy both the WHERE and ORDER BY clauses efficiently. MySQL typically chooses only one index per query. If it uses idx_status for filtering, the sorting must be done in memory or a temporary file (filesort). If it tries to use idx_created_at to satisfy the sorting, the filtering on status becomes inefficient.

The solution is to design a Composite Index whose column order matches the query pattern. According to the Leftmost Prefix Rule, the order of columns in the index is critical. Our query logic is:

Filter by status (optional)
Filter by assignee_id (optional)
Sort by created_at

A seemingly reasonable composite index would be (status, assignee_id, created_at).

-- Create a smarter composite index
ALTER TABLE `tasks` ADD INDEX `idx_query_panel` (`status`, `assignee_id`, `created_at` DESC);

Now, let’s re-examine our queries. When a user filters by both status and assignee_id and sorts by created_at, this index works perfectly. EXPLAIN would show Using index condition, and because the index itself is already sorted by created_at, Using filesort would disappear from the Extra column.

But what if the user only filters by status? According to the Leftmost Prefix Rule, the (status) part of idx_query_panel is still effective. What if they only filter by assignee_id? The index is useless because assignee_id is not the leftmost prefix.

The catch is that we must design indexes for the most common query patterns. After consulting with the operations team, we identified the most frequent queries:

Filter by status, sort by created_at.
Filter by status and assignee_id, sort by created_at.

Therefore, idx_query_panel(status, created_at DESC) and idx_query_panel_assignee(status, assignee_id, created_at DESC) would be better choices. Let’s focus on the first, most fundamental scenario.

-- A composite index designed for the core use case
ALTER TABLE `tasks` DROP INDEX `idx_query_panel`; -- Remove the old one
ALTER TABLE `tasks` ADD INDEX `idx_status_created` (`status`, `created_at` DESC);

This index, idx_status_created, can efficiently handle requests that “filter by status, sort by time.” The query WHERE status = ? ORDER BY created_at DESC can fully leverage this index, avoiding a filesort.

But we can do even better. The current query SELECT uuid, task_name, ... still requires a table lookup. After finding the primary key id via the index, MySQL has to go back to the clustered index to fetch the data for columns like uuid and task_name. When many rows are returned (as in the case of a deep pagination scan), these lookups represent a significant overhead.

This is where a Covering Index comes into play. If an index contains all the columns required by a query, MySQL doesn’t need to perform table lookups; it can return the data directly from the index. This is a massive performance boost.

We can add all the fields displayed on the list page to our index.

-- The ultimate form: a covering index
-- Note: The index becomes very large, a classic space-for-time tradeoff
ALTER TABLE `tasks` DROP INDEX `idx_status_created`;
ALTER TABLE `tasks` ADD INDEX `idx_cover_dashboard` (
  `status`,
  `created_at` DESC,
  `assignee_id`,
  `uuid`,
  `task_name`,
  `priority`
);

Now, the Extra column in the EXPLAIN output will show Using index. This means the entire query—from filtering and sorting to fetching the final data—is completed within the idx_cover_dashboard index structure, without ever touching the primary table data. This is the pinnacle of query performance.

The covering index solved the filtering, sorting, and table lookup problems, but the linear performance degradation caused by OFFSET still exists. We need a pagination method that doesn’t rely on OFFSET, commonly known as Keyset Pagination or cursor-based pagination.

The principle is simple: instead of telling the database to “skip N rows,” we tell it to “start fetching from the point I last saw (the cursor).” This “cursor” is the value of the sorting column from the last record of the previous page. In our case, that’s created_at.

Here’s the refactored server-side code in Astro:

// src/lib/data-fetcher.ts

import dbClient from './mysql';

// Define a more robust data fetching function
interface FetchTasksParams {
  status?: string;
  assigneeId?: number;
  limit?: number;
  // The cursor is the created_at timestamp of the last record from the previous page.
  // For the first page, this value is undefined.
  cursor?: string; 
}

interface FetchTasksResult {
  tasks: any[];
  nextCursor: string | null;
}

// Production-grade task fetching logic
export async function fetchTasks({
  status,
  assigneeId,
  limit = 50,
  cursor
}: FetchTasksParams): Promise<FetchTasksResult> {
  try {
    const whereClauses: string[] = [];
    const queryParams: (string | number)[] = [];

    // 1. Build the base WHERE conditions
    if (status) {
      whereClauses.push('status = ?');
      queryParams.push(status);
    }
    if (assigneeId) {
      // Assuming we have an index for (status, assignee_id, created_at)
      whereClauses.push('assignee_id = ?');
      queryParams.push(assigneeId);
    }

    // 2. Handle the cursor - this is the key
    if (cursor) {
      // The cursor is an ISO format string "YYYY-MM-DDTHH:mm:ss.sssZ"
      // We need to tell MySQL to find records *before* this timestamp
      whereClauses.push('created_at < ?');
      queryParams.push(new Date(cursor));
    }
    
    const whereSql = whereClauses.length > 0 ? `WHERE ${whereClauses.join(' AND ')}` : '';

    // 3. Construct the final query
    // Note: ORDER BY must strictly match the index order
    const query = `
      SELECT 
        status, 
        created_at, 
        assignee_id, 
        uuid, 
        task_name, 
        priority
      FROM tasks
      ${whereSql}
      ORDER BY created_at DESC
      LIMIT ?
    `;

    // Fetch N+1 records to check if there's a next page
    const [tasks] = await dbClient.query(query, [...queryParams, limit + 1]);

    // 4. Determine if there is a next page and generate nextCursor
    let nextCursor: string | null = null;
    if (tasks.length > limit) {
      // If N+1 records are returned, there's a next page
      // We return only N records to the client and use the N-th record to generate the next cursor
      const lastTask = tasks[limit - 1]; 
      // Convert the Date object to an ISO string for URL transport
      nextCursor = lastTask.created_at.toISOString();
      tasks.splice(limit); // Remove the extra record
    }

    return { tasks, nextCursor };

  } catch (error) {
    // In a real project, detailed logging should be here
    console.error('Failed to fetch tasks:', error);
    // Throw a controlled error to the frontend
    throw new Error('Database query failed.');
  }
}

In our Astro page, we would use it like this:

// src/pages/dashboard.astro ---
import { fetchTasks } from '../lib/data-fetcher';

const url = new URL(Astro.request.url);
const status = url.searchParams.get('status') || undefined;
const cursor = url.searchParams.get('cursor') || undefined;

let result;
let error = null;

try {
  result = await fetchTasks({ status, cursor, limit: 50 });
} catch (e) {
  error = e.message;
}

// The pagination component now needs to handle a cursor instead of a page number
// The "Next Page" button link will be /dashboard?status=FAILED&cursor=...
const nextPageUrl = result?.nextCursor
  ? `/dashboard?status=${status || ''}&cursor=${encodeURIComponent(result.nextCursor)}`
  : null;

---
<Layout>
  {error && <p class="error">{error}</p>}
  {result && <TaskTable client:load tasks={result.tasks} />}
  {nextPageUrl && <a href={nextPageUrl}>Next Page</a>}
  <!-- A more complex component would be needed to handle "previous page" and "first page".
       Keyset pagination makes "previous" more complex, so often only "next" and "back to start" are provided. -->
</Layout>

The performance of this method is constant. No matter which page you are on, the query time is virtually the same because it always starts scanning from a precise location (created_at < ?) and reads limit records forward. With the help of the covering index, this is extremely fast.

Architecture Diagram and Final Flow

The entire request-response flow can be described with the following diagram:

sequenceDiagram
    participant User
    participant Browser
    participant Astro as Astro (SSR)
    participant MySQL as MySQL Database

    User->>Browser: Clicks "Next Page" (URL contains cursor)
    Browser->>Astro: GET /dashboard?status=FAILED&cursor=...
    Astro->>Astro: Parse URL params (status, cursor)
    Astro->>MySQL: Execute optimized SELECT query (WHERE created_at < cursor ORDER BY created_at DESC LIMIT 51)
    Note right of Astro: Uses covering index idx_cover_dashboard, 
query performance is stable at <5ms
    MySQL-->>Astro: Returns 51 records
    Astro->>Astro: Process result: take first 50, generate nextCursor from the 50th record
    Astro->>Browser: Return server-rendered HTML page
    Browser->>User: Display new page of data and new "Next Page" link

The key to this architecture is moving the data-intensive operations entirely to the server and leveraging deep optimization of database indexes and query patterns to ensure SSR performance. Astro acts as the glue layer and rendering engine, while the UI component library (regardless of choice) handles lightweight client-side interactions (like client-side sorting or highlighting after client:load), with the heavy lifting already completed by the server.

Limitations and Future Directions

This solution isn’t a silver bullet. First, implementing keyset pagination is more complex than traditional OFFSET pagination, especially for handling “previous page” or jumping to an arbitrary page number. Our implementation simplifies this by only providing a “next page” link. Scenarios requiring full pagination controls might need a more complex bi-directional cursor logic.

Second, the covering index strategy is a classic example of a space-for-time tradeoff. The idx_cover_dashboard index can become quite large, as it includes multiple columns. This increases the overhead of write operations (INSERT, UPDATE), as this large index must be updated with every write. In write-heavy, read-light scenarios, the cost of this strategy must be carefully evaluated.

Finally, if future query dimensions are added, such as filtering by priority, we would need to decide whether to expand this “all-in-one” index or create multiple, smaller indexes tailored to specific query patterns. This involves continuous monitoring and analysis of business query patterns. When query combinations become exceedingly complex, relying solely on MySQL may no longer be the best choice. Introducing a dedicated search engine like Elasticsearch to handle complex, multi-dimensional filtering and aggregation could be a more scalable evolutionary path.

Astro MySQL UI Component Library Index Optimization

Crafting a Custom Saga Orchestrator in C# for Atomic NLP Pipelines on ASP.NET Core

2023-10-27 Backend Architecture

ASP.NET Core C# NLP ACID Saga

Building a Kubernetes Operator for a Legacy App with Puppet Config, JWT, and CSS Modules

2023-10-27 Cloud Native

Kubernetes Puppet JWT CSS Modules

Index Optimization for Real-Time Filtering of Tens of Millions of Records with Astro & MySQL

Phase 1: The Naive Implementation and Runaway Queries

Phase 2: Redemption Through Indexing - Composite and Covering Indexes

Phase 3: Taming Deep Pagination with a Cursor-Based Approach

Architecture Diagram and Final Flow

Limitations and Future Directions