MirrorWeb offers automated website and social media archiving services with full text search capability for all content. The UK government hired MirrorWeb to provide search services across 20 years of archived data from over 4,800 websites. In this session, MirrorWeb discusses the technology stack they built using Amazon Elasticsearch Service (Amazon ES) to search across the 333 million unique documents (over 120 TB) that they indexed within a 10-hour period. They discuss how they moved data from on-premises to Amazon S3 using AWS Snowball and then processed that data using Amazon EC2 Spot Instances, reducing costs by over 90%. They also talk about how they used AWS Lambda to ingest data into Amazon ES. Finally, they share best practices for building a large-scale document search architecture.
Publisher: Amazon Web Services
You can watch this video also at the source.