Task 1 - What is content discovery?
- It’s really very broad, but any content you can find relevant to whatever you are doing
- Could be:
- Old pages left online but removed from navigation, config files, backups, screenshots someone posted publicly on accident, etc.
- Three main ways of discovering said content
- Manually
- Automated
- OSINT (Open-Source Intelligence)
Task 2 - Manual Discovery - Robots.txt
- Common file that tells search engine crawlers what pages to avoid. We aren’t robots, so we can see what the crawlers are told not to look at.
Task 3 - Manual Discovery - Favicon
Task 4 - Manual Discovery - Sitemap.xml
- The opposite of Robots.txt! This is what search engines should index.
- Can sometimes contain pages that are still active but not necessarily meant to be.
- Makes it easy to see a list of (some) urls/content
Task 5 - Manual Discovery - HTTP Headers
- Often the headers will include the webserver software, or other identifying information if not hidden.
- Good for finding versions running!
Task 6 - Framework Stack
- If a framework is identified, perhaps it has a default admin page? Go from there!
Task 7 - OSINT - Google Hacking / Dorking
- Google Dorking is just a fun name for using google’s more advanced search features to find content without scanning/tooling.
- You can search for specific sites, words in urls, and a lot more
- Placeholder - I’ll be making a personal cheatsheet for this!
Task 8 - OSINT - Wappalyzer
- I’ve actually used this tool for a while, check it out here
Task 9 - OSINT - Wayback Machine
- Never thought about using the wayback machine for OSINT but upon reading about it being used like this, it makes perfect sense
Task 10 - Github
- People leave secrets in public repos all the time, and versioning keeps a record and snitches on them if they aren’t careful to remove something, and or change it later.
- Hilariously, when I was first starting to code (7 years ago!) I left a discord bot token in a public repo. Never did that again ;)
Task 11 - Amazon S3 Buckets
- These are storage servers/is a storage service by Amazon AWS.
- Files are given permissions, and if they are set incorrectly, you could find a lot of information you aren’t meant to see.
- the format for the bucket urls is
http(s)://{name}.s3.amazonaws.com where {name} is whatever the org chose.
Task 12 - Automated Discovery
- This is the process of using tools to automatically search for content, rather than doing it yourself. Never would have guessed that… ;)
- Wordlists
- Lists of common words, sometimes passwords, etc
- Tools
- There are a ton! Do some googling here, it’s what I did :)