Student Internship Position (Lookyloo-03)

Overview - Student Internship Position - Lookyloo executive summary, integration with third party services

Lookyloo started as a side project aiming to help internal teams in media organisations to have an overview of the content loaded on their websites. It is now used on a daily basis by CIRCL in order to analyse phishing and other malicious websites in the context of incident response.

The current status of the system makes it relatively simple for an analyst with a good understanding of web technologies to understand what is going on on a website, but there is still work to do in order to make the capture easier to understand and analyse by less technical users.

One way to make their life easier is to integrate third party services in lookyloo, and to allow submitting information gathered during a capture to other dedicated systems.

The goals of this internships are following:

  • Getting used to the concepts and limitations of web scraping with the following tutorial
  • Getting used to Lookyloo. The testing page, and the examples are a good starting point. Expand them if needed.
  • Create an overview page summarizing the information gathered by the capture on lookyloo, and by the 3rd party modules (if enabled) so all the information are visible at one place.
  • Create a module to integrate Lookyloo with Pandora (the project will be opensourced by the time the internship starts).
  • (Optional) Create a module for CIRCL Passive DNS
  • (Optional) Create a module for BGP Ranking
  • (Optional) Figure out how to represent nodes containing calls to tracking/Ads infrastructures, probably using uBlock Origin datasets.
  • Write a report explaining the findings made along the way

Summary page

Lookyloo is mainly an investigation tool allowing the users to dig into a capture and find the specific details they’re looking for, and this is not going to change. But even in such case, it would be useful to have a place where we can see all the main information about the capture without having to click through multiple views.

The first task will be to figure out which information (statistics of the capture itself, reports from 3rd party services, euristics from fimilar captures, …) we want to represent. Then, we need to decide how to represent them in a way that makes sense, and finally implement it.

It is important to keep in mind that Lookyloo can be used in a information security context (i.e. investigating malicious websites), but it is also used on perfectly legitimate websites to investigate privacy violations, and by administrators of complex websites to have an overview of their infrastructure.

Pandora

Pandora is a tool developed initially by CERT-AG that CIRCL was tasked to turn into an opensource project. At the time of the writing of thins document, it is not available publicly, but it will be by the time the internship starts.

Pandora is a quick analysis tool for suspicious files that can be either uploaded directly on the web interface, or forwarded by mail. The integration with Lookyloo will consist in pushing URLs extrated from the emails or files to Lookyloo and trigger a capture. From lookyloo, we want to take resources gathered during a capture and submit them to Pandora for analysis.

Other integrations

If time allows, you will implement more 3rd party modules to integrate Lookyloo with other systems. This part will be discussed later on, depending on the needs at the time.

Write up if the findings

Web technologies are extremely versatile, allowing the developers to do a lot of extremely odd things. Many of them are untangled by Lookyloo and the libraries used in the project. You will for sure discover things you (and the maintainers of the project) didn’t know about, so it will be important to document them.

Current status of the project

Lookyloo connects together a few tools in a consistent manner:

  • Scrapy, a webcrawling framework (Python).
  • Splash, a webservice used for rendering the website and generating the HTTP Archive (HAR) file (runs in a docker).
  • ETE Toolkit, a Python framework for the analysis and visualization of (phylogenetic) trees (Python).
  • D3JS, for the visualisation of the tree in the browser (JavaScript).

  • ScrapySplashWrapper, a simplistic library relying on scrapy to filter out the ressources to open on the website to investigate. Then, it queries Splash, format and returns the data generated by it (Python 3.8+).
  • har2tree, a library that generates an ETE Toolkit tree from the HAR file, and other data returned by Splash (Python 3.8+)
  • Lookyloo glues all the parts together (Python 3.8+, Javacript, CSS, HTML). Note that the webserver used is flask
  • Pandora - the project will be opensourced by the time the internship starts

The current code is stable but needs a lot of improvements in order to support the required features.

Your task is to understand the code and interfaces to other services and bring the code to the next level.

Your work will be part of the daily activities of CIRCL and for countless people doing lookups against our web service.

If this is a challenge you like to accept, talk to us!

Qualification

  • Must be an EU citizen with a valid work permit in Luxembourg
  • Must be eligible for a student internship in the field of information security and/or computer science
  • Must have a high-level of ethics due to the nature of the work
  • Must be fluent in English, Unix, git, and Python. JavaScript and web development in general would be a plus.
  • Contributions performed under this internship will be released as free software

How to apply

The application package must include the following:

  • A resume in ASCII text format
  • A motivation letter why you are interested in the internship

The package is to be sent to info(@)circl.lu indicating reference internship-lookyloo-03.

Application deadline

The deadline for the application is the 15th of March 2022. Applications received after the deadline will not be considered.

Classification of this document

TLP:WHITE information may be distributed without restriction, subject to copyright controls.