Student Internship and Bachelorarbeit Position (Bachelorarbeit-02)

Overview - Student Internship and Bachelorarbeit Position - Lookyloo integration with takedown infrastructure and automatic classification

Lookyloo started as a side project aiming to help internal teams in media organisations to have an overview of the content loaded on their websites. It is now used on a daily basis by CIRCL in order to analyse phishing and other malicious websites in the context of incident response.

The current status of the system makes it relatively simple for an analyst with a good understanding of web technologies to understand what is going on on a website, but due to the amount of malicious content submitted on the platform, their takedown has to be handled in a somewhat automated process to make sure the owner(s) of the websites/infrastructure are informed, and then make sure the content is made inaccessible to potential victims.

In a second phase, we want to automatically classify the captures based on their content, with minimal user interaction in order to save time to the analyst.

Internship (3 months)

The goals of the internship are twofold:

  1. give time to the intern to familiarize themselves with the tools developed by CIRCL, understand their usecases, install and configure all of them in a development environment. This step is very important as all the tools will be used at one point or the other during the 6 months and it is important to have a mental model of the current infrastructure.

Related resources:

  • Getting used to the concepts and limitations of web scraping following the workshop given at hack.lu
  • Setting up a working environment that includes Lookyloo (+ monitoring), Lacus, Pandora, and MISP
  • Understand with the team the needs for automation in the takedown process using a ticketing system (RTIR in CIRCL’s case)
  1. work in close collaboration with the incident response team to understand their work, how they handle malicious websites and finally (attempt to) take them down. There is already a lot of tooling in place to do that, but Lookyloo isn’t completely integrated and the core goal of the internship is to finalize it.

Tasks:

  • Decide on a workflow and steps to integrate lookyloo in this process
  • Figure out the missing parts in the current code base and develop the code needed accordingly
  • [Optional] Refactor web interface in order to remove dependency on jQuery
  • [Optional] Improve captcha bypass (ongoing issue, needs constant updates)
  • [Optional] Trigger the same captures using different proxies from different countries

This internship will be realized in collabotation with this other internship in order to avoid duplicating the work and give a practical example of opensource development in a small team. Both interns will pick specific standalone tasks depending on their interests but the goal by the end of the internship is to have a complete workflow to manage and track takedown requests.

Bachelorarbeit (3 months)

The handling and takedown is partially automatized but an analyst still has to manually find malicious websites. The current tool helps them to find these websites relatively quick, but there is no pre-processing of the massive amount of captures (~200.000/month) triggered on Lookyloo, so many are very probably missed.

The bachelorarbeit will be an introduction to machine learning to autmatically attach tags (type of hosting, parking pages, presence of captcha, type of content, …) based on indicators gathered during the capture.

If there is enough time, it would be nice to use the machine learning skills learned in the first phase to automatically flag potential GDPR violations on arbitrary websites.

Current status of the project

Lookyloo connects together a few tools in a consistent manner:

  • Playwright, a framework for Web Testing and Automation. Used here to capture a URL, get a HAR export, a screenshot, and the cookiejar.
  • ETE Toolkit, a Python framework for the analysis and visualization of (phylogenetic) trees (Python).
  • D3JS, for the visualisation of the tree in the browser (JavaScript).

  • Lacus, capturing system using playwright, as a web service.
  • har2tree, a library that generates an ETE Toolkit tree from the HAR file, and other data returned by Splash (Python 3.8+)
  • Lookyloo glues all the parts together (Python 3.8+, Javacript, CSS, HTML). Note that the webserver used is flask

The current code is stable but needs a lot of improvements in order to support the required features.

Your task is to understand the code and interfaces to other services and bring the code to the next level.

Your work will be part of the daily activities of CIRCL and for countless people doing lookups against our web service.

If this is a challenge you like to accept, talk to us!

Qualification

  • Must be an EU citizen with a valid work permit in Luxembourg
  • Must be eligible for a student internship in the field of information security and/or computer science
  • Must have a high-level of ethics due to the nature of the work
  • Must be fluent in English, Unix, git, and Python. JavaScript and web development in general would be a plus.
  • Contributions performed under this internship will be released as free software

How to apply

The application package must include the following:

  • A resume in ASCII text format
  • A motivation letter why you are interested in the internship

The package is to be sent to info(@)circl.lu indicating reference internship-bachelorarbeit-02.

Application deadline

The deadline for the application is the 15th of January 2022. Applications received after the deadline will not be considered.

Classification of this document

TLP:CLEAR information may be distributed without restriction, subject to copyright controls.