Datasets
**Comprehensive data is accessible via the provided
[link](https://github.com/llm-refactoring/llm-refactoring-plugin/tree/main/datasets)**.
To validate our technique, we used the following datasets:
1. _Community Corpus-A_ consists of **122** Java methods and their corresponding Extract
Method refactorings collected
from five open-source repositories: MyWebMart, SelfPlanner, WikiDev, JHotDraw, and JUnit.
This dataset previously served as the foundation for evaluating various state-of-the-art
Extract Method refactoring
automation tools, including JExtract, JDeodorant, SEMI, GEMS, and REMS.
2. _Community Corpus-B_ Silva et al. maintain an active corpus containing 448 Java methods,
each accompanied by its
respective extract method refactorings that open-source developers actually
performed.Automatable changes represent
“clean”
commits in which the developers made a single type of change, i.e., performed Extract Method
refactoring. We chose
the
latter category as our focus is on replicating the exact development scenarios where our
tool only performs
refactorings,
it does not expand the code functionality. This resulted in **154** replicable refactorings.
3. _Extended Corpus_: To enhance the robustness of our evaluation with a sizable oracle of
actual refactorings performed
by
developers, we constructed Extended Corpus. To create it, we employed RefactoringMiner for
detecting _Extract
Method_.
We ran it on highly regarded open-source repositories: IntelliJ Community Edition, and
CoreNLP. After filtering to
remove
refactoring commits that mixed feature additions (the one-liners and the extracted methods
whose body overlapped a
large
proportion of the host method), we retain **2,849** _Extract Methods_ from these
repositories.
Datasets details are available via the **following
[link](https://github.com/llm-refactoring/llm-refactoring-plugin/tree/main#dataset-details)**.