# Should I shuffle samples with cross-validation?

## Метаданные

- **Канал:** Data School
- **YouTube:** https://www.youtube.com/watch?v=E5zOAtlrQqE
- **Источник:** https://ekstraktznaniy.ru/video/23716

## Транскрипт

### Segment 1 (00:00 - 03:00) []

When I ran cross_val_score earlier in this chapter, I passed an integer, 3 in this case, that specified the number of cross-validation folds. This code shows you what happens "under the hood" when you specify cv=3 for a classification problem. I'm going to walk through this code so you understand what's happening and you can modify it when needed. First, you'll notice that we're importing a class called StratifiedKFold. It's known as a cross-validation splitter, which means that its role is to split datasets. We create an instance of this class and pass it a 3 so that it will create 3 folds. And then we can pass this instance to cross_val_score instead of an integer. It's called StratifiedKFold because it uses stratified sampling to ensure that the class proportions are approximately equal in each fold. For example, if 40% of the passengers in the dataset survived, then stratified sampling ensures that about 40% of each fold is survived passengers. In other words, it ensures that each fold is representative of the entire dataset. Stratified sampling is desirable because it produces more reliable cross-validation scores, and again, scikit-learn will do this for you by default. Another good thing to know about StratifiedKFold is that by default, it does not shuffle the samples before splitting. Thus, there is nothing random about this process, and as such you will get the same results every time you run cross_val_score. In most cases, it doesn't matter whether you shuffle the samples before splitting. However, if the order of the samples in your dataset is not arbitrary, then it's important to randomly shuffle the samples when cross-validating. For example, you could imagine that if your dataset was sorted by one of the features, then some folds would only have high values of that feature and other folds would only have low values of that feature, which could result in unreliable cross-validation scores. If you do need to shuffle the samples, you simply modify the cross-validation splitter by setting shuffle to True, and then pass that splitter object to cross_val_score. Note that because you are introducing randomness into the process by shuffling, you should also set a random_state to ensure reproducibility. In summary, if you have a classification problem and the samples are in an arbitrary order, you can just pass an integer to the cv parameter of cross_val_score, and it will use stratified sampling without shuffling. If your samples are not in an arbitrary order, you should use StratifiedKFold as your splitter and set shuffle to True, and then pass the splitter object to the cv parameter of cross_val_score, as I did above. Finally, it's worth mentioning that if you're working on a regression problem instead, and you need to shuffle the samples, you should use the KFold class instead of the StratifiedKFold class, because stratified sampling does not apply to regression problems.