top of page
  • Writer's pictureMauricio Cordeiro

Python for Geosciences: Scatter Plots and PDF reports (step by step)

Updated: Jan 23

Learn how to create scatter plots from satellite imagery and automate a PDF reporting engine with the results


Hi, and welcome back to part 6 of Python for Geosciences. One advantage of using a programming language like Python to perform satellite image analysis instead of a geospatial software is that we can automate any part of the process. This automation can be used in a production chain to return a periodically output(for example, monthly water surface, annual deforestation, among others) or it can be used to perform a historical analysis (create a deforestation time series, for example). In either usage, we will be generating huge amount of output data and just the process of reviewing it can be overwhelming depending on the periodicity and time frame involved.

I describe one example of such automation in my "Water Detection in High Resolution Satellite Images using the waterdetect python package" story (there is also a peer-reviewed paper about it [1]). I’ve developed the waterdetect package in 2019/2020, during my PhD, with the objective of supporting "real time" water quality analysis in large scale. One of the goals was to make it fast compared to other algorithms (ex. an entire Sentinel 2 tile with 120M pixels can be processed in less than 5 minutes). Such performance was achieved by a mix of supervised and unsupervised machine learning techniques such as agglomerative clustering and Naïve Bayes classifier (more details in this story here).

I have already used the waterdetect package to process more than 4.000 Sentinel 2 images and it can be overwhelming to quality check all those outputs. In order to make this screening more efficient I created a PDF reporting engine that encapsules a lower resolution RGB of the scenes, the overlay of applied masks (clouds mask, water mask, etc.) and the main graphs into a single PDF file. In Figure 1 we can see an example of the waterdetect report in a Sentinel 2 scene from Mexico.

In the previous posts we have already learnt how to create RGB representation and overlays with masks, so now it is time to put them all together into a PDF report. Additionally we will learn how to create another type of graph very useful for image analysis, the Scatter Plots. Lets get started!

Step 1- Install PyPDF2

We have already installed the libraries necessary for much of the work. The matplotlib library we use for displaying the images and graphs is able of writing directly to .PNG, .JPG or .PDF files, so the only part missing is a library for merging them all together into a single report. For that, we will use the simple PyPDF2 package (here), that can be installed in our environment using the command:

!pip install PyPDF2

from a notebook cell. The “!” tells the notebook to execute the command in the shell instead of executing it as a Python command. After that, we can test importing the library.

from PyPDF2 import PdfFileMerger

Step 2- Create the RGB PDF

Now we will use the functions that we have already defined in previous posts to create a PDF with a RGB representation of the scene and another PDF with the overlay of the RGB with the clouds, shadows and water masks. A simple modification will be done in our load_landsat_image, in order to load the bands in the desired resolution, as the entire image would be too large for the PDF file.

To keep the aspect ratio intact, we will receive as parameter just the width and we will then calculate the height accordingly and pass it to the rasterio's read function. After that, we create our RGB image using the create_rgb function. Note that we clip values higher than 1, because we need values between [0…1] to avoid issues with the PDF creation afterwards. After creating the RGB, instead of displaying it with plt.imshow, we will save it directly to a .PDF file with plt.imsave. We will also load the B5 (NIR) and QA_PIXEL bands to use them in the next step.

Step 3- Create masked RGB PDF files

To create a cloud and cloud shadow masked RGB, we will use the code snippet created in Part 4, most precisely the functions get_mask and plot_masked_rgb. In addition, we will also use the normalized_difference function from Part 2 to create the NDWI index. The water mask will be defined as pixels with NDWI greater than 0.0. As we did in the previous step, we will save it as a PDF file and display the result just for visualization purposes.

Step 4- Creating Scatter Plots

In Part 3- Spectral Analysis (here), we used plt.plot function to plot the mean reflectance spectrum of some targets. Another useful visualization for satellite image analysis are scatter plots graphs. To make a scatter plot we will use the plt.scatter function. The function receives two lists or two vectors containing the values for X and Y axis. The trick here is to convert our 2-D bands into 1-D vectors before passing the values to plt.scatter. The easiest way to change the shape of an array is to use the numpy’s reshape function. If we pass the argument -1 to a dimensions’ shape, it means that the shape should be inferred from the length of the array and remaining dimensions. Let’s see an example and try to plot the graph at the end:

Original shape: (1021, 1000)
Fixing first dimension as size 2: (2, 510500)
Reshaping it to a vector: (1021000,)
Note: If the data is not reshaped in 1-D array, the plt.scatter function will do it internally, however it is a good practice to have control over the shapes we are dealing with.

Well… the graph worked, but it is not meaningful. We need to do some tweaks here. First, the values should be scaled by 0.0001 factor to correctly represent the reflectance. Additionally, we have pixels with 0 value on all bands. That’s because we have no-data pixel that should be treated before plotting the graph. We could do this by ignoring 0 values, but the correct way to check for no-data pixels is to load the image’s no-data mask and we will use just the pixels that have valid data.

Another problem is that we are plotting more than 1 million points. In this case, we can randomly resample to a smaller quantity of pixels. To accomplish this, we could use numpy.random.choice function and pass the quantity of pixels we want to use. The problem with this approach is that each time we call the choice function, it will make a completely random choice and we will not be selecting the same pixels for each band. Instead, we will create a n-sized vector with random indices using numpy.random.randint, and them get the pixels corresponding to these indices. The maximum index value will depend on the quantity of pixels, after we filter for the valid data.

Our graph is looking much better, but we can still make some improvements before it is ready for the report. In this graph we don’t know exactly to which target each pixel belongs to. We could use different colors to represent the different targets (clouds, shadows, water and land), for example.

For that, we will start creating 3 lists representing the labels, the masks and the colors for each group. We will define the following groups:

  • land

  • clouds

  • shadows

  • water

We already have masks for them, except for the land. We can them define the land mask as the inverse (‘~’ operand) of the other 3 masks combined, like so:

land = ~(clouds | shadows | water)

groups = ['land', 'clouds', 'shadows', 'water']
masks = [land, clouds, shadows, water]
colors = ['green', 'orange', 'brown', 'blue']

Now, we need to loop through these 3 lists (groups, masks , and colors) in parallel. For example, when we are in group ‘clouds’, we need the clouds mask and the orange color and successively. This can be done using the zip function.

for group, mask, color in zip(groups, masks, colors):
    print(f'Group: {group} - Color: {color}')

Code output:
Group: land - Color: green
Group: clouds - Color: orange
Group: shadows - Color: brown
Group: water - Color: blue

In the loop, we will plot each group with a different color and also a different label to create the legend. The last thing we will do is to set the markers of the legend as opaque.

Step 5- Save Scatter Plots

Now that we’ve succeeded in creating a scatter plot, it is important to wrap it up into a reusable function. The function shall be generic so that we could also plot an index (ex. NDWI) at one of the axis. Also, we will use plt.savefig function to save the figure as a PDF. To test, we will call our new function using B5 and NDWI as X and Y axes, respectively.

Step 6- Create PDF

Now that we have the PDF files already created, we can use the PdfFileMerger class from PyPDF2 package. We will create a function that just receive the PDFs (as a list of filenames) and merge them all into a single report. The PDF report can be seen in Figure 2.

Voilà our simple PDF report. As I mentioned before, it will be very useful for screening and sharing your results.


Here's the entire code for this tutorial.


Today we’ve learned how to create another type of graph for our satellite image analysis, the scatter plot. Combining different bands in the axis can make it possible to distinguish different types of land cover, such as crops, cities, etc. Additionally, we saw how to save all the outputs in lower resolution PDFs and how to merge them into simple PDF reports. It will make easier to verify the outputs or share the results.

If you liked this story or if you have any doubts or suggestions, please leave your comments and don’t forget to “clap” for it. And if you want to receive more stories like this, you can follow me on medium. Hope you liked, so stay tuned and see you next story.

Other posts from Python for Geosciences series

Link for the other posts of Python for Geosciences series:


[1] Cordeiro, M. C. R.; Martinez, J.-M.; Peña-Luque, S. Automatic Water Detection from Multidimensional Hierarchical Clustering for Sentinel-2 Images and a Comparison with Level 2A Processors. Remote Sensing of Environment 2021, 253, 112209.

68 views0 comments


bottom of page